Jump to content

Content quality/en

From Strategic Planning

What is the content landscape in which Wikimedia operates?

Landscape definition

As an online encyclopedia built through mass collaboration, Wikipedia takes a social media approach to a type of content that is still primarily dominated by more traditional media. Therefore, in order to understand the full content landscape in which it (and other Wikimedia projects) operates, it would be helpful to understand both the traditional and social media landscapes, as well as the intersection between the two.

Tom Cross provides one way of starting to think about this hybrid landscape when he states that, “Wikipedia fills the gap between the real-time news media and the slow publication of authoritative encyclopedic resources by providing a central collection point for data about a recent event that is available immediately”.[1]

Below is an updated attempt to map this content landscape, including some of the most prominent participants and the current positioning of several Wikimedia projects:

While Wikimedia dominates the online reference space, it appears to be positioned between several important online content trends:

  • Organizations increasingly relying on social media to provide new ways for people to share real-time information and news [2]
    • 367K CNN iReports worldwide
    • Facebook now has 250M active members worldwide
    • Over 3M tweets are sent per day

  • Blogs becoming a mainstream source of opinion, news, and expert information
    • 77% of active Internet users read blogs
    • There are an avg. of 900K blog posts per day

  • Key players expanding the supply of free online books and published works
    • >7K free public domain books at Amazon Kindle store
    • Google Books will soon distribute work released with Creative Commons licenses

  • Increasing momentum behind open educational resources
    • California calls for the adoption of digital math and science textbooks
    • President Obama proposes investing in free online courses to improve community colleges

What do these trends mean for Wikimedia? What other online content trends should the strategic planning process take into account?

Here is a link to one of the many recent articles about Google Books and Google's latest moves towards expanding its digital library

What is Wikimedia's current position in this content landscape?

Internet growth by region

Data on the size and growth of Wikipedia content

The current number of articles available for a sample of Wikipedias can be seen here:[3]

Article growth over time for these same Wikipedias can be seen here:[4]

Based on the number of new articles per day, English Wikipedia’s content growth rate appears to have been slowing since 2007:[5]

On all language Wikipedias, the different sizes of articles follow a bi-model distribution: many stub and redirect articles, with a smaller peak around the standard size of article, which is 1.5 KB.[6]

Data on content breadth and composition

A study by the researchers at PARC titled “What’s in Wikipedia: Mapping Topics and Conflict Using Socially Annotated Category Structure” brings some data to bear on the question what information is actually contained in English Wikipedia’s 2.96M articles. They found information covering 22M categories, which can be grouped into 11 overall topics with the following distribution and growth (2006–2008):

Culture and the arts is not only the largest topic, and twice the size of the next largest topic, but has also seen the most growth since 2006.

Is this the same for other language Wikipedias? How much content sharing currently goes on (e.g. through translation)?

English Wikipedia’s 2.96M articles, and the fact that the PARC researchers found that content could be mapped to 22M categories, also speak to the breadth of content that mass collaboration has made possible. Information comparing Wikipedia’s content breadth to that of other encyclopedic projects can be found at size comparisons. Some of the comparisons that seem most relevant to English Wikipedia have been updated whenever possible and can be seen here:

Data on "vital articles"

Some people argue, however, that content breadth should not be the true goal of Wikipedia. Instead, Wikipedia should focus on creating and improving the quality of a smaller group of "vital articles" that every encyclopedia needs to have.

Members of the community have been working to create a list of 1,000 "essential articles", or "basic subjects for which Wikipedia should have a corresponding high quality article". The current list of those articles can be seen here

Using a slight different set of topics, here is one way to look at the distribution of these vital articles:

And here is the same list sorted by number of page views

For analysis of the quality of these current articles, please visit the Quality factbase page.

Data on content usage/affinity

Overall page hits per day for the same sample group of Wikipedias can be seen here:[7]

Average page hits per article per day can then be calculated, as is done here:

However, a closer look at the top 1000 pages in en Wikipedia (by average page hits per day for 2009) shows that the top pages get a disproportionate share of page views and starts to hint at what happens as you move down the content "tail". As a note, the top 1000 pages receive 5% of daily page views, while representing significantly less than 1% of total pages.[8]

Note: "Special", "Portal", and "Wikipedia" pages (e.g. Main Page, Search, Citation Needed) have been removed from these calculations in order to focus in on content that is being viewed. Obvious redirects to other sites (e.g. YouTube, Facebook, Twitter, and MySpace) have also been removed for the same reason.

For the following analysis, the top 100 pages (by average daily page hits) for a sample of language Wikipedias were assigned to a set of general categories, with the following results:

As a note, all "Special", "Search" and "Portal" pages were removed from the analysis in an attempt to isolate the actual content that users are viewing.

Who is creating Wikipedia content?

The following analysis comes from Jose Felipe Ortega Soto - Wikipedia: A Quantitative Analysis.

Most articles receive only a few contributions, with a few articles attracting the majority of contributions.

Articles that reach featured status tend to be edited by experienced users. Most contributors to featured articles have between 300 and 1000 days of experience on Wikipedia.

Editors are less likely to leave Wikipedia if they are contributing to featured articles. This statistic is much higher if they are contributing to featured articles and talk pages.

The survivability of editors over time on several Wikipedias. The blue line shows that editors who contribute to talk pages and featured articles are most likely to remain an active contributor. The black line represents the survivability of Wikipedians who did not contribute to featured articles or talk pages. The red line shows editors who contributed to talk pages, and the green line shows editors who contributed to featured articles.

What options does Wikimedia have for extending the scope of its content?

A preliminary list of broad options includes:

  • Continuing to expand content breadth and diversity (within and across languages)
  • Expanding the depth of existing content
    • Expanding the support for research on Wikiversity to include research other than content related
  • Expanding to different types of content (different forms of content, for different users)
    • Opening new communities where existing communities have chosen to limit their market, to capture other market segments. E.g. Wikibooks limit to Text-books

What initiatives could Wikimedia consider to support this scope extension?

A preliminary list includes:

  • Content donations
  • Content partnerships (e.g. with content institutions or other online encyclopedias)
  • Providing incentives for the community to focus content creation efforts

What is the potential impact of these content initiatives?

  • Adding more content while the number of frequent editors (>100) stagnates means to worsen the articles-per-frequent-editor ratio ("AFE ratio"). The English language edition is a prominent example for the consequences: if the AFE ratio goes beyond a certain value, the community is not longer able to guarantee the reliability of the content (vandalism free); the online-ecosystem gets out of control.

The articles-per-frequent-editor ratio ("AFE ratio") answers the question: "How many articles have to be controlled by one core community member?"

Current AFE ratios (May 2009)
Language edition Articles (ch>200) Frequent editors (e>100) AFE ratio (smaller=better)
en pending pending pending
de 904,000 pending pending
sv 300,000 pending pending
  • Increasing the Market Segments approached by the service, might have the effect of increasing the pool of editors by capturing editors not currently attracted, due to market penetration.


  1. Cross, Tom, "Puppy smoothies: Improving the reliability of open, collaborative wikis" http://outreach.lib.uic.edu/www/issues/issue11_9/cross/index.html
  2. Statistics from
  3. http://meta.wikimedia.org/wiki/List_of_Wikipedias#All_Wikipedias_ordered_by_number_of_articles
  4. Wikipedia:Statistics
  5. Wikipedia: Statistics
  6. Jose Felipe Ortega Soto - Wikipedia: A Quantitative Analysis
  7. [1]
  8. [2]