Editor Trends Study/Results
In October 2010, the Wikimedia Foundation commissioned a study to help better understand the internal dynamics of our communities. The project page provides details on the motivations for the study and the initial data that gave rise to the research question. The following document summarizes the results of the study.
The goal of the Editor Trends Study is to help the Wikimedia movement to understand the dynamics of how editors come and go at Wikipedia. The questions driving this study are:
The answers to these questions are of critical importance in determining Wikimedia’s priorities in response to trends in participation. We have performed extensive analysis of participation trends in the English Wikipedia, with some additional analysis of participation trends in other Wikipedia language editions.
The question above is especially relevant because the number of new Wikipedians has been trending downward since mid-2007 in the English Wikipedia and in other languages. If long-time retention of these new users is also dropping, this would argue for a systematic problem with regard to new contributor growth.
Please keep in mind that this research barely scratches the surface of the type of work we can do to better understand the trends within our communities. Our findings support some early and important conclusions very strongly. They also allow us to formulate some more tentative hypotheses. As with any study of this type, please be open to the possibility that conclusions may change as further analysis is conducted.
While additional analysis and validation are required, our findings support the following early conclusions:
- Wikipedia communities are aging, some more rapidly than others. The percentage of editors with less than one year of experience has fallen quickly since 2006. Among the five Wikipedias studied, the German Wikipedia has seen the most rapid drop in the percentage of "young" users, while Russian Wikipedia has seen the least rapid drop.
- The retention of New Wikipedians dropped dramatically from mid 2005 to early 2007, and has since remained low. In the English Wikipedia, this drop coincided with the largest continuing influx of new editors in the history of the project, while the stabilization of retention coincides with the stabilization of editor numbers.
- This trend cannot be simply attributed to increased newbie experimentation or vandalism. The aforementioned trends hold up even when performing analysis on registered editors who have made at least 50 edits.
- Editor retention has not worsened over the past three years. After the drop in retention rates during 2005–07, editor retention has remained relatively stable, albeit at a substantially lower level. It appears as though the factors that caused the sharp drop in retention during 2005–07, while still exerting their force, aren’t causing retention to get significantly worse.
- The retention rate of long-time editors is about 75–80%. We can expect editors who are over three wiki-years old to stay at a retention rate of about 75–80% over the next year (based on historical trends). This retention rate also appears to be relatively stable, suggesting that the participation trends of experienced editors are not undergoing some massive positive or negative change.
Note on defining editors
If Wikipedia is the encyclopedia anyone can edit, then who is and is not "a Wikipedian" is open to interpretation. For the purposes of this study, we stuck with definitions of editing that are consistent with the ones WMF uses to measure its editor population. Different definitions of the term editor provide different views on what is happening with our editor population and we should continue to evaluate metrics as we gain a deeper understanding of the dynamics of our community (see discussion).
In order to be counted as an editor, we use two main criteria:
- Cumulative lifetime edits: Each editor must make a minimum number of cumulative lifetime edits in order to be counted. We used a cumulative cutoff to focus the analysis on users who have a minimum level of involvement. For most analysis, this number is 10 to match our definition of "New Wikipedian".
- Ongoing edit activity (e.g., monthly, yearly): Once an editor has crossed the cumulative lifetime edits threshold, we determine whether a user is "Active" in a given period by determining whether she made a minimum number of edits within the period. This level of activity could be measured in either months (e.g., edits per month) or year (edits per year).
Each of the graphs will indicate the thresholds used.
Editor Composition within a Wikipedia Project
Wikipedia communities are aging (Finding #1)
To measure the “age” of Wikipedia communities, we first conducted an analysis of the composition of large Wikipedias. We asked the question: “How old, in wiki-years, is our current editor base?”, wiki-years being defined as the time between their last edit and the time they became a New Wikipedian. For example, if an editor makes his first edit in June 2006 and continues editing (i.e., meets the ongoing activity threshold) into late 2010, that user would be 4.5 wiki-years old. We then asked this question at yearly intervals (e.g., age composition of all relevant editors in 2010, 2009, 2008).
Using this approach, we find that editors in 2010 have the following ages in wiki-years:
Note: the cumulative lifetime edits threshold for this analysis is 10 (definition of a New Wikipedian) and the ongoing threshold is 50 edits per year.
This chart shows that 40% of editors who made at least 50 edits in 2010 have been editing Wikipedia for under a year. Approximately 15% have been editing between one and two years (joined around 2009), and another 12% have been editing between two and three years (joined around 2008).
This composition has changed over time, as shown in the following graph:
We see from this chart that the proportion of new editors in the English Wikipedia is shrinking and that this trend appears to accelerate around 2007. Part of this trend is expected – as communities age, so does the tenure of their users. One would expect that early on in the life of Wikipedia (say 2001–04), new editors would predominate. And as the project grows, there will be more and more editors that have been with the project for a longer period of time.
In addition to looking at the percentage breakdown, it is helpful to look at the absolute numbers of editors.
The absolute numbers, which are on a yearly basis, reflect the pattern that is observed on a monthly basis (see wikistats). The total number grows dramatically between 2005–07, peaks in 2007, then shows a gradual decline. We also see that the absolute number of new users (under one year old) declines after 2006.
When we compare across projects, we note that aging patterns are not identical, though they follow a similar pattern. The following chart show the percentage of editors within a given year that have been with Wikipedia for a year or less:
The German Wikipedia has seen the most decline in percentage of Wikipedians with less than one year experience with (25% of 2010 editors). The Russian Wikipedia shows the least decline and has the youngest editors (52% have one year experience or less).
The above analysis gives us a snapshot of the age composition in “wiki-years” of editors of large Wikipedias. But what factors give rise to these distributions? Three main factors drive these distributions:
- How many New Wikipedians come through the doors in a given year
- How these New Wikipedians are retained over time
- How veteran editors are retained over time
The rest of this analysis will focus on the English Wikipedia (with some comparisons to the German and French Wikipedias), but Wikilytics is available for people who want to conduct similar analysis on other projects.
The following chart shows the number of New Wikipedians joining the English Wikipedia per month (from stats.wikimedia.org).
As the chart shows, there was a tremendous influx of New Wikipedians to the English Wikipedia 2005–07. This influx created a substantial "supply" of editors for the project which we still see today as these editors together comprise a significant proportion of the editing population of the English Wikipedia in 2010. Since 2008, we see a decline in the number of New Wikipedians joining the project, which means that the supply of editors coming into the English Wikipedia is slowing.
What happens to these users after they become New Wikipedians? To answer this question, we've done cohort analysis.
Retention rate dropped from mid-2005 to early-2007 and has remained low ever since (Finding #2)
The cohort analysis performed takes groups of users and analyzes their editing activity over time. Using this methodology, the one-year retention rates of editors that joined at different points in time were calculated and compared to the number of Active Editors:
The red line shows the one-year retention rates of users that joined in each month (i.e., became a New Wikipedian in that month). Retention rate is defined as the percentage of the original cohort that made at least one edit in the twelfth month after joining. For example, of the users that became New Wikipedians in January 2006, about 31% were editing one year later. The drop in retention began rather abruptly in mid 2005 and continued to fall rapidly until early 2007. Then, in early 2007, the retention rate appears to bottom out. When we look at the retention rates after 2007, we come to the following finding:
Editor retention has not worsened over the past three years (Finding #4)
Since early 2007, the retention rate has fallen slightly, but not nearly to the degree to which it fell from mid 2005 to early 2007.
Retention and Community Growth
The following graph superimposes the retention trend over the Active Editor trend.
As in the prior graph, the red line corresponds to the one-year retention of cohorts that became New Wikipedians at different points in time. The blue line shows the number of monthly Active Editors (i.e., editors that have made at least five edits in a given month).
The beginning of the retention drop in 2005 coincides with the explosion of active editors. From mid 2005 to early 2007, the retention rate continued to fall as rapidly as the Active Editor count grew. Then, in early 2007, as the Active Editor level plateaued, the retention rate bottomed out. Some critical events also coincided with the sharp fall in retention, such as the Seigenthaler controversy and the beginnings of the BLP policy. A deeper exploration of events affecting the community and subsequent changes in policies, norms, and tools would be helpful in increasing our understanding of the causes of the drop in retention.
Note: Retention charts for other Wikipedia projects may be found here.
We also examined specific cohorts of users to get an even deeper understanding of the longitudinal behavior of different groups of editors. To accomplish this, we followed specific cohorts of users that became New Wikipedians in a given month and measured their behavior over time. In the example below, we've taken the group of editors who became New Wikipedians in January 2006 and measured their activity in each month after joining.
The x-axis in the above chart represent months after becoming a New Wikipedian. Since this cohort is for January 2006, 0 = January 2006, 1 = February 2006, 2 = March 2006, etc. The y-axis shows the percentage of this cohort of New Wikipedians that made at least one edit in the given month. So during month 0 (January 2006), 100% of the cohort made at least one edit since, by definition, they would have had to make at least one edit to cross the 10-edit threshold to have become a New Wikipedian during that month. One month later (February 2006), 43% of this cohort of users made at least one edit. As expected, we see a sharp drop-off in the initial months. But as this cohort ages (towards the right of the graph), the curve flattens meaning that, proportionally, fewer Wikipedians are becoming inactive each month.
It is important to note that if an individual editor does not appear in a given month (i.e., does make at least one edit in that month), they are not considered to have "left" Wikipedia. They are simply inactive for that month and may (or may not) become active again in a subsequent month after lapsing. The aggregate data, however, give a reasonably clear picture of how a given group of users' activity changes over time.
We can then compare this January 2006 cohort to editors that became Wikipedians in January of other years.
The above chart shows the percentage of editors making at least one edit in the months after they became a New Wikipedian (i.e., the months measurement is relative to New Wikipedian month of each cohort).
Several distinct patterns emerge from the data. While all January cohorts from 2004–10 show a declining pattern typical of this type of analysis, two distinct groups appear to cluster:
- 2004 and 2005 as one group with a shallow(er) decline
- 2007–10 as another group with a distinctly steeper decline.
The January 2006 cohort appears to be transitional, as expected from the one-year retention chart previously shown.
Visually, we can see that the January 2004 and January 2005 cohorts maintain their level of activity at a much higher rate than the other cohorts. And editors that joined in the Jan 2007–10 are more likely to become less active more quickly.
The following shows the two-month, six-month and 12-month levels of activity for these January cohorts:
January Cohorts: 2-, 6-, and 12-month activity levels
The most significant difference is between the Jan 2006 and Jan 2007 cohorts, where activity rates see their most substantial drop: a difference of 8 percentage points after the first month and 11 points after six months. After a year, the activity level of the Jan 2007 cohort is about half that of the Jan 2006 cohort. The Jan 2006 cohort also retains activity at a lower rate than the Jan 2004 and Jan 2005 cohorts, though the difference is not as significant. We do not know whether this change in activity retention is a result of the types of editors coming in or because of their experience with editing Wikipedia and/or the community. But we do know something drastically changed during this time period, which corresponds to the period of massive influx of New Wikipedians.
Another way to look at this data is to select a point on the y-axis and see how long each cohort takes to get to that level of activity. If we select 20% (80% of editors inactive during the month):
Months Required for January Cohort to become 80% inactive
It takes the January 2010 cohort of the English Wikipedia two to three months to reach levels the 2004 cohort would not reach for five years.
From this data, we also see that activity retention has not significantly worsened since 2008. Six-month activity numbers have been in the mid-teens for the last three years.
This analysis was repeated for April, July, and October (corresponding to calendar quarters)to ensure that the January cohorts are not an aberration. The quarterly charts are located here.
Downward trend in retention cannot be simply attributed to increased newbie experimentation or vandalism (Finding #3)
We can also repeat this analysis for higher levels of editing activity. Raising the bar to 50 cumulative edits (i.e., requiring 50 lifetime edits instead of 10) yields the following:
While the retention rates are slightly higher (as one would anticipate given the higher cut-off), the general pattern is remarkably similar – higher retention rates for the 2004 and 2005 cohorts, lower retention rates for the 2007–10 cohorts, and 2006 as the transitional year.
This general pattern also exists for the German and French Wikipedias, though the differences between cohorts appear slightly less pronounced. Using the original definition of activity (at least one edit per month), we see that the German and French Wikipedias show substantially the same trend:
Comparing the retention rates for the January 2009 cohorts, we find that German Wikipedia has the highest retention, followed by French Wikipedia and then English Wikipedia.
Retention rate of long-time editors is about 75–80% (Finding #5)
Thus far, we have discussed the first two factors affecting editor composition: rate of new editors coming in and the retention of these new editors. The third factor, how veteran editors are being retained, may be analyzed by asking "How have the cohorts of veteran editors been trending recently?" For example, we can look at how users who joined in 2006 have been trending in 2009/2010. We can do this by focusing in on the most right-most portion of the curves:
The slope of this curve, while not as steep as the initial period, is non-zero, indicating that each year, we should expect a certain percentage of these long-time editors to stop editing. The following graph show the activity of the January Cohorts from Sep 2009 to Sep 2010, this time with absolute numbers. The more recent cohorts will appear at the top of the graph.
Calculating the September year-over-year change shows that we can expect approximately 12–28% of the older editors to stop editing. Because of the variation in editing numbers from month to month, averaging the change over several months provides a more meaningful representation of the trend. If we average the year-over-year change over six months (year-over-year change for April, May, June. . .September), the figure is approximately 20–25%.
While we would hope to retain as many valuable contributors as possible, 100% retention is not a realistic expectation. Users leave for a variety of reasons, some of which (e.g., having a child, getting a new job) are outside the control of the community. Retention of 75–80% per year seems to be within the range of reasonable expectations. Further, there does not appear to be any significant change in this trend over the past several years, as evidenced by the linear character of the retention curves after the one-year mark.
The following analysis is not central to the Editor Trends Study, but we wanted to publish the data as it may help provide some helpful context around some of the trends.
Using the same dataset, we are also able to get an understanding of how long it takes users to become a New Wikipedian (i.e., amount of time it takes for users to reach their 10th cumulative edit). The following chart shows, by year, how quickly New Wikipedians achieve their first 10 edits for the English Wikipedia:
The y-axis represents the percentage of users who became New Wikipedians in a given year. The x-axis represents days after making their first edit. This data indicate across the years, a consistent 30% of New Wikipedians made their 10th edit on the same day they created their account. For the remaining 70% of users, we see the velocity of editing slow down. Users that started editing 2004 had the highest velocity for their first 10 edits, with over 80% crossing the 10th edit in their first two months. By 2010, however, only about 63% of editors made their 10th edit in the first two months. The data do not provide any information on the cause, but edit velocity does appear to have slowed down, at least for the first 10 edits.
To answer our questions about the editor dynamics of the Wikipedia projects, we must first decide on our data source, our level of analysis and our analysis strategy.
We have three alternatives for our data source:
We have chosen to use the XML data dumps because of the three options, they are the most complete, most accessible and easiest to use for replication. We think it is important that Wikipedia community members will be able to replicate this study (and future studies) with as few obstacles as possible. Quering the live database is not preferred as this will make it harder for community members to replicate. Randomized data sampling is a well known method when the total population to be surveyed is (too) large. However, we know very little about the underlying distribution and hence it will be very hard of us to judge whether the random sample is in fact random.
Level of Analysis
We decided to conduct the analysis at the project level (an individual Wikipedia project) as opposed to the ecological level (including all Wikipedia projects). There are considerable differences between the Wikipedia projects along a variety of dimension (e.g., number of editors, age of project, policies). To do justice to these differences, we need to focus on the individual project. The ecological level could obscure some of the more nuanced findings that are specific to an individual project.
As such, we have done a deep dive on the English Wikipedia with some comparisons to the German and French Wikipedias. Some data are provided for the Russian, Spanish, and Japanese Wikipedias. Other projects may use the Wikilytics software developed by Diederik van Liere to run their own analysis.
Over the past years, different Wikipedia community members have developed different tools to analyze Wikipedia projects. We evaluated two of these tools before deciding that we should develop our own tool. We have considered the following tools:
Wikistats gives very detailed information on the growth of editors of all the Wikipedia projects. It uses the XML data dumps as its data source and has a nice command line interface that allows people to run/replicate their own analysis.
The focus of WikiXRay is primarily on the articles and editor contributions compared to the editor focus of Wikistats.
Both WikiStats and WikiXRay plot changes over time but do not follow specific groups of editors over time that allow us to make statements about differences between such groups. As both tools did not offer the possibility to track particular group of editors over time and both tools require significant time to run their analyzes, we decided to develop Wikilytics.
Wikilytics is a cross-platform command-line based program written in Python and can be used for all the main Wikipedia projects and their different languages. It supports both the ETL and analysis phase of this research. ETL stands for Extraction, Transformation and Loading. First, Wikilytics allows you to download an XML data dump file for a particular project in a particular language and unzipping it. After that, it will extract the required variables from the XML file, store it in an intermediate CSV file. Once the extraction phase is finished then all the CSV file are sorted, loaded into a Mongo database and the final phase is precomputing a number of variables. Tasks that can be done in parallel are parallelized.
An extensible plugin architecture allows for developing custom queries to run against the database. The plugin architecture can be thought of as a simplified map-reduce system. The researcher supplies a plugin function (the mapper) which is responsible for outputting the data as a key value statement. The built-in reducer will aggregate the mapper values. Results are written both to CSV file or the database.
Data Used for Graphs
Here are links to Google Spreadsheets which contain the data used for the graphs in this document. The data are available as downloadable Google spreadsheet files.
- Editor Age Composition Data
- English Wikipedia Cohort Data
- German Wikipedia Cohort Data
- French Wikipedia Cohort Data
- Spanish Wikipedia Cohort Data
- Japanese Wikipedia Cohort Data
- Russian Wikipedia Cohort Data
Limitations of Analysis
As stated previously, while this study provides a new level of insight into the community trends it just scratches the surface of the type of research that can be done. The analysis uses whether a user has edited within a period of time as the primary indicator of activity and retention. As with any metric for activity, it is imperfect. It does not take into account, for example:
- Whether an edit has been reverted
- Whether an edit is a revert action
- The length of the edit
- The quality of the edit
- Edits across other namespaces
The study also requires an editor to make a minimum number of lifetime edits to be included in the analysis. The purpose of this cut-off is to focus the analysis on editors who have made a minimum level of commitment to editing Wikipedia. We suspect, however, that the band of activity prior to reaching the 10-edit cutoff could be significant. From anecdotal evidence, we know that many editors edit Wikipedia anonymously prior to creating an account.
Incorporating more nuanced definitions of activity and retention in subsequent research may offer slightly different views of the trends.
Using stats.wikimedia.org as the reference dataset, we used the number of “New Wikipedians” as a measure to ensure that the data in the Editor Trends Study match, to a reasonable tolerance, the reference dataset. A large discrepancy would indicate issues with either a) differences in the root dataset and/or b) errors in the processing of the data (ETL, business logic, data manipulation).
The dataset used for stats.wikimedia.org is dated January 15, 2011 may be found here. The scripts to derive the data were run on data up until December 31, 2010.
The stub-meta-history dataset is used for the Editor Trends Study and is dated September 4, 2010 (see this page).
Because of the difference in the root dataset, we expect some degree of variance between the reference data and the Editor Trends Study. The main known difference is in the calculation of the editor count. In both stats.wikimedia.org and the Editor Trends Study, the number of editors is calculated through tabulation based on the Article table. Both toolkits go through the list of articles (excluding deleted articles) in a given project such as the English Wikipedia and then tabulate the user accounts that have made at least one edit to the a given article. So at any given point in time, the number of editors actually corresponds to the sum of editors who have edited an active (not-deleted) page on a project.
The first source of discrepancy between the two datasets is that of page deletions. For example, if, at time 0, editor A edits only page X. An editor count at time 0 would include editor A. If, however, at time 1, page X is deleted and editor A has still only edited page X (prior to its deletion), an editor count at time 1 would not include editor A.
A second source of discrepancy is that Wikistats counts editors using usernames while Wikilytics counts editors using user ids. Mediawiki, the software used to run Wikipedia, has gone through many improvements and iterations. This has lead to some data inconsistencies.
The third source of discrepancy is the way Wikistats and Wikilytics exclude edits by bots. Both use a blacklist approach: if an username / userid belongs to a known bot than that edit is excluded from the analysis. However, in the case of Wikistats the blacklist is slightly more up-todate than the one used by Wikilytics.
Comparison of Cumulative New Wikipedians Number
The following is a comparison of the cumulative number of New Wikipedians (editors with at least 10 lifetime edits):
|As of Sep 2010||English Wikipedia||German Wikipedia||French Wikipedia|
|Editor Trends Study||
Comparison of New Wikipedians per Month Number
The following graphs show the comparisons of the number of New Wikipedians each month for the English, German, and French Wikipedias: