Barriers to growth of South Asian language Wikipedias

Information on the barriers to South Asian language Wikipedias is largely taken from a presentation at Wikimania by BalaSundaraRaman L on the Tamil Wikipedia. It would be great to have some other editors of other South Asian language wikipedias to add their thoughts Sarah476 16:50, 18 September 2009 (UTC)[reply]

//Low awareness of and poor quality of tools for typing Indian scripts on Western-style keyboards//

I think lack of awareness is the main problem. There are very good quality tools actually.

Main barriers I could think of:

  • Lack of awareness to type in local language in computer.
  • Lack of higher level socio-economic development. See Maslow's hierarchy of needs
  • Low penetration of computer / internet. Use of obsolete operating systems which do not allow to read, write in local language.
  • The typical internet users demography do not overlap the demography of people who is inclined to work on local languages (typical to India).
  • Medium of instruction and related subject material being in English only for professional courses and latest advancements in world.--Ravidreams 18:55, 18 September 2009 (UTC)[reply]

Question: why is it important to encourage people to use non-English languages? Is Hindi, for example, a more authentic medium of communication than English if the individual in question uses both? Some people argue that Indian English deserves its own moniker--i.e. should be regarded as a native/local language in its own right, especially since it does have its own grammatical norms, e.g.

Article count as metric

Language Off count > 200 Char Mean bytes Length 0.5K Length 2K Size Words Images
Tamil 16 k 16 k 1619 81% 21% 74 MB 3.0 M 3.0 k
Bengali 19 k 12 k 1113 49% 11% 61 MB 3.1 M 8.5 k
Marathi 21 k 6.4 k 623 20% 5% 44 MB 1.8 M 0.769K
Telugu 42 k 13 k 578 16% 5% 64 MB 3.0 M 2.6 k
Hindi 24 k 14 k 1128 35% 11% 76 MB 4.6 M 1.4 k
Malayalam 8.3 k 7.8 k 2425 78% 30% 58 MB 2.1 M 5.4 k
Kannada 6.1 k 5.3 k 1282 53% 14% 23 MB 0.965M 0.211K
Tamil's rank 5 1 2 1 2 2 3 3
Table showing comparison of top Indian language Wikipedias (as of Nov 2008)

Raw article count numbers might be misleading at times. There have been instances of some wikis adding single word (the title) articles by using bots etc., and liberal addition of stubs by users. There are a few other metrics like the number of articles above 200 characters or a certain size in bytes to discount trivial pages. The slightly-outdated table on the right side can show how different the sorting order is based on the metric chosen. Please note that my intention is to drive home the point about the metric to be chosen and not to assert anything about particular languages (though the table was prepared elsewhere to analyse Tamil's position). -- Sundar 18:08, 18 September 2009 (UTC)[reply]

I agree with Sundar. Wikipedians in Malayalam Wikipedia have focused their efforts to ensure quality of each article in the wikipedia. There has been focused efforts to ensure that every article in wikipedia is useful to a reader and that the quantity of articles is not artificially blown up. Based on Sundar's table if we assume that useful articles are those >2KB in size, below will be the number of useful articles in each wiki
  1. Tamil 3360
  2. Hindi 2640
  3. Malayalam 2490
  4. Telugu 2100
  5. Bengali 2090
  6. Marathi 1050
  7. Kannada 854
As you could observe, this has little correlation with the rank based on the number of articles.
In addition to supporting Sundar's point, my goal is to drive home the point that different wikis use different metrics to evaluate their collaboration. For Malayalam wikipedia, it is the page depth of wikipedia that the wikipedians have taken as a metric since many of us believe that to a certain level this metric takes into account the level of thought, debates and discussions that have gone into while capturing different perspectives while putting together the content of each article. Many of the articles in are related to religion and philosophy and the wiki has seen a good amount of discussions (as well as a tonne of vandalism from open proxies). As of today, ranking based on Page Depth is:
  1. Malayalam 176
  2. Bengali 65
  3. Tamil 26
  4. Kananda 16
  5. Marathi 16
  6. Hindi 14
  7. Telugu 6
Again, little correlation with other rankings. So instead of focusing on rankings, the focus should be on what each wikipedia communities value most and maybe a graph based on each of those..

--Jacob.jose 01:59, 19 September 2009 (UTC)[reply]

Thanks so much for all of this great information. I think that you are definitely right that simple article count is not a good enough metric and that we should also look at the quality of the articles. The information that everyone has provided on quality in South Asian languages is great and I am going to copy some of it and put it onto the main page. Hopefully, I will soon be able to perform a standardized quality analysis across all of the different language wikipedias and post the results in their appropriate sections. While I agree that it is important to keep in mind what each Wikipedia community values most, I also think that there is value in being able to compare across Wikipedia communities especially as the foundation think about how to marshal its scarce resources. Finally I had a question, I am very interested in the idea of page depth as a metric but am not sure how it is calculated. Would you be able to explain that Jacob.Jose? Sarah476 19:53, 21 September 2009 (UTC)[reply]
Thanks Sarah for listening and responding and Jacob for pitching in. The formula behind the metric is explained at m:Depth. One can see that "non-articles" is a good measure of community participation, but doesn't always mean more "discussions", it could be portals, etc., Also, some wikipedians would write entire articles in a single edit, so number of edits might misrepresent the depth. In my opinion depth is still not a perfect measure, but is a step in the right direction. -- Sundar 10:07, 22 September 2009 (UTC)[reply]
just do you put in an update, I have added information about "useful articles" to most of the regional analysis sections. I decided to use articles greater than 1.5kb as this was the data that was available in an Excel file which allowed me to perform the analysis. I don't think that the difference between 2kb and 1.5kb is that great as it should still a eliminate stub articles created by botsSarah476 15:35, 30 September 2009 (UTC)[reply]

medium of instruction for higher education

//English is the primary language for higher education in South Asia. However, some universities in India use Hindi at the undergraduate level and some universities in Sri Lanka teach in Sinhala.//

In Tamilnadu, (except for professional courses like Engineering, medicine etc.,), many of the undergraduate arts and science courses are still taught in Tamil. Especially, in Government colleges. We need to check how the status is in other non-hindi speaking states.--Ravidreams 18:46, 18 September 2009 (UTC)[reply]

In Nepal We have two types of elementary schools English medium and Nepali Medium up to class 10 then the if they chose +2(11,12) they have to study in Eglish. (Tribhuvan University and Kathmandu University) they have to study in Enlish .Except few subjects in Arts(Humanities) in Universities all the subjects are taught in English . By the time they complete Masters they speak half Nepali and Half English,mixed together so the people don't understand them...:) some become like neither this nor that ...सरोज कुमार ढकाल 20:16, 22 November 2009 (UTC)[reply]

Milestones in other WikiMedia projects

Milestones in other Wikimedia projects should be a metric too. Tamil Wiktionary has 96000+ words and ranks in Top 15 Wiktionaries globally and the number one in Indian languages. --Ravidreams 18:48, 18 September 2009 (UTC)[reply]


Is Pakistan the primary country for Punjabi? I thought that there are more Punjabi speakers in India. -- Sundar 11:53, 20 September 2009 (UTC)[reply]

I see that it's fixed now. -- Sundar 10:11, 22 September 2009 (UTC)[reply]
The punjabi relevant to Pakistan is pnb, not pa. --Urdutext 01:16, 25 November 2009 (UTC)[reply]

File:South Asia are useful articles.png

Hello Sarah476,

Could you please tell me from where the data for preparing this chart is collected. The information presented in this chart is not matching with the data available at .

Also that chart make more sense if the analysis is based on the percentage of articles that is more than 1.5 kb, rather than on the number of articles. The present analysis is just a subset of the total number of articles. There is no meaning in comparing a wikipedia with 50,000 articles with another wikipedia with hardly few thousands of articles.

A Wikipedia with 50,000 article may of course have 4000 articles with size more than 1.5 kb. But we can not expect the same number of long articles from a wikipedia with the total article count is hardly 4000 or 5000.

--Shijualex 05:30, 7 October 2009 (UTC)[reply]

Dear Shijualex, Thanks so much for your comments. The data from this chart comes directly from the .cvs files that are I believe the files were pulled at the beginning of September. The specific file I used shows the percentage of articles greater than 1500Kb. I then took that data and combined it with the information I pulled directly from the site on the number of articles. Because these two sets of data were pulled at slightly different times, it is possible that there are some minor inconsistencies. If there are any specific numbers that you think are wrong please let me know and I will take a look at them and make sure that there are no mistakes in my calculations.

In terms of of whether there is value in comparing the number of articles of greater than 1500kb, I do understand your point that large wikipedias would be expected to have much larger numbers of articles greater than 1500kb than smaller wikipedias. However, I also believe that it is important put these numbers out there. If we are trying to understand which wikipedias are providing users with useful articles than it is important to compare these raw numbers. Hopefully, by understanding what larger wikipedias that have a greater numbers of articles of greater than 1500kb have done to achieve their success, we can learn some lessons and apply them to the development of smaller wikipedias. Sarah476 14:31, 9 October 2009 (UTC)[reply]

need to revisit the potential user calculation figures

Undersigned feels a strong need to revisit potential user calculation figures calculated at Reach/Regional Analysis/South Asia#Table of Major South Asian languages and their Wikipedias

since data references of ethnologue report can be considered age old since refered ethnologue claims to have figures of 1997 that means those are more likely to be of 1991 census , And India had census in 2001 , with india's huge rate of population growth figures can not be called representative.

Unfortunately while opening Government of India's official census website my fire fox is crashing down so may be we need to look for some other resources.

Undersigned could not see specific report on which User:Sarah476 has based International Telecom Union 2008 , my visit to some of the report shows some of the data is again from 1997/8

Further if we take present figures as correct then south asian languages are not likely to have much scope getting new users since potentila would stands already covered; I sincerly belive that wont be correct so sincere wish that figure be revisited.

Mahitgar 14:30, 6 November 2009 (UTC)[reply]

Article count as metric

Lot of discussion on this discussion page is centered around 'Article count as metric' , I do not have any reason to contest these figures but what we need to have is shaering of information about what methdologies or steps of certain wikipedia/n's have lead to a better performance , and what recomondations we can make to other wikipedias and wheter those recomondations can also be used for furtharance of wikimedia strategy initiative Mahitgar 14:35, 6 November 2009 (UTC)[reply]

Potential Users

Hello, I changed "Thousands" to "Millions". Yann 19:43, 22 November 2009 (UTC)[reply]

Millions doesn't make sense to most of us (people of Indian subcontinent). We usually use thousands/lakhs/ or crores. To avoid ambiguity, it is better to use scientific notation than million or billon. --Shijualex 05:05, 19 January 2010 (UTC)[reply]

Number of Speakers of Nepali language

The population of Nepal makes 22,736,934 and they all speak Nepali language , In India the demographic records show there are 10 million native speakers of Nepali and in Bhutan about 30-40 % of the people speak Nepali ... I am not ready to accept the data table that is presented here please have a muse over it .... सरोज कुमार ढकाल 10:47, 26 November 2009 (UTC)[reply]