Localisation

From Strategic Planning


Problem description

For local language projects to grow it is important for editors to be able to interact with the MediaWiki software in their own language. It is also important that the software supports the characters of the local languages and that it support right to left script.

MediaWiki system messages

The statistics on localisation of the MediaWiki software system messages consists as of 25th of December of a list of 323 different localisations. MediaWiki defines 362 localisations. The follow localisations are excluded for various reasons, the most commons is that the localisation definition has been created for convenience reasons or is still present for backward compatibility reasons: als, be-x-old, ckb, crh, de-at, de-ch, de-formal, dk, en-gb, fiu-vro, gan, got, hif, kk, kk-cn, iu, kk-kz, kk-tr, ko-kp, ku, ku-arab, nb, ruq, simple, sr, tg, tp, tt, ug, zh, zh-classical, zh-cn, zh-sg, zh-hk, zh-min-nan, zh-mo, zh-my, zh-tw, zh-yue.

Weighing the current translation level of MediaWiki system messages by importance (most often used messages weight heavier than other MediaWiki core messages, that weigh heavier than messages of MediaWiki extensions used by Wikimedia - see example), currently 49 Wikimedia supported languages score 90 points or more out of 100[1]. The Wikimedia Site Matrix defines 283 language codes, out of which 13 are non-language codes (closed-zh-tw, nomcom, simple, zh-cfr) or language codes without projects, possibly redirected to another language code (cz, dk, epo, jp, minnan, nan, nb, nomcom, simple, tokipona, tp, zh-cfr). Wikimedia hosts projects in 270 languages with their own prefix. Projects that have not established themselves in Wikimedia Incubator are excluded from this count.

Among the languages the amount of localisation varies from 0% to 100%. The number of MediaWiki core non-optional system messages is 2369, and the number of messages for extensions used by Wikimedia is 2727 (per 2009/12/25). The sum of messages to be translated is (2,369 + 2,727 ) * 323 = 1,646,008. The average localisation percentage of MediaWik core is 46.87%. For MediaWiki extensions used by Wikimedia it is 20.61%[2]. This means that about 1,105,757 or 67.1% of the total number of messages has not been translated.

Character sets

Latin characters are well supported by MediaWiki but the character set of every language should be supported. For example, according to http://www.africanlocalisation.net/sites/default/files/AtypI08%20African%20fonts.pdf the African languages largely uses latin alphabets, but with a variety of character set extensions. Other character sets that are used in African languages includes Arabic script, Ethiopic, Tifinagh, Nko, Vai, Kikakui, Bamum and Mandombe. There probably are even more character sets that needs to be supported if one considers not only African languages.

If one just adds together the number of speakers of the languages given as examples in the link above that uses extended latin alphabets gives about 140 million speakers (slide 12-13), showing that coverage of different character sets is important.

The following section just lists some information. If you are able to, please rewrite this section with a relevant problem description.

Wikimedia software should support the character set of every language. According to this proposal http://www.africanlocalisation.net/ could provide some advices. Here follows some relevant information from http://www.africanlocalisation.net/sites/default/files/AtypI08%20African%20fonts.pdf:

Fonts used by African languages (a search on Wikipedia on many of the languages given as examples seems to indicate that they can be written in many different writing systems)
  • Latin (more than 90% of the languages)
Examples of languages that use basic latin fonts:
Swahili, Zulu, Shona, Somali, Oromo
Examples of languages that use latin fonts with additional characters:
Hausa (40M), Fula (25M), Kanuri (4M), Bambara and other Manding languages, Akan (19M)
Examples of languages that use more complex variations
Yoruba (25M), Lingala (25M), Dinka (3M)
  • Arabic script
  • Ethiopic
  • Tifinagh
  • Nko
  • Vai
  • Kikakui, Bamum, Mandombe
The support of different African fonts seems to vary on different platforms, the following comments might give some valuable information about where to look for guidance
  • Microsoft Windows, good support with latest Uniscribe – problem with older version
  • Mac, great since OpenType support
  • Linux, decent support with latest Gtk/Pango
  • Java, uses ICU, good stuff
Variations of fonts seems to be common to, the following a bit cryptic sentences seems to highlight this (check the pdf document above for some examples too, begining on page 27)
  • 16 variations of Latin or IPA (ex: Manding Languages, Bambara 3M in West Africa, Lingala 10M in Central Africa)
  • 11 with Greek design (Manding languages 10M,Fula in West Africa 15M, Lingala 10M in Central Africa)
  • 31 with hook or tail (ex: Hausa 25M, Fula 25M, Seereer 1M)
  • 10 with bar/stroke (Hausa 25M, Kanuri 4M, languages in Western and Central Africa)
  • 126 pre-composed accented characters (all across Africa)
  • 15 combining diacritics (Igbo 30M, Yoruba 25M, Lingala 10M, Malagasy 20M, tonal languages)
Open Source African fonts
  • Charis SIL and Doulos SIL
  • Gentium
  • DejaVu fonts
  • Liberation fonts (in progress)
  • Droid fonts (in progress)

More documents about African languages can be found at http://www.africanlocalisation.net/documents, especially the following documents seems interesting:

Characters needed for African orthographies in Latin writing system - http://www.africanlocalisation.net/content/characters-needed-african-orthographies-latin-writing-system

Just adding together the number of speakers of the languages given as examples above that uses extended latin alphabets gives about 140 million speakers, showing that coverage of different character sets is important.

Right to left support

Finaly some languages such as Arabic is writen from right to left. The MediaWiki software has to fully support these languages as well. Some problem description, someone please!.

Languages without orthographies

Conservative estimate of the number of languages is ~6000. Methodology of identifying what is language and what is not varies a lot. Because of cultural and political reasons, relatively distant language systems are often included inside of one "language" (cf. Mandarin Chinese). A number of the languages of rain forests of Amazon, Africa and Polynesia are not "discovered" yet.

The most of languages don't have orthographies. Orthography is exclusivity of ~1000 languages at most.

The most of the languages will disappear during this century. Those languages are interesting to Wikimedia just in the sense of preserving cultural and scientific knowledge.

However, many of them have fair chances to survive. Chances for survival of those languages will be significantly increased if they have Wikimedia projects, which means that they have to have orthographies.

According to Ethnologue, number of languages inside of the large groups:

  1. Niger-Congo (1532 languages)
  2. Austronesian (1257 languages)
  3. Trans–New Guinea (477 languages)
  4. Sino-Tibetan (449 languages)
  5. Indo-European (439 languages)
  6. Afro-Asiatic (374 languages)
  7. Nilo-Saharan (205 languages)
  8. Pama-Nyungan (178 languages)
  9. Oto-Manguean (177 languages)
  10. Austro-Asiatic (169 languages)
  11. Kradai (92 languages)
  12. Dravidian (85 languages)
  13. Tupian (76 languages)

+ up to 1000 living indigenous languages of Americas + more than 1000 languages of smaller families, isolates and unknown languages (cf. for example Sentinelese language).

Of those, just Indo-European, Dravidian and Afro-Asiatic languages mostly have orthographies. All other language groups are consisted of languages which mostly don't have orthographies. It may be considered that at least 1000 languages from the set of smaller groups, isolates and unknown languages don't have orthographies.

Strategies for solving these problems

Translation of MediaWiki system messages

Siebrand at translatewiki.net has estimated that 100 messages could be translated per hour by a professional translator. To translate the approximately 1,100,000 messages that at the moment is untranslated would therefore take about 11,000 hours. If one counts the number of languages with more than one million native speakers on this list there are 275 such languages. Assuming that the percentage of untranslated messages is similar to that of the list of the 323 languages this means about 940,000 untranslated system messages. (Probably the number of untranslated messages in these languages are lower than this because it is the uppermost 275 languages that has been filtered out, which is likely to be reflected in a higher amount of finished translations.) With the same translation speed as above this means about 9,400 translation hours. There are several ways to ge these messages translated.

  • Translators could be hired. An estimate from Siebrand is that the cost for hiring translators would be $85/hour plus 20% in overhead. This would mean that all the messages could be translated for 9,400 hours * $85/hour * 1,2 = $958,800. This is about 13% of the goal of this years fund raiser.
  • At translatewiki.net translation rallies has been arranged where translators has been awarded with a share of €1000 if they translate a certain minimum (500?) of system messages. This approach has resulted in paying about $0.08/message. If it is possible to translate all 940,000 messages by this method with the same effectivity this would be a method that costs about $75,000. Or once again comparing to this years fund raiser, 1% of this years goal.
  • A third method to get translation done could be to let translators register with a pay pal account and pay $0,1/message they translate. With the translation speed of 100 messages an hour that Siebrand has estimated this would mean that translators could earn about $10/hour. For translators in wealthy countries this would not be a very high pay and could be seen more as encouraging volunteers for their work. In some less wealthy countries, probably in many that currently are under-localized too, this amount would however probably be a quite high hourly pay. This would mean that for quite well localized languages where volunteering is more likely to happen the money are more of an encouragement to do volunterring work, while for the less localized countries where voluntary work is less likely to happen it is more of an actual wage. It is however a problem how the quality of the translations can be assured with this method. To cover transaction fees and other overhead there could be a minimum threshold of translations that needs to be done to get payed and letting the first translations be unpaid to cover for such expenses. The cost of this method would be $94,000, or about 1,3% of this years fund raiser goal. The exact amount $0,1 could however be changed, giving another prize.
  • A fourth method is to run a massive campaign at all Wikimedia projects that highlights the need for localization to be done. Siebrand could work together with WMF to arrange such a campain. One way to run this campaign could be to use the fund raiser banner space to promote localization. Also make the local chapters promote localization. The cost for this method would be very low.

A three stage process to get the localization done could be

  1. Run a campain to get a higher number of volunteering translators to translatewiki.net.
  2. When the number of translated messages starts to plateau, put money into bounty rallies and pay-per-message solutions.
  3. Finaly, when neither of these solutions are sufficient to translate the remaining messages. Consider whether it is worth to hire professional translator to get the last messages translated.

Important note about paying for translation: Even though paying for translation might be an effective way of getting the translation done it is important to realize that volunteers are likely to stop translating them self if others get paid for the same work. Especially hiering of professional translators is likely to discourage volunteering translators. The bounty rallies and pay-per-message methods are less likely to discourage volunteers because everyone has a chance to have a share. But it is still important to ensure that everyone has the same chance on the share then. A monetary reward could also decrease the intrinsic motivation as explained in the Wikipedia article overjustification effect.

Character set

To solve character set issues it would be a good idea to cooperate with allready established open source communities that tries to solve the same issues for other systems. Affricanlocalization (http://affricanlocalization.net) is one such community.

Some open source packages for African fonts are

  • Charis SIL and Doulos SIL
  • Gentium
  • DejaVu fonts
  • Liberation fonts (in progress)
  • Droid fonts (in progress)

More documents about African languages can be found at http://www.africanlocalisation.net/documents, especially the following documents seems interesting:

Characters needed for African orthographies in Latin writing system - http://www.africanlocalisation.net/content/characters-needed-african-orthographies-latin-writing-system

Right to left support

This task force has not had time to research what the actuall issues with right to left support is, but it is however important to give right to left languages the same support as left to right languages.

Important additional note: One more thing that has been brought forward is the necessity of development/availability of internationalization tools for media content. The Task Force has not had time to research what these issues are or how they can be solved, but one thing that has been mentioned is that SVG->PNG conversion gives strange results for some characters. That the software is internationalized is of course as important as localization.