This is a proposal to create a new Wikimedia project for online Translation Memory.
Translation Memory is a database that stores texts and translations of these texts with close mapping between segments of the original and the translation. It can be for computer-aided translation, linguistic research, language comparison and especially for machine translation.
Similar existing projects
There are currently several products that offer functionality that is similar to online translation memory, but none of them has all the needed characteristics of being Free, online and general-purpose:
- translatewiki.net is an online database of translated messages of MediaWiki and several other Free Software packages, which has an easy-to-use interface for entering and editing translations. It is online and Free, and has the characteristics of a Translation Memory system, but it is limited to translating software messages and some tweaking is needed to adapt it to translating general-purpose texts.
- Google Translator Toolkit
- Google Translator Toolkit is an online web application that allows easy sentence-by-sentence translation of various texts. It also has certain integration with Wikipedia. The toolkit was used in several projects in which Google collaborated with Wikipedia communities in languages of developing countries (Swahili, Tamil, Bengali and others). Its disadvantage is that it is not Free Software and the stored pairs of translated texts can only be efficiently reused by Google.
- OmegaT is a computer-aided translation application. It is Free software, it can save the results of the work as a Translation Memory file, but it doesn't offer an online storage service, which can be reused by researchers for developing and improving linguistic software.
- OmegaWiki is a freely-licensed project (GFDL and CC-BY) which combines a MediaWiki-based interface with a relational database to create a multilingual dictionary. Its central data object is a "Defined Meaning", which is linked to words in different languages. This means that the project is focused on words, rather than sentences or texts, but it's possible that its infrastructure can be reused for that, too.
- Apertium is a Free Software package for machine translation, especially suited for closely related languages such as Spanish and Catalan, but it is also used for unrelated languages. It can use Translation Memories internally, but doesn't have an interface for uploading these.
- Tatoeba Project
- The Tatoeba Project has a web site where users can add translations of sentences and paragraphs with a simple interface. All the data is free & open source. They have easy download options (currently just CSV format), and are constantly collecting free&open translation memories into their database. By converting the CSV's into TMX'es, this could be used to augment Apertium.
- A bit like the precedent, Traduxio is a web site where people can add translations using a simple interface.
- Wikimedia projects
- Wikipedia uses interlanguage links to map between encyclopedic articles that deal with similar topics. It is not required, however, that the articles be exact translations, which is a basic requirement of Translation Memory. Even if an initial version of an article is a translation from another language, from that point on the two pages develop independently, and it is a basic trait of a Wikipedia article that it is constantly updated to reflect new developments. Coordination and exact sentence-level mapping between different language versions of an article may be a good idea, but it is very hard to implement, at least at this stage. Wikibooks has a similar status in this regard.
- Wikisource texts, unlike Wikipedia articles, are supposed to remain after they are uploaded and proofread. Wikisource, like Wikipedia, can map between different language versions of one text, although there's no proper mechanism for sentence-level mapping. Also, different language version have different policies about translate texts - some Wikisource language communities (e.g. Hebrew) accept translations that weren't published earlier, while in others (e.g. English) it is a matter of debate. Finally, there are sometimes several translations of the same text into one language (The Bible is an obvious example, and there are others).
- Wiktionary's structure is very loose and wildly different between different language versions.
- Implementing a central wiki for interlanguage links - using the Interlanguage extension or something else - may facilitate the usage of existing Wikimedia projects as a Translation Memory repository.
Translation Memory is an important part of modern machine translation software. Good semi-automatic translation may give a strong boost to developing Wikipedia, Wikibooks and Wiktionary projects in underprivileged languages.
- Why should this be a Wikimedia project?
- This should be a Wikimedia project for two reasons:
- A successful execution of such a project will benefit other existing Wikimedia projects.
- Wikimedia, already being a massively multilingual endeavor, has many parts of the infrastructure for implementing this.
- The basic ideas of Translation Memory are fairly simple and described in professional literature. There are no new concepts to invent or to reverse-engineer.
- Existing Wikimedia and related projects have similar infrastructure - MediaWiki+interlanguage links, translatewiki.net, OmegaWiki. There's also OmegaT, which is Free and may be reused.
- Free translation memory - a thread on the foundation-l mailing list, which triggered this idea. The initial version of this page is a summary of that thread. That thread, in turn, was a follow-up to discussions about the usefulness of machine translation and translation memory in Wikimania 2010 and on foundation-l.
Do you have a thought about this proposal? A suggestion? Discuss this proposal by going to Proposal talk:Free Translation Memory.