Proposal talk:Content translated automatically

From Strategic Planning
Jump to navigation Jump to search

Ah, this assumes that all the content on Wikipedia is converted into a database (the ultimate database) and that Wikipedia stops being an encyclopedia. This proposal actually is quite close to A central repository of all language independent data. - Brya 19:16, 2 September 2009 (UTC)

Impact?

Some proposals will have massive impact on end-users, including non-editors. Some will have minimal impact. What will be the impact of this proposal on our end-users? -- Philippe 00:07, 3 September 2009 (UTC)

Automated translation is impractical

Unfortunately, automated translation is a technology which has not yet arrived, nor will it in the near future. The only environment where it is even moderately accurate is in restricted subject domains, which is quite the opposite of what we have in all-encompassing Wikimedia projects. It's not that machine translation in an unrestricted domain is always gibberish; there are occasional short passages, typically in stilted prose, that can be rendered quite accurately. Rather, it's that in a passage of any length, particularly if it is written in an interesting style, it is virtually certain that some part of it will not be translatable by machine at all. There are a host of different reasons for this, including grammatical features that depend on semantic or discourse context for interpretation, the assumption of a shared culture by the writer, homographs, ambiguities, misspellings and omitted diacritical marks in the source language, idioms, and misinterpretation of proper names as common nouns. Even if a translation were 90% accurate (50% is a more realistic figure), if the remaining 10% is erroneous in a biographical article, it could expose Wikimedia to legal action. In Wikimedia projects, we would also have issues with wikimarkup, template transclusion, etc. And that's just for one additional language. For a more thorough discussion, see this Wikipedia article, or if you really want to be convinced, use Google Translate to translate a random 200-word passage (say, this paragraph) into a language you know—or translate it into Spanish, and then translate the result back into English. Even though the latter test can hide some kinds of translation errors, I predict you'll agree the result is not a sound translation. Unconventional 02:21, 22 September 2009 (UTC)

Update: I went ahead and tried the Google Translate test. The last sentence, in Spanish, came out as "In spite of that the test of the latter can hide some kinds of translation error, I can predict that you will agree the result is not a translation of sound", that is, it used a different sense of "sound". Then, when I translated it back into English, it came out "...is a sound translation", hiding the previous error and now dropping the "not", thus reversing the meaning! There were a dozen other errors as well. Unconventional 02:57, 22 September 2009 (UTC)
Had been tried in Czech language: http://navajo.cz/ (automatic translation from en:). The result is absolutely unusable. 85.70.83.93 23:06, 20 October 2009 (UTC)

Manual translation to inbetween langauge

(EDIT -- this bit's cack, i'd reccomend reading 'take 2' below)

OK, as per the above (auto-translation is infeasable):

how about a mechanical 'inter-language' which contains all the data neccesary to translate to any language?

For example, take the phrase I miss you. Included in that phrase is:

  • meaning of words:
    • 'I' unambiguously means 'the speaker'
    • 'miss' can mean a few things, tho it's meaning is quite obvious from the context to a human
    • 'you' means 'the listener' or 'undefined' (informal form of 'one', or german 'mann')
  • plurality
    • unambiguously singular with 'I'
    • missing with 'you', could be singular or plural
  • role in sentance
    • subject-verb-etc, deducable from sentance structure (and inflection of 'I'), and the fact that 'miss' is a verb
      • 'I' is missing; 'you' is being missed OR having something (implied) missed at it (I shoot gun, 'gun' is having 'shoot' done TO it; I shoot you, 'you' is having shoot done AT it)
  • temporal inflection of verb 'miss'
    • i.e., just generally true wrt time.

OTOH, some information is NOT present that would be REQUIRED to translate to another language

  • plurality of 'you'
  • sex
    • japanese informal for I: boku (male) watashi (female)
  • age
    • some languages inflect pronouns based on age?
  • formality
    • English pretty much has 'you' and 'sir'; german has 'Dich' and 'Sie'; BUT, formality levels not comparable. If we split 'formality level' into 1, 2 and 3:
      1. english == 'you', german == 'Dich'
      2. english == 'you', german == 'Sie'
      3. english == 'Sir', german == 'Sie'
    • also, Japanese etc much more complicated formality levels
  • non-temporal inflection
    • did person expect 'miss you', is happy/sad at 'miss you'; I /think/ some other languages inflect like this?
  • etc

So, auto-translation isn't possible due to ambiguities (/which/ 'miss' is happening? is 'you' 2nd person or the less poncy way of saying 'one'?) and omissions (is 'I' an adult male? young girl?).

If you could convert the text into a form that had all of the missing data in:

  • I [1st person singular formality:4 adult (etc)][subject,implied]
    • "1st person singular" as infrequently, 'me' is used in this role: "you and me" for example.
    • "formality:4" is for, e.g., japanese, which has different formality levels -- watashi, boku, watakushi, etc...
    • "adult" for languages where 'I' inflects by age of speaker
    • "subject" clearly marks the subject
    • "implied" for languages like japanese which apparently have a tendancy to not explicitly state that which can be reasonably assumed
  • miss [meaning3, verb]
    • "meaning 3" differentiates from, e.g., 'did not hit'
  • you [2nd person singular, formality:4, (etc)][umm... direct object?]
    • other languages need to know singular/plural, something not present in the english 'you' in order to translate.

and, e.g., 'I shoot you' would also include something along the lines of "I shoot (IMPLIED:gun) (dative:)you" so that japanese and english could omit 'the gun', whilst other languages that don't do that can include "the (otherwize not actually present in the text) gun"

with something like the above, in theory translating FROM this to any language becomes a lot easyer as no data is missing/ambiguous.

Translating TO could be done wiki-style. e.g. simply have non-inter-language'd words in green, and volunteers can click them and fill in a form, e.g. for 'you' it could be 'speculative|2nd single|2nd plural; what formality level is appropriate, talking to queen|boss|wife|equal|stranger|underling|scum|someone who's car you've just hit|someone u think is awesome? is 'you' (mostly) male|female child|adolescent|adult|animal?

That way you'd gradually 'inter-language' everything. English/American people could write an article, which could be auto-translated into Russian, a Russian could improve it, other Russians could 'inter-language' the improvement, which could then be auto-translated into French for a french editor to improve, etc, unifying the editor base and reducing the translatoral overhead (and making it less skill-intensive, need only know basics of own language, not be fluent in another), presumably making every language article better and possibly countering cultural bias.

The translating TO workload would obviously be split amongst many people over time. actually, you could possibly auto-translate TO interlanguage, then have it handle the ambiguities like Ich missen|fail-to-hitten dich|sie(?) so that if it's obvious which is meant to any germans, they can (check, and) provide the missing data, perhaps helping make it clearer in another language. or somehting. --86.148.9.100 07:19, 23 September 2009 (UTC)

While I'm a great believer in making wiki articles more widely available, I don't see this method working. For example, the phrase "I miss you" can be expressed as "Ich vermisse Dich" and, more commonly "Du fehlts mir". "Missen" means "do without, manage without, give up on" - a somewhat different message to give someone. A German speaker would have no idea what "fail-to-hitten" means, unless he can speak English; in which case, he would have a good laugh - putting the suffix "en" on to the end of an English word is not the solution. Similarly, there is a difference between "Sie" and "sie", which alters the meaning of the sentence from "I miss you" to "I miss them". My conclusion is that, until computers develop a fuzzy logic to be able to identify subtleties in language and translate them equally subtly into the target language, the only solution is a human translator directly translating between languages. 78.152.231.113 14:55, 1 October 2009 (UTC)

take 2

OK, I explained that very poorly, and my examples were marred by the fact that I am monolingual.

Let me try again:

  • Lets say that an article contains the phrase "I miss you".
  • There is ambiguity here:
    • miss: fail to hit, or derive displeasure from the absence of?
    • you: singular or plural?
    • tone: formal or informal?
    • etc, but lets just stick to the ambiguities that mar translation into German
  • this gives 8 possibilities for translating into German (forgive me if I'm wrong, any German-speakers feel free to correct this bit; the verb-inflection is definately wrong):
    • fail-to-hit,singular,informal -- ich verschiesse dich
    • fail-to-hit,singular,formal -- ich verschiessen Sie
    • fail-to-hit,plural,informal -- ich verschiessen euch
    • fail-to-hit,plural,formal -- ich verschiessen Sie
    • long-after,singular,informal -- ich vermisse dich
    • long-after,singular,formal -- ich vermissen Sie
    • long-after,plural,informal -- ich vermissen euch
    • long-after,plural,formal -- ich vermissen Sie

This is ignoring the fact that I miss(long-after) you(sing,informal) might not best be translated literally (ich vermisse dich); another ('literally' different, possibly idiomatic) translation might be better (du fehlts mir -- what's that, 'you missed by me'?)

So, observation 1: a machine-translation would suck. observation 2: this is because one language might REQUIRE certain information that is abscent from another (in above example -- exact meaning of words (no 1:1 mapping from En : De); formality level; singularity/plurality of 'you'; etc).

Whilst machines can do part of the translation, humans would also be required to do some of the translation work

Now, just as articles don't need to be done by a small team of volunteer experts -- but rather can be done by a large mainly-amateur group -- translation need-not be done by a small team of volunteer translators, but can instead be done 'the wiki way':

  • someone contributes an edit to the English wikipedia
    • that edit is the text 'I miss you'
  • this gets translated into an intermediatory machine language as:
    • [ formality=? sentence == I[perspective=1st; plurality=sing; case=subjective] miss[meaning=?] you[perspective=2nd; plurality=?; case=objective] ]
  • this gets translated into German automatically as:
    • ich verscheissen|vermissen euch|dich|Sie
    • this is added to the German wikipedia article
    • The text is, say, green to indicate a partial machine-translation, mouseover reveals the original English, and clicking opens a translation window
  • Now, any Germans who happen along might be able to narrow the possibilities down, either by dint of speaking English, or by the context it's in.
    • lets say the context rules out 'failing to hit';
      • German editor clicks 'verscheissen|vermissen'
      • box pops open, german dude clicks 'clarify verb', then selects 'vermissen'
      • intermediatory machine language updated to [ formality=? sentence == I[perspective=1st; plurality=sing; case=subjective] miss[meaning=2] you[perspective=2nd; plurality=?; case=objective] ]
      • German text updated to ich vermissen euch|dich|Sie

Hokay, so what we have so far is the Germans being exposed to (hard-to-understand) contributions from the English editors, and a way for them to aid the translation effort (even, possibly, if they don't speak English).

Carrying on with the wiki/machine translation:

  • [ formality=? sentence == I[perspective=1st; plurality=sing; case=subjective] miss[meaning=2] you[perspective=2nd; plurality=?; case=objective] ]
  • this could be machine-translated into Japanese as:
    • (ok, this totally isn't going to be actual, correct japanese)
    • watashi|ware|watakushi|boku|atakushi|atashi wa anata|etc|etc shita[?]
      • note that, as with German, auto-translation is marred by the missing 'formality' data -- watashi|watakushi|etc, the exact pronoun to use depends (partially) on formality, as does the ending of the verb shita[?]
      • Note also that the Germans' pinning-down of what exactly the verb means is of benefit to the Japanese, as there exists no verb (?) in Japanese that could mean 'fail to hit' or 'long-after'; but, the Germans have pinned down the exact meaning, allowing machine-translation to 'shita[?]'
      • note also that there's MORE missing data here from a Japanese POV:
        • is the speaker ('I') male or female, and what age?
        • verb ending in Japanese requires data on... umm... stuff?
  • anyhow, a Japanese person can tell from the surrounding context how formal this phrase is supposed to be
    • they click the phrase and a window pops-up.
    • select 'set formality level'
    • select formality level from a list of examples
  • intermediatory machine language grabs this data and becomes [ formality=2 sentence == I[perspective=1st; plurality=sing; case=subjective] miss[meaning=2] you[perspective=2nd; plurality=?; case=objective] ]
    • this is machine-translated into Japanese as watashi wa anata shita[?]
    • AND, back to de.wikipedia, it is auto-translated (now that there's extra formality-data) as ich vermissen euch|dich (iow, 'Sie' is no longer an option, which we know thanks to the Japs)

Just because they wrote it first doesn't mean that English-speakers can't help:

  • I miss you[translation-clarification requested]
  • click
  • is you singular (you) or plural (you lot)?
  • click
  • [ formality=2 sentence == I[perspective=1st; plurality=sing; case=subjective] miss[meaning=2] you[perspective=2nd; plurality=sing; case=objective] ]
  • there's now enough information (contributed by English, German, and Japanese people) to auto-translate this into German as ich vermissen dich, at which point it would be cool if the machine could spot the better(?) translation of du felts mir.

Obviously, whilst this is going on Germans, Japanese, and English people could all be contributing to the (same, shared across language-barriers) article, whilst translating each other's work as they go along.

Benifits of doing it this way are twofold:

manpower

current way is done by (relatively small) group of bilinguals; wikiway is done by a much larger group of people; also, why should bilinuguals translate, as they could just read the original article in whichever language; whereas done the wiki-way, the work is done by people who (presumably) have to decypher the text anyway to understand what they're reading (iow, there's more reason for the people to help with translation than relying on volunteer efforts from bilinguals)

Both of these should result in more work being done on translation

unification of wiki-editor base to work on one unified wikipedia

not only does editor-base-unification increase the amount of manpower dedicated to any given article (presumably increasing speed of article improvement) by not splitting people up based on language, but it also allows, e.g., Germans access to German-language articles on topics who's references are mainly in French (i.e., French people write it, based on verifyable French references, but the Germans still have access to it) or simply for all languages to have access to articles on Japanese stuff written by the Japanese (so, presumably a better job would be done than if written by non-Japanese); whilst also countering bias (taking foo.wikipedia, it's currently biased towards foo-o-phone ways of thinking and what's notable in the foo-o-sphere; by unifying the editor base, this would be reduced/eliminated)

thoughts? oh, and treat the above as a wiki article (i.e., tinker with/fix it if you want) --81.152.234.187 17:49, 2 October 2009 (UTC)

Statistical approach

While I am certainly in favour of making use of computerised translation facilities, I am not sure that this should be done fully automatically without appropriate editing. My suggestion is that in the case of Wikipedia, a statistical approach could be implemented on the basis, say, of the priority and quality of articles (e.g. talk page criteria) as well as the number of accesses to articles. This would provide a means for deciding on which articles seemed to deserve inclusion, for example, in the English-language Wikipedia and which could be drawn from the English Wikipedia for inclusion in other languages. For the latter, articles in English about the country or countries in which a given language is spoken would merit special attention. The whole question of proper referencing could also be handled along these lines. This would mean that instead of finding a long string of articles in other languages on a given topic, those which have been given high marks for quality and priority would be distinguished from the others (perhaps by colour coding) and bi- or multi-lingual editors would be encouraged to draw on computer translations as a means for creating or expanding articles in other languages.