Does Google Translate speak "like a 10-year-old"?

[Thanks to Martin W. Lewis for inspiration for this post]

In several earlier posts (see here, here and here), I’ve already touched on the topic of Google Translate and its failures to… translate. But the argument continues, with more and more GT propaganda pieces appearing in the popular media on a regular basis. Here’s one of the latest examples: an article by Jeremy Kingsley in Slate. According to Mr. Kingsley, Google Translate “already speaks 57 languages as well as a 10-year-old” — but does it?!

The typical defence of Google Translate advocates is that it allows one to “get the gist of it”, but as I showed in the earlier posts, to say that, one has to define “the gist” very loosely. Here’s an additional example:

One question though: what interests our government have promoted to integer island. Nothing in their history, customs and Muslim clan, political and economic … neighbor, do not close to us except to ostracize them fly abroad. This department will remain a burden for our country. So to answer this question: why this referendum proposed pipe?

Did you get the gist of it? This is the GT-produced French-to-“English” “translation” (in fact, it’s neither English, nor a translation) of a comment from a forum discussion of Mayotte becoming an overseas department (département d’outre-mer) of France (the event itself is discussed in detail in an excellent post in Martin W. Lewis’s GeoCurrents blog).

Judging by GT’s translation, the commentator is not pleased with Mayotte’s incorporation into France, but why? It is hardly clear from this feeble attempt at translation. So here’s the original French passage:

“Une question tout de même : quels intérêts notre gouvernement a-t-il favorisés pour intéger cette île . Rien dans leur histoire , moeurs musulmanes et claniques , contexte politique et économique voisin… , ne les rapproche de nous sauf pour ostraciser l’étranger qui les vole . Ce département restera un poids pour notre pays. Donc qui répondra à cette question : pourquoi avoir proposé ce référendum pipé ?”

And here’s a human-made translation:

A question all the same: what interests has our government encouraged to integrate this island? Nothing in their history, Muslim and clannish morals, or local political or economic context moves them closer to us except to ostracize the foreigner who robs them. This department will remain a weight on our country. So who will respond to this question: why was this loaded referendum proposed?

This exercise in translation, GT and human, highlights the falsity of Mr. Kingsley’s statement (which he argues for throughout the article) that GT “speaks like a 10-year-old”. There’s nothing closer to the truth. In fact, GT and a human child (speaking any language, it doesn’t matter which) handle language completely differently. And it shows in the results.

As you can see in the passage above, GT handles best “the big words”, rather than “the small words” or the grammar. In the passage above, GT handles correctly gouvernement, politique, économique, ostraciser, référendum (what 10-year-old really knows such words?!). In contrast, it fails to translate moeurs, contexte and vole; misses a typo in intéger; and mishandles the idiomaticity of pipé, among other things.

The reason that GT can handle “the big words”, which would get you a lot of points in Scrabble, but not the “small words” is directly related to length. Longer words tend to be newer and less frequent in the language (as words “shrivel” with age and use). Such words are also less likely to be polysemous (i.e. have multiple meanings) or be part of homonym or homograph sets (homonyms are words that are pronounced the same but mean different things, e.g. mussel and muscle; homographs are spelled the same, but are pronounced and interpreted differently, e.g. the verb tear and the noun tear). Statistically speaking, the longer the word, the less likely it is to coincide in pronunciation (or in spelling) with another word. And, of course, the fewer different meanings, the easier it is to “translate”.

This is illustrated beautifully with the verb voler (vole in the above French passage). As a transitive verb, it means ‘to steal, to rob’ and as an intransitive verb, it means ‘to fly’. Any human — including a 10-year-old and even younger children — will be able to choose the correct translation because humans process the structure of the sentence. We just can’t help it; GT just can’t do it. In our passage, the subject of vole is the relative pronoun qui and the object is a pronominal clitic les (which appears before the verb, as pronominal clitics are known to do in French). Thus, this verb here is unmistakably ‘to rob’, not ‘to fly’.

This also goes to show that, contrary to Mr. Kingsley’s claim that “there are more exceptions, qualifications, and ambiguities than rules and laws to follow”, ambiguities are often subject to rules too, even if these rules are more subtle than Mr. Kingsley would like.

Let’s note also that “the big words” are less likely to be grammatically irregular than “the small words”, which again — given the inability of GT to handle grammar — makes “the big words” easier to translate. Hence, the future tense form of the verb rester ‘to remain’ (restera in the passage) is translated by GT properly whereas the future tense form of the verb répondre ‘to respond’ (répondra in the passage) is not.

Furthermore, GT fails to analyze and/or render the syntactic structure of the original passage. It mishandles a direct question in the first sentence; two instances of coordinate structures in the second sentence; two instances of transitive structures (treated by GT as intransitives), also in the second sentence; another instance of a direct question in the fourth sentence; a complex analytical tense, also in the fourth sentence; and a participial modifier at the very end. In fact, only the third sentence Ce département restera un poids pour notre pays is translated correctly or with anything resembling a grammatical English.

In contrast to GT, children handle grammar even if they don’t understand some of “the big words”. All the aspects of grammar that GT fails to “understand” — questions, coordinations, modification structures, verb-argument structures — are easily understood by 10-year-olds and even younger children. In fact, an average 5-year-old can do better than mistake a transitive verb for an intransitive one, even if the child does not know what the meanings (transitive or intransitive) are. Moreover, there is clear evidence for the so-called syntactic bootstrapping: children learn the meaning of new words that they don’t yet know by working them out from their syntactic context, sort of like we process Lewis Carroll’s famous lines:

`Twas brillig, and the slithy toves
Did gyre and gimble in the wabe:
All mimsy were the borogoves,
And the mome raths outgrabe.

In other words, although children acquiring their native tongue do not (yet) know the terminology like “subjects, nouns, verbs”, they do — contrary to Mr. Kingsley — “deconstruct sentence structure” and figure out the patterns behind those structures.

So “will Google’s computers understand language better than humans?” — hardly, if they don’t even attempt to understand language, but simply to find the statistically best matches.

