The Malformed Language Tree of Bouckaert and His Colleagues

May 24, 2014 by

[This post was originally published in October 2012]

In the previous post, we examined the first element in the research of Bouckaert and his colleagues: identifying and comparing cognate sets, which is a prerequisite for the second step, namely constructing the linguistic family tree and putting a time scale on it. In this post, we will focus on the problems surrounding their diagram of the branching Indo-European languages; the next post will focus on dating issues. But first let’s consider how their methodology works.

Essentially, the method they employ is based on comparing cognate sets and calculating the number of shared cognates, which allows the grouping of languages into subsets based on the assumption that the more cognates a given pair of languages shares, the closer their relationship. A phylogenetic tree is constructed to represent those relative degrees of relatedness. The number of shared cognates also allows the researchers to estimate relative dating of splits on the tree: if  language A and language B share 97% of cognates (i.e. only 3 out of 100 items are not cognates), whereas a language C and a language D share merely 85% of cognates (i.e. 15 out of 100 items are not cognates), the split between C and D is taken to be five times older than the split between A and B. After relative dating has be so established, absolute timing is determined by factoring in known dates of specific historical events that are thought to be associated with splits on the tree. A caveat must be made here: any dating of the branching patterns of a linguistic tree presupposes that splits between separate languages are discrete events that happen at a certain point in time. In actuality, they are not, as language divergence is a gradual process. Certain historical dates can be assumed as approximate divergence dates, but only to a limited extent. For example, 1492 CE—the year Jews were expelled from Spain—can be taken as the divergence date between Spanish and Balkan Ladino.* Prior to the expulsion of the Jew from Spain, their language—though containing some borrowings from Hebrew—some  was virtually indistinguishable grammatically from that of the Christian Spaniards (see also chapter 12 of Languages of the World: An Introduction). Another example of a historical event used to estimate the time of a divergence on the tree is the split of Dutch and Afrikaans, dated to the establishment of the Cape Colony (though Afrikaans did not derive from Dutch in general but rather from certain specific dialects of that language).


In the ideal situation, the procedure described above can produces a reasonably serviceable model of language spread and divergence through time. But things are rarely as smooth in practice. As a brief demonstration of the potential pitfalls encountered by such a model, let’s consider how it works—and how it breaks down—in regard to a small selection of lexical items from seven languages spoken in Vanuatu and nearby regions of Papua New Guinea, shown in the image on the left. Colored cells in the table represent words in each lexical set that appear to be similar, whether through common descent (true cognates) or borrowing (which we cannot distinguish at this point, as is the case in regard to many of the supposed cognates used in Bouckaert et al.’s article as well). As can be seen, the Motu, Sowa, Mota, and Raga words for ‘wind’ are similar, but the Hiw, Waskia, or Amara counterparts are not. Mota and Raga have the most similar items, eight each. Next come Sowa with six similar items, Hiw with five, and Motu with four. Waskia and Amara appear to have no shared vocabulary, at least not in this selection.


Based on this data alone, we can hypothesize that Mota and Raga are the most closely related languages, with Sowa a more distant “cousin”, Hiw more distant yet, and Motu least closely related of all; Waskia and Amara, on the other hand, have to be treated as unrelated to each other and to the rest of the languages under consideration. (Relative dating of the splits can also be established, but we will not do so here.)





But if one examines additional vocabulary elements along with grammatical properties of these languages—as well as their known histories—a very different picture is revealed, schematized in the chart on the left. While Hiw, Sowa, Mota, and Raga are indeed grouped together into East Vanuatu subset, Motu is more closely related to Amara than to the first four languages. Both Motu and Amara belong to the Western Oceanic grouping, although they belong to distinct subsets within that grouping, which accounts in part for their failure to share any similar words in our sample. Together with other clusters of languages, East Vanuatu and Western Oceanic are members of the Oceanic branch of the Austronesian family. The Oceanic branch includes over 500 languages, among them the Polynesian tongues discussed in an earlier post. Waskia, the seventh language in our set, belongs to the Trans-New Guinea language family, completely unrelated to Austronesian. The discrepancy between the tree produced by comparing potential cognates and the tree established by a more thorough analysis highlights an essential point: while words are an important element of any language, grammatical patterns—that is how words are put together—are equally important. We shall return to this crucial point below.


As with our Austronesian/Papuan model, the lexicostatistical methods employed by Bouckaert et al. produce a tree that has a questionable configuration. For example, the authors group Armenian and Tocharian together, with the split of the hypothesized Tocharo-Armenian proto-language dated to approximately 3200 BCE (5,200 years ago). As noted by Alexei Kassian of the Institute of Linguistics, Russian Academy of Sciences in his critique, “nobody has ever proposed such a grouping, which also directly contradicts not only traditional linguistic arguments, but also formal lexicostatistics as well”. The study also groups Frisian with Flemish and Dutch, rather than English, which many Germanic scholars find objectionable. This grouping is probably due to the fact that Frisian and Dutch have heavily borrowed from one another. Thus, Frisian being grouped together with Flemish and Dutch suggests that a large number of borrowings were mistaken by Bouckaert et al. for cognates. As we shall see below, many other odd configurations on Bouckaert et al.’s tree likely derive from the same problem. It should also be pointed out that the authors of the Science article make a critical mistake in that they do not distinguish shared innovation from shared retention. It has become a cornerstone of historical linguistics that only shared innovations have probative value when it comes to language classification. Otherwise, a variety of English which uses stone has to be considered closer to German (which uses the cognate stein) than to a variety of English which uses (the French loanword) rock. However, because stone and stein are shared archaisms, they are of no worth for


Without going through each split on the tree one by one, I will point out some of the most glaring errors, starting with those concerning the Romance languages. According to Bouckaert et al.’s tree, Romanian was the first language to separate from Latin, while Sardinian is grouped together with Italian, Romansch, and Ladin in a significantly later split (we will return to the Romanian problem in the next post). Most traditional classifications, such as the one schematized in this Wikipedia chart, treat Sardinian as the first language to have branched off the Romance sub-tree, and for good reason; Sardinian has a substantial non-Indo-European substrate, which probably indicates that Sardinian Latin began to diverge from classical Latin as soon as the language was imposed on the island in 238 BCE. In regard to the tongues of the Italian Peninsula, Friulian is usually grouped with Romansch, Ladin, and other speech varieties spoken in Northern Italy, which are more closely related to Gallo-Romance (including French) than to standard Italian, as the latter is based on the Tuscan dialect, belonging to the South Romance grouping. Among the commonalities that Northern Italian dialects share with (Standard) French are grammatical effects resulting from contacts with the Germanic languages of the Franks, Burgundians, and Longobards, such as using subject-auxiliary inversion to form yes/no questions (illustrated by English Have you…? or French Avez-vous…?) and the obligatory nature of a subject in every sentence (illustrated by the ungrammaticality of *Rains in English or the corresponding *Pleut in French). Northern Italian dialects share these pattern with English and French; southern Italian dialects do not.


Another unsupportable configuration of the Science model concerns the internal grouping of the Slavic languages. According to the classification employed by Bouckaert et al., Byelorussian and Polish are sibling tongues. They contend that Byelorussian is more distantly related to Ukrainian and Russian than it is to Polish, but also that Polish is more closely related to Ukrainian and Russian than it is to Czech, Slovak, or Lusatian. Such a notion contradicts the well-established classification of Slavic languages into South Slavic (including Slovenian, Serbo-Croatian, Bulgarian, and Macedonian), East Slavic (including Russian, Ukrainian, and Byelorussian), and West Slavic (including Czech, Slovak, Lusatian, and Polish!). This established classification scheme is based not merely on cognates, but also on the basis of phonology and grammar. For example, one of the sound patterns that characterizes East Slavic languages—but not Polish—is so-called pleophony (in Russian, polnoglasie), that is having an extra vowel/syllable in words like the Russian moloko, Byelorussian malako, and Ukrainian moloko for ‘milk’ (all stressed on the last syllable). Polish, like other West Slavic and South Slavic languages, does not have this feature; hence, the Polish mleko, the Czech mleko, the Slovak mlieko, the Slovenian, Serbo-Croatian, and Macedonian mleko, and the Bulgarian mljako. Additional argument—which are numerous—for classifying Polish as a West Slavic language, more closely related to Czech or Slovak than to Byelorussian, Ukrainian, or Russian, are discussed in detail in Sussex & Cubberley (2006). Even earlier glottometric approaches similar to those employed by Bouckaert et al. failed to group Polish with Byelorussian.** Bouckaert et al. appear to have hypothesized a close tie between Polish and Byelorussian due to a high degree of “horizontal transmission” (i.e. borrowing) that characterized the two languages for centuries. From the 14th century to the late 18th century, the Byelorussian lands were politically tied to Poland, first as part of the Grand Duchy of Lithuania, in personal union with Poland, and later as part of the Polish–Lithuanian Commonwealth. Significant ethnic and linguistic mixing characterized these lands as recently as 100 years ago.

In regard to the higher level of Slavic classification, Bouckaert et al. group West Slavic and East Slavic languages together, separating them from the South Slavic languages. While this classification is supported by many patterns in sound structure, morphology, and word order, other historically based patterns support different classification schemes. Some scholars group East and South Slavic together as opposed to West Slavic; others draw a distinction between North Slavic (including Polish, Sorbian, and the three East Slavic languages) and South Slavic (including, perhaps confusingly, the traditional South Slavic languages as well as Czech and Slovak). Confusion here is generated by the fact that some linguistic features cut across internal Slavic boundaries: for instance, fixed-stress languages include those of the West Slavic branch and Macedonian. Some of these cross-cutting features are due to religious affiliations: languages whose speakers tend to be Orthodox Christians favor Greek-based lexis, while languages whose speakers are non-Orthodox “often show a greater preference for indigenous or Western lexis” (Sussex & Cubberley 2006, p. 9). The use of Cyrillic or Latin-based alphabet also lines up with religious affiliation. Other grammatical similarities that cut across the traditional Slavic language-family categories are due to extensive borrowing, as has been demonstrated in the case of the Balkan sprachbund (i.e. an area where linguistic features are shared across family boundaries). Two Slavic languages found in the Balkan sprachbund area—Bulgarian and Macedonian—share a number of features with non-Slavic neighboring languages such as Romanian and Albanian, most notably their reliance on suffixed articles. Compare the Bulgarian grad-at ‘the city’ with the Romanian counterpart oraş-ul (the hyphen is used to show the boundary between the root and the suffixed article): although the suffixed articles are themselves different—Bulgarian -at vs. Romanian -ul—the use of such suffixed articles in general is distinctive, limited in Europe to the Balkans, Scandinavia, and Lithuania. The difficulties encountered in trying to fix the higher-order classification of the Slavic languages indicate that a tree-based model may not be the ideal tool to represent language relatedness. As a result, newer network- and wave-based models are gaining ground in historical linguistics.


The internal groupings within the Indo-Aryan branch of the Indo-European proposed by Bouckaert et al. are similarly unexpected. As can be seen in the figure to the left, which takes a segment of their tree and color-codes each language to represent the established classification scheme, hardly any of the traditional groupings are reproduced by Bouckaert et al.’s algorithm. The only cluster of languages that corresponds to the schema created by scholars of the Indic languages consists of Assamese, Oriya, and Bengali, which are all members of the Eastern Zone. Bihari, also traditionally classified as a member of this group, finds itself loosely related to Hindi and Urdu (both Central Zone languages), as well as to Lahnda, a Northwestern Zone language. Singhalese, which is typically treated as a member of a different grouping altogether (not color-coded here), is linked by Bouckaert et al. together with Kashmiri, another Northwestern Zone language, more closely related to Lahnda and Sindhi. (Many linguists classify Kashmiri as a Dardic language, which would place it on a highly distinct branch of the Indo-Aryan languages.) Breaking from all established classification schemes, Bouckaert et al. group Sindhi with Marwari, a member of the Central Zone, and Lahnda with Urdu. They further claim that Urdu and Hindi are not particularly closely related, their split dating to about 1,200 CE. In actuality, Hindi and Urdu are basically mutually intelligible, and as a result many people consider them to be different forms of the same language. We know from unassailable historical sources, moreover, that these two languages broke from the Hindustani dialect continuum only in the 19th century. Similar mistakes are found elsewhere on the tree. Bouckaert et al., for example, treat Gujarati and Marathi as relatively closely related, but scholars with actual expertise in this area argue that Gujarati is more closely related to Hindi, Urdu, and Marwari than it is to Marathi.

Yet the biggest oddity of the modeled linguistic tree involves its treatment of Romani, the language of the Gypsies, or Roma people. According to Bouckaert et al., Romani was the first Indo-Aryan language to split off the rest of the tree, around 1500 BCE (3,500 years ago). The improbability of this date was noted by biologist and blogger Razib Khan. What Khan does not mention, however, is that linguistic analysis alone can demonstrate its absurdity. Here one needs to examine the evolution of the Romani grammatical gender system and of the corresponding systems in other Indo-Aryan languages. Earlier forms of these languages had three genders—masculine, feminine, and neuter—inherited from proto-Indo-European. But for reasons that need not concern us here, the Indo-Aryan languages, including Hindustani and others, lost the neuter gender. From written sources we know that this change occurred some time around 1000 CE. Once the neuter gender was lost, the formerly neuter nouns were reassigned to either the masculine or feminine gender, seemingly at random. Modern Romani too has only two genders, masculine and feminine. As in Hindustani, and hence modern Hindi, the majority of the formerly neuter nouns became masculine in Romani. Crucially, gender reassignment occurred in the same manner in Romani as in Hindi. For instance, agni ‘fire’, which was neuter in Prakrit, the ancestral language of modern Indo-Aryan languages, became the feminine āga ‘fire’ in Hindi and likewise the feminine jag ‘fire’ in Romani. Such parallel changes apply to hundreds of formerly neuter nouns; thus, it is statistically all but impossible that they could all be reassigned to the same gender in Hindi and Romani purely by chance. The simplest explanation is that Romani separated from the languages of northern India after the loss of the neuter gender around 1000 CE, and the reassignment of nouns, which happened only once, with both Hindi and Romani inheriting the novel forms. As a result, Romani could not have branched off from the languages of northern India before the 11th century CE.

Dating the Romani split to a period 2,500 years later than the one proposed by Bouckaert et al. receives further support from genetic studies, which place the “founding event” (that is, the Roma exodus from India) “approximately 32-40 generations ago”. Assuming 25-30 years per generation, this figure nicely matches the 1000 CE date derived from linguistic studies (see Morar et al. 2004). Bouckaert et al. probably classify Romani as the “odd man out” among the Indo-Aryan languages due to the fact that it picked up a much of its vocabulary from languages it came into contact with during its journey from India to Europe: Armenian, Turkish, Persian, Kurdish, and especially Greek. Among the many Greek loanwords in Romani are drom ‘road’ from the Greek drómos ‘road’, zumin ‘soup’ from the Greek zumí ‘soup’, xoli ‘anger’ from the Greek xolí ‘anger’, as well as grammatical loanwords like pale ‘again’ from the Greek pale ‘again’, komi ‘still’ from the Greek akómi ‘still’ and numerals efta ‘seven’, oxto ‘eight’ and enja ‘nine’.

Yet again, we see that accepting the model proposed by Bouckaert and his colleagues requires one to believe not just three but actually dozens of impossible things before breakfast (with apologies to Lewis Carroll). We shall examine more of their linguistic failings in tomorrow’s post.



* After 1492, Ladino in the Balkans was wholly isolated from Spanish (making it a perfect separation point), whereas Ladino in North Africa (esp. Morocco) remained in contact with Spanish.

**Both Polish and Byelorussian were shown by these earlier studies to have close ties to Ukrainian, but lesser connection to each other, as reported in Sussex & Cubberley (2006, p. 474).


Morar, Bharti; David Gresham; Dora Angelicheva; Ivailo Tournev; Rebecca Gooding; Velina Guergueltcheva; Carolin Schmidt; Angela Abicht; Hanns Lochmuller; Attila Tordai; Lajos Kalmar; Melinda Nagy; Veronika Karcagi; Marc Jeanpierre; Agnes Herczegfalvi; David Beeson; Viswanathan Venkataraman; Kim Warwick Carter; Jeff Reeve; Rosario de Pablo; Vaidutis Kucinskas and Luba Kalaydjieva (2004) “Mutation history of the roma/gypsies”. American Journal of Human Genetics 75(4): 596-609.

Pereltsvaig, Asya (2012) Languages of the World: An Introduction. Cambridge University Press.

Sussex, Roland and Paul Cubberley (2006) The Slavic Languages. Cambridge University Press..


Related Posts

Subscribe For Updates

We would love to have you back on Languages Of The World in the future. If you would like to receive updates of our newest posts, feel free to do so using any of your favorite methods below: