Meeting Darwin’s Last Challenge: Matching Genes, Languages, and Geography (Part 1)

Sep 29, 2015 by

[Note: this is the first post of a three-part mini-series on this issue of genes, languages, and geography.]

In The Origin of Species, Charles Darwin conjectured that “the cultural transmission and differentiation of languages over the period of human history matches the biological transmission and differentiation of the genetic characters which define the populations of the world”, an issue that has been taken up by the ERC-funded LanGeLin (Language and Gene Lineages) research project, led by Giuseppe Longobardi and involving linguists, population geneticists, and molecular anthropologists in York, Ferrara, and Bologna. The team has published a number of articles, mostly in linguistic journals (cf. Longobardi & Guardiano 2009, Longobardi et al. 2013); their most recent publication titled “Across Language Families: Genome Diversity Mirrors Linguistic Variation Within Europe” appeared in American Journal of Physical Anthropology (Longobardi et al. 2015). While exploring the Russian-language website Генофонд.рф, dedicated to the peopling of northern Eurasia (the topic of one of my current courses!), I found a summary/overview of Longobardi et al. (2015), written by Nadežda Markina. What particularly caught my attention, however, were two short critiques posted under Markina’s summary, one by linguist Svetlana Burlak (“Grammar is no better than lexicon”) and the other by geneticist Oleg Balanovsky (“Two similar studies arrived at opposite conclusions”). Below I offer my own thoughts about these two critiques; the relevant passages from them are reproduced below in my translation. (Before I proceed, I must commend our Russian colleagues, whose financial circumstances are even more precarious than those of Western scholars, but who manage to produce and maintain high-quality websites aimed at popularizing today’s cutting-edge science such as,, and, as well as the site I mentioned above, Генофонд.рф.)


The main contribution of Longobardi et al. (2015) is in comparing linguistic diversity “inferred from data on both Indo-European and non-Indo-European languages” of Europe with genetic diversity and geographical distances between the relevant populations. It is important to note that in order to determine the linguistic distances among languages, the researchers relied not only on a refined lexical database with “several lexical roots … often listed for the same meaning” (also employed by Bouckaert et al. 2012), but crucially also on a grammatical comparison tool, PCM (Parametric Comparison Method), developed in their earlier work. The idea behind this method is comparing values of binary syntactic parameters “drawn from a supposedly universal list, defining a structured variation space within the human capacity often labeled “universal grammar” (UG) or “faculty of language.”” The authors point out that this method allows comparing “all languages, no matter how lexically distant … bypassing many problems arising with word collation”. Based on the lexical and the syntactic comparisons, corresponding trees of language relatedness, reproduced on the left, were constructed. The matrices of lexical and syntactic distances were compared to corresponding measures of genetic and geographical distances, leading the authors to conclude that “contrary to previous observations, on the European scale, language proved a better predictor of genomic differences than geography”.

Let’s now turn to Svetlana Burlak’s critique. Her first challenge concerns what she claims to be Longobardi et al.’s main hypothesis, namely that “grammar reflects language relatedness better than the lexicon”. She writes (translation mine):

“It appears not to be so.

1) Syntax (and grammar more generally) is more vulnerable than the lexicon to the influences arising during long-term bilingual contacts. If the contacts are shallow, then grammar is retained while the lexicon is borrowed—but that is the case with the culture-related lexicon, whereas works on glottochronology use basic lexicon, which is selected especially because of its much lower probability of borrowing than in the culture-related lexicon. And when the contacts are such that peoples live together for a long time and everybody knows, and frequently uses, both languages, grammar acquires many common features.”

Several things need clarifying here. First of all, as Giuseppe Longobardi pointed out to me in personal communication, the idea that grammar is a better reflection of language relatedness than the lexicon is not the working hypothesis of Longobardi et al. (2015) at all. Rather, they show that the results of lexical and syntactic comparisons “correlate a lot”.

But as Martin Lewis and I discuss in chapters 3 and 4 of our recent book, The Indo-European Controversy: Facts and Fallacies in Historical Linguistics, the idea that grammar may be a better indicator of language relatedness than the lexicon goes back over a hundred years ago, and for a good reason. Meillet (1908: 126) noted that “Les coincidences de vocabulaire n’ont en general qu’une très petite valeur probante” [“Coincidences of vocabulary are in general of very little probative value”]. One of the reasons for this has to do with borrowing: much of the literature on language contact reiterates that grammatical borrowing is far more limited and rare than lexical borrowing (cf. Moravscik 1978, Thomason and Kaufman 1988, Matras 2000, Curnow 2001, Aikhenvald 2006, Matras 2009, Haspelmath and Tadmor 2009). As Burlak herself notes, grammatical “borrowing” occurs only in certain relatively rare circumstances of “long-term bilingual contacts”. Lexical items, in contrast, are borrowed much more readily.

Moreover, it must be noted that even the so-called basic lexicon, including kinship terms, pronouns, and the like, is not impervious to borrowing. For example, Yiddish words for ‘grandmother’, ‘grandfather’, and ‘nephew’ are borrowed from Slavic, while its word for ‘widow’ is a loanword from Hebrew. A better-known example is the English pronouns they, them, their, which were borrowed from Old Norse. Even numbers 1 through 10, which a pioneer of the historical/comparative philology James Parsons considered “convenient to every nation, their names … most likely to continue nearly the same, even though other parts of languages might be liable to change and alteration”, can be borrowed, as is the case with Romani numerals 7-9, borrowed from Greek. While the overall probability of borrowing in the basic lexicon is lower than in the culture-related lexicon (e.g. names of plants and animals, cultural innovations and technologies, and the like), it is not possible to exclude all borrowed lexical items a priori. As we have argued extensively in our book, despite all explicit efforts to exclude borrowings from a lexical dataset, results of studies based solely on lexical material exhibit biases due to loanwords unintentionally left in the dataset. As a result, malformed trees emerge, such as those that show Romani as an outlier within the Indo-Aryan branch of Indo-European or Russian as more distantly related to Ukrainian that Polish is. These discrepancies with the well-established relationships within the Indo-European family can be explained by high levels of lexical borrowing in languages whose position in such lexically-based phylogeny trees appears “wrong” (e.g. Romani and Russian).

Figure 6

But just as lexical borrowing can misshape family trees, wouldn’t the same be true of grammatical “borrowing”? Note that here and below, I use the term “borrowing” in quotes when applicable to grammatical phenomena because it has been shown to be radically different in nature from lexical borrowing (cf. Van Coetsem 1988, 2000, Thomason & Kaufman 1988, Louden 2000, Lucas 2012, Pereltsvaig 2015): the latter is typically driven by “recipient language agentivity” while the former more often than not occurs as a result of “source language agentivity” (or “interference through shift”). Theoretically, it is indeed possible that some syntactic tree configurations would reflect horizontal transmission rather than family relationships. However, there are two reasons to believe that such grammatical “borrowing” would have a lesser impact on the tree configuration than borrowing in the lexical domain. First, Longobardi and his team have shown that “diachronic resetting of syntactic parameters is slower than lexical replacement” (Longobardi et al. 2015); in other words, given a period of time and a set of related languages, the grammatical dataset would contain fewer innovations as compared to the shared common ancestor—including innovations due to both horizontal transmission and language-internal causes—than would the lexical dataset. Second, of the innovations found in the grammatical dataset, fewer are due to horizontal transmission than among innovations found in the lexical dataset because grammatical “borrowing” is relatively rare (as Burlak mentions in her critique). Thus, given the same list of languages and the same number of data points (parameters or lexical sets), a grammatical dataset will contain fewer borrowings than a corresponding lexical dataset, and so the impact of horizontal transmission will necessarily be less significant in the case of a grammatical comparison. For example, the only departure from the traditional family tree in Longobardi and Guardiano (2009) that can be accounted for by grammatical “borrowing” concerns the position of Bulgarian: rather than being depicted as more closely related to Serbo-Croatian, it is shown as an outlier in the Slavic grouping (see the image on the left, reproduced from Pereltsvaig & Lewis 2015: 241; see also discussion on pp. 226-227). In Longobardi et al. (2015), the only two placements on the syntactic tree that can be potentially attributed to “malformation due to borrowing” are those of Irish as related to Germanic and Greek as related to Slavic. The Irish-Germanic connection could be an artefact of the putative grammatical influence of Celtic on English (cf. McWhorter 2008). (Most scholars, however, group Celtic languages with Italic/Romance ones, after the Germanic branch split off.) The Greek-Slavic link in Longobardi et al. (2015) may be a reflection of the Balkan Sprachbund. Note that in their tree based on lexical distances, both Irish and Greek are depicted as outliers among the Indo-European languages of Europe.


The relative reliability of PCM, as compared with lexical comparison, is further buttressed by the fact that when three non-Indo-European languages are added to the mix, the syntactic tree of the 12 Indo‑European languages under consideration does not change. Earlier studies that examined the reliability of lexical comparison methods have noted that addition (or subtraction) of a language into the computation may result in a radical restructuring of the tree as concerns other languages (cf. Ringe et al. 2002). In Longobardi et al. (2015), the addition of Finnish, Hungarian, and Basque does not change the configuration for the 12 Indo-European languages. (The lexical tree does not include these three non‑Indo-European languages because it is based on an exclusively Indo-European lexical database; in theory, however, it is not impossible to create parallel lexical datasets for these three languages as well.)

Finally, note that the study reported in Longobardi et al. (2015) can be seen as a “verification and validation” stage for the PCM methodology, as it is being applied here to a set of languages whose relatedness is relatively well-understood. Unlike earlier studies based purely on lexical data, such as Gray and Atkinson (2003) or Bouckaert et al. (2012), Longobardi et al.’s syntactic tree does not contain any really weird groupings. In contrast, their lexical tree shows some “surprises”, such as Germanic being more closely related to Slavic than to Italic; this configuration, like other “malformations” mentioned above, can also potentially be accounted for by borrowing, in this case from Germanic into Slavic, at a fairly early stage.

To be continued…




Aikhenvald, A. (2006) Grammars in contact: a cross-linguistic perspective. In: A. Aikhenvald and R. M.W. Dixon (eds.) Areal Grammars in contact: A crosslinguistic typology. Oxford: Oxford University Press. Pp. 1–66.

Bouckaert, R.; et al. (2012) Mapping the Origins and Expansion of the Indo-European Language Family. Science 337: 957-960.

Curnow, T. J. (2001) What can be ‘borrowed’? In: A. Aikhenvald and R.M.W. Dixon (eds.) Areal diffusion and genetic inheritance: Problems in comparative linguistics. Oxford: Oxford University Press. Pp. 412–436.

Gray, R. D. and Q. D. Atkinson (2003) Language-tree divergence times support the Anatolian theory of Indo-European origin. Nature 426: 435-439.

Haspelmath, M. and U.Tadmor (2009) Loanwords in the world’s languages: A comparative handbook. Berlin: de Gruyter.

Longobardi, G. and C. Guardiano (2009) Evidence for syntax as a signal of historical relatedness. Lingua 119: 1679-1706.

Longobardi, G.; et al. (2013) Toward a syntactic Phylogeny of modern Indo-European Languages. Journal of Historical Linguistics 3(1): 122-152.

Longobardi, G.; et al. (2015) Across Language Families: Genome Diversity Mirrors Linguistic Variation Within Europe. American Journal of Physical Anthropology.

Louden, M. (2000) Contact-Induced Phonological Change in Yiddish: Another Look at Weinreich’s Riddles. Diachronica 27: 85-110.

Lucas, C. (2012) Contact-induced grammatical change. Towards an explicit account. Diachronica 29(3): 275-300.

Matras, Y. (2000) How predictable is contact-induced change in grammar? In: C. Renfrew, et al. (eds.) Time depth in historical linguistics. Cambridge, UK: McDonald Institute for Archeological Research. Pp. 563-583.

Matras, Y. (2009) Language contact. Cambridge, UK: Cambridge University Press.

McWhorter, J. (2008) Our Magnificent Bastard Tongue: The Untold History of English. Gotham Books.

Meillet, A. (1908) Les Dialectes Indo-Européens. Paris: Libraire Ancienne, Honoré Champion, Editeur. Libraire de la Société de Linguistique de Paris.

Moravcsik, E. (1978) Universals of language contact. In: J. H. Greenberg (ed.) Universals of Human Language. Stanford, CA: Stanford University Press. Pp. 94-122.

Pereltsvaig, A. (2015) The Emergence of Embedded V2 in Yiddish from the Parametric Perspective. Paper presented at the Comparative Germanic Syntax Workshop, Chicago.

Pereltsvaig, A. and M. W. Lewis (2015) The Indo-European Controversy: Facts and Fallacies in Historical Linguistics. Cambridge University Press.

Ringe, D.; et al. (2002) Indo-European and Computational Cladististics. Transactions of the Philological Society 100(1): 59-129.

Thomason, S. G. and T. Kaufman (1988) Language Contact, Creolization and Genetic Linguistics. Berkeley, CA: University of California Press.

Van Coetsem, F. C. (1988) Loan Phonology and the Two Transfer Types in Language Contact. Dordrecht: Foris.

Van Coetsem, F. C. (2000) A General and Unified Theory of the Transmission Process in Language Contact. Heidelberg: Winter.


Related Posts

Subscribe For Updates

We would love to have you back on Languages Of The World in the future. If you would like to receive updates of our newest posts, feel free to do so using any of your favorite methods below: