Meeting Darwin’s Last Challenge: Matching Genes, Languages, and Geography (Part 2)

Lexical_Syntactic_Longobardi In the previous post, I began to analyze Svetlana Burlak’s critique of Longobardi et al. (2015), posted on Генофонд.рф website. As I pointed out, the idea that grammar is a better indicator of language relatedness finds new support if we compare the lexical and syntactic relatedness trees they produce for the same 12 Indo-European languages of Europe. (Note, however, that Longobardi and his team stress not that grammar is better than the lexicon but that the two types of comparisons correlate very closely in their study.) In my opinion, although their lexical and syntactic relatedness trees exhibit some effects of horizontal transmission (aka borrowing), the syntactic tree accords better with the traditional classification of these languages than does the lexical tree. Burlak’s second point of critique concerns the selection of syntactic parameters in this study; she writes (translation mine):

“The authors consider 56 grammatical features, but each grammar contains many more features. On what criteria do the authors base their selection of these particular features as significant?”

But the authors answer this question in their paper, or rather refer the curious reader to their earlier work (Longobardi and Guardiano 2009, Longobardi et al. 2013) and the support materials freely available online. (Has Burlak only read Nadežda Markina’s Russian-language summary of Longobardi et al.’s 2015 article, with which her critique is posted, I wonder?) Thus, Longobardi and Guardiano (2009: 1687) are in agreement with Burlak in assuming that numerous syntactic parameters exist: “UG parameters number at least in the hundreds, although we are too far from being able to make precise estimates”. (Note, however, that not all researchers agree; for example, Mark Baker has claimed that the number of parameters is much smaller.) Yet, since an exhaustive list of syntactic parameters that describe all the variation among all the world’s languages that ever existed has not been devised yet, a study based on PCM (Parametric Comparison Method) by necessity relies on the subset of the eventual list of syntactic parameters. In order to avoid cherry-picking random parameters from those that have been proposed in the syntactic literature (which, as discussed in Nichols 1996, “poses serious probabilistic problems”), Longobardi and Guardiano (2009), as well as subsequent studies of the team, take an alternative route: they limit their study to a universal subdomain which is “intrinsically well defined within syntactic theory itself (and sufficiently vast to hopefully be representative)”. The subdomain they select is that of the nominal structure. All 56 parameters they examine concern features expressed in noun phrases, such as gender, number, and definiteness, and the order of noun-phrase-internal elements such as numerals, attributive adjectives, and the like. Since the application of these parameters is limited to the nominal domain, they are presumably unrelated to other parameters that concern the clausal domain. The choice of the nominal domain parameters is also attributable to the fact that these parameters have been relatively well‑studied in the literature since Abney’s (1987) seminal dissertation, across groups of unrelated languages, as well as on the micro-comparative level (see, e.g., Julien 2005 on noun phrases in Scandinavian languages).

While the detailed discussion of each of the 56 parameters examined by the LanGeLin project would take too long, I will mention merely three of them to give my readers a flavor of what these parameters are about. For example, parameter 9 (p9) concerns the need for overt material in D° in the case of proper names. In languages with the “+” value, either the proper name itself or a (placeholder, “dummy”) article must appear before a possessor or prenominal adjective, whereas in languages with the “−” value, there is no need for overt material in D°. Italian has the “+” value, while English has the “−” value. Thus, the exact counterpart of the English old McDonald (as in “Old McDonald had a farm…”) in Italian is ungrammatical: *vecchio McDonald. Instead, either the proper name itself must raise into D° (cf. Longobardi 1994), as in McDonald vecchio (literally, ‘McDonald old’), or an article must appear in that position, as in il vecchio McDonald (literally, ‘the old McDonald’).

Another parameter, p13, concerns the presence of double-definiteness in languages employing a suffixal definiteness marker (e.g. Scandinavian languages). For example, in Danish, a language with the “−” value of this parameter, the presence of a free prenominal definiteness marker (i.e. article det), which in turn occurs in the presence of another prenominal element such as an adjective, precludes the appearance of the suffixal definiteness marker: hence, hus-et (lit. ‘house-the’) but det nye hus ‘the new house’. In contrast, in Norwegian, a language with the “+” value of this parameter, the suffixal definiteness marker appears even in the presence of a free prenominal definiteness marker: hus-et (lit. ‘house-the’) and det nye hus‑et (lit. ‘the new house-the’). Omitting this suffixal definiteness marker leads to ungrammaticality in Norwegian, but not in Danish.

The last parameter I will mention here, p16, concerns the expression of plurality on nouns in the presence of a numeral (‘two’ or above). In English, which has the “+” value of this parameter, a noun co‑occurring with a numeral appears in the plural form: e.g. five boys. Omitting the plural marker on the noun (i.e. five boy) is ungrammatical in standard English (although it is commonly found in non-standard varieties of English, such as on Tristan da Cunha, Falkland Islands, and St. Helena). In Hungarian, which has the “−” value of this parameter, a numeral appears with the singular form of the noun: öt fiú (lit. ‘five boy’). The plural form of the noun, marked by the suffix –k, is incompatible with a numeral: *öt fiúk (lit. ‘five boys’).


Burlak’s final criticism concerns the outliers and what they mean for the reliability of the grammatical comparison (recall that the overall theme of her criticism is “Grammar is no better than lexicon”). She writes (translation mine):

If Greek falls out of the Indo-European cluster, such an analysis cannot be taken to be reliable. If we apply it, for example, to some languages of Amazonia, and it turns out that languages A, B, and C form a cluster, while languages D, E, and F are not members of that same cluster, it is not guaranteed that language D is not a member of the same language family with A, B, and C, just as Greek is a member of the Indo-European family? And so how can we distinguish which of the languages that do not cluster together are members of the family in question (like Greek) and which are not (like Hungarian or Basque)?

Figure 6

In other words, when we have several (layered) outliers, or small outlying clusters, where do we draw the boundary between outliers that are members of the same family as the languages in the main cluster and those that are not? Consider, for example, the image on the left, reproduced from Pereltsvaig & Lewis (2015: 241). In this adaptation of a tree from Longobardi and Guardiano (2009), are Slavic languages and Hindi (which cluster together, for reasons I will not discuss here, but see Pereltsvaig & Lewis 2015: 226-227) members of the same language family as Germanic, Romance, and Greek languages that cluster together? Are Celtic languages to be included in the same language family? Are Finno-Ugric languages also in the same language family? Are Semitic languages? Is Wolof? Is Basque? We know the answers to these questions according to the traditional, widely accepted view that includes Slavic, Indo-Iranian, and Celtic languages—but not any of the other languages—into the Indo-European family. Some more controversial theories include Finno-Ugric languages into the same “Euroasiatic” family—or both Finno-Ugric and Semitic languages into the same “Nostratic” family. Wolof and Basque are generally considered to not be members of the same family as the other languages. But note that the tree itself provides us with no clear way to make such cut-off decisions. Just by looking at the configuration of this tree itself, one could say that Celtic languages are not members of the (truncated) Indo-European family. Conversely, one could suppose that all languages except Wolof and Basque are members of the same language family—as do advocates of the Nostratic hypothesis.

Thus, Burlak’s argument is in principle valid. However, note that the “Greek problem” that Burlak describes does not apply with respect to the syntactic tree of Indo-European languages in Longobardi et al. (2015), where all languages, including Greek, cluster with others. The problem does emerge, however, with respect to their lexical tree of the same 12 languages. Here, not only Greek but potentially also Irish are positioned as outliers, and one could, in theory, cut them off as not belonging to the same language family. (In fact, their configuration with respect to the other Indo-European languages is exactly the same as that of Wolof and Basque in the figure above.) Thus, by Burlak’s own argument, a lexically-based phylogeny is less reliable than a syntactically-based one, contrary to the overall theme of her critique.


When comparing the syntactic tree of only 12 Indo-European languages with that of a larger set including three non-Indo-European languages, as in the image reproduced on the left, one could say that Burlak’s “Greek problem” re-emerges with respect to the larger set. I think, however, that this is an artefact of selecting only three non-Indo-European languages, of which two cluster together. Had the researchers included a larger number of Finno-Ugric (or Uralic) languages, my hunch is that we would see two clear large clusters on the syntactic tree, representing these two families: Indo-European and Finno-Ugric (or Uralic). Importantly, adding such other Finno-Ugric languages would not solve the cut-off problem with respect to the lexically-based phylogeny: even if two large clusters would emerge on that tree for Indo-European and Finno-Ugric families, we would still have no way to know whether Greek or Irish are outliers in the Indo-European family or if one or both of these languages belong to neither the Indo-European nor Finno-Ugric family. Note that given the proximity to, and hence the possibility of lexical borrowing to/from Indo-European languages, it would be possible to get this configuration even if Greek and/or Irish were not themselves Indo-European languages.


All in all, I disagree with Burlak’s critique that lexically-based approaches are better for determining language relatedness than those based on grammatical datasets, especially the PCM, which works with universal and discrete comparanda. In my opinion, lexically-based approaches show only word-relatedness, not really language relatedness in general. In the earlier days of comparative linguistics, lexically-based approaches had to be used for practical reasons, yet they are merely a first-try approximations of more holistic language phylogenies. As Longobardi and Guardiano (2009: 1684) point out, progress in theoretical linguistics in recent decades, which parallels the rise of molecular genetics in biology, means that linguists no longer need to limit themselves to easily observable, yet overly simplistic and highly unreliable comparanda like words.




