When Did Roma Leave India?—New Discovery or Corroboration of Old Theories?

[This post was originally published in January 2013.]

(Thanks to Yaron Matras for his help with researching this post.)

As was highlighted in my earlier mini-series on the history of English, popular media reports on scientific issues involving human history, migrations, and languages habitually pick studies whose claims contradict the current consensus; such studies are further sensationalized, while other work on the topic is generally ignored. An additional example is the popular media reports on a genetics study on the exodus of the Roma people (Gypsies) from India, recently published in Current Biology (“Reconstructing the Indian Origin and Dispersal of the European Roma: A Maternal Genetic Perspective”, 22(24): 2342-2349). According to a short article by Sindya N. Bhanoo in the New York Times, titled “Genomic Study Traces Roma to Northern India”, this “wide-ranging genomic study appears to confirm that the Roma came from a single group that left northwestern India about 1,500 years ago”. In actuality, the article in Current Biology makes no such claims. Instead, its contribution is much more modest. The main focus of the article is the different groups of Roma in Europe. The researchers examined genetic data from approximately 200 Roma individuals from the Iberian Peninsula, particularly their mtDNA (which traces maternal descent), which showed genetic similarity to the Roma from the Balkans region. Their conclusion is that the Roma of Spain and Portugal migrated via Southeastern Europe, contrary to popular views that some of the Roma came to the Iberian Peninsula via North Africa. A large part of the study concerns the issues of genetic affinities among Roma groups, the degree of admixture with neighboring populations, and migration routes followed since the first arrival in Europe.

When it comes to determining the prehistoric homeland of the Roma people, the New York Times article contradicts itself: the headline claims that the Roma have been traced to Northern India, whereas the third sentence in the main text (cited above) places to putative homeland in Northwestern India. While this geographic discrepancy may seem insignificant at first glance, it is not: the genetic study in question showed that the probability of the Roma homeland being located in Northwestern India (that is Himachal Pradesh, Kashmir, and Punjab) is 72%, whereas the corresponding figure for Northern India (specifically, Uttar Pradesh and Madhya Pradesh) is only 2%. One may defend the author of the headline by appealing to typographic limitations in the number of characters, but this defense fails: adding four characters by replacing “Northern” with “Northwestern” would make this headline only 47 characters long, whereas the headline of Nicholas Wade’s piece on the Science article on Indo-European origins—“Family Tree of Languages Has Roots in Anatolia, Biologists Say”—is the whopping 62 characters long.

To determine the location of the putative Roma homeland in India, the authors of the Current Biology article compared the DNA samples of European Roma with an extensive existing database of Indian sequences, conducting the analysis on both the regional and the state level. Their results “pointed at Punjab state (in North-Western India) as the most probable candidate to be the ancestral homeland of the Roma mtDNA types”, with a probability of 54%. Contrary to the way that the New York Times article spins this finding, it is not a fresh discovery, but rather is “in agreement with previous linguistic and anthropological studies”, as the authors of the Current Biology paper readily admit in the abstract. Curiously, the second most probable location of the Roma homeland according to this genetic study is in Eastern India, specifically Bihar, Orissa, and West Bengal. While the probability of this region is only 20%, compared to Northwestern India’s 72%, it is significant in light of the fact that the rest of the regions—Northern India, Western India, Southwestern India, Southeastern India, and Northeastern India— together account for only 8% of the probability. So far, there is nothing in the historical, linguistic, or genetic record to indicate a connection of the European Roma to Eastern India, but as we shall see below, a linguistic connection does exist between Romani and Central Indic languages, spoken in what the Current Biology team classifies as Northern India, specifically Uttar Pradesh and Madhya Pradesh.


As for the timing of the Roma exodus from India, the New York Times article claims that this latest genetics study dates it at around 1,500 years ago, that is around 500 CE. Note that this hypothesis contradicts both the accepted consensus in the linguistic community that places the Roma exodus five hundred years later, around 1000 CE, and the much earlier date produced by the Gray/Atkinson model (see Bouckaert et al. 2012)—3,500 years ago (or 1500 BCE). However, the genetic study in Current Biology makes no claims whatsoever as to when the Roma actually left India. All that the geneticists can confirm is that the ancestors of the Roma were based in Northwestern India “2,158±1,178 years” ago (that is between 1324 BCE and 32 CE). This finding is “in agreement with previous historical records that locate the Roma in Europe at least 1,000 years ago”. Note that other genetic studies (e.g. Morar et al. 2004) place the Roma exodus “approximately 32-40 generations ago”, which—assuming 25-30 years per generation—matches the 1000 CE date derived from linguistic studies.

Indeed, examining the Romani language and its connections to other languages has been instrumental in demystifying where the Roma originated and when they left India on their way to Europe. The similarities between Romani and other Indic languages were first noticed in the late 1700s. The first published work that postulates an Indian origin of the Romani language is Johann Christian Christoph Rüdiger’s On the Indic Language and Origin of the Gypsies, published in 1782, fourteen years before Sir William Jones’s famous pronouncement about the affinity of Sanskrit with Ancient Greek and Latin (see Matras 1999 for a detailed analysis). Rüdiger used surprisingly modern methodology, collecting his Romani data directly from a native speaker (which he admitted to find “tiresome and boring”) and his Hindi data from a manual written by a missionary. He compared a significant number of corresponding words from the two languages, as well as grammatical structures, noting that

“as regards the grammatical part of the language the correspondence is no less conspicuous, which is an even more important proof of the close relation between the languages.”

Subsequent linguistic studies focused on identifying the more specific location of the Roma homeland, as well as on dating the Roma exodus, by examining in detail phonological and morphological patterns of various Indic languages. For example, the evolution of the grammatical gender system in Indic languages indicates that Romani must have been spoken in India around 1000 CE. Earlier forms of Indic languages, known as Middle Indic, had three genders: masculine, feminine, and neuter. However, by the turn of the 2nd millennium CE, the neuter gender was lost (in some languages), with most formerly neuter nouns becoming masculine and a few becoming feminine. This change—and several others, to be discussed below—characterize the transition from Middle Indic period to the so-called New Indic phase. The Romani language fits the profile of a New Indic language: it has only two genders, masculine and feminine. More importantly, most of the formerly neuter nouns in Romani were reassigned to the same gender as their cognates in other New Indic languages, such as Hindi. For instance, the neuter agni ‘fire’ in the Prakrit language (a Middle Indic language) became the feminine āga ‘fire’ in Hindi and likewise the feminine jag in Romani. Given that there are several dozen formerly neuter nouns retained in Romani and reassigned to the same gender as in Hindi, the probability of the same change happening independently in the two languages is vanishingly small. The more likely explanation is that Romani was spoken in India at the turn of the 2nd millennium CE, so that the loss of the neuter gender and the reassignment of formerly neuter nouns to masculine or feminine genders occurred before Romani split off the rest of the Indic family. Thus, the Romani exodus must be dated to around 1000 CE.

The only potential problem with using the gender system to date the Roma exodus from India is the fact that not all modern Indic languages have lost the neuter gender. Marathi and Oriya, for example, have retained the three-way gender system. But fortunately, Romani exhibits a number of other phonological and morphological properties that characterize it as a New Indic language. These include such grammatical developments as the loss of the elaborate nominal case endings present in Old and Middle Indic and their reduction to a simple opposition between nominative and oblique. For example, the word ‘boy’ in Romani has only two forms: the nominative raklo and the oblique rakles-, comparable to the Hindi laṛkā and laṛke-, respectively. Other case-like meanings are expressed by former postpositions, repurposed as clitics, and attaching to the oblique form (some of these clitics have been subsequently grammaticalized into suffixes in Romani). Another development shared by Romani with its Indic brethren is the disappearance of the Old and Middle Indic past tense conjugation and the use of the past participle instead, still visible in some dialects of Romani. The past participle shows agreement in gender, as in ov gelo ‘he went’ vs. oj geli ‘she went’. These forms are comparable to the Hindi vo gayā ‘he went’ vs. vo gayī ‘she went’. These and other shared patterns indicate that early Romani was part of the Indic dialect continuum during the transition period to early New Indic, which took place in medieval times, perhaps as early as the 8th or 9th century CE or as late as the 10th century CE.


With respect to locating the Roma homeland, linguistic studies are in agreement with (and often predate) genetic studies. Certain structural features of Romani are shared with the so-called Dardic languages of Northwestern India, such as Kashmiri. Several of these features are retentions from the earlier form of Indic. One is the retention of consonant clusters such as tr and št in words like patrin ‘leaf’ (from Old Indic patra‑) and mišto ‘good’ (from Old Indic mr̥ṣṭa); see the map on the left from the Manchester Romani Project website. Other shared retentions include the retention of consonantal endings such as -s and -n in oblique case endings, and the retention of -n- in words like dand ‘tooth’ (from Old Indic danta, but compare with the Hindi dẫt). But as discussed in a my earlier post, shared innovations are more important than shared retentions in determining the classification of languages. Luckily, Romani also shares an important innovation with the Dardic languages: the emergence of a new past-tense conjugation, based on the attachment of enclitic pronouns to the participle. Consider the Romani past tense forms such as kerdjom ‘I did’, kerdjas ‘he/she did’ and the like. They derive from combinations like *kerdo-jo-me ‘done-by-me’, *kerdo-jo-se ‘done by him/her’, and so on (see Matras 2001). This shared innovation in Romani and Dardic languages lends further support to the view that early Romani was spoken in the extreme northwestern areas of the Indian subcontinent in medieval times.




























Linguistic analysis also allows us to shed light on an even earlier stage of Romani and to show that the ancestors of the Roma came from Central India before they lived in the Northwestern part of the subcontinent. The connection to languages of Central India was discovered by Turner in 1926. His evidence came from a number of shared early developments that are confined to the forerunners of the Central Indian languages, such as the form šun- ‘to hear’ from Old Indic (Sanskrit) śr̥n- and jakh ‘eye’ (via *akkhi) from Old Indic akṣi- (see the maps from the Manchester Romani Project website). The phonological shape of the nominalizing suffix -ipen (as in sastipen ‘health’), cognate to Central Indic -ippan (from Old Indic –itvana) rather than to Northwestern Indic -ittan is another link between Romani and languages of Central India. These features emerged during the early transition stage from Old to Middle Indic, sometime after 500 BCE. That Romani shares these developments proves that it began its history as a Central Indian language. But Romani does not share all the developments that happened in the Central Indic languages, retaining some Old Indic features instead. As mentioned above, Romani retains the consonant combinations tr and št which were simplified in the Central Indic languages during the transition to the Middle Indic period, producing patta and miṭṭha, respectively. Thus, it appears that speakers of what was to become Romani left the Central Indian region at some point during the first half of the first millennium CE, before the clusters were simplified, and migrated to the northwest, an area that remained unaffected by these changes.

From northwestern India, the Roma migrated to Europe through southwestern Asia, evidently making prolonged stopovers along the way. Evidence for their route comes chiefly in the form of loanwords and grammatical borrowings from a number of languages then spoken in Asia Minor: mostly from Byzantine Greek and to a lesser extent from Armenian and from several Iranian languages. (Some of the borrowings from Iranian might be attributed to any one of several Iranian languages, including both Persian and Kurdish, while others may be localized more precisely.) The immense Greek influence on Romani testifies not only to widespread bilingualism among the Roma and to their minority status, but also to a long period of intense contact with Greek-speaking populations. Crucially, the Greek influence permeated all areas of Romani, including its lexicon, morphology, and syntax. Among Greek loanwords in Romani are nouns like drom ‘road’ from the Greek drómos ‘road’, zumin ‘soup’ from the Greek zumí ‘soup’, xoli ‘anger’ from the Greek xolí ‘anger’, luludi ‘flower’, fóros ‘town’, kókalo ‘bone’, skamín ‘chair’ and many more, as well as grammatical words like pale ‘again’ from the Greek pale ‘again’, komi ‘still’ from the Greek akómi ‘still’, and numerals efta ‘seven’, oxto ‘eight’, and enja ‘nine’. Morphological borrowings from Greek into Romani include the marker of ordinal numbers -to (as in pandžto ‘fifth’), nominal endings as in prezident-os ‘president’, slug-as ‘slave’, čač-imos ‘truth’, and endings that identify loan verbs as in mog-in-ava ‘I can’, intr-iz-ava ‘I enter’. Greek has also had an immense impact on the syntax of Romani. The Greek influence can be seen in the emergence of a definite article placed before the noun (e.g. o čhavo ‘the boy’) and the shift from Object-Verb (as in the rest of Indic languages) to Verb-Object order (e.g. xav manřo ‘I eat bread’, where the verb xav indicates that the subject is ‘I’). Other features that can be attributed to Greek influence are postposed relative clauses introduced by a general relativizer kaj (as in o manuš kaj giljavel ‘the man who sings’) and the contrast between a factual complementizer kaj and a non-factual one te.

Influences from other languages once spoken in southwestern Asia include numerous Iranian loanwords in Romani such as diz ‘fortress, town’ from Persian diz, zor ‘strength’ from Persian or Kurdish zor, and baxt ‘luck’ from Persian or Kurdish baxt. Finally, another important contact language was Armenian, which contributed to Romani words like bov ‘oven’, kotor ‘piece’, and grast ‘horse’. Some scholars have argued that contact with Iranian and Armenian occurred before contact with Greek, chiefly due to the geographical locations of the languages in our present era. But since eastern Anatolia, where both Iranian languages and Armenian were spoken, was part of the Byzantine Empire up to the late eleventh century, it is also possible that Greek, Iranian and Armenian influences were all acquired during the same period. (See Yaron Matras’ Romani: A Linguistic Introduction for a more detailed description of the Romani language and history, or check out the Manchester Romani Project online DVD “The Romani language: an interactive journey”, which tells the history of the Romani language and is accompanied by interactive illustrations, games, and sound samples; this program is available in 18 languages.)

Thus, linguistic rather than genetic studies allow us to gain most insight into the past of the Roma people, their migrations, and interactions with other peoples. Still, it is important that recent genetic studies confirm rather than contradict linguistic insights.


Additional sources:

