Language as a window into human history
Human societies leave many kinds of traces—pottery, bones, genes. But language is the only trace transmitted faithfully through every generation for thousands of years. Its patterns encode who interacted with whom, when populations split or merged, and how technologies, cosmologies and lifeways travelled across the world. Where archaeology uncovers what people made, and genetics reveals who they descended from, the comparative study of languages reveals how communities communicated, moved, interacted, and transformed over time. The DLCE’s integrative research program treats linguistic diversity as a global time machine—one that opens an unparalleled window onto the layers of human history. We analyse these layers of human history by using computational phylogenetic methods to infer dated language family trees. This involves generating new, high-quality, cognate-coded lexical datasets and applying novel phylogenetic methods tailored to the nuances of linguistic data.
Recent work in the DLCE has tackled the 200-year-old debate about the origin of the Indo-European languages. The origins of the Indo-European language family are hotly disputed. Bayesian phylogenetic analyses of core vocabulary have produced conflicting results, with some supporting a farming expansion out of Anatolia ~9,000 years before present. In contrast, others support a spread with horse-based pastoralism out of the Pontic-Caspian Steppe ~6,000 yr B.P. This project built an extensive new dataset of core vocabulary across 162 Indo-European languages to infer the chronology and divergence sequence of Indo-European languages. Previous phylolinguistic analyses have produced conflicting results. In this project, we diagnosed and resolved the causes of this discrepancy. First, the datasets employed had limited language sampling and widespread coding inconsistency. Second, some analyses enforced the assumption that modern spoken languages derive directly from ancient written languages rather than from parallel spoken varieties. Together, these methodological problems distorted branch length estimates and, thus, date inferences. Our new dataset eliminates past inconsistencies and provides a fuller, more balanced language sample, including 52 non-modern languages, for a denser set of time-calibration points. We applied ancestry-enabled Bayesian phylogenetic analysis to test rather than enforce direct ancestry assumptions. Ancestry-enabled phylogenetic analysis of this dataset indicates that few ancient languages are direct ancestors of modern clades and produces a root age of ~8,120 yr B.P. for the family. Although this date is not consistent with the Steppe hypothesis, it does not rule out an initial homeland south of the Caucasus, followed by a northward branch onto the steppe and then across Europe. This hybrid hypothesis is consistent with recently published ancient DNA evidence from the steppe and the northern Fertile Crescent.
A common objection to the use of tree models in the analysis of language evolution points out that words are often borrowed across language lineages. While borrowing is substantially lower in core vocabulary and is frequently easy to detect and remove from analyses, a persistent doubt remains about the robustness of phylogenetic inferences to moderate levels of undetected borrowing. To assess this impact, we have recently re-analysed our Indo-European data with the “Contact Trees” model in BEAST 2, which simultaneously infers a set of trees and the borrowings. Current results suggest that our age and subgrouping estimates for the Indo-European family are robust.
While the Indo-European language family has been a preoccupation for many historical linguists, we have extended our phylogenetic analyses to several other regions of the world where major language family expansions have occurred. Previously, we modelled the spread of the Bantu language family across Africa, the Dravidian family in India and tonal changes in Mixtec languages in Mexico. In 2023, we extended this approach to analyse the spread of Uto-Aztecan languages. We tested the “Northern origin” hypothesis, which posits that speakers of Proto-Uto-Aztecan were hunter-gatherers in south-eastern California, Arizona, and north-western Mexico, diversifying between 3,000 B.P. and 5,000 B.P. and adopting agriculture as they spread south. The “Southern origin” “devolution” hypothesis posits that they were maize farmers in central Mexico, diversifying between 3,000-4,500 B.P. and losing agriculture in the southwest of North America. We inferred the age of Proto-Uto-Aztecan to be around 4,100 years and identified the most likely homeland to be near what is now Southern California. We reconstructed the most probable subsistence strategy in the ancestral Uto-Aztecan society and inferred no casual or intensive cultivation, an absence of cereal crops, and a primary subsistence mode of gathering (rather than agriculture). Our results, therefore, supported the timing, geography, and cultural practices consistent with the Northern origin hypothesis. We are currently working on similar analyses of the history of the Mayan and Pano languages in the Americas, testing to what extent the linguistic history matches archaeological evidence.
The Austronesian expansion across the far-flung islands of the Pacific and Indian Ocean is the largest and one of the most recent in human history. It poses particular challenges for phylogenetic analyses because of its size (over 1,200 languages), rapidity, huge variation in retention rates, and regions where language linkages rather than tree-like structures abound. Our work on this language family has recently focused on substantially expanding the number of languages analysed, inferring deep relationships and investigating the causes of linkages. We have also been investigating the role of social processes in driving language divergences in Vanuatu, which has more languages per capita than anywhere else in the world. Our preliminary results using the new “spike model” of lineage divergences indicate that group identity and boundary formation (schisomogenesis) account for a substantial amount of change in the lexicon and phonology, but far less in grammar.
Representative publications
Heggarty, P., et al. (2023). Language trees with sampled ancestors support a hybrid model for the origin of Indo-European languages. Science, 381(6656): eabg0818.
Greenhill, S.J., Haynie, H., Ross, R., Chira, A., List, J., Campbell, L., Botero, C.A., Gray, R.D. (2023). A recent northern origin for the Uto-Aztecan family. Language, 99(1), 81-107. doi:10.1353/lan.0.0276