Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Creating Phylogenic Language Trees Michele Fretta Saint Michael’s College Colchester, Vermont Phylogenic Trees Trees are a useful way to visualize the connections between things. The methods we will use here are often used in biological genetics. Language Trees Have you ever seen something like this? Latin It’s a language tree, which describes the way several languages are related. Latin is the ancestor, and the “leaves” are the contemporary languages. French Italian Portuguese Spanish We will construct tree graphs which model the relationships between the words of 12 languages. Warning! The trees in this presentation are examples only. The data they are based on is not statistically significant. In (language) trees, it is assumed that a genetic relationship implies an evolutionary relationship Suppose we have data of words from several different languages. How can we use the data to construct a visual representation of the relationships among them (i.e., a tree)? Branch and Bound Method Based on data from several languages, we will construct a sample tree. First, we will use matrices to organize the data… Sample Cognate Table Cognates are words in two different languages which have similar sound and meaning. Dutch ALLES EN DIER ASCH English ALL AND ANIMAL ASHES French TOUT E ANIMAL SENDRE German ALLES UND TIER ASCHE Hindi SEB OR JANVER RAKH Italian TUTTO E ANIMALE SENERE Nepalese SAB AU JANAWAR KHAG Persian HAME VA HEYVAN KHAKESTAR Polish WSZYSTEK I ZWIERZE POPIAL Portuguese TODO E ANIMAL SINDRA Russian VES I ZVER PYOPEL Spanish TODO I ANIMAL SENIZA Cognate Matrix Sample “1” indicates that there is a cognate between the two languages “0” indicates that there is no cognate. Pair-wise Percentage Similarity Then, we use the cognate matrices to find the percentage of similarity between each pair of languages. Group and average. Building the tree, group by group Spanish French Italian Portuguese Russian Polish German Dutch English Hindi Nepali Persian Here’s another method, which is often more reliable than Branch & Bound: Kruskal’s Algorithm Method Parsimony trees are based on the assumption that the least number of evolutionary steps is the most likely. We will construct a graph which represents all possible language relationships, And we will use a greedy algorithm in order to find the shortest (most likely) relationships. Pair-wise Cognate Percentages Begin with our matrix of cognate percentages: Dutch English French German Hindi Italian Nepalese Persian Polish PortugueseRussian Spanish Dutch x English 30% x French 10% 10% x German 70% 10% 1% x Hindi 1% 1% 1% 1% x Italian 10% 10% 90% 1% 1% x Nepalese 1% 10% 1% 1% 60% 1% x Persian 1% 11% 1% 1% 22% 1% 33% x Polish 20% 1% 20% 10% 1% 20% 1% 1% x Portuguese 10% 10% 80% 1% 1% 70% 1% 1% 20% x Russian 20% 1% 20% 10% 1% 20% 1% 1% 40% 20% x Spanish 10% 10% 99% 1% 1% 90% 1% 1% 30% 70% 20% x The Complete Graph Each edge represents an entry in our cognate-percentage matrix. They are color-coded by percentage similarity. 99% 90% 80% 70% 60% 40% 33% 30% 22% 20% 11% 10% 1% Implementing Kruskal’s Algorithm German Hindi French English Italian Nepali Dutch Persian Spanish Russian Portuguese Polish Our Minimum Spanning Tree Russian Portuguese Nepali Persian Dutch Spanish French Polish English Hindi Italian German Modifying the Tree Remember the tree with Latin? Latin is an internal vertex because it is an ancestor. Some of the vertices in our minimal spanning tree are internal vertices, but we want them all to be leaves This is because leaves will represent presentday languages, which we are working with in this case. Modifying the Tree Attach a leaf to each internal vertex. The former internal vertices will serve as ancestors, like Latin. Sometimes, linguists know very little about these “protolanguages”. Modifying the Tree (cont’d) Russian Portuguese Nepali Persian Dutch Spanish French Polish English Hindi Italian German Comparing our trees… They are rooted differently, but they are actually quite similar! Persian Nepali Italian Hindi German English English Dutch Hindi German Polish Polish French Russian Spanish Portuguese Dutch Persian Italian Russian French Spanish Portuguese Nepali Rooting the Tree Our two trees would look more similar if they had the same root. The root should be the oldest protolanguage, the ancestor of all of the tree’s contemporary languages. Historians do this. Based on historical records, they can rearrange the tree so that the older languages are closed to the root. Class Problem There is a creole in India which is based on Portuguese. If this language began as a synthesis of Portuguese and Hindi, how would you place it in this tree? Portuguese Nepali Russian Hint: Dutch Persian French Spanish There might be a problem with this ... Polish Italian English Hindi German Supplements… Applications Parallels with modeling biological evolution Mapping the migrations of human populations Modeling the genetic similarities among human populations Time divergences: using a decomposition “clock” to estimate the number of years which have separated languages. Biology vs. Linguistics: Inheritance and transference In biological trees, genes are compared to find these relationships In language trees, words are compared Common ancestor Common ancestor Word transfer Gene transfer Bacteria Eukarya Archaea English French German Glycolysis Replication Boeuf Beef Cow Kuh Electron Transport Transcription Porc Pork Swine Schwein Photosynthesis Translation Mouton Mutton Sheep Schaf Diagram from Searls, Nature Disadvantages—Branch and Bound This method may group together slow-changing languages rather than related languages It also does not necessarily yield the best solution. Disadvantages—Kruskal’s Not necessarily the best tree. The parsimony method only guarantees that the tree is twice the “parsimony length” of the best tree. Some languages break the rules of a tree. Creoles cause a cycle in the tree because they have more than one parent language.