Download Language Trees

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Maximum parsimony (phylogenetics) wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

Transcript
Creating
Phylogenic Language
Trees
Michele Fretta
Saint Michael’s College
Colchester, Vermont
Phylogenic Trees

Trees are a useful way to visualize the
connections between things.

The methods we will use here are often used in
biological genetics.
Language Trees
Have you ever seen something like this?
Latin
It’s a language tree,
which describes the
way several languages
are related.
Latin is the ancestor,
and the “leaves” are
the contemporary
languages.
French
Italian
Portuguese
Spanish

We will construct tree graphs which model the
relationships between the words of 12
languages.


Warning! The trees in this presentation are examples only. The data they are based on is
not statistically significant.
In (language) trees, it is assumed that a genetic
relationship implies an evolutionary
relationship

Suppose we have data of words from several
different languages.

How can we use the data to construct a visual
representation of the relationships among them
(i.e., a tree)?
Branch and Bound Method
 Based on data from several languages, we will construct a sample
tree.
 First, we will use matrices to organize the data…
Sample Cognate Table

Cognates are words in two different languages which
have similar sound and meaning.
Dutch
ALLES
EN
DIER
ASCH
English
ALL
AND
ANIMAL
ASHES
French
TOUT
E
ANIMAL
SENDRE
German
ALLES
UND
TIER
ASCHE
Hindi
SEB
OR
JANVER
RAKH
Italian
TUTTO
E
ANIMALE
SENERE
Nepalese
SAB
AU
JANAWAR
KHAG
Persian
HAME
VA
HEYVAN
KHAKESTAR
Polish
WSZYSTEK
I
ZWIERZE
POPIAL
Portuguese
TODO
E
ANIMAL
SINDRA
Russian
VES
I
ZVER
PYOPEL
Spanish
TODO
I
ANIMAL
SENIZA
Cognate Matrix Sample


“1” indicates that there is a cognate between
the two languages
“0” indicates that there is no cognate.
Pair-wise Percentage Similarity


Then, we use the cognate matrices to find the
percentage of similarity between each pair of
languages.
Group and average.
Building the tree,
group by group
Spanish French Italian Portuguese Russian
Polish German Dutch
English
Hindi
Nepali
Persian
Here’s another method, which is often more reliable than Branch & Bound:
Kruskal’s Algorithm Method



Parsimony trees are based on the assumption
that the least number of evolutionary steps is
the most likely.
We will construct a graph which represents all
possible language relationships,
And we will use a greedy algorithm in order to
find the shortest (most likely) relationships.
Pair-wise Cognate Percentages

Begin with our matrix
of cognate percentages:
Dutch
English
French
German Hindi
Italian
Nepalese Persian Polish
PortugueseRussian Spanish
Dutch
x
English
30%
x
French
10%
10%
x
German
70%
10%
1%
x
Hindi
1%
1%
1%
1%
x
Italian
10%
10%
90%
1%
1%
x
Nepalese
1%
10%
1%
1%
60%
1%
x
Persian
1%
11%
1%
1%
22%
1%
33%
x
Polish
20%
1%
20%
10%
1%
20%
1%
1%
x
Portuguese
10%
10%
80%
1%
1%
70%
1%
1%
20%
x
Russian
20%
1%
20%
10%
1%
20%
1%
1%
40%
20%
x
Spanish
10%
10%
99%
1%
1%
90%
1%
1%
30%
70%
20%
x
The Complete Graph
Each edge represents an entry in our cognate-percentage matrix. They are color-coded by
percentage similarity.
99%
90%
80%
70%
60%
40%
33%
30%
22%
20%
11%
10%
1%
Implementing Kruskal’s Algorithm
German
Hindi
French
English
Italian
Nepali
Dutch
Persian
Spanish
Russian
Portuguese
Polish
Our Minimum Spanning Tree
Russian
Portuguese
Nepali
Persian
Dutch
Spanish
French
Polish
English
Hindi
Italian
German
Modifying the Tree




Remember the tree with Latin?
Latin is an internal vertex because it is an
ancestor.
Some of the vertices in our minimal spanning
tree are internal vertices, but we want them all
to be leaves
This is because leaves will represent presentday languages, which we are working with in
this case.
Modifying the Tree

Attach a leaf to each internal vertex.

The former internal vertices will serve as
ancestors, like Latin. Sometimes, linguists
know very little about these “protolanguages”.
Modifying the Tree (cont’d)
Russian
Portuguese
Nepali
Persian
Dutch
Spanish
French
Polish
English
Hindi
Italian
German
Comparing our trees…
They are rooted
differently,
but they are actually
quite similar!
Persian
Nepali
Italian
Hindi
German
English
English
Dutch
Hindi
German
Polish
Polish
French
Russian
Spanish
Portuguese
Dutch
Persian
Italian
Russian
French
Spanish
Portuguese
Nepali
Rooting the Tree

Our two trees would look more similar if they had the
same root.

The root should be the oldest protolanguage, the
ancestor of all of the tree’s contemporary languages.

Historians do this.

Based on historical records, they can rearrange the tree so
that the older languages are closed to the root.
Class Problem


There is a creole in India which is based on Portuguese.
If this language began as a synthesis of Portuguese and Hindi,
how would you place it in this tree?
Portuguese
Nepali
Russian
Hint:
Dutch
Persian
French
Spanish
There might
be a
problem
with this ...
Polish
Italian
English
Hindi
German
Supplements…
Applications




Parallels with modeling biological evolution
Mapping the migrations of human populations
Modeling the genetic similarities among
human populations
Time divergences: using a decomposition
“clock” to estimate the number of years which
have separated languages.
Biology vs. Linguistics:
Inheritance and transference


In biological trees, genes are compared to find these
relationships
In language trees, words are compared
Common ancestor
Common ancestor
Word transfer
Gene transfer
Bacteria
Eukarya
Archaea
English
French
German
Glycolysis
Replication
Boeuf
Beef
Cow
Kuh
Electron
Transport
Transcription
Porc
Pork
Swine
Schwein
Photosynthesis
Translation
Mouton
Mutton Sheep
Schaf
Diagram from Searls, Nature
Disadvantages—Branch and Bound

This method may group together
slow-changing languages rather than
related languages

It also does not necessarily yield the best
solution.
Disadvantages—Kruskal’s

Not necessarily the best tree.


The parsimony method only guarantees that the
tree is twice the “parsimony length” of the best
tree.
Some languages break the rules of a tree.

Creoles cause a cycle in the tree because they have
more than one parent language.