Download Ramminger ThLL digital

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Institut für Kulturgeschichte der Antike, Wien
Thesaurus Linguae Latinae, München
JOHANN RAMMINGER
THE THESAURUS LINGUAE LATINA
DIGITAL PERSPECTIVES
THESAURUS LINGUAE LATINAE
2ND LATIN LEXICOGRAPHY SUMMER SCHOOL
Digitized/machine readable dictionaries of Latin
Lewis & Short (Perseus)
Georges (www.zeno.org)
Gaffiot (Logeion)
Forcellini (commercial)
Oxford Latin Dictionary (comm.)
DuCange (ducange.enc.sorbonne.fr)
D. of Medieval Latin from British Sources (Logeion)
NLW Dictionary of Early Modern Latin
Thesaurus Linguae Latinae Database (DeGruyter, comm.)
Thesaurus Linguae Latinae (pdf, open access)
Institut für Kulturgeschichte
der Antike, Wien
ThLL – search mask
The ThLL as a machine readable/digital object
The ThLL as a machine readable/digital object
Manual lexicography has produced extraordinary results for Greek and
Latin, but it cannot in the immediate future provide for all texts the same
level of coverage available for the most heavily studied materials. As
we build a cyberinfrastructure for Classics in the future, we must
explore the role that automatic methods can play within it. Using
technologies inherited from the disciplines of computational linguistics
and computer science, we can create a complement to these traditional
reference works - a dynamic lexicon that presents statistical information
about a word’s usage in context, including information about its sense
distribution within various authors, genres and eras, and syntactic
information as well.
Abstract from Bamman & Crane 2009
The ThLL as a machine readable/digital object
Manual lexicography has produced extraordinary results for Greek and
Latin, but it cannot in the immediate future provide for all texts the same
level of coverage available for the most heavily studied materials. As
we build a cyberinfrastructure for Classics in the future, we must
explore the role that automatic methods can play within it. Using
technologies inherited from the disciplines of computational linguistics
and computer science, we can create a complement to these traditional
reference works - a dynamic lexicon that presents statistical information
about a word’s usage in context, including information about its sense
distribution within various authors, genres and eras, and syntactic
information as well.
Abstract from Bamman & Crane 2009
The ThLL as a machine readable/digital object
Manual lexicography has produced extraordinary results for Greek and
Latin, but it cannot in the immediate future provide for all texts the same
level of coverage available for the most heavily studied materials. As
we build a cyberinfrastructure for Classics in the future, we must
explore the role that automatic methods can play within it. Using
technologies inherited from the disciplines of computational linguistics
and computer science, we can create a complement to these traditional
reference works - a dynamic lexicon that presents statistical information
about a word’s usage in context, including information about its sense
distribution within various authors, genres and eras, and syntactic
information as well.
Abstract from Bamman & Crane 2009
The ThLL as a machine readable/digital object
Manual lexicography has produced extraordinary results for Greek and
Latin, but it cannot in the immediate future provide for all texts the same
level of coverage available for the most heavily studied materials. As
we build a cyberinfrastructure for Classics in the future, we must
explore the role that automatic methods can play within it. Using
technologies inherited from the disciplines of computational linguistics
and computer science, we can create a complement to these traditional
reference works - a dynamic lexicon that presents statistical information
about a word’s usage in context, including information about its sense
distribution within various authors, genres and eras, and syntactic
information as well.
Abstract from Bamman & Crane 2009
The ThLL as a machine readable/digital object
Manual lexicography has produced extraordinary results for Greek and
Latin, but it cannot in the immediate future provide for all texts the same
level of coverage available for the most heavily studied materials. As
we build a cyberinfrastructure for Classics in the future, we must
explore the role that automatic methods can play within it. Using
technologies inherited from the disciplines of computational linguistics
and computer science, we can create a complement to these traditional
reference works - a dynamic lexicon that presents statistical information
about a word’s usage in context, including information about its sense
distribution within various authors, genres and eras, and syntactic
information as well.
Abstract from Bamman & Crane 2009
+ ThLL
ThLL – advantages for non-traditional research
Thesaurus – is a random text corpus for research other than the
original function
– has validated data (best text, no mistakes)
– has lemmatized data: homographs (lemma) and
ambiguous forms (paradigm) distinguished, different
stems etc. of the same word collected
– has chronologically fixed data (→ Index librorum)
ThLL – advantages for non-traditional research
Thesaurus – is a random text corpus for research other than the
original function
– has validated data (best text, no mistakes)
– has lemmatized data: homographs (lemma) and
ambiguous forms (paradigm) distinguished, different
stems etc. of the same word collected
– has chronologically fixed data (→ Index librorum)
But: double reduction process:
texts > material in ThLL archive > material printed in word article
Inflection of Latin verbs
Matteo Pellegrini & Marco Passarotti, "LatInfLexi: an Inflected Lexicon of
Latin Verbs", Proceedings of the Fifth Italian Conference on
Computational Linguistics (CLiC-it 2018). CEUR Workshop Proceedings
2253. URL: ceur-ws.org/Vol-2253/
• 254 possible forms x 3348 verbs = 850,392 paradigm cells
• 850,392 paradigm cells – 97,855 impossibles = 752,537 paradigm cells
Inflection of Latin verbs - reality
• 752,537 paradigm cells max
• Paul Tombeur, Thesaurus Formarum Totius Latinitatis (1998). CD-Rom
• TFTL epoch
unattested forms
(%)
Antiquitas
544,395
(72.34%)
Aetas Patrum
482,324
(64.1%)
Medium Aeuum
484,421
(64.37%)
Recentior Latinitas
640,552
(85.12%)
all epochs
401,690
(53.38%)
(numbers from Pellegrini & Passarotti 2018)
Matteo Pellegrini & Marco
Passarotti, "LatInfLexi: an
Inflected Lexicon of Latin
Verbs", Proceedings of the
Fifth Italian Conference on
Computational Linguistics
(CLiC-it 2018). CEUR
Workshop Proceedings 2253.
URL: ceur-ws.org/Vol-2253/
can that be right?
Matteo Pellegrini & Marco
Passarotti, "LatInfLexi: an
Inflected Lexicon of Latin
Verbs", Proceedings of the
Fifth Italian Conference on
Computational Linguistics
(CLiC-it 2018). CEUR
Workshop Proceedings 2253.
URL: ceur-ws.org/Vol-2253/
Matteo Pellegrini & Marco
Passarotti, "LatInfLexi: an
Inflected Lexicon of Latin
Verbs", Proceedings of the
Fifth Italian Conference on
Computational Linguistics
(CLiC-it 2018). CEUR
Workshop Proceedings 2253.
URL: ceur-ws.org/Vol-2253/
which the ThLL has done (+Zettelarchiv!)
Length of article
length in number
of quotations 6231
4234
3506
length in kilobyte 537k
296k
279k
408
178
206
34k
docere
dicere
ducere
lenght (kb)
dicare
quotations
16k
degenerare
14k
diruere
Length of article & number of quotations w/ lemma
6231
3202
4234
3506
4087
1608
408
209
178
206
78
88
docere
dicere
ducere
quotations
dicare
degenerare
quot.s w. lemma
diruere
Length / quotations / different forms
6231
3202
4234
3506
4087
1608
408
209
172
178
132
132
62
88
34
docere
dicere
ducere
quotations
dicare
quot.s w. lemma
206
78
degenerare
forms
31
diruere
Length / quotations / different forms
6231
3202
4234
3506
4087
1608
408
max
254
206
78
209
172
132
178
132
62
88
31
34
docere
dicere
quotations
ducere
dicare
quot.s w. lemma
degenerare
forms
diruere
max forms
Quotations and different Forms
3202
4087
1608
max
254
209
172
132
132
78
62
88
34
docere
dicere
ducere
quot.s w. lemma
dicare
forms
degenerare
max forms
31
diruere
Quotations and different Forms
attested forms (Pellegrini & Passarotti):
Antiquitas 28%
max
172
70%
254
132
132
62
34
docere
dicere
ducere
forms
dicare
max forms
degenerare
31
12%
diruere
Attested forms – further research
How does Latin function?
– Which forms are used most frequently?
– Which paradigm positions degenerate (>1 orth.) – ThLL ‘De formis‘
– How often are ambiguous forms actually used? Avoided?
– Do verbs with Greek roots behave differently (limited paradigm)?
– ...
Integrate into the framework of variational linguistics:
diaphasic var.: genres~styles~registers (prose/poetry, iur., coins...)
diastratic variation: sociolects
diatopic variation: regional substrate languages (Greek, Celtic, ... )
diachronic variation: language change (ThLL gives first attestations)
Integrate with text databases
Collocations and the Thesaurus
In corpus linguistics, a collocation is a sequence of words or terms
that co-occur more often than would be expected by chance.
There are about six main types of collocations: adjective+noun,
noun+noun (such as collective nouns), verb+noun, adverb+adjective,
verbs+prepositional phrase (phrasal verbs), and verb+adverb.
Collocation extraction is a computational technique that finds
collocations in a document or corpus, using various computational
linguistics elements resembling data mining.
(Wikipedia)
Collocations
in the
Thesaurus:
privatus
(oppos.)
Collocations in the Thesaurus: primatus
What to do with collocations?
David Bamman & Gregory Crane, “Computational Linguistics and
Classical Lexicography”, Changing the Center of Gravity:
Transforming Classical Studies Through Cyberinfrastructure.
digital humanities quarterly 3.1 (2009)
URL:
http://www.digitalhumanities.org/dhq/vol/3/1/000033/000033.html
Collocations and meaning: spiritus
Latin context word
Sanctus
Testis
Holy
Witness
Probability spiritus = spirit
99.9%
99.9%
Vivifico
Make alive
99.9%
Omnipotens
All-powerful
99.9%
(from Bamman & Crane 2009)
Collocations and meaning: spiritus
Latin context word
Sanctus
Testis
Holy
Witness
Probability spiritus = spirit
99.9%
99.9%
Vivifico
Make alive
99.9%
Omnipotens
All-powerful
99.9%
Latin context word
Mons
Commotio
Mountain
Commotion
Probability spiritus = wind
98.3%
98.3%
Ventus
Ala
Wind
Wing
95.2%
95.2%
(from Bamman & Crane 2009)
Collocations and meaning: the Thesaurus
promittere = (to send out), to let grow: capillus, barba, coma
Collocations and meaning: the Thesaurus
promittere = to let grow: capillus, barba, coma
promittere = to promise: caelum, victoria, astra : nihil
Putting numbers on lexical semantics
Imitari in
Lewis & Short
Oxford Latin Dictionary
Thesaurus Linguae Latinae
Institut für Kulturgeschichte
der Antike, Wien
Lewis & Short (1879) – 45 quotations/2 groups
Institut für Kulturgeschichte
der Antike, Wien
Oxford Latin Dictionary (1982) – 62 quotations/6 groups
Institut für Kulturgeschichte
der Antike, Wien
Thesaurus Linguae Latinae (1936)
– 449 quotations/33 groups
Institut für Kulturgeschichte
der Antike, Wien
imitari
number of examples
450
Lewis&Short
400
350
300
OLD
250
200
150
TLL
100
50
46
0
Institut für Kulturgeschichte
der Antike, Wien
imitari
number of examples
450
Lewis&Short
400
350
300
OLD
250
200
150
TLL
100
50
46
66
0
Institut für Kulturgeschichte
der Antike, Wien
imitari
number of examples
500
449
450
Lewis&Short
400
350
OLD
300
250
200
150
TLL
100
50
46
66
0
Institut für Kulturgeschichte
der Antike, Wien
imitari
number of examples
granularity of informations
500
35
33
449
450
30
400
25
350
300
23
20
250
15
200
150
10
10
100
50
13
45
6
62
5
0
2
0
definitions
Lewis&Short
OLD
ex. / def.
TLL
Institut für Kulturgeschichte
der Antike, Wien
Language statistics and social value system
 Hypothesis: The number of attestations of a lexem is a reflection
of its importance within the value system of a culture
 Latin is an inflected language --> lemmatization is needed to
measure frequency of lexems --> Thll
 Length of articles in the ThLL reflects the frequency of its use in
Latin texts
 ThLL-material is complete only for 'short words' (lexems with few
attestations)
 The Thll-entry reduces the material proportionally to its length –
the longer an entry, the more material has been left out
Institut für Kulturgeschichte
der Antike, Wien
divitiae
luxuria/es
dives
dito
divito
ditifico
luxurio
luxus luxuriosus
ditesco
luxurius
ditator
luxuriator
luxurialis
luxuriamen
ditabilis
ditificus
pauper
pauperculus
paupertinus
paupertatula
pauperesco
pauperto
pauperium
pauperasco
paupertas
pauperies
paupero
Institut für Kulturgeschichte
der Antike, Wien
900
divitiae – luxuria – opulentia – paupertas
800
700
600
500
400
300
200
100
0
Institut für Kulturgeschichte
der Antike, Wien
Frequency: Zipf's law
Zipf ’s Law is a statistical distribution in certain data
sets, such as words in a linguistic corpus, in which the
frequencies of certain words are inversely proportional to
their ranks. Named for linguist George Kingsley Zipf,
who around 1935 was the first to draw attention to this
phenomenon, the law examines the frequency of words
in natural language and how the most common word
occurs twice as often as the second most frequent word,
three times as often as the subsequent word and so on
until the least frequent word. The word in the position n
appears 1/n times as often as the most frequent one.
source: https://whatis.techtarget.com/definition/Zipfs-Law
Institut für Kulturgeschichte
der Antike, Wien
Frequency: Zipf's law - English
source: https://phys.org/news/2017-08-unzipping-zipf-law-solution-century-old.html
Institut für Kulturgeschichte
der Antike, Wien
Frequency: quotations per lemma
size of subst.-lemmata in ThLL vol. 3, 5.1, 5.2, 7.2, 9.2
7000
number of quotations per
lemma: max 6413 (dies)
6000
5000
4000
3000
2000
number of examined
ThLL-articles: 6043
1000
5866
5611
5356
5101
4846
4591
4336
4081
3826
3571
3316
3061
2806
2551
2296
2041
1786
1531
1276
1021
766
511
256
1
0
Frequency: quotations per lemma
size of subst.-lemmata in ThLL vol. 3, 5.1, 5.2, 7.2, 9.2
7000
6000
5000
4000
3000
2000
100
50
10 quotations / lemma
1 quotation
1000
5866
5611
5356
5101
4846
4591
4336
4081
3826
3571
3316
3061
2806
2551
2296
2041
1786
1531
1276
1021
766
511
256
1
0
Frequency: quotations per lemma
size of subst.-lemmata in ThLL vol. 3, 5.1, 5.2, 7.2, 9.2
7000
6000
5000
4000
3000
2000
100
50
10 quotations / lemma
1000
1 quotation
5866
5611
5356
5101
4846
4591
4336
4081
3826
3571
3316
3061
2806
2551
2296
2041
1786
1531
1276
1021
766
511
256
1
0
Future research
Combine quantitative and qualitative research
(e.g. etym. dives < divinus: rich is being like a god)
ThLL has largely standardized language of description to
describe semantic and syntactic phenomena: 'proprie',
'translate', 'cum colore', 'acc. praedic.’ ...
Combine with the chronological data of the ThLL (Index
librorum)
Future research
e.g. granularity of meaning: number of subdivision in an article
assumption:
the longer an article, the more subdivisions there will be
are there articles which are long and have fewer subdivisions
than average: conclusion - formulaic
are there articles which are short and have more subdivisions
than average: conclusion - insecurity about meaning? repeated
new formation of word? - check diachronic and diatopic
distribution ...
Institut für Kulturgeschichte der Antike, Wien
Thesaurus Linguae Latinae, München
JOHANN RAMMINGER
THE THESAURUS LINGUAE LATINA
DIGITAL PERSPECTIVES
THESAURUS LINGUAE LATINAE
2ND LATIN LEXICOGRAPHY SUMMER SCHOOL