Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Institut für Kulturgeschichte der Antike, Wien Thesaurus Linguae Latinae, München JOHANN RAMMINGER THE THESAURUS LINGUAE LATINA DIGITAL PERSPECTIVES THESAURUS LINGUAE LATINAE 2ND LATIN LEXICOGRAPHY SUMMER SCHOOL Digitized/machine readable dictionaries of Latin Lewis & Short (Perseus) Georges (www.zeno.org) Gaffiot (Logeion) Forcellini (commercial) Oxford Latin Dictionary (comm.) DuCange (ducange.enc.sorbonne.fr) D. of Medieval Latin from British Sources (Logeion) NLW Dictionary of Early Modern Latin Thesaurus Linguae Latinae Database (DeGruyter, comm.) Thesaurus Linguae Latinae (pdf, open access) Institut für Kulturgeschichte der Antike, Wien ThLL – search mask The ThLL as a machine readable/digital object The ThLL as a machine readable/digital object Manual lexicography has produced extraordinary results for Greek and Latin, but it cannot in the immediate future provide for all texts the same level of coverage available for the most heavily studied materials. As we build a cyberinfrastructure for Classics in the future, we must explore the role that automatic methods can play within it. Using technologies inherited from the disciplines of computational linguistics and computer science, we can create a complement to these traditional reference works - a dynamic lexicon that presents statistical information about a word’s usage in context, including information about its sense distribution within various authors, genres and eras, and syntactic information as well. Abstract from Bamman & Crane 2009 The ThLL as a machine readable/digital object Manual lexicography has produced extraordinary results for Greek and Latin, but it cannot in the immediate future provide for all texts the same level of coverage available for the most heavily studied materials. As we build a cyberinfrastructure for Classics in the future, we must explore the role that automatic methods can play within it. Using technologies inherited from the disciplines of computational linguistics and computer science, we can create a complement to these traditional reference works - a dynamic lexicon that presents statistical information about a word’s usage in context, including information about its sense distribution within various authors, genres and eras, and syntactic information as well. Abstract from Bamman & Crane 2009 The ThLL as a machine readable/digital object Manual lexicography has produced extraordinary results for Greek and Latin, but it cannot in the immediate future provide for all texts the same level of coverage available for the most heavily studied materials. As we build a cyberinfrastructure for Classics in the future, we must explore the role that automatic methods can play within it. Using technologies inherited from the disciplines of computational linguistics and computer science, we can create a complement to these traditional reference works - a dynamic lexicon that presents statistical information about a word’s usage in context, including information about its sense distribution within various authors, genres and eras, and syntactic information as well. Abstract from Bamman & Crane 2009 The ThLL as a machine readable/digital object Manual lexicography has produced extraordinary results for Greek and Latin, but it cannot in the immediate future provide for all texts the same level of coverage available for the most heavily studied materials. As we build a cyberinfrastructure for Classics in the future, we must explore the role that automatic methods can play within it. Using technologies inherited from the disciplines of computational linguistics and computer science, we can create a complement to these traditional reference works - a dynamic lexicon that presents statistical information about a word’s usage in context, including information about its sense distribution within various authors, genres and eras, and syntactic information as well. Abstract from Bamman & Crane 2009 The ThLL as a machine readable/digital object Manual lexicography has produced extraordinary results for Greek and Latin, but it cannot in the immediate future provide for all texts the same level of coverage available for the most heavily studied materials. As we build a cyberinfrastructure for Classics in the future, we must explore the role that automatic methods can play within it. Using technologies inherited from the disciplines of computational linguistics and computer science, we can create a complement to these traditional reference works - a dynamic lexicon that presents statistical information about a word’s usage in context, including information about its sense distribution within various authors, genres and eras, and syntactic information as well. Abstract from Bamman & Crane 2009 + ThLL ThLL – advantages for non-traditional research Thesaurus – is a random text corpus for research other than the original function – has validated data (best text, no mistakes) – has lemmatized data: homographs (lemma) and ambiguous forms (paradigm) distinguished, different stems etc. of the same word collected – has chronologically fixed data (→ Index librorum) ThLL – advantages for non-traditional research Thesaurus – is a random text corpus for research other than the original function – has validated data (best text, no mistakes) – has lemmatized data: homographs (lemma) and ambiguous forms (paradigm) distinguished, different stems etc. of the same word collected – has chronologically fixed data (→ Index librorum) But: double reduction process: texts > material in ThLL archive > material printed in word article Inflection of Latin verbs Matteo Pellegrini & Marco Passarotti, "LatInfLexi: an Inflected Lexicon of Latin Verbs", Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018). CEUR Workshop Proceedings 2253. URL: ceur-ws.org/Vol-2253/ • 254 possible forms x 3348 verbs = 850,392 paradigm cells • 850,392 paradigm cells – 97,855 impossibles = 752,537 paradigm cells Inflection of Latin verbs - reality • 752,537 paradigm cells max • Paul Tombeur, Thesaurus Formarum Totius Latinitatis (1998). CD-Rom • TFTL epoch unattested forms (%) Antiquitas 544,395 (72.34%) Aetas Patrum 482,324 (64.1%) Medium Aeuum 484,421 (64.37%) Recentior Latinitas 640,552 (85.12%) all epochs 401,690 (53.38%) (numbers from Pellegrini & Passarotti 2018) Matteo Pellegrini & Marco Passarotti, "LatInfLexi: an Inflected Lexicon of Latin Verbs", Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018). CEUR Workshop Proceedings 2253. URL: ceur-ws.org/Vol-2253/ can that be right? Matteo Pellegrini & Marco Passarotti, "LatInfLexi: an Inflected Lexicon of Latin Verbs", Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018). CEUR Workshop Proceedings 2253. URL: ceur-ws.org/Vol-2253/ Matteo Pellegrini & Marco Passarotti, "LatInfLexi: an Inflected Lexicon of Latin Verbs", Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018). CEUR Workshop Proceedings 2253. URL: ceur-ws.org/Vol-2253/ which the ThLL has done (+Zettelarchiv!) Length of article length in number of quotations 6231 4234 3506 length in kilobyte 537k 296k 279k 408 178 206 34k docere dicere ducere lenght (kb) dicare quotations 16k degenerare 14k diruere Length of article & number of quotations w/ lemma 6231 3202 4234 3506 4087 1608 408 209 178 206 78 88 docere dicere ducere quotations dicare degenerare quot.s w. lemma diruere Length / quotations / different forms 6231 3202 4234 3506 4087 1608 408 209 172 178 132 132 62 88 34 docere dicere ducere quotations dicare quot.s w. lemma 206 78 degenerare forms 31 diruere Length / quotations / different forms 6231 3202 4234 3506 4087 1608 408 max 254 206 78 209 172 132 178 132 62 88 31 34 docere dicere quotations ducere dicare quot.s w. lemma degenerare forms diruere max forms Quotations and different Forms 3202 4087 1608 max 254 209 172 132 132 78 62 88 34 docere dicere ducere quot.s w. lemma dicare forms degenerare max forms 31 diruere Quotations and different Forms attested forms (Pellegrini & Passarotti): Antiquitas 28% max 172 70% 254 132 132 62 34 docere dicere ducere forms dicare max forms degenerare 31 12% diruere Attested forms – further research How does Latin function? – Which forms are used most frequently? – Which paradigm positions degenerate (>1 orth.) – ThLL ‘De formis‘ – How often are ambiguous forms actually used? Avoided? – Do verbs with Greek roots behave differently (limited paradigm)? – ... Integrate into the framework of variational linguistics: diaphasic var.: genres~styles~registers (prose/poetry, iur., coins...) diastratic variation: sociolects diatopic variation: regional substrate languages (Greek, Celtic, ... ) diachronic variation: language change (ThLL gives first attestations) Integrate with text databases Collocations and the Thesaurus In corpus linguistics, a collocation is a sequence of words or terms that co-occur more often than would be expected by chance. There are about six main types of collocations: adjective+noun, noun+noun (such as collective nouns), verb+noun, adverb+adjective, verbs+prepositional phrase (phrasal verbs), and verb+adverb. Collocation extraction is a computational technique that finds collocations in a document or corpus, using various computational linguistics elements resembling data mining. (Wikipedia) Collocations in the Thesaurus: privatus (oppos.) Collocations in the Thesaurus: primatus What to do with collocations? David Bamman & Gregory Crane, “Computational Linguistics and Classical Lexicography”, Changing the Center of Gravity: Transforming Classical Studies Through Cyberinfrastructure. digital humanities quarterly 3.1 (2009) URL: http://www.digitalhumanities.org/dhq/vol/3/1/000033/000033.html Collocations and meaning: spiritus Latin context word Sanctus Testis Holy Witness Probability spiritus = spirit 99.9% 99.9% Vivifico Make alive 99.9% Omnipotens All-powerful 99.9% (from Bamman & Crane 2009) Collocations and meaning: spiritus Latin context word Sanctus Testis Holy Witness Probability spiritus = spirit 99.9% 99.9% Vivifico Make alive 99.9% Omnipotens All-powerful 99.9% Latin context word Mons Commotio Mountain Commotion Probability spiritus = wind 98.3% 98.3% Ventus Ala Wind Wing 95.2% 95.2% (from Bamman & Crane 2009) Collocations and meaning: the Thesaurus promittere = (to send out), to let grow: capillus, barba, coma Collocations and meaning: the Thesaurus promittere = to let grow: capillus, barba, coma promittere = to promise: caelum, victoria, astra : nihil Putting numbers on lexical semantics Imitari in Lewis & Short Oxford Latin Dictionary Thesaurus Linguae Latinae Institut für Kulturgeschichte der Antike, Wien Lewis & Short (1879) – 45 quotations/2 groups Institut für Kulturgeschichte der Antike, Wien Oxford Latin Dictionary (1982) – 62 quotations/6 groups Institut für Kulturgeschichte der Antike, Wien Thesaurus Linguae Latinae (1936) – 449 quotations/33 groups Institut für Kulturgeschichte der Antike, Wien imitari number of examples 450 Lewis&Short 400 350 300 OLD 250 200 150 TLL 100 50 46 0 Institut für Kulturgeschichte der Antike, Wien imitari number of examples 450 Lewis&Short 400 350 300 OLD 250 200 150 TLL 100 50 46 66 0 Institut für Kulturgeschichte der Antike, Wien imitari number of examples 500 449 450 Lewis&Short 400 350 OLD 300 250 200 150 TLL 100 50 46 66 0 Institut für Kulturgeschichte der Antike, Wien imitari number of examples granularity of informations 500 35 33 449 450 30 400 25 350 300 23 20 250 15 200 150 10 10 100 50 13 45 6 62 5 0 2 0 definitions Lewis&Short OLD ex. / def. TLL Institut für Kulturgeschichte der Antike, Wien Language statistics and social value system Hypothesis: The number of attestations of a lexem is a reflection of its importance within the value system of a culture Latin is an inflected language --> lemmatization is needed to measure frequency of lexems --> Thll Length of articles in the ThLL reflects the frequency of its use in Latin texts ThLL-material is complete only for 'short words' (lexems with few attestations) The Thll-entry reduces the material proportionally to its length – the longer an entry, the more material has been left out Institut für Kulturgeschichte der Antike, Wien divitiae luxuria/es dives dito divito ditifico luxurio luxus luxuriosus ditesco luxurius ditator luxuriator luxurialis luxuriamen ditabilis ditificus pauper pauperculus paupertinus paupertatula pauperesco pauperto pauperium pauperasco paupertas pauperies paupero Institut für Kulturgeschichte der Antike, Wien 900 divitiae – luxuria – opulentia – paupertas 800 700 600 500 400 300 200 100 0 Institut für Kulturgeschichte der Antike, Wien Frequency: Zipf's law Zipf ’s Law is a statistical distribution in certain data sets, such as words in a linguistic corpus, in which the frequencies of certain words are inversely proportional to their ranks. Named for linguist George Kingsley Zipf, who around 1935 was the first to draw attention to this phenomenon, the law examines the frequency of words in natural language and how the most common word occurs twice as often as the second most frequent word, three times as often as the subsequent word and so on until the least frequent word. The word in the position n appears 1/n times as often as the most frequent one. source: https://whatis.techtarget.com/definition/Zipfs-Law Institut für Kulturgeschichte der Antike, Wien Frequency: Zipf's law - English source: https://phys.org/news/2017-08-unzipping-zipf-law-solution-century-old.html Institut für Kulturgeschichte der Antike, Wien Frequency: quotations per lemma size of subst.-lemmata in ThLL vol. 3, 5.1, 5.2, 7.2, 9.2 7000 number of quotations per lemma: max 6413 (dies) 6000 5000 4000 3000 2000 number of examined ThLL-articles: 6043 1000 5866 5611 5356 5101 4846 4591 4336 4081 3826 3571 3316 3061 2806 2551 2296 2041 1786 1531 1276 1021 766 511 256 1 0 Frequency: quotations per lemma size of subst.-lemmata in ThLL vol. 3, 5.1, 5.2, 7.2, 9.2 7000 6000 5000 4000 3000 2000 100 50 10 quotations / lemma 1 quotation 1000 5866 5611 5356 5101 4846 4591 4336 4081 3826 3571 3316 3061 2806 2551 2296 2041 1786 1531 1276 1021 766 511 256 1 0 Frequency: quotations per lemma size of subst.-lemmata in ThLL vol. 3, 5.1, 5.2, 7.2, 9.2 7000 6000 5000 4000 3000 2000 100 50 10 quotations / lemma 1000 1 quotation 5866 5611 5356 5101 4846 4591 4336 4081 3826 3571 3316 3061 2806 2551 2296 2041 1786 1531 1276 1021 766 511 256 1 0 Future research Combine quantitative and qualitative research (e.g. etym. dives < divinus: rich is being like a god) ThLL has largely standardized language of description to describe semantic and syntactic phenomena: 'proprie', 'translate', 'cum colore', 'acc. praedic.’ ... Combine with the chronological data of the ThLL (Index librorum) Future research e.g. granularity of meaning: number of subdivision in an article assumption: the longer an article, the more subdivisions there will be are there articles which are long and have fewer subdivisions than average: conclusion - formulaic are there articles which are short and have more subdivisions than average: conclusion - insecurity about meaning? repeated new formation of word? - check diachronic and diatopic distribution ... Institut für Kulturgeschichte der Antike, Wien Thesaurus Linguae Latinae, München JOHANN RAMMINGER THE THESAURUS LINGUAE LATINA DIGITAL PERSPECTIVES THESAURUS LINGUAE LATINAE 2ND LATIN LEXICOGRAPHY SUMMER SCHOOL