* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Transcription as
Survey
Document related concepts
Transcript
KONVENS Wien, 15 Sep 2004 EXMARaLDA – A modeling and visualization framework for the computer-assisted transcription of spoken language Thomas Schmidt SFB 538 ‚Mehrsprachigkeit‘ University of Hamburg Background • Multilingual Database, SFB 538 „Mehrsprachigkeit“, University of Hamburg • EXMARaLDA (Extensible Markup Language for Discourse Annotation) • Dissertation project „Computer-based transcription of spoken language as a modelling and visualisiation process“ (Supervisor: Angelika Storrer) Background • Transcription of spoken language – Interviewer / child interaction – Classroom interaction – Interpreted doctor-patient discourse – for discourse / conversation analysis – for (child) language acquisition studies Background • Problem: Diversity of Transcription Data – Theoretical diversity: • Entities of transcription (utterances, turns, non-verbal activities etc.) • Relations between entities (temporal, hierarchical, features, ...) • Presentation formats (partitur notation, column notation, ...) – Technological diversity: • Storage formats (text, binary, RDB) • Software (syncWriter, HIAT-DOS, DBM-Systems, word processors, ...) • Operating Systems (Windows, MAC OS) Background Background Background • Problem: Diversity of Transcription Data • Aim: A common platform for computerassisted transcription Exchange, reuse, archive transcription data Merge corpora Use different software tools with one piece of data Background • Problem: Diversity of Transcription Data • Aim: A common platform for computerassisted transcription • (Elements of a) Solution XML technology Three level architecture Separate form from content Separate logical from physical structure Topics of this talk 1. Some methodological considerations: Linguistic methods Computer science methods „Computing in the humanities“ Interdisciplinary communication 2. Components of the developed system Methodological considerations Transcript Transcription as... Quality criteria Computer Transcription as... „Verschriftlichung“ Readability Visualisation Visualisation Visualisation Form Analogue model Application vs. Logical layer Document... Form Form View Form Form Form Theory Established view Adequacy Modified view Modelling Symbolic model E/R model Content Model theory view Database view Text technology view Methodological considerations Transcription as Modeling and Visualization of spoken language Accordance with text-technological concepts One model, different visualizations No tradeoff between readability and adequacy No tradeoff between human and computer processability No “Standardization” of models a common modelling framework, not a common model no ontological specifications XML = Standardization of physical representation Visualization to Model Visualization to Model Structural relations: 1. Temporal sequence Visualization to Model Structural relations: 1. Temporal sequence 2. Simultaneity Visualization to Model Structural relations: 1. Temporal sequence 2. Simultaneity 3. Equivalence (Entity Feature) Visualization to Model Structural relations: 1. Temporal sequence 2. Simultaneity 3. Equivalence (Entity Feature) 4. Hierarchy (Containment) Modeling framework • Relational? Sequence? Simultaneity? • OHCO? Simultaneity? • DAG: Annotation Graphs? Complexity? Transcription Graphs System architecture Application: Input tools EXMARaLDA Partitur-Editor Application: Input tools Simple EXMARaLDA Text file Application: Input tools TASX annotator Application: Input tools PRAAT Application: Input tools EUDICO Linguistic Annotator (ELAN) Application: Visualization ... as a wrapped partitur ... as a line transcript ... in column notation Application: Corpus management EXMARaLDA Corpus Manager (COMA) Application: Query/Analysis Search and Query Instrument for EXMARaLDA (SQUIRREL) Project status • Software past beta stage • Five projects at our own institution use EXMARaLDA for their corpus work • Around 800 users in research and teaching outside SFB • Used at the IDS in Mannheim • Submitted a suggestion for integration of data model into P5 of the TEI guidelines Summary Transcription as theory and „Verschriftlichung“ Computer-assisted transcription as modelling and visualisation Interdisciplinary bridge / Methodology of computational techniques in „classical“ linguistics Concrete practical improvements for work with transcription data EXMARaLDA and Database „Multilingalism“ Data model, formats and tools building on the separation of model and visualisation Fin.