Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A roadmap for MT : four « keys » to handle more languages, for all kinds of tasks, while making it possible to improve quality (on demand) International Conference on Universal Knowledge and Language (ICUKL2002), Goa, 25-29 November 2002 Christian Boitet GETA, CLIPS, IMAG, 385 av. de la bibliothèque, BP 53 F-38041 Grenoble cedex 9, France [email protected], http://clips.imag.fr/geta Outline • Basic concepts What is MT ? Goals: Quality / User Architectures: Vauquois' triangle • State of the art MT of texts: examples, problems MT of spoken dialogs • The future of MT Goals 4 keys Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 2/30 What is M(a)T ? • At least 3 types of automation MT = Machine Translation MAT = Machine Assisted Translation MAHT = Machine Aided Human Translation • A scientific technology Informatics (computer science) Linguistics Mathematics Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 3/30 Goals: Quality / User User Quality rough, qu ick from raw to very good Ch. Boitet lingui stically naive lingui stically specialized MT for access MT for translators special fields : atom, chemistry… general information helps: lexicons, proposals from a translation memory… MT for individual authors MT for revisors (posteditors) with interactive disambiguation raw MT, polishable ICUKL2002, Goa, 25-29/11/2002 4/30 Architectures: Vauquois' triangle Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 5/30 Architekturen: Vauquois Dreieck (größer) Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 6/30 Formal intermediate structures Linguistic level(s) Linguistic main organization Geometrical structure String Surface Deep 1-level n-level Ch. Boitet Syntagms Algebraic structure Labels Struct. string (constituents) Chain graph (chart) Boolean Dependencies Tree structure features Logical and semantic relations Graph / Network Structured attributes Hypergraph Feature structures ICUKL2002, Goa, 25-29/11/2002 Correspondence Structure—Text Scope Sentence concrete (text - readable from structure) (almost all) Paragraph Page abstract (Ariane-G5, Sygma rt) (e.g. UNL) Document 7/30 How to produce an MT system • Choose an architecture • Program the "tools" Spezialized languages for linguistic programming (SSLP) Development environment (MT shell) • Build the "lingware" Lexical data / rules / weights Grammatical data / rules / weights Possible specialization to a typology ("sublanguage") • How? Human work ± computer help / support Automatic learning (weights, likeliness…) Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 8/30 State of affairs • only a small number of language pairs is covered by MT systems designed for information access Systran EC (2000): 19/1 10 language pairs, 8 OK for intended use See also examples by Ronaldo Martins • even fewer are capable of quality translation or speech translation • Now a few examples… Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 9/30 Examples: MT for access, Web (1) ENGLISH (human ver sion) FRENCH (hu ma n ve rsion) ENGLISH (Systran FRE-ENG version) The European-Heritage.net thesaurus covers the fields of archaeology and architecture as defined in the Council of Europe conventions signed in Granada (1985) and Malta (1992). It encompasses information ranging from t he partners involved, categories of cultural assets and legislation, to activities, skills and funding. It is supplemented by a number of specific thesauruses compiled by each memb er state on a particular topic, such as the thesaurus on Andalusian heritage or the architectural thesaurus from t he Mérimée database in France. This new, open-ended search tool will come on line shortly, together with a management and administration system shared amo ng the various contributors. Le thesaurus European-Heritage.net couvre les champs de l'archéologie et de l'architecture au sens des conventions du Conseil de l'Eu rope de Grenade (1985) et de Malte (1992). The European-Heritage.net thesaurus covers the fields of archaeology and architecture within the meaning of conventions of the Council of Europe of Gr enade (1985) and Malta (1992). Il prend en compte des aspects aussi variés que les acteurs, les catégories de biens culturels, la législation ou encore les interventions, les métiers et les financeme nts. Il e st complété et prolongé par des thesaurus spécifiques développés par chaque Eta t membre sur tel ou tel sujet spécifique, comm e le thesaurus du patrimo ine historique andalou ou le thesaurus d'architecture de la base de données documentaire Mérimée en France. Cet instrument de recherche, forcément évolutif, sera mis prochainement en ligne accomp agné d'un dispositif de gestion et d'administration réparti e ntre les différents contributeurs. It takes into account aspects as varied as the actors, the categories of cultural goods, the legislation or the interventions, the trades and the financings. It is supplemented and prolonged by thesaurus specific developed by each Member State on such or such specific subject, li ke the thesaurus of the Andalusian historical inheritance or the thesaurus of architecture of the documentation data base Mérimée in France. This instrument of search, inevitably evolutionary, will be put soon on line accompanied by a device of manageme nt and administration distributed between the various contributors. Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 10/30 Examples: MT for access, Web (2) • FE quite "easy", compared with EG and mainly FG GERMAN (Systran ENG-GER ve rsion ) GERMAN (Systran FRE-GER ve rsion) Der European-Heritage.netthesaurus umfaßt die Felder von archaeology und von Architektur, wie in den Europaratvereinbarungen definiert, die in Gr anada (1985) unterzeichnet werden und in Malta (1992). Er gibt die Informationen um, die von den betroffenen Partnern, von den Kategorien der kulturellen Werte und der Ge setzgebung, bis zu Aktivitäten, von den Fähigkeiten und von der Finanzierung reichen. Er wird durch eine Anzahl von den spezi fischen Thesauren ergänzt, die durch jeden Mitgliedsstaat auf einem bestimmt en Thema, wie dem Thesaurus auf Andalusian Erbe oder dem architektonischen Thesaurus von der Datenbank Mérimée in Frankreich kompiliert werden. Der European-Heritage.net-Thesaurus bedeck t die Felder der Archäologie und der Architektur im Sinne der Übereinkommen des Europarats von Granada (1985) und von Malta (1992). Dieses neue, offene Suchhilfsmittel kommt auf Zeile kurz, zusamm en mit einem M anageme ntund Lei tungssystem, das unter den verschiedenen Mitwirkenden geteilt wird. Ch. Boitet Er berück sichtigt Aspekte dermaßen variierte, daß die Beteiligten, die Kategorien kultureller Güter, die Gesetzgebung oder noch die Interventionen, die Berufe und die Finanzierungen. Er wird vervollständigt und wird durch ein spezifische Thesaurus entwickelt durch jeder Mitgliedstaat über das eines oder andere spezifische Thema verlängert, als der Thesaurus des andalusischen historischen Kulturgutes oder der Thesaurus der Architektur der urkundlichen Datenbank Mérimée in Frankreich. Dieses notgedrungen entwicklungsfähige Forschungsinstrume nt wird gestellt demnächst online begleitet von einer Verwaltungs- und Verwaltungsvorrichtung, die aufgeteilt unter den verschiedenen Beitragenden. ICUKL2002, Goa, 25-29/11/2002 11/30 Comparison: raw vs rough MT SpanA m raw Spanish-Eng lis h ou tput Rever so raw Spanish-Eng lis h ou tput Message of the Director-General of the World Health Organization From its discovery, a ntibiotics have completely transformed the perspective of humankind with respect to infectious diseases. T oday the use of antibiotics, combined with improvements in sanitation, housing, and nutrition, together w ith the advent of the vaccination programs generalized, have caused a notable reduction of infectious diseases that previously were commo n and annihilated entire populations. Scourges that terrified millions of people, as plague, whooping cough, poliomy elitis, and the scarlatina, have been controlled or are on the verge of being controlled. Now, in the dawn of a new millennium, humankind faces another crisis. Previously curable diseases as the gonorrhea and typhoid fever are becomi ng rapidly difficult to treat, while old assassins as tuberculosis and malaria now are armed of the increasingly impenetrable resistance to the antimi crobial drugs. This phenomenon is potentially contenible. The problem is increasingly profound and complex, accele rated by the abuse of antibiotics in the developed countries and the paradoxical underutilization of the quality antimicrobial drugs in the developing countries due to the poverty and to the scarcity resulting from an effective health care. Message of the Chief operating officer of the World Organization of the Health From h is{*its*} discovery, the antibiotics have transformed completely the perspective of the humanity with regard to the infectious diseases. T oday the use of the antibiotics, cocktail with improvements in the reparation, the housing and the nutrition, together with the advent of the programs of widespread vaccination, they have given place to a notable decrease of infectious diseases that before were common and were annihilating entire populations. Ch. Boitet Scourges that terrified million persons, a s the pest, the savage cough, the poliomy elitis and the scarlatina, they have been controlled or are on the verge of be controlling. Now, in the dawn of a new millenium, the humanity faces with another crisis. Diseases before curable as the gonorrhea and the fever tifoidea they are becoming rapidly difficult to treat, whereas killer old men as the tuberculosis and the malaria are armed{*assemb led*} now with the increasing imp enetrable resistance the antimicrobial ones. This phenomenon is potentially contenible. The problem is increasingly deep and comp lex, accelerated by the abuse of the antibiotics in the developed countries and the paradoxical subutilization of the antimi crobial ones of quality in the countries in development due to the poverty and the resultant shortage of an attention of effective health. ICUKL2002, Goa, 25-29/11/2002 12/30 Examples: MT for revisors… Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 13/30 …with BV-aero/FE (2) Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 14/30 MT of spoken dialogs • Specialized systems are already usable e.g. ATR/Matsushita, IBM, CSTAR/Nespole!… Much "noise" and "ungrammaticalities" But specializing is very helpful! • General systems are also possible e.g. NEC/Xroad, Linguatec/Talk&Translate Speech recognition is already good enough Rough may be good enough (e.g. for chatting) • Interpretation is different from translation… …and participants are intelligent ! Similarity with access-oriented-MT Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 15/30 French-Korean through IF (1) Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 16/30 French-Korean through IF (2) Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 17/30 French-Korean through IF (3) Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 18/30 A road map… to which goals? • MT of adequate quality • Not only for access • For all languages Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 19/30 Four keys • 2 on the technical side • 2 on the organizational side Compromize: a far wider coverage, a somewhat smaller asymptotic quality • Automatic learning techniques • Using non-textual pivots (intermediate formal descriptors) Democratization, cooperation • Cooperative development of open source linguistic resources on the Web • Towards systems where quality can be improved "on demand" by users Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 20/30 Learning techniques • Extend the use of hybrid techniques symbolic, numerical, or mixed ==> they have demonstrated their potential at the research level • stochastic grammars • weighted (or "neural") dictionaries • or build new tools, intrinsically numerical inspiration from voice recognition • 2 examples learning analyzers : text —> semantic tree (IBM) learning implicit very detailed DG from tree bank (NAIST) Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 21/30 Using non-textual pivots • Semantico-pragmatic (ontological) pivots task & domain oriented ==> limited applicability • Abstract linguistic descriptors the most precise, but often too sophisticated depend on each language • Anglo-semantic pivot: UNL "the HTML of linguistic content" • in UNL, a hypergraph represents the abstract structure of (supposedly) equivalent English utterance less precise but "robust" symbols constructed from English ==> usable by all developers Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 22/30 A simple UNL graph score(icl>event,agt>human,fld>sport) .@entry.@past.@complete agt Ronaldo (icl>proper noun) obj ins plt head(pof>body).@def pos corner(icl>thing).@def goal(icl>abstract thing) pos goal(icl>concrete thing) mod left(aoj<thing) •Ronaldo has headed the ball into the left corner of the goal Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 23/30 Cooperative development • of open source linguistic resources • on the Web Mutualization is necessary at least for lexical knowledge too costly even for the leaders size (#entries) has to augment for each language (300K, 3M?) #languages has to increase dramatically (11 —> 20 —> 180?) Integration of human- and machine-oriented knowledge is useful e.g. to produce mixed MT/MAHT systems Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 24/30 A contribution: the Papillon project • Goal: produce many open source dictionaries from a central lexical data base • Means: build rich (DiCo) monolingual dictionaries of lexies (senses) interlink lexies by interlingual links (axies) use XML & associated tools as basis to generate many formats • for humans and for machines start from (free) digital resources induce "consumers" to become "producers" (contributors) • Quality control: private accounts central validating/integrating group Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 25/30 Papillon database macrostructure User User User Interaction with the Dictionaries Dictionary Dictionary Extraction of Dictionaries Lexical Human Contributors Database Integration of existing resources Resource Ch. Boitet Resource Resource ICUKL2002, Goa, 25-29/11/2002 26/30 PAPILLON diagram French. DiCo Vocable carte n.f. Lexie carte.1 carte à jouer Lexie carte.2 carte géographique Thai DiCo Japan. DiCo Interlingual links Acception 343 UNL: card(icl>play), card(icl>thing)… Acception 345 UNL: map(fld>geography) Acception 1002 UNL: card(fld>money) a カード 地図 Engl. DiCo Vocable card N Lexie card.1 playing card Lexie card.2 money card Interlingual links based on translations = "AXIEs" Possibility to link 1 lexie with >1 acceptions References to other semantic systems: AXIE—1————n—>UW Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 Vocable=lexie map 27/30 Construct systems where quality can be improved "on demand" by users • a priori through interactive disambiguation in the source language • or a posteriori by correcting the pivot representation (UNL or other) through any language (as in MultiMeteo) ==> In the 2 cases, all versions (in all languages) are improved • possibility to merge MT multilingual generation computer-aided authoring Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 28/30 Conclusion • 4 keys to open the door to MT of adequate quality to all languages • On the technical side, dramatically increase the use of learning techniques use pivot architectures, the most universally usable pivot being UNL • On the organizational side, cooperatively develop open source linguistic resources on the web construct systems where quality can be improved "on demand" by users • On the practical side, seek keys to unlock private investment, public funding, voluntary cooperation could this conference become a decisive turning point? Ch. Boitet ICUKL2002, Goa, 25-29/11/2002 29/30