Download Punjabi Text Generation using Interlingua

Punjabi Text Generation using Interlingua approach in Machine Translation A thesis Submitted in partial fulfillment of the requirement for the award of degree of Master of Engineering in Software Engineering Under the Supervision of Dr. R. K. Sharma Astt.Professor School of Mathematics & Computer Applications Thapar Institute of Engineering and Technology, Patiala Submitted By SACHIN KALRA (8023114) Computer Science & Engineering Department Thapar Institute of Engineering & Technology (Deemed University), Patiala-147004 (India) June 2004 1 Declaration I hereby certify that the work which is being presented in the thesis entitled, “Punjabi Text Generation using Interlingua approach in Machine Translation”, in partial fulfillment of the requirements for the award of degree of Master of Engineering in Software Engineering submitted in Computer Science and Engineering Department of Thapar Institute of Engineering and Technology (Deemed University), Patiala, is an authentic record of my own work carried out under the supervision of Dr. R. K. Sharma. The matter presented in this thesis has not been submitted by me for the award of any other degree of this or any other University. SACHIN KALRA This is to certify that the above statement made by the candidate is correct and true to best of my knowledge. Dr. R. K. Sharma Astt. Professor School of Mathematics & Computer Applications Thapar Institute of Engg. & Technology, Patiala Countersigned by (Dr. D.S. Bawa) Dean (Academic Affairs) Thapar Institute of Engg. & Technology, Patiala. (Ms. Seema Bawa) Assistant Professor & Head, Computer Sc. & Engg. Department, Thapar Institute of Engg. & Technology, Patiala. 2 .0.1.1 Acknowledgement A journey is easier when traveled together. Interdependence is certainly more valuable than independence. This thesis is the result of work carried out during the final year of my course whereby I have been accompanied and supported by many people. It is a pleasant aspect that I have now the opportunity to express my gratitude for all of them. No amount of words can adequately express the debt, I owe to Dr. R. K. Sharma, Assistant Professor, School of Mathematics & Computer Applications, for his kind support, motivation and inspiration that triggered me for the thesis work. I owe him lots of gratitude for having me shown this way of research. I wish to express my gratitude to Ms. Seema Bawa, Assistant Professor & Head, Computer Science & Engineering Department for her excellent guidance and encouragement right from beginning of this course. I am also thankful to all the faculty and staff members of the Computer Sc. & Engg. Department for providing me all the facilities required for the completion of this work. No thesis could be written without being influenced by the thoughts of others. I would like to thank my friends Harsimran Singh and Surinder Pal Singh who were always there at the hour of the need and provided with all the help and support, which I needed. I am grateful to my brother Deepak Kalra who helped me with his kind suggestions. At last but not the least I would like to thank “The Creator of Destinies” for not letting me down at the time of crises and showing me the silver lining in the dark clouds. SACHIN KALRA (8023114) 3 The scientific art of Machine Translation (MT) is the attempt to automate all, or part of the process of translating from one human language to another. Technically translation is nothing more than word substitution (determined by the dictionary) and reordering (determined by reordering rules). However translating a text requires not only a good knowledge of the vocabulary of both source and target language, but also of their grammar i.e. the system of rules which specifies whether a sentence is well-formed in a particular language or not. Additionally, it requires some element of real world knowledge — knowledge of the nature of things out in the world and how they work together — and technical knowledge of the text’s subject area. Interlingua and transfer based approaches to machine translation have long been in use in competing and complimentary ways. The former proves economical in situations where translation among multiple languages is involved, while the latter is used for pair specific translation tasks. The additional attraction of an interlingua is that it can be used as a knowledge representation scheme. But given a particular interlingua, its adoption depends on its ability to (a) capture the knowledge in texts precisely and accurately and (b) handle cross language divergences. The aim of the thesis is to design a Machine Translation (MT) system which translates sentences from an interlingual representation of English sentences to Punjabi language sentences. Input to the system is an interlingual representation that follows the structure of the family of target Indian languages. The interlingual form is a knowledge representation that contains most of the semantic information needed to construct the text in the Punjabi language. The generator takes an interlingual representation of meaning as input and produces a sentence with that meaning as output in Punjabi language. 4 The implementation is done in C language on windows platform. The Punjabi language sentences are written in Punjabi with appropriate changes and certain assumptions. 5 CONTENTS Certificate ................................................................................................................i Acknowledgement .................................................................................................ii Abstract .................................................................................................................iii List of Figures .................................................................................................... vii Chapter 1: Introduction .........................................................................................1 1.1 Introduction to Artificial Intelligence ...........................................................1 1.2 Introduction to Natural Language Processing (NLP)..................................3 1.3 Applications of NLP ...................................................................................7 Chapter 2: Machine Translation............................................................................9 2.1 Definition....................................................................................................9 2.2 Types of Machine Translation..................................................................10 2.3 Historical Review of MT ...........................................................................11 2.4 Various Strategies to Machine Translation ..............................................16 2.5 What makes MT so difficult?....................................................................30 2.2.1 2.2.2 2.2.3 2.3.1 2.3.2 2.3.3 2.3.4 2.3.5 2.3.6 2.3.7 2.4.1 2.4.2 2.4.3 2.4.4 2.4.5 2.4.6 2.5.1 Machine-Aided Human Translation ........................................................... 10 Human-Aided Machine Translation ........................................................... 10 Fully-automated Machine Translation (FAMT)........................................... 11 Before the computer.................................................................................. 11 The pioneers, 1947-1954 .......................................................................... 12 The decade of optimism. 1954-1966 ......................................................... 12 The aftermath of the ALPAC report, 1966-1980 ........................................ 13 The 1980s ................................................................................................. 14 The early 1990s ........................................................................................ 15 The late 1990s. ......................................................................................... 15 Direct MT system ...................................................................................... 17 Indirect MT system.................................................................................... 19 Knowledge-based MT (KBMT) .................................................................. 22 Example-Based Machine Translation (EBMT)........................................... 24 Statistical MT ............................................................................................ 27 Hybrid Machine Translation Paradigms..................................................... 29 Linguistic Problems ................................................................................... 31 Chapter 3: Role of Interlingua in Machine Translation .....................................38 3.1 Interlingua ................................................................................................38 3.2 Machine Translation with and without an Interlingua ...............................39 3.3 Advantages of Translating with an Interlingua .........................................39 3.4 Grain Size of Meaning: The Challenge of Interlingua Design ..................43 6 Chapter 4: Angla Bharti System Overview ........................................................44 4.1 System Overview.....................................................................................44 4.2 PLIL: Pseudo-Lingua for Indian Languages.............................................49 4.2.1 4.2.2 PLIL Structure: .......................................................................................... 51 Examples: ................................................................................................. 53 Chapter 5: Implementation..................................................................................55 5.1 Why Morphological Analysis? ..................................................................55 5.2 Morphological Generation Using Paradigms ...........................................57 5.3 The Generator Module.............................................................................59 5.4 Results.....................................................................................................64 5.2.1 5.2.2 5.3.1 5.3.2 Algorithm: Forming paradigm table............................................................ 58 Algorithm: Generating a word form............................................................ 59 Introduction to Punjabi Language.............................................................. 60 PLIL Examples.......................................................................................... 61 Chapter 6: Conclusion and Future Scope..........................................................66 6.1 Conclusion ...............................................................................................66 6.2 Future Scope ...........................................................................................67 7 Chapter1 Introduction 1.1 Introduction to Artificial Intelligence Artificial Intelligence (AI) is the branch of Computer Science that is primarily concerned with the ability of machines to adapt and react to different situations as human do. In order to achieve artificial intelligence, we must first understand the nature of human intelligence. Human intelligence is a behavior that incorporates a sense of purpose in actions and decisions. Intelligent behavior is not a static procedure. Learning, defined as behavioral changes over time that better fulfill an intelligent being’s sense of purpose, is a fundamental aspect of intelligence. An understanding of intelligent behavior will be realized when either intelligence is replicated using machines, or conversely when we prove why human intelligence cannot be replicated. In an attempt to gain insight into intelligence, researchers have identified three processes that comprise of intelligence: searching, knowledge representation, and knowledge acquisition. The field of AI can be broken down into five smaller components, each of which relies on these three processes to be performed properly. They are: game playing, expert systems, neural networks, natural language processing, and robotics programming [2] [28]. Game playing is concerned with programming computers to play games, such as chess, against human or machine opponents. This sub-field of AI relies mainly on the speed and 8 computational power of machines. Game playing is essentially a search problem because the machine is required to consider a multitude of possibilities. While the computational power of machines is greater than that of the human brain, machines are unable to solve search problems perfectly because the size of the search space grows exponentially with the depth of the search, making the problem intractable. Expert systems are programmed systems that allow trained machines to make decisions within a very limited and specific domain. Expert systems rely on a huge database of information, guidelines, and rules that suggest the correct decision for the situation at hand. Although they mainly rely on their working memory and knowledge base, the systems must make some inferences. The vital importance of storing information in the database in a manner such that the computer can “understand” it, creates a knowledge representation problem. Neural networks, a field inspired by the human brain, attempts to accurately define learning procedures by simulating the physical neural connections of the human brain. A unique aspect of this field is that the networks change by themselves, adapting to new inputs with respect to the learning procedures they have previously developed. The learning procedures can vary and incorporate many different forms of learning, which include learning by recording cases, by analyzing differences, or by building identity trees (trees that represent hierarchical classification of data). Natural Language Processing (NLP) and robotics programming are two fields that simulate the way humans acquire information, an integral part to intelligence formation. The two are separate sub-fields because of the drastic difference in the nature of their inputs. Language, 9 the input of NLP, is a more complex form of information to process than the visual and tactile input of robotics. Robotics typically transforms its input into motion, whereas NLP has no such state associated transformation. A perfection of each of the sub-fields is not necessary to replicate human intelligence because a fundamental characteristic of humans is to err. However, it is necessary to form a system that puts these components together in an interlocking manner, where the outputs of some of these fields should be inputs for others, to develop a high-level system of understanding. To date, this technology does not exist over a broad domain. 1.2 Introduction to Natural Language Processing (NLP) One of the most widely researched applications of Artificial Intelligence is Natural Language Processing. NLP’s goal, as previously stated, is to determine a system of symbols, relations and conceptual information that can be used by computer logic to communicate with humans. This implementation requires the system to have the capacity to translate, analyze and synthesize language. With the goal of NLP well defined, one must clearly understand the problem of NLP. Natural language is any human “spoken or written language governed by sets of rules and conventions sufficiently complex and subtle enough for there to be frequent ambiguity in syntax and meaning.” The processing of language entails the analysis of the relationship between the mental representation of language and its manifestation into spoken or written form [16]. Humans can process a spoken command into its appropriate action. We can also translate different subsets of human language (e.g. English to Hindi). 10 If the results of these processes are accurate, then the processor (the human) has understood the input. The main tasks of artificial NLP are to replace the human processor with a machine processor and to get a machine to understand the natural language input and then transform it appropriately. Currently, humans have learned computer languages (e.g. C, Perl, and Java) and can communicate with a machine via these languages. Machine languages (MLs) are a set of instructions that a computer can execute. These instructions are unambiguous and have their own syntax, semantics and morphology. The main advantage of machine languages, and the major difference between ML’s and NL’s, is ML’s unambiguous nature, which is derived from their mathematical foundation. They are also easier to learn because their grammar and syntax are constrained by the finite set of symbols and signals. Developing a means of understanding (a compiler) for these languages is remarkably easy compared to the degree of difficulty of developing a means of understanding for natural languages. An understanding of natural languages would be much more difficult to develop because of the numerous ambiguities, and levels of meaning in natural language. The ambiguity of language is essentially why NLP is so difficult. There are five main categories into which language ambiguities fall: syntactic, lexical, semantic, referential and pragmatic [12]. The syntactic level of analysis is strictly concerned with the grammar of the language and the structure of any given sentence. A basic rule of the English language is that each sentence must have a noun phrase and a verb phrase. Each noun phrase may consist of a determiner and a noun, and each verb phrase may consist of a verb, preposition and noun phrase. There are various different valid syntactic structures, and rules such as this, make up the grammar of a language and must be represented in a concrete manner for the 11 computer. Secondly, there must exist a parser, which is a system that determines the grammatical structure of an input sentence by comparing it to the existing rules. A parser must break the input down into words and determine by categorizing each word if the sentence is grammatically sound. The lexical level of analysis concerns the meanings of the words that comprise each sentence. Ambiguity increases when a word has more than one meaning (homonyms). For example “duck” could either be a type of bird, or an action involving bending down. Since these two meanings have different grammatical categories (noun and verb) the issue can be resolved by syntactic analysis. The sentence’ s structure will be grammatically sound with one of these parts of speech in place. From this information, a machine can determine the definition that appropriately conveys the sense of the word within the sentence. However this process does not resolve all lexical ambiguities. Many words have multiple meanings within the same part of speech, or a part of speech can have sub-categories that also need to be analyzed. The verb “can” can be considered an auxiliary verb or a primary verb. If it is to be considered a primary verb, it can convey different meanings. The primary verb “can” can either mean “to fire” or “the process of putting stuff into a container”. In order to resolve these ambiguities we must resort to semantic analysis. The semantic level of analysis addresses the contextual meanings of the words as they relate to word definitions. In the “can” example, if another verb follows the word, then it is most likely an auxiliary verb. Otherwise, if the other words in the sentence are related to jobs or work then the former definition of the real verb should be taken. If the other words were related to preserves or jams, the latter definition would be more suitable. The field of 12 statistical analysis provides methodology to resolve this ambiguity. When this type of ambiguity arises, we must rely on the meaning of the word to be defined by the circumstances of its use. Statistical Natural Language Processing (SNLP) looks at language as a non-categorical phenomenon and can use the current domain and environment to determine the meanings of words. SNLP can also be used to gather another type of contextual information. It can track the slow evolution of word meanings. For example, years ago the word “ like” was used in comparisons, as a conjunction or a verb. Currently, it is often inadvertently used as a colloquialism. This is the type of contextual information that is necessary in order to resolve pragmatic ambiguities. Pragmatic ambiguities are cultural phrases or idioms that have not been developed according to any set rules. For example, in the English language, when a person asks, “ Do you know what time it is?” he usually is not wondering if you are aware of the hour, but more likely wants you to tell him the time. Referential ambiguities deal with the way clauses of a sentence are linked together. For example, the sentence “ Ram hit the man with the hammer” has referential ambiguity because it does not specify if Ram used a hammer to hit a man, or if Ram hit the man who had a hammer. Referential ambiguities in a sentence are very difficult to reduce because there may be no other clues in the sentence. In order to determine which clauses of the sentence refer to or describe each other (in the example, who the hammer belongs to), the processor would have to increase its scope of analysis and consider surrounding sentences to look for clarification. 13 There are many tasks that require an understanding of Natural Language. Database queries, fact retrieval, robot command, machine translation and automatic text summarization are just a small subset of the tasks. Although complete understanding has not yet been achieved, there are imperfect versions of NLP technologies on the market. 1.3 Applications of NLP One important application of NLP is Machine Translation (MT): “ the automatic translation of text…from one [natural] language to another.” The existing MT systems are far from perfect; they usually output a buggy translation, which requires human post-edit. These systems are useful only to those people who are familiar enough with the output language to decipher the inaccurate translations. The inaccuracies are in part a result of the imperfect NLP systems. Without the capacity to understand a text, it is difficult to translate it. Many of the difficulties in realizing MT will be resolved when a system to resolve pragmatic, lexical, semantic and syntactic ambiguities of natural languages is developed [19]. There are currently three approaches to Machine Translation: direct, semantic transfer and inter-lingual. Direct translation entails a word-for-word translation and syntactic analysis. The word-for-word translation is based on the results of a bilingual dictionary query, and syntactical analysis parses the input and regenerates the sentences according to the output language’ s syntax rules. For example the sentence “ He reads the book.” could be accurately translated into “ vaha pusawaka paxawA hai” using this technology. This kind of translation is most common today in commercial systems, such as Altavista. However this approach to MT does not account for semantic ambiguities in translation. 14 The semantic transfer approach is more advanced than the direct translation method because it involves representing the meaning of sentences and contexts, not just equivalent word substitutions. This approach consists of a set of templates to represent the meaning of words, and a set of correspondence rules that form an association between word meanings and possible syntax structures in the output language. Semantics, as well as syntax and morphology, are considered in this approach. This is useful because different languages use different words to convey the same meaning. However, one limitation of this approach is that each system must be tailored for a particular pair of languages [3]. The third and closest to ideal (thus inherently most difficult) approach to MT is translation via interlingua. “ An interlingua is a knowledge representation formalism that is independent of the way particular languages express meaning.” This approach would form the intermediate step for translation between all languages and enable fluent communication across cultures. This technology, however, greatly depends on the development of a complete NLP system, where all levels of analysis and ambiguities in natural language are resolved in a cohesive nature. 15 Chapter 2 Machine Translation 2.1 Definition "Machine Translation (MT) can be defined as a translation where the initiative is with a computer system, either autonomously (FAHQT = Fully Automatic High Quality Translation) or where the user is asked to apply post-editing or pre-editing, or to answer clarification/disambiguation dialogues [8]." The term ‘‘Machine Translation’’ (MT) refers to the use of a machine for aiding or performing translation tasks involving more than one human language. Bearing this definition in mind, the work on MT was in fact started in the 17th century when the use of mechanical dictionaries was first suggested. The machines translation (MT) `systems’ invented in those days were merely mechanical dictionaries for aiding human translation. The whole translation process was very much relied on human efforts. Though the MT `systems'invented in the 17th century were referred to as mechanical dictionaries, they were not aiming at merely providing the meaning of words in the lexicon. They were aiming at forming an unambiguous language based on logical principles and iconic symbols which allow people to communicate with each other without the fear of misunderstanding. Since then, the research on MT focused on producing different proposals of this kind of unambiguous languages. A well-known unambiguous language of this kind is Esperanto and it has been used as an interlingua in some interlingual MT systems and multi-lingual dictionary programs. 16 2.2 Types of Machine Translation Hutchins and Somers [6] divided MT systems into three different categories: 2.2.1 Machine-Aided Human Translation In this category, we can include the following. • Spell, grammar or style checkers • Monolingual or bilingual dictionaries, thesauri and encyclopedias • Optical Character Recognition (OCR) programs and automatic term lookup • Machine pre-translation: replacing source language (SL) words and phrases that are unambiguous 2.2.2 Human-Aided Machine Translation Human-Aided Machine Translation System generally consists of the following. • Pre-editing: checking through the source text for foreseeable problems of MT and attempts to remove them, e.g. marking grammatical categories of homographs or substituting unknown words. The use of a controlled language can also be considered as a form of pre-editing. • Interactive MT: an MT system which would pause and ask the user to resolve the problem of ambiguity 17 • Post-editing: correct the output of MT to an agreed standard, e.g. amending the style of the output sentences, or any minimal amendments which will make the text more readable. 2.2.3 Fully-automated Machine Translation (FAMT) The source language text is fed into the computer as a file, and the computer produces a translation automatically without any human intervention. This is sometimes referred to as batch mode. There are two types of fully automatic machine translation. There is fully automatic high-quality machine translation (FAHQMT) and there is low-quality machine translation. 2.3 Historical Review of MT 2.3.1 Before the computer It is possible to trace ideas about mechanizing translation processes back to the seventeenth century, but realistic possibilities came only in the 20th century. In the mid 1930s, a French-Armenian Georges Artsrouni and a Russian Petr Troyanskii applied for patents for ‘translating machines’ . Of the two, Troyanskii' s was the more significant, proposing not only a method for an automatic bilingual dictionary, but also a scheme for coding interlingual grammatical rules (based on Esperanto) and an outline of how analysis and synthesis might work. However, Troyanskii’ s ideas were not known about until the end of the 1950s. Before then, the computer had been born. 18 2.3.2 The pioneers, 1947-1954 Soon after the first appearance of ‘electronic calculators’ research began on using computers as aids for translating natural languages. The beginning may be dated to a letter in March 1947 from Warren Weaver of the Rockefeller Foundation to cyberneticist Norbert Wiener. Two years later, Weaver wrote a memorandum (July 1949), putting forward various proposals, based on the wartime successes in code breaking, the developments by Claude Shannon in information theory and speculations about universal principles underlying natural languages. Within a few years research on machine translation (MT) had begun at many US universities, and in 1954 the first public demonstration of the feasibility of machine translation was given (a collaboration by IBM and Georgetown University). Although using a very restricted vocabulary and grammar it was sufficiently impressive to stimulate massive funding of MT in the United States and to inspire the establishment of MT projects throughout the world [30]. 2.3.3 The decade of optimism. 1954-1966 The earliest systems consisted primarily of large bilingual dictionaries where entries for words of the source language gave one or more equivalents in the target language, and some rules for producing the correct word order in the output. It was soon recognized that specific dictionary-driven rules for syntactic ordering were too complex and increasingly ad hoc, and the need for more systematic methods of syntactic analysis became evident. Optimism remained at a high level for the first decade of research, with many predictions of imminent "breakthroughs". However, disillusion grew as researchers encountered "semantic 19 barriers" for which they saw no straightforward solutions. There were some operational systems – the Mark II system (developed by IBM and Washington University) installed at the USAF Foreign Technology Division, and the Georgetown University system at the US Atomic Energy Authority and at Euratom in Italy – but the quality of output was disappointing (although satisfying many recipients’ needs for rapidly produced information). By 1964, the US government sponsors had become increasingly concerned at the lack of progress; they set up the Automatic Language Processing Advisory Committee (ALPAC), which concluded in a famous 1966 report that MT was slower, less accurate and twice as expensive as human translation and that "there is no immediate or predictable prospect of useful machine translation." It saw no need for further investment in MT research; and instead it recommended the development of machine aids for translators, such as automatic dictionaries, and the continued support of basic research in computational linguistics. 2.3.4 The aftermath of the ALPAC report, 1966-1980 Although widely condemned as biased and short-sighted, the ALPAC report brought a virtual end to MT research in the United States for over a decade and it had great impact elsewhere in the Soviet Union and in Europe. However, research did continue in Canada, in France and in Germany. Within a few years the Systran system was installed for use by the USAF (1970), and shortly afterwards by the Commission of the European Communities for translating its rapidly growing volumes of documentation (1976). In the same year, another successful operational system appeared in Canada, the Meteo system for translating weather reports, developed at Montreal University [30]. 20 In the 1960s in the US and the Soviet Union MT activity had concentrated on RussianEnglish and English-Russian translation of scientific and technical documents for a relatively small number of potential users, who would accept the crude unrevised output for the sake of rapid access to information. From the mid-1970s onwards the demand for MT came from quite different sources with different needs and different languages. The administrative and commercial demands of multilingual communities and multinational trade stimulated the demand for translation in Europe, Canada and Japan beyond the capacity of the traditional translation services. The demand was now for cost-effective machine-aided translation systems that could deal with commercial and technical documentation in the principal languages of international commerce. 2.3.5 The 1980s The 1980s witnessed the emergence of a wide variety of MT system types, and from a widening number of countries. First there were a number of mainframe systems, whose use continues to the present day. Apart from Systran, now operating in many pairs of languages, there was Logos (German-English and English-French); the internally developed systems at the Pan American Health Organization (Spanish-English and English-Spanish); the Metal system (German-English); and major systems for English-Japanese and JapaneseEnglish translation from Japanese computer companies. Throughout the 1980s research on more advanced methods and techniques continued. For most of the decade, the dominant strategy was that of ‘indirect’ translation via intermediary representations, sometimes interlingual in nature, involving semantic as well as morphological and syntactic analysis and sometimes non-linguistic ‘knowledge bases’ . The 21 most notable projects of the period were the GETA-Ariane (Grenoble), SUSY (Saarbrücken), Mu (Kyoto), DLT (Utrecht), Rosetta (Eindhoven), the knowledge-based project at Carnegie-Mellon University (Pittsburgh), and two international multilingual projects: Eurotra, supported by the European Communities, and the Japanese CICC project with participants in China, Indonesia and Thailand. 2.3.6 The early 1990s The end of the decade was a major turning point. Firstly, a group from IBM published the results of experiments on a system (Candide) based purely on statistical methods. Secondly, certain Japanese groups began to use methods based on corpora of translation examples, i.e. using the approach now called ‘example-based’ translation. In both approaches the distinctive feature was that no syntactic or semantic rules are used in the analysis of texts or in the selection of lexical equivalents; both approaches differed from earlier ‘rule-based’ methods in the exploitation of large text corpora. Another feature of the early 1990s was the changing focus of MT activity from ‘pure’ research to practical applications, to the development of translator workstations for professional translators, to work on controlled language and domain-restricted systems, and to the application of translation components in multilingual information systems. 2.3.7 The late 1990s. These trends have continued into the later 1990s. In particular, the use of MT and translation aids (translator workstations) by large corporations has grown rapidly – a particularly impressive increase is seen in the area of software localization (i.e. the 22 adaptation and translation of equipment and documentation for new markets). There has been a huge growth in sales of MT software for personal computers (primarily for use by non-translators) and even more significantly, the growing availability of MT from on-line networked services (e.g. AltaVista, and many others). The demand has been met not just by new systems but also by ‘downsized’ and improved versions of previous mainframe systems. While in these applications, the need may be for reasonably good quality translation (particularly if the results are intended for publication), there has been even more rapid growth of automatic translation for direct Internet applications (electronic mail, Web pages, etc.), where the need is for fast real-time response with less importance attached to quality. With these developments, MT software is becoming a mass-market product, as familiar as word processing and desktop publishing. 2.4 Various Strategies to Machine Translation The history of MT is dominated by two generations of MT systems. First generation MT systems refer generally to the ones which were constructed before 1960s. These systems employed a direct approach to MT which was mainly based on word-to-word and/or phrase-to-phrase translations. A simple word-to-word translation cannot resolve the ambiguities arising in MT. A more thorough analysis of source language text is required to produce better translation. As the major problem of the first generation MT was the lack of linguistic information about source text, researchers therefore moved onto finding ways to capture this information. This gave rise to the development of the indirect MT systems which are generally regarded as second generation MT systems. This section reviews the characteristics of the first and second generations of MT systems and explains how these 23 systems attempt to tackle the problem of ambiguity. A brief summary of the relationship between these systems is shown in the Figure 2.1. Figure 2.1: The Vauquois Triangle 2.4.1 Direct MT system A direct MT system (also known as a transformer) simply translates source language text to the corresponding target language (TL) text in a word-for-word or phrase-to-phrase manner by means of bilingual dictionary lookup. Then the resulting TL words are reorganized according to the target language sentence format. In order to improve the output quality, some direct MT systems perform some morphological analysis before the bilingual dictionary lookup but they rarely analyze the sentence structure of the source language (SL) text. 24 Figure 2.2: Typical building blocks of a direct MT system Direct MT systems were developed in the 1950s. In those days, computers were very primitive and had a very long processing time. This explains why direct MT systems are very primitive and do not analyze the linguistics of sentences before performing the translation. Owing to its primitive nature, the direct MT approach is very straight-forward and easy to implement. It supports the translation of SL sentences which have both matching source-to-target language words and similar structures as the TL sentences. However, as very little, if any, effort has been put in disambiguating SL sentences, this approach does not support the translation of ambiguous sentences. This approach also fails to translate sentences to a language which has very different syntactic structures and/or different use of words/phrases from the source language. The main problem of the direct MT approach is that it does not analyze the linguistic information nor the meaning of source sentences before performing the translation. Without this information, the resulting MT system cannot resolve the ambiguities that arise in the source sentence and/or during the translation. Thus, this approach fails to translate any seemingly ambiguous sentences (e.g. ‘‘Ram saw a bank on the bank of a river.’’). As a result, the first generation MT systems cannot provide a quality translation of the source language text. 25 2.4.2 Indirect MT system Owing to the fact that linguistic information helps an MT system to disambiguate SL sentences and to produce better quality target language translation, with the advance of computing technology, MT researchers started to develop methods to capture and process the linguistics of sentences. This was when the era of indirect MT systems started. Hutchins and Somers [6] identified two kinds of second-generation MT systems: transfer-based and interlingual systems as shown in Figures 2.3 and 2.4. Figure 2.3: Typical building blocks of a transfer-based MT system Figure 2.4: Building blocks of an interlingual MT system The structures of these systems are fairly similar. The module ‘Source Text Analysis’ aims at capturing the required linguistic information about the SL sentences for aiding the translation. The transfer-based approach uses the information obtained from the analysis 26 module directly to lookup the corresponding TL words. The interlingual approach involves the use of an intermediate language (i.e. an interlingua) for the transfer -- with the SL text translated to the interlingua and the interlingua translated to the TL text. As suggested by Hutchins and Somers, an interlingua is an intermediate ‘meaning’ representation and this representation: ‘‘includes all information necessary for the generation of the target text without ‘looking back’ to the original text. The representation is thus a projection from the source text and at the same time acts as the basis for the generation of the target text; it is an abstract representation of the target text as well as a representation of the source text [6]. ’’ Some researchers used an existing artificial language (e.g. Esperanto) as the interlingua because it is generally believed to be more regular and consistent, both lexically and structurally, than natural languages and could capture the characteristics of any natural language in a relatively precise way. In addition, as these artificial languages had already been developed, they can be incorporated to an interlingual MT system directly. No additional effort is required to define the interlingua. The use of an interlingua enables an MT system to perform the translation without looking back at and referring to the original SL text. After translating the SL words to their TL forms, the job of the ‘Target Text Generation’ module is to synthesize the resulting TL words to form the target sentences. One advantage of the transfer-based approach is that it allows the source language text to be analyzed according to what is required for facilitating its translation to a target language. Thus, much less effort, if any, would be wasted in analyzing the unnecessary features of the SL sentences. In addition, this approach also facilitates a close examination of the 27 differences between a language pair. This, in turn, will facilitate the design and implementation of the required MT system. The interlingual approach, however, is more time-consuming as a lot of processing time is consumed in the ‘double-transfer’. It also allows a double chance -- during both the from and to interlingua translations -- for ambiguities to occur. However, if a multilingual MT system is to be built, this approach would reduce the time and effort needed to produce a transfer module for each language pair (as required in the transfer-based approach) in the system as shown in figure 2.5. English Text English Text Analysis Hindi Text Hindi Text Analysis Punjabi Text Punjabi Text Analysis Interlingua Hindi Text Analysis Hindi Text Punjabi Text Analysis Punjabi Text English Text Analysis English Text Figure 2.5: Interlingual System The system structures of both the transfer-based and interlingual approaches allow a systematic analysis and processing of the linguistic information about sentences. However, these approaches do not provide an immediate solution to the problem of ambiguities and language difference. A lot of detailed investigations into resolving the linguistic problems occurred during the translation are still required [5]. 28 2.4.3 Knowledge-based MT (KBMT) Arnold et al. define it as ‘‘The term knowledge-based MT has come to describe a rule-based system displaying extensive semantic and pragmatic knowledge of a domain, including an ability to reason, to some limited extent, about concepts in the domain" [8]. The assumption behind KBMT is that high quality translation requires in-depth understanding of the text. A domain model which supports this in-depth understanding of the meaning and relationship of words in the text is therefore used to aid the translation process. And the Motivation behind KBMT is that post-editing is time-consuming and expensive, thus it is worth putting more effort in designing an MT system which can produce high quality output without human intervention. So No post-editing is involved for obtaining very high quality translation. KBMT tends to be domain specific (especially a domain which is relatively less ambiguous, e.g. technical documents) because it is very complicated and difficult to represent a complete knowledge about the whole world. Some basic components of a KBMT system are: • An ontology of concepts (serves as an interlingua) • SL lexicon and grammar for the analysis process • TL lexicon and grammar for the generation process • Mapping rules between the Interlingua and SL/TL syntax Strengths of the KBMT approach are: 29 • It supports the production of very high quality translation. • It allows a good modularity in the resulting MT system. Therefore, the development of a parser can be completely independent of the generator of the MT system. • The parser and generators are independent of each other. Therefore, the development of SL and TL components can be overlapped with each other. This, in turn, reduces the system development time. • As no explicit SL-TL mapping is required for each language pair, any source language supported by the system can be translated to any target language defined in the system. • It makes the addition of a new language to the existing system easier. A new language can be added to the existing system through the implementation of a new parser and/or generator module(s) which link(s) this language to the interlingua. The newly incorporated parser and/or generator will then be able to co-operate with other parsers or generators to produce the required translation. Some Weaknesses of the KBMT approach are: • Owing to the fact that the main reason for the inadequacy of many existing MT systems is the lack of an adequate analysis and understanding of the SL text, the idea to use deep textual understanding for MT is perhaps one of the best ways to improve existing MT technology. However, an effective KBMT system relies on a good means to knowledge acquisition and representation, which is not highly available. 30 • The use of an interlingua for meaning representation reduces the amount of effort required for developing a multilingual MT system. However, it is not easy to select or to define an adequate interlingua. Without an adequate interlingua, a deep textual understanding will not be supported by the resulting KBMT system, thus its effectiveness will be reduced significantly. • The success of a KBMT system depends on a large amount of hand-coded lexical knowledge. This hand-coding process is time-consuming and labour-intensive. Some means to alleviate this problem is required. 2.4.4 Example-Based Machine Translation (EBMT) According to Turcato et al. “EBMT is essentially translation by analogy. EBMT is also regarded as a case-based reasoning approach to MT, where previously resolved translation cases are reused to translate new SL text” [14]. The basic assumption of EBMT is: ‘‘If a previously translated sentence occurs again, the same translation is likely to be correct again.’’ This idea is sometimes thought to be reminiscent of how human translators proceed when using a bilingual dictionary: looking at the examples given to find the SL example that best approximates what they are trying to translate, and constructing a translation on the basis of the TL example that is given. Konstantinidis presented the general architecture of an EBMT system as 31 Figure 2.6: EBMT Architecture The EBMT approach proposed by Nagao [1] uses raw, unanalyzed, unannotated bilingual data and a set of SL and TL lexical equivalences mainly expressed in terms of word pairs (with SL and TL verb equivalences expressed in terms of case frames) as the linguistic backbone of the translation process. The translation process is mainly a matching process which aims at locating the best match in terms of semantic similarities between the input sentence and the available example in the database. In EBMT, instead of using explicit mapping rules for translating sentences from one language to another, the translation process is basically a procedure of matching the input sentence against the stored example translations. The basic idea is to collect a bilingual corpus of translation pairs and then use a best match algorithm to find the closest example to the source phrase in question. This gives a translation template, which can then be filled in by word-for-word translation. The distance calculation, for finding the best match for a source phrase, can involve calculating the closeness of items in a hierarchy of terms and concepts provided by a thesaurus. 32 Strengths of the EBMT approach are: • EBMT is not domain specific. As the example set becomes more complete, the quality of translation will improve incrementally without the need to update and improve detailed grammatical and lexical descriptions. • This approach can be (in principle) very efficient, since in the best case there is no complex rule application to perform -- all one has to do is find the appropriate example and (sometimes) calculate distances. • An EBMT system is potentially multilingual: An EBMT program can be implemented in such a way that it reads in any bilingual translation data and process them in order to produce the database for translation. Some Weaknesses of the EBMT approach are: • This method is dependent on the collection of good bilingual data, which might not be highly available. • The calculation of the best match might be a complicated and lengthy process. • For instance, as suggested by Arnold et al.: ‘‘When there are a number of different examples each of which matches part of the string, but where the parts they match overlap, and/or do not cover the whole string. In such cases, calculating the best match can involve considering a large number of possibilities” [8]. • In terms of improving the translation quality, the more examples which cover different translation cases the better. However, more examples stored in the 33 translation database means that the time for searching through the database in order to locate the best match is longer. • In some cases, especially when an input sentence is relatively less ambiguous, a simple rule-based system which analyses the linguistic information about the input sentence would be less complicated and thus more efficient. 2.4.5 Statistical MT The use of statistical data for MT has been suggested since the age of first generation MT. However, this approach was not pursued extensively. This is perhaps mainly due to the fact that computers in those days were not powerful enough to support such kind of computationally intensive approach. Statistical approaches to MT can mean: • Approaches which does not use explicitly formulated linguistic knowledge to perform MT (i.e. pure statistical MT); or • The application of statistical techniques or techniques on calculating probability to aid parts of the MT task (e.g. word sense disambiguation). The idea behind the pure statistical MT approach is to let a computer learn automatically how to translate text from one language to another by examining large amounts of parallel bilingual text, i.e. documents which are nearly exact translations of each other. The statistical MT approach uses statistical data (e.g. which SL lexical unit is translated to which TL word(s) and how often this translation occurs) to perform translation. This statistical data is obtained from an analysis of a vast amount of bilingual texts. Different probabilities are extracted from the bilingual texts automatically by a computer, i.e.: 34 • The probability of a source sentence to occur in the texts, • The probabilities of a source word to be translated as one, two, three, etc. target words, • The translation probabilities for each word in each language, and • The probabilities of the position of each word in an SL sentence which is not in the same position of the TL word in the target sentence (i.e. the probability of distortion). These probabilities are vital to the translation process as they are the sole information for calculating how an SL sentence should be translated to the TL form. In a pure statistical MT system, no bi-lingual dictionary or any explicit linguistic information is required to aid the translation. Therefore, techniques for aligning the bilingual text (i.e. bilingual phrase or even word alignment) are required to help the system to learn how to perform translation. If there are more than one TL equivalents for an SL word, the frequency of each translation will be used for calculating the probability of the use of each translation. In order to cope with the translation of an ambiguous word, the probabilities for the current and neighboring words in a sentence are combined and used for resolving the ambiguity. Strengths of the Statistical approach are: • Even if an exact match of a translation is not listed in the bilingual corpus, the MT system can still use the translation probabilities to approximate a possible translation. • Provided that a good corpus of bilingual texts is available, statistical MT offers a fast and less costly approach to MT. 35 • The IBM team involved in the Candide project also demonstrated the fact that the knowledge of both source and target languages is not essential for this approach as the people from the IBM team either know very little French or no French at all [8]. • The fact that pure statistical MT learns how to perform MT through observing the translation behaviors of vast amount of bilingual text means that this method is language independent. Some Weaknesses of the Statistical approach are: • One limitation of statistical MT is that unless the corpus is very large and contains text from different domain (e.g. technical text, newspaper, novels, etc.), the statistics generated would tend to be domain specific. Thus, this method might tend towards domain-specific, i.e. produce less accurate result for the text from different domain than the training data. • One major drawback for statistical MT is that its translation performance is rather poor. Out of 100 short test sentences, only 39% of the translations are correct [8]. • If this approach is used in real-life MT tasks, a lot of post-editing on the resulting translations will be required which makes this approach very costly. 2.4.6 Hybrid Machine Translation Paradigms Current thinking in MT circles suggests that significant progress in the field of MT is unlikely to be achieved by refining any single approach. It has therefore become a common interest to merge different MT paradigms into one system in order to yield better translation results. In recent years, one can consequently see the development of an increasing number 36 of hybrid MT systems with the aim of combining the strengths of each individual approach and improving overall translation quality as a result. At this point in time, the extent to which such hybrid MT paradigms can improve the performance of MT engines is not yet fully known, since the work carried out in this field is still in its infancy. 2.5 What makes MT so difficult? Natural language translation is not an easy task. Due to the versatile usage of words and phrases, sometimes even a well-trained and experienced human translator has difficulties in translating a piece of source language (SL) text correctly. By no means a computer in the present age can compare with an average human being in terms of understanding knowledge of the real world. Without the ability to understand real-world knowledge, it is more difficult to ‘teach’ a computer to perform a task which well-trained and experienced human translators found difficult at times. In addition to this inability, there are other problems which impedes a computer system to perform high quality natural language translation. Here we discuss some of these problems. Similar to a human translator, before a computer can translate text from one language to another, some means is required to ‘teach’ a computer to perform translation. The simplest way to perform MT is to find out the corresponding target language (TL) equivalent for each word present in the source text one by one. Direct translation is simple and straight-forward: no syntactic analysis is required on the SL sentence and the source-to-target language equivalents obtained build up the required TL sentence without the need of further processing. Though the word-for-word translation method works fine in translating the English sentence ‘‘Ram loves Sita.’’ to Chinese, if this 37 method is used to translate the same English sentence to, say, Hindi, the output sentence obtained will be syntactically incorrect. If an ambiguous SL sentence is to be translated by the simple word-for-word translation method, the output translation might seem like a piece of junk text to a target language speaker, or worse still, convey a wrong meaning and cause misunderstanding. Therefore, a more detailed processing is required for effective MT. A more effective MT approach contains three parts: SL text analysis, source-to-target language transfer and TL text generation. This approach allows a more thorough analysis of the source language text so as to help resolving the ambiguity within it before the required source-to-target language translations are looked up during the source-to-target language transfer. This method also allows the reorganization and/or deletion of selected TL words and introduction of additional TL words so that the output sentences will conform with the TL grammar. Even though this three-stage MT method can give rise to a better MT output, due to the complexity of natural languages, there still exist many problems which affect the effectiveness of MT systems. Here we discuss how and why the three-stage MT method is inadequate in catering for real-life translation needs. 2.5.1 Linguistic Problems If any word within a natural language has only one interpretation (i.e. having one syntactic, semantic and pragmatic analysis), MT would become a much simpler task. An MT system can obtain the TL translation by simply analyzing each word within a SL sentence and generating the target sentence according to the TL grammar. However, this is often not the case with any natural language. A word not only can have more than one interpretation, it can also combine with other constituents within a sentence to form other interpretations. 38 For instance, a word may appear in more than one syntactic category, e.g. the word ‘ships’ can be a noun or a verb. A word may combine with other word(s) to form a new lexical unit, e.g. the phrasal verb ‘fish for’ as in ‘‘Ram fished for invitations’’. Even within the same syntactic category, a word can have more than one meaning, e.g. the noun ‘saw’ can mean a tool for cutting, or a short, well-known saying or proverb. The existence of ambiguous words makes it more difficult for an MT system to capture the appropriate meaning of a source sentence so as to produce the required translation. Here we briefly discuss the different kinds of problems occurring in natural languages which affect the effectiveness of MT systems. 1. Lexical Ambiguity Lexical ambiguity occurs when a word possesses more than one meaning. One famous kind of lexical ambiguity is caused by homographs. A homograph is a word (i.e. a sequence of characters) with more than one meaning. For instance, the English word ‘saw’ is commonly used as the past tense of the verb ‘see’, but it can also mean a tool for cutting, the action of using this tool for cutting as well as a short, well-known saying or proverb. It is not always difficult to disambiguate a homograph. Some homographs have only one meaning within a single syntactic category. For instance, as a noun, the word ‘minute’ means a unit for measuring time; as a verb, it means to make a written record of what is said or decided during a meeting; as an adjective, it means tiny. • One minute has sixty seconds. • Part of the job of a secretary is to minute meetings. • There is only minute difference between these pictures. 39 It is relatively easy to disambiguate this kind of homographs. Simply analyze the syntactic structure of the sentence and find out the syntactic category of the homograph within the sentence, the appropriate meaning can then be obtained. Knowing the syntactic category of a word does not always help the disambiguation of homographs. It is because some homographs have more than one meaning even when they are used in the same syntactic category. For instance, the noun ‘ball’ can mean a dance party or a round object for sports. With this kind of homographs, where more than one meaning exist in the same syntactic category, one way to disambiguate them is to consider their semantic properties in relation to the semantic properties of other words in the sentence. For instance, the meaning of ‘ball’ in the sentence ‘‘Ram kicked a ball.’’ must be a round physical object for sports because the verb ‘kick’ requires physical contact with a physical object, but a dance party is an abstract event which cannot be kicked. However, even comparing the semantic properties of words within a sentence does not always help. As pointed out by Hutchins and Somers [6], with a sentence like ‘‘When you hold a ball, ...’’ in which both senses of the verb ‘hold’ (i.e. to grasp and to organize) can be used with the different senses of the noun ‘ball’, it would be difficult to obtain the appropriate meanings unless the later part of the sentence provides more clue to disambiguate these senses. 2. Structural Ambiguity Structural ambiguity is concerned with the syntactic representation of sentences. It occurs when more than one syntactic structure can be associated with a sequence of words. For instance, a well-known example of this kind is the sentence ‘‘Flying planes can be 40 dangerous.’’ in which the word ‘flying’ can function as a noun or an adjective, and results in more than one meaning for this sentence [6]: • It can be dangerous to fly planes. • Planes that are flying can be dangerous. Each of these interpretations results in a different translation of the sentence. With this kind of structural ambiguity which human translators would find difficult to disambiguate without the knowledge of the actual event, it is very unlikely that computers can perform the required disambiguation without any human intervention. With ambiguous sentences of this kind where both analyses result in a valid meaning, without knowing the author’s intended meaning or the context of the sentence, it is perhaps impossible to translate this sentence appropriately. In such a situation, perhaps an MT system should generate two translations for this sentence. However, not all potential structural ambiguities would trigger the need to generate more than one target language translation. For instance, as suggested by Arnold et al. [8] if the modal of above sentence is replaced by the appropriate tense, i.e.: • Flying planes is dangerous. • Flying planes are dangerous. the syntactic structure of the word sequence ‘flying planes’ can then be disambiguated by analyzing the number agreement between the subject and the verb of each sentence. In some cases, structural ambiguity can even be resolved by analyzing the phrasal structure of the sentences. For instance, consider the following sentences: 41 • The tape measures are all sold out. • The tape measures five inches long. The word sequence ‘tape measures’ have two interpretations in the above sentences: a noun group (i.e. noun modifier + noun) and a noun with a verb respectively. Upon analyzing the structure of the sentence, the appropriate occurrences of ‘tape measures’ in these sentences can be obtained. Syntactic processing alone is adequate to perform this kind of disambiguation. 3. Multiword Units A word can possess more than one meaning and causes problems of ambiguity in an MT system. When a word is used in conjunction with other word(s), even if each of these words possesses only one meaning, they can also become ambiguous. Two common examples of this kind in the English language are phrasal verbs and idioms. According to the Collins COBUILD English Grammar ‘‘Phrasal verbs are a special group of verbs which are made up of a verb and an adverb and/or a preposition which are used to extend or change the meaning of a verb. As a phrasal verb often constitutes a meaning that is different from the literal meaning of its constituents, sentences with this kind of verbs have a higher chance to be ambiguous. For instance, consider the phrasal verb ‘eat in’ and the co-occurrence of the verb ‘eat’ and the preposition ‘in’, as in [6]: • Ram eats in on Sundays. 42 • Ram eats in a restaurant on weekdays. The first ‘eats in’ has the modified meaning ‘‘to eat at home’’, whereas the second one uses the literal meaning of the verb ‘eat’ and the preposition ‘in’, i.e. ‘‘to eat in a particular place’’. One way to disambiguate the above sentences is by analyzing their syntactic structures: the word ‘in’ functions as an adverb, which does not govern an object, and a preposition, which governs the object noun phrase (NP) ‘‘a restaurant’’, respectively. However, not all phrasal verbs can be disambiguated by simply analyzing the syntax of a sentence. For instance: • Ram fell for Sita. • Ram fell for a lie. Where the phrasal verb ‘fall for’ means ‘‘to be attracted towards’’ and ‘‘to be tricked’’ respectively. In both cases, the phrasal verb ‘fall for’ governs an object (i.e. ‘Sita’ and ‘a lie’ respectively) and they have the same sentence structure. To disambiguate this kind of phrasal verbs will require the analysis of lexical semantics (i.e. the meaning of words). 4. Language Differences The problems we have looked at so far concern with finding out the appropriate word senses used in SL text. Even though a SL sentence can be successfully disambiguated by an MT system, when it is translated to a target language, there are other problems which can hinder the production of an appropriate TL translation. One common problem in translation is that a word in one language might not have an immediate equivalent in another language, i.e. lexical holes. For instance, the English verb ‘stab’ has no immediate equivalent in 43 Spanish [6]. One possible way to translate this kind of words is to express the meaning by several TL words, e.g. the English verb ‘stab’ can be translated to a Spanish phrase meaning ‘give knife wound to’ in English. However, some lexical holes might be too difficult to fill (i.e. no TL expression can adequately express the meaning of a SL word) and the only way is to leave the word untranslated. To decide whether or not to leave a SL word or phrase untranslated is not an easy task. A computer cannot perform this decision making on its own. If all lexical holes have to be filled by additional dictionary entries and translation rules while developing an MT system, it will prolong the system development and processing time. Different language seems to have a different classification of the world for example Even both Americans and Brits speak English, there is still room for misunderstanding due to different usage of words. For instance, Brits call the front engine cover of a car ‘bonnet’ and the storage space at the back of a car ‘boot’. However, Americans use different words to express the same meaning. In fact, to average Americans, ‘bonnet’ is a kind of hat whereas ‘boot’ is a kind of footwear. Therefore, the sentence ‘‘I unlocked the boot and laid the tools on the bonnet’ ’ which sounds normal to a Brit, might sound funny to average Americans. 44 Chapter 3 Role of Interlingua in Machine Translation 3.1 Interlingua The approach to machine translation (MT) known as Interlingual MT requires the composition of an unambiguous language-neutral representation of the meaning of the source text from which an equivalent text in a target language may be generated. Thus, a sub problem for any Interlingua (IL)-based MT system is that of decoding the lexical and compositional meaning of the source language (SL) text. There are a number of clear attractions to an interlingual architecture. First, from a purely Intellectual or scientific point of view, the idea of an interlingua is interesting, and exciting. Second, from a more practical point of view, an interlingual system promises to be much easier to extend by adding new language pairs, than a transfer system (or a transformer system). This is because, providing the interlingua is properly designed, it should be possible to add a new language to a system simply by adding analysis and synthesis components for it. Compare this with a transfer system, where one needs not only analysis and synthesis, but also transfer components into all the other languages involved in the system. Since there is one transfer for each language pair, N languages require N*N - 1 transfer components (one does not need a transfer component from a language into itself) [9]. 45 3.2 Machine Translation with and without an Interlingua Machine translation methodologies are commonly categorized as direct, transfer, and interlingual. The methodologies differ in the depth of analysis of the source language and extent to which they attempt to reach a neutral representation of meaning or intent between the source and target languages. Direct translation involves very little analysis of the source language --- often only looking up the words in a bilingual dictionary. Transfer usually involves some analysis of the source language. However in transfer systems, the representation of the source language sentence may not be identical to the representation of the target language sentence. The two representations would be related to each other by transfer rules -- rules that specify which source language structures correspond to which target language structures. Interlingual MT may involve the deepest analysis of the source language. The analysis must be deep enough to neutralize the differences between the source and target languages. Of course, in practice, the boundaries between the three methodologies are not sharp. For example, many transfer systems perform quite deep analysis of the source language. 3.3 Advantages of Translating with an Interlingua The choice of direct, transfer, or interlingual MT depends on the application of MT and on the available resources. For example, direct MT may not be able to re-order the words in the target language and may not provide good translations for idioms and other constructions for which a word-by-word substitution is not adequate. However, direct MT may be quick to implement and may be useful for applications for which getting the gist of 46 the meaning is sufficient. Furthermore, direct MT may be the only option when the only resource available is a bilingual glossary. Interlingual MT is particularly advantageous in multi-lingual applications involving more than two languages. The reason is that interlingual MT requires fewer components in order to relate each source language to each target language. An interlingual system is illustrated schematically in Figure 3.1. For each language, there is an analyzer and a generator. The analyzer takes as input a source language sentences and produces as output an interlingual representation of the meaning. The generator takes an interlingual representation of meaning as input and produces a sentence with that meaning as output. To translate from L1 to L2, L1’s analyzer produces an interlingual representation and L2’s generator produces an L2 sentence with the same meaning. If there are n languages and we want to be able to translate from each language to each language, n analyzers and n generators are needed, for a total of 2n components. In contrast, a transfer-based system or a direct system might require up to n-squared components --rules that map L1 to L2, L2 to L1, L1 to L3, L3 to L1, L2 to L3, L3 to L2, etc. 47 Punjabi English (input sentence) The pain started three days ago. Bangla Marathi Asamiya Hindi Oriya Analyzers Gujrati c:give-information+occurrence+health-status (health-status=pain, phase=start, e-time=previous, Interlingua time=(relative-time=(time-distance= (quantity=3, time-unit=day), time-relation=before)))) Gujrati Bangla English Hindi Punjabi (output sentence) Xaraxa wIn xin pahale SUrU hUA. Marathi Oriya Asamiya Generators Figure 3.1: Multilingual Translation with an Interlingua There are other advantages of interlingual MT. First, related to the point we have already made, it takes fewer components to add a new language. For example, suppose we want to add language Lm to the system shown in Figure 3.1, and we want all-ways translation between all of the languages. We only need to add an analyzer for Lm and a generator for Lm. Once Lm is connected to the interlingua with an analyzer and a generator, it is automatically connected to input and output L1-Ln. Another advantage of the interlingua approach is that the analyzers and generators can be written by mono-lingual system developers. For example, building an MT system for Hindi and Punjabi does not require anyone to be bilingual in Hindi and Punjabi. It only requires 48 that the Hindi speakers connect Hindi to the interlingua and the Punjabi speakers connect Punjabi to the interlingua. Interlingual MT also supports paraphrase of the input in the original language. When an English speaker says The pain started three days ago, the analysis process produces the interlingua shown in figure 3.1. The interlingua is a system-internal representation which is not of interest to most users, and so is not visible to users. The generator may then produce a target language sentence like xaraxa wIn xin pahale SUrU hUA. The source language speaker, however, does not know whether the target language translation is correct (because s/he presumably does not speak the target language). In order to give the source language speaker a chance to check the translation, the source language generator can produce a source language sentence from the same interlingua. Since an interlingua represents the meaning of the sentence, the generator might produce a syntactically different sentence such as Pain is for the last three days, but the meaning of the input sentence should be preserved. The source language speaker can then verify that the meaning is correct. Of course, paraphrase from the same interlingua might not always reveal a problem. Suppose the target language generator malfunctions, producing Hindi sentence, but the source language generator works properly and produces a correct paraphrase of the original sentence. In that case, the source language speaker will not be alerted to the problem with the target language generator. Conversely, the source language generator may malfunction giving the speaker the mistaken impression that there is a problem with the source language analyzer or the target language generator. 49 3.4 Grain Size of Meaning: The Challenge of Interlingua Design The biggest problem of interlingua design is that “ meaning” is a bottomless pit. It is always possible to add more detail to a meaning representation, but in order to implement an MT system, the details must end at some point. Many interlingua developers find that the most time-consuming part of interlingua design is in deciding when to stop refining the meaning representation. For example, should there be a slightly different shade of meaning for I have high blood pressure (more likely to be a persistent condition) and My blood pressure is high (more likely to be a temporary current condition)? 50 Chapter 4 Angla Bharti System Overview 4.1 System Overview As AnglaHindi is a derivative of Anglabharti, let us first look at the Anglabharti methodology. As pointed out earlier, Anglabharti is a machine-aided translation methodology specifically designed for translating English to Indian languages. English is a SVO language while Indian languages are SOV and are relatively of free word-order. Instead of designing translators for English to each Indian language, Anglabharti uses a pseudo-interlingua approach. It analyses English sentences only once and creates an intermediate structure with most of the disambiguation performed. The intermediate language structure has the word and word-group order as per the structure of the group of target languages. The intermediate structure is then converted to each Indian language through a process of text-generation. The effort in analyzing the English sentences is about 70% and the text generation accounts for the rest of the 30%. Thus only with an additional 30% effort, a new English to Indian language translator can be built. Anglabharti is a pattern directed rule based system with context free grammar like structure for analysis of English as source language. The analysis generates a ‘pseudo-target’ applicable to a group of Indian languages. A set of rules obtained through corpus analysis is used to identify plausible constituents with respect to which movement rules for the ’pseudo-target’ is constructed. The idea of using ‘pseudo-target’ is primarily aimed at incorporating advantages similar to that of using interlingua approach exploiting structural similarity. Indian languages are verb ending, free word-group order, and a lot of structural similarity. 51 Indian languages can be classified into four broad groups according to their origin and similarity [13]. These are Indo-Aryan family (Hindi, Bangla, Asamiya, Punjabi, Marathi, Oriya, Gujrati etc.); Dravidian family (Tamil, Telugu, Kannada & Malayalam); AustroAsian family and Tibetan-Burmese family. Within each group, there is a high degree of structural similarity. Paninian framework based on Sanskrit grammar using Karak (similar to ’case’) relationship provides an uniform way of designing the Indian language text generators using selectional constraints and preferences. A block schematic diagram of the Anglabharti methodology is depicted in figure 4.1. A brief description of some of the major building blocks of Anglabharti is given in the following paragraphs [26]. Rule-base: This contains rules for mapping structures of sentence from English to Indian languages. This database of pattern transformations from English to Indian languages is entrusted the job of making a surface-tree to surface-tree transformation, bypassing the task of getting a deep tree of the sentence to be translated. The database of structural transformation rules from English to Indian languages forms the heart of the Anglabharti system. The system is designed to cater to compound, complex, imperative, interrogative and other constructs such as headings etc. As mentioned earlier, by making a generic rulebase for Indian languages, Anglabharti exhibits a potential benefit while translating from English. This module is also responsible for picking up the correct sense of each word in the source language to the extent feasible using interleaved semantic interpreter. Further disambiguation and choice of right construct and lexical preferences are performed by the target language text-generator module. Many a time, multiple rules may get invoked 52 leading to multiple interpretation of the input sentence. The rules are ordered in terms of their preferences and an upper limit is put on the number of alternatives produced. These multiple translations are available for further post-editing. Multi-lingual dictionary/ Lexical database and Sense Disambiguator: The lexical database is the fuel to the translation engine. It contains various details for each word in English, like their syntactic categories, possible senses, keys to disambiguate their senses, corresponding words in target languages with their tags. A number of ontological/semantic tags are used to resolve sense ambiguity in the source language. Most of the disambiguation rules are in the form of syntactosemantic constraints. We use semantics to resolve most of the intra-sentence anaphora/pronoun references. Alternative meanings for the unresolved ambiguities are retained in the pseudo target language. The lexical database is hierarchically organized to allow domain specific meanings and also prioritize meanings as per users’ requirement. Target text generators and Corrector for ill Formed Sentences [10] [7]: These form the tail end of the system. Their function is to generate the translated output for the corresponding target languages. A text generator module for each of the target languages transforms the pseudo target language to the target language. These transformations do lead to sentences, which may be ill-formed. The ill-formed sentences are target language specific and are usually related to incorrect placement of emphasizers, negation and forms denoting cultural dependence (such as plurals being used for persons whom you pay respect). A corrector for ill-formed sentences is used for each of the target languages. Finally, a human-engineered post-editing package is used to make the final corrections. It is our experience that for more than 50% of the normal text, the human post-editor needs to know only the target language 53 as the humans use a lot of contextual information in making the right choice. For resolving the structural ambiguity, one needs to consult the source language. It may be noted that by having Figure 4.1: System Architecture of ANGLABHARTI different text generators using the same rule-base and sense disambiguator, a generic MT system is obtained for a host of target languages. We have used Paninian framework with 54 verb-centric expectation driven methodology [4] with selectional restrictions/semantic constraints for synthesizing the Indian language text. AnglaHindi besides using all the modules of Anglabharti, also makes use of an abstracted example-base for translating frequently encountered noun phrases and verb phrasals. The example-based approach developed by the author’ s group, named ANUBHARTI [11] [18], is invoked before the rule-based approach is applied. The example-base is statistically derived from the corpus. Ambiguities in the meanings of the verb phrasals are also resolved using an appropriate distance function in the example-base [21]. AnglaHindi accepts unconstrained text [22] [20]. The text may be made up of headings, parenthesized texts, text under quote marks, currencies, varying numeral & date conventions, acronyms, unknowns and other frequently encountered constructs. The performance of the system has been evaluated by human translators. The system generates approximately 90% acceptable translation in case of imple, compound and complex sentences up to a length of 20 words [10]. Current version of AnglaHindi is not tuned to any specific domain of application or topic. However, it has user-friendly interfaces, which allows hierarchical structuring of the lexical database leading to preferences on lexical choice. Similarly, it has provisions for augmenting its abstracted example-base specific to an application domain. This not only eliminates the alterative translations but also generates more accurate and acceptable translation. Currently, the alternate translations are being ranked with respect to the ordering of the rule-base. This can be further enhanced by using domain specific 55 information and target language statistics. The alternate translations can be ranked based on hidden Markov model of Hindi in the specific domain. For each alternate translation, the language model yields a figure of merit reflecting preferences for style and lexical choice. Overall, the AnglaHindi system attempts to integrate [15] example-based approach with rule-base and human engineered post-editing. An attempt is made to fuse the modern artificial intelligence techniques with the classical Paninian framework based on Sanskrit grammar. 4.2 PLIL: Pseudo-Lingua for Indian Languages Anglabharti system architecture exploits structural similarity of Indian languages. This structural similarity is more homogeneous within each family of languages such as within the Indo-Aryan family, Dravidian family and others. Anglabharti system translates the English source language into an intermediate language that follows the structure of the family of target Indian languages. It contains most of the semantic information needed to construct the text in the final target Indian language within the class of languages. This intermediate language has been referred to as Pseudo-lingua of an MT systems using Interlingua approach, but it is not so in real sense as it caters only to a class of languages for which it has been designed and here the source language is assumed to be English. An Inter-lingual MT system envisages embodying a knowledge representation schema wherein all ambiguities of the source language are assumed to have been resolved. This is an ideal situation that is hard to achieve. On the other hand PLIL does not claim to have a representation wherein all ambiguities have been resolved. The English to PLIL encoder 56 generates a structure which is as per the requirement of the target Indian language. Thus the PLIL to target language decoder design becomes a text generator task. PLIL consists of two major components [25]: A multi-lingual lexical data-base of English to Indian languages: An English root word/lexicon is mapped on to corresponding target language lexicon along with its associated grammatical and semantic information. A root word may have multiple categories and/or multiple meanings. A lexicon in a language represents a certain mental concept as visualized by the native speaker. The mapping into the target language meaning is to the lexicon representing closest concept of the speaker. In PLIL, a concept is uniquely represented by the syntacto-semantic information associated with root word and its meaning. A grammar representing the family/class of target language: This grammar has been loosely defined around a CFG formalism which generates the word order for the class of languages. A sentence in PLIL is defined in terms of NP, VP and other constructs as expected in case of any natural language. In addition, a number of keywords/terms are used to denote the nature of the sentence, connectives and indicators that help in lexical choice or invoke functions in the process of target language synthesis. Many of these symbols are self explanatory and are directly taken from English sentence, these keywords/symbols can be found in map.c, tam_gen_rules.c, verb_para, *.pl files marked with keyword TLDC, *.txt files, phrasals.txt and already_hindi. Explanation for some of the additional symbols used is given in appendix. Anglabharti uses a pattern directed rule-base to convert the input English sentence structure into PLIL structure. The constituents of PLIL are formally explained 57 below and some examples are included. In many of PLIL examples, only one alternative is shown for explanation. 4.2.1 PLIL Structure: <np>: {<det> <adj> (<lexicon><grammatical category><GNP> [<semantic type>] [<list of meanings>:<gender><paradigm number>] [<other language>] [<other language>] ) } <adj>: (<det> ( <lexicon> <grammatical category> <degree> [<semantic>] [<list of meanings>] [<other language>] [<other language>] ) <pp>: { pp <np> ( <lexicon> <grammatical category> [prep_name] ) } <vp>: {verb_type <verb_pattern> } <verb_pattern>: (<lexicon> <verb form> <pattern_type> < auxiliary> < GNP> [<list of meanings>] verb paradigm number [<other language>] [<other language>] ) [(verb_types: Active, Passive), (verb forms: verb_1 (e.g., eat), verb_2 (e.g., eats), verb_3 (e.g., ate), verb_4 (e.g., eaten), verb_5 (e.g., eating)), (pattern_type: see tam_gen_rules.c), (auxiliary : am, was, is, are, were, has, have, had, has_been, have_been, had_been, will, will_be, will_have, will_have_been etc.)] 58 <adv>: (<lexicon> <grammatical category> [<list of meanings>] [<other language>] [<other language>]) <S>: <adv> <verb_pattern> <toinf_pattern> < <sen_type> <sub_np> <connective> <pp> <connective> <obj_np> <connective> <toinf_pattern> <verb_pattern> <adv> <vp> >.sviram <comp_sentence>:<S><sentence_connectors><S> <sub_np>:<np> <obj_np>:<np> <toinf>: {toinf <np> <connective> <verb_pattern) to_in} | {toinf (verb_pattern) to_in} | {toinf} <sen_type>: aff: affirmative (negative sentences preceded by ‘not’ in VP) imp: imperative type com: complex type let: let type qs qwhat: interrogative, yes/no type qs: interrogative, wh-type 59 com if: if-then type com sen_either: either-or type com prfxas: sentence starting with ‘as’ complex: multi component type (the list is partial) <connective>: k1 | k2 | k3 | k4 | other markers (map.c) 4.2.2 Examples: Present Simple: English sentence: They speak Greek. Hindi translation: ve griika BARA bolawe hain. PLIL: <aff {sub_np ( they noun don’ t_care plural third [human] [ve: m 8] [] [] ) } {obj1_np ( greek sadjnoun don’ t_care singular third [topic] [grIka BARA : f 3] [] [] ) } k1 {main_vp_active ( speak verb_1 normal normal don’ t_care plural third [bola] 11 [] [] ) } > . sviram Present Progressive: English sentence: He is writing a letter. Hindi translation: vaha eka pawra liKa rahA hai. 60 PLIL: <aff { sub_np ( he noun masculine singular third [human] [vaha: m 8] [] [] ) } { obj1_np ( a det [eka/{}] [tamil_a] [telgu_a] ) ( letter noun neuter singular third [topic] [pawra : m 6] [] [] ) } k1 { main_vp_active ( write verb_5 normal is masculine singular third [liKa] 11 [] [] ) } > . sviram 61 Chapter 5 Implementation In order to implement a Machine Aided Translation System, we have to perform Morphological Analysis. We start this chapter with a brief note on Morphological Analysis. 5.1 Why Morphological Analysis? The first question is why we need to perform morphological analysis. If we had an exhaustive lexicon which listed all the word forms of all the roots, and along with each word form, it listed its features values then clearly we do not need a morphological analyzer. Given a word, all we need to do is to look it up in the lexicon and retrieve its feature values. For example, suppose an exhaustive lexicon for Hindi contains the following entries related to the roots ‘laDakA’ and ‘kapaDA’ as in figure 5.1 [24]: Word Cate- Root Gender Number Per- Case Form gory laDakA noun laDakA masc. sg. 3rd direct laDake do. do. do. pl. do. do. laDake do. do. do. sg. do. oblique laDakoM do. do. do. pl. do. do. kapaDA noun kapaDA masc. sg. 3rd direct kapaDe do. do. do. pl. do. do. kapaDe do. do. do. sg. do. oblique kapaDoM do. do. do. pl. do. do. son Figure 5.1: Example of exhaustive lexicon for Hindi 62 Now, given a word, it can be looked up and its feature values returned. This method has several problems. First, it is extremely wasteful of memory space. Every form of the word is listed which contributes to the large number of entries in such a lexicon. Even when two roots follow the same rule, the present system stores the same information redundantly. Second, it does not show relationships among different roots that have similar word forms. Thus, it fails to represent a linguistic generalization. This is necessary if the system is to have the capability of understanding (even guessing) an unknown word. (In fact, human beings routinely deal with word forms they have never heard before when they know the root and the affixes separately.) In generation process, the linguistic knowledge can be used if the system needs to coin a new word. Third, some languages have a rich and productive morphology. The number of word forms might well be infinite in such a case. Clearly, the above method cannot deal with such languages. There is another criterion by which to judge a morphological analyzer or a scheme for morphological analysis. This is the speed with which it performs the analysis. In case of the exhaustive lexicon, the time spent in analysis is zero, the only time needed is in searching and retrieving a word from the lexicon. As the analysis scheme becomes more sophisticated, it is also likely to take more time. A proper balance may, therefore have to be struck. The schemes popular in NLP have chosen speed over the requirements of dealing with unknown words. 63 5.2 Morphological Generation Using Paradigms For morphological generation, we should have different tables of word forms covering the words in a language. Each table of word forms covers a set of roots which means that the roots follow the pattern (or paradigm) implicit in the table for generating their word forms. For example, in Hindi the paradigm for ‘laDakA’ and other roots in its class can be specified by giving its word forms. Other roots such as ‘kapaDA’ (cloth) behave like ‘ladakA’ and belong to the same paradigm. The paradigm can be extracted from the word forms of ‘laDakA’ by identifying the number of characters to be deleted from the root and the characters to be added to obtain the word forms. For example, we can say that if you want plural oblique case of the root ‘laDakA’ , delete the last character (‘A’ ) and add (‘oM’ ) at the end. [root = laDakA, number = plural, case = oblique] Æ laDakoM This can be expressed as Case Number Direct Oblique Singular (0,∅) (1,e) Plural (1,e) (1,oM) Figure 5.2: Paradigm table for ‘ladakA’ class 64 5.2.1 Algorithm: Forming paradigm table a) Create an empty table PT of the same dimensionality, size and labels as the word forms table WFT. b) For every entry w in WFT, do a. if w=r i. then store “ (0,∅)” in the corresponding position in PT b. else begin i. let i be the position of the first characters in w and r which are different, ii. store (size(r)-i+1, suffix(i, w)) at the corresponding position in PT. c. end. c) Return PT. Along with the roots, the types and other grammatical information that is common to all the associated endings (i.e. word forms) can be stored. Figure 5.3 shows some example roots together with common gender information. Root Type Gender laDakA (n, laDakA) m kapaDA (n, laDakA) m bhASA (n, bhASA) f roTii (n, laDakii) f laDakii (n, laDakii) f Figure 5.3: Dictionary of roots Here the endings of type (n, laDakA) are applicable to ‘laDakA’ as well as ‘kapaDA’ (cloth), ‘ghoDA’ (horse) etc. Similar is the case with ‘roTii’ (bread), ‘laDakii’ (girl), ‘lakaDii’ (wood), etc. The paradigm table can be used with any of the roots in the same class to generate its word. For example, ‘kapaDoM’ can be generated from root ‘kapaDaa' , 65 number plural, and case oblique, by deletion and addition as specified by the paradigm table. This leads to efficient storage because there is only one paradigm table for a class of roots rather than a separate word forms table for each root. 5.2.2 Algorithm: Generating a word form a) If root r belongs to Dictionary of indeclinable words (DI), then return (word stored in DI for r (irrespective of Feature Values FV) b) let p = paradigm type of r as obtained from Dictionary of roots (DR) c) let PT = paradigm table for p. d) let (n, s) = entry in PT for feature values FV e) w : = r minus n characters at the end f) w : = w plus suffix s In fact, the word form table given by the language expert is from the point of view of generation. It is set up so that given a root and the desired features, one can locate the right table and then look up the right entry. It is not surprising, therefore that the paradigm table is also set up for generation. 5.3 The Generator Module The PLIL or intermediate representation contains all relevant syntactic and semantic information. The translation of the text is performed with this PLIL as input. Here we give description of rules of Punjabi grammar which are of relevance to this system design. The 66 modifications in the nouns and the verbs depending on their use are discussed. The rules of all these modifications have been incorporated in the system. 5.3.1 Introduction to Punjabi Language Punjabi uses a different word order than English. The main differences are that verbs are placed at the end of the sentence and that Punjabi (like other Indian languages) uses postpositions instead of prepositions. Postpositions are like prepositions except that they are written after the noun. Affirmative Sentences English: Subject Verb Object Æ I learn Punjabi. Punjabi: Subject Object Verb Æ I Punjabi learn. English: Subject Verb Preposition Object Æ I go to shop. Punjabi: Subject Object Postpositions Verb Æ I shop to go. Imperative Sentences English: Verb Place Adverb Æ Come here now. Punjabi: Place Adverb Verb Æ Here now come. English: Verb Negative Verb Adverb Æ Do not eat quickly. Punjabi: Adverb Negative Verb Æ Quickly not eat. 67 5.3.2 PLIL Examples 1) We write books. <aff {sub_np ( we noun dont_care plural first [human] [asiM :m 8] [] [] ) } {obj1_np (books noun neuter plural third [thing] [kiwAba : f 9] [] [])} k1 {main_vp_active ( write verb_1 normal normal dont_care plural first [lika] 11 [] [] ) } > . sviram 2) He is writing a letter. <aff {sub_np ( he noun masculine singular third [human] [Oha:m 8] [] [] ) } {obj1_np (a det [ika/{}] [tamil_a] [telgu_a]) (letter noun neuter singular third [topic] [pawara : m 6] [] [])} k1 {main_vp_active ( write verb_5 normal is masculine singular third [lika] 11 [] [] ) } > . sviram A sentence in PLIL is defined in terms of NP, VP and other constructs as expected in case of any natural language. In addition, a number of keywords/terms are used to denote the nature of the sentence, connectives and indicators that help in lexical choice or invoke functions in the process of target language synthesis. The Noun Phrase The noun is modified from the root form suitably to indicate the number information as well. There are two cases in which a word is declined: Direct, Oblique. Direct form of a word does not undergo any changes in its original form when used in a sentence. The oblique form of a word most often reflects the change in the last consonant or 68 vowel of the word when used in a sentence. The declination rules incorporated in the system are shown below in figure 5.4. Noun Paradigm Examples: Apawwi 5, rAwa 9, ladakA 11 (Root word, Paradigm number) Root num case "Apawwi" "s" "d" "Apawwi" "p" "Apawwi" num_dl_ch suffix del_ch 0 "" "" "d" 0 "yAz" "" "s" "o" 0 "" "" "Apawwi" "p" "o" 0 "yoM" "" "rAwa" "s" "d" 0 "" "" "rAwa" "p" "d" 1 "eM" "a" "rAwa" "s" "o" 0 "" "" "rAwa" "p" "o" 1 "oM" "a" "ladakA" "s" "d" 0 "" "" "ladakA" "p" "d" 1 "e" "A" "ladakA" "s" "o" 1 "e" "A" "ladakA" "p" "o" 1 "oM" "A" Figure 5.4: Some Declination rules incorporated in the system num Æ Number num_dl_ch Æ Number of characters to be deleted del_ch Æ character to be deleted 69 The Verb Phrase The form of the verb normally depends on the number and gender of the AGENT. Consider as examples the following group of sentence. ladkA bAzAra jAtA hai. ladkein bAzAra jAtein hain. ladkI bAzAra jAtI hai. ladkiyAna bAzAra jAti hain. Thus the form of the verb ‘jA’ (go) changes according to the gender. If however, the tense is past, past perfect, present perfect or future perfect, the form of the verb depends on the number and gender of the OBJECT. This is illustrated by the following example: ladke ne santrA khAyA. ladke ne santrein khAyin. The verb ‘khA’ (eat) modifies according to the gender and number of object. Apart from these modifications, the verb is modified also according to the tense of the sentence. For example if the verb is ‘khA’ (eat) then the simple present tense form of the verb is indicated by ‘khAtA hai’ . For the past tense the verb appears as ‘khAyA’ . Thus the verb form varies according to the tense of the sentence. 70 5.4 Results Example 1: English Sentence: We write books. PLIL form: <aff {sub_np ( we noun dont_care plural first [human] [asiM : m 8] [] [] ) } {obj1_np (books noun neuter plural third [thing] [kiwAba : f 9] [] [])} k1 {main_vp_active ( write verb_1 normal normal dont_care plural first [lika] 11 [] [] ) } > . sviram Generated Punjabi Sentence: asiM kiwAbAm likade haM. Example 2: English Sentence: He is writing a letter. PLIL form: <aff {sub_np ( he noun masculine singular third [human] [Oaha:m 8] [] [] ) } {obj1_np (a det [ika] [] []) (letter noun neuter singular third [topic] [pawara : m 6] [] [])} k1 {main_vp_active ( write verb_5 normal is masculine singular third [lika] 11 [] [] ) } > . sviram Generated Punjabi Sentence: Oaha ika pawara lika reha hE. 71 Example 3: English Sentence: They speak Greek. PLIL form: <aff {sub_np ( they noun dont_care plural third [human] [Oaha : m 8] [] [] ) } {obj1_np (greek sadjnoun don’ t_care plural third [human] [grIka bolI : f 3] [] [])} k1 {main_vp_active ( speak verb_1 normal normal dont_care plural third [bola] 11 [] [] ) } > . sviram Generated Punjabi Sentence: Oaha grIka bolI bolade ne. Example 4: English Sentence: He was reading the book. PLIL form: <aff {sub_np ( he noun masculine singular third [human] [Oaha:m 8] [] [] ) } {obj1_np (the det [] [] []) (book noun neuter singular third [thing] [kiwAba : f 9] [] [])} k1 {main_vp_active ( read verb_5 normal was masculine singular third [paDa] 11 [] [] ) } > . sviram Generated Punjabi Sentence: Oaha kiwaAba paDa reha si. 72 Chapter 6 Conclusion and Future Scope 6.1 Conclusion MT is relatively new in India, about a decade old. In comparison with MT efforts in Europe and Japan, which are at least 3 decades old, it would seem that Indian MT has a long way to go. However, this can also be an advantage, because Indian researchers can learn from the experience of their global counterparts. The system uses the interlingua approach for transforming English language sentence to the corresponding Punjabi language sentence. System is capable to translate the simple English sentence given in the interlingua form. Though the module that has been implemented performs translations from English to Punjabi, the underlying principles are general enough to be used for translation from English to any Indian language. The implemented system is helpful, but not perfect. There are linguistic problems that cannot be handled by the system. In future, System can be upgraded to solve the linguistic problems. As we realize that the perfect automatic translation cannot be expected with the current technology, therefore, for the time being, we have to promote a systematization of machine translation, and consider post-editing as a part of the system, while continuing efforts to improve the accuracy of translation. 73 6.2 Future Scope The designed system is just an example, a prototype of an MT system. This system can be further expanded to incorporate more features. The sentences which are linked together (e.g. - I met Ram. He was going to market) can not be handled by the system as of now. So, it can be extended to have this feature. The system handles only affirmative sentences. System can be expanded to handle more complex and variety of sentences. Clauses are not handled here. There are some words which can be used as noun or adjective or verb depending on use in a sentence. This type of ambiguity can be resolved by making modifications in the present system. 74 REFERENCES [1] Nagao, M., “ A Framework of a Mechanical Translation between Japanese and English by Analogy Principle” , Artificial and Human Intelligence, Elithorn, A. and Banerji, R. (eds.), Elsevier Science Publishers, B. V. 1984. [2] Jackson, Philip, C., “Introduction to Artificial Intelligence”, 2nd ed. New York: Dover Publications, 1985. [3] Nagao, Makoto, “ Machine Translation: how far can it go?” , Oxford University Press,1989. [4] R.M.K. Sinha, “A Sanskrit based Word-expert model for machine translation among Indian languages”, In Proc. of workshop on Computer Processing of Asian Languages, AIT, Bangkok, Thailand, Sept.26-28, pp. 82-91, 1989. [5] Raman, S. & Alwar, N., “ An AI-Based approach to Machine Translation in Indian Languages” , Communications of the ACM, Volume 33, No. 5, May 1990. [6] Hutchins, W. J. & Somers, H.L., “ An Introduction to Machine Translation” , Academic Press, London, 1992. [7] R.M.K. Sinha and C. Sanyal, “Correcting ill-formed Hindi sentences in machine translated output”, In Proceedings of Natural Language Processing Pacific Rim Symposium NLPRS’93, Fukuoka, Japan, pp 109119, 1993. [8] Arnold, D., Balkan, L., Humphreys, R. L., Meijer, S. & Sadler, L., “ Machine translation: An introductory guide” , Blackwells/NCC, London, http://www.essex.ac.uk/linguistics/clmt/MTbook/HTML/book.html 75 1994. [9] Deryle W. Lonsdale, Alexander M. Franz, and John R. R. Leavitt, “ Large-Scale Machine Translation: An Interlingua Approach” , Center for Machine Translation, Carnegie Mellon University, Pittsburgh, Pa., USA, 1994. [10] R.M.K. Sinha, R. Srivastava and A. Agrawal, “Designing Hindi Text Generator for Machine Translation”, In Proc. Symposium on Natural Language Processing, SNLP’95, Bangkok, Thailand, pp 286-296, 1995. [11] Renu Jain, R.M.K. Sinha and A. Jain, “Role of Examples in Machine Translation, In Proc. IEEE International Conference on Systems, Man and Cybernetics, Vancouver, Canada, pp 1615-1620, 1995. [12] Finlay, Janet and Alan Dix, “An Introduction to Artificial Intelligence”, London: UCL Press, 1996. [13] Jain, R., Sinha, R. M. K., & Jain, A., “Translation between English and Indian Languages”, Journal of Computer Science and Informatics, pp 19 –25, 1997. [14] Turcato, D., McFetridge, P., Popowich, F. & Toole, J., “ A unified example-based and lexicalist approach to machine translation” , `Proceedings of the 8th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI ' 99)' , Chester, 1999. [15] R.M.K. Sinha, Approaches in “Hybridizing Machine Aided Rule-Based Translation and Example-Based System”, In Proc. International Conference on Artificial Intelligence IC-AI’2000, June 2629, Las Vegas, USA, 2000. 76 [16] C. Manning and H. Schutze, “Foundations of Statistical Natural Language Processing”, Cambridge: The MIT Press, 2000. [17] Sinha, R. M. K., Renu Jain & Ajai Jain, “ Translation from English to Indian Languages: ANGLABHARTI Approach” , In Proc. Symposium on Translation Support Systems STRANS2001, February 15-17, Kanpur, India, 2001. [18] Renu Jain, R.M.K. Sinha and Ajai Jain, “ANUBHARTI: Using Hybrid Example-Based Approach for Machine Translation”, In Proc. Symposium on Translation Support Systems STRANS2001, February 15-17, Kanpur, India, 2001. [19] Generation 5, “An Introduction to Natural Language Theory”, 24th April 2001. http://www.generation5.org/nlp.shtml [20] R.M.K. Sinha, “Dealing with Unknown Lexicons in Machine Translation from English to Hindi”, In Proc. of IASTED International Conference on Artificial Intelligence and Soft Computing, May 21-24, Cancun, Mexico, pp 333-336, 2001. Vartika Bhandari, R.M.K. Sinha and Ajai J 77

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Punjabi Text Generation using Interlingua