Download Punjabi Text Generation using Interlingua

Document related concepts

Inflection wikipedia , lookup

Word-sense disambiguation wikipedia , lookup

Semantic holism wikipedia , lookup

Latin syntax wikipedia , lookup

Transformational grammar wikipedia , lookup

Lexical semantics wikipedia , lookup

Malay grammar wikipedia , lookup

Agglutination wikipedia , lookup

Morphology (linguistics) wikipedia , lookup

Spanish grammar wikipedia , lookup

Pleonasm wikipedia , lookup

Parsing wikipedia , lookup

Pipil grammar wikipedia , lookup

Untranslatability wikipedia , lookup

Junction Grammar wikipedia , lookup

Transcript
Punjabi Text Generation using Interlingua
approach in Machine Translation
A thesis
Submitted in partial fulfillment of the
requirement for the award of degree
of
Master of Engineering
in
Software Engineering
Under the Supervision of
Dr. R. K. Sharma
Astt.Professor
School of Mathematics & Computer Applications
Thapar Institute of Engineering and Technology, Patiala
Submitted By
SACHIN KALRA
(8023114)
Computer Science & Engineering Department
Thapar Institute of Engineering & Technology
(Deemed University), Patiala-147004 (India)
June 2004
1
Declaration
I hereby certify that the work which is being presented in the thesis entitled,
“Punjabi Text Generation using Interlingua approach in Machine Translation”, in
partial fulfillment of the requirements for the award of degree of Master of Engineering
in Software Engineering submitted in Computer Science and Engineering Department
of Thapar Institute of Engineering and Technology (Deemed University), Patiala, is an
authentic record of my own work carried out under the supervision of Dr. R. K. Sharma.
The matter presented in this thesis has not been submitted by me for the award of
any other degree of this or any other University.
SACHIN KALRA
This is to certify that the above statement made by the candidate is correct and true to best
of my knowledge.
Dr. R. K. Sharma
Astt. Professor
School of Mathematics & Computer Applications
Thapar Institute of Engg. & Technology, Patiala
Countersigned by
(Dr. D.S. Bawa)
Dean (Academic Affairs)
Thapar Institute of Engg. & Technology,
Patiala.
(Ms. Seema Bawa)
Assistant Professor & Head,
Computer Sc. & Engg. Department,
Thapar Institute of Engg. & Technology,
Patiala.
2
.0.1.1 Acknowledgement
A journey is easier when traveled together. Interdependence is certainly more valuable than
independence. This thesis is the result of work carried out during the final year of my
course whereby I have been accompanied and supported by many people. It is a pleasant
aspect that I have now the opportunity to express my gratitude for all of them.
No amount of words can adequately express the debt, I owe to Dr. R. K. Sharma, Assistant
Professor, School of Mathematics & Computer Applications, for his kind support,
motivation and inspiration that triggered me for the thesis work. I owe him lots of gratitude
for having me shown this way of research.
I wish to express my gratitude to Ms. Seema Bawa, Assistant Professor & Head, Computer
Science & Engineering Department for her excellent guidance and encouragement right
from beginning of this course. I am also thankful to all the faculty and staff members of the
Computer Sc. & Engg. Department for providing me all the facilities required for the
completion of this work.
No thesis could be written without being influenced by the thoughts of others. I would like
to thank my friends Harsimran Singh and Surinder Pal Singh who were always there at the
hour of the need and provided with all the help and support, which I needed. I am grateful
to my brother Deepak Kalra who helped me with his kind suggestions.
At last but not the least I would like to thank “The Creator of Destinies” for not letting me
down at the time of crises and showing me the silver lining in the dark clouds.
SACHIN KALRA
(8023114)
3
The scientific art of Machine Translation (MT) is the attempt to automate all, or part of the
process of translating from one human language to another. Technically translation is
nothing more than word substitution (determined by the dictionary) and reordering
(determined by reordering rules). However translating a text requires not only a good
knowledge of the vocabulary of both source and target language, but also of their grammar
i.e. the system of rules which specifies whether a sentence is well-formed in a particular
language or not. Additionally, it requires some element of real world knowledge —
knowledge of the nature of things out in the world and how they work together — and
technical knowledge of the text’s subject area.
Interlingua and transfer based approaches to machine translation have long been in use in
competing and complimentary ways. The former proves economical in situations where
translation among multiple languages is involved, while the latter is used for pair specific
translation tasks. The additional attraction of an interlingua is that it can be used as a
knowledge representation scheme. But given a particular interlingua, its adoption depends
on its ability to (a) capture the knowledge in texts precisely and accurately and (b) handle
cross language divergences.
The aim of the thesis is to design a Machine Translation (MT) system which translates
sentences from an interlingual representation of English sentences to Punjabi language
sentences. Input to the system is an interlingual representation that follows the structure of
the family of target Indian languages. The interlingual form is a knowledge representation
that contains most of the semantic information needed to construct the text in the Punjabi
language. The generator takes an interlingual representation of meaning as input and
produces a sentence with that meaning as output in Punjabi language.
4
The implementation is done in C language on windows platform. The Punjabi language
sentences are written in Punjabi with appropriate changes and certain assumptions.
5
CONTENTS
Certificate ................................................................................................................i
Acknowledgement .................................................................................................ii
Abstract .................................................................................................................iii
List of Figures .................................................................................................... vii
Chapter 1: Introduction .........................................................................................1
1.1
Introduction to Artificial Intelligence ...........................................................1
1.2
Introduction to Natural Language Processing (NLP)..................................3
1.3
Applications of NLP ...................................................................................7
Chapter 2: Machine Translation............................................................................9
2.1
Definition....................................................................................................9
2.2
Types of Machine Translation..................................................................10
2.3
Historical Review of MT ...........................................................................11
2.4
Various Strategies to Machine Translation ..............................................16
2.5
What makes MT so difficult?....................................................................30
2.2.1
2.2.2
2.2.3
2.3.1
2.3.2
2.3.3
2.3.4
2.3.5
2.3.6
2.3.7
2.4.1
2.4.2
2.4.3
2.4.4
2.4.5
2.4.6
2.5.1
Machine-Aided Human Translation ........................................................... 10
Human-Aided Machine Translation ........................................................... 10
Fully-automated Machine Translation (FAMT)........................................... 11
Before the computer.................................................................................. 11
The pioneers, 1947-1954 .......................................................................... 12
The decade of optimism. 1954-1966 ......................................................... 12
The aftermath of the ALPAC report, 1966-1980 ........................................ 13
The 1980s ................................................................................................. 14
The early 1990s ........................................................................................ 15
The late 1990s. ......................................................................................... 15
Direct MT system ...................................................................................... 17
Indirect MT system.................................................................................... 19
Knowledge-based MT (KBMT) .................................................................. 22
Example-Based Machine Translation (EBMT)........................................... 24
Statistical MT ............................................................................................ 27
Hybrid Machine Translation Paradigms..................................................... 29
Linguistic Problems ................................................................................... 31
Chapter 3: Role of Interlingua in Machine Translation .....................................38
3.1
Interlingua ................................................................................................38
3.2
Machine Translation with and without an Interlingua ...............................39
3.3
Advantages of Translating with an Interlingua .........................................39
3.4
Grain Size of Meaning: The Challenge of Interlingua Design ..................43
6
Chapter 4: Angla Bharti System Overview ........................................................44
4.1
System Overview.....................................................................................44
4.2
PLIL: Pseudo-Lingua for Indian Languages.............................................49
4.2.1
4.2.2
PLIL Structure: .......................................................................................... 51
Examples: ................................................................................................. 53
Chapter 5: Implementation..................................................................................55
5.1
Why Morphological Analysis? ..................................................................55
5.2
Morphological Generation Using Paradigms ...........................................57
5.3
The Generator Module.............................................................................59
5.4
Results.....................................................................................................64
5.2.1
5.2.2
5.3.1
5.3.2
Algorithm: Forming paradigm table............................................................ 58
Algorithm: Generating a word form............................................................ 59
Introduction to Punjabi Language.............................................................. 60
PLIL Examples.......................................................................................... 61
Chapter 6: Conclusion and Future Scope..........................................................66
6.1
Conclusion ...............................................................................................66
6.2
Future Scope ...........................................................................................67
7
Chapter1
Introduction
1.1
Introduction to Artificial Intelligence
Artificial Intelligence (AI) is the branch of Computer Science that is primarily concerned
with the ability of machines to adapt and react to different situations as human do. In order
to achieve artificial intelligence, we must first understand the nature of human intelligence.
Human intelligence is a behavior that incorporates a sense of purpose in actions and
decisions. Intelligent behavior is not a static procedure. Learning, defined as behavioral
changes over time that better fulfill an intelligent being’s sense of purpose, is a fundamental
aspect of intelligence. An understanding of intelligent behavior will be realized when either
intelligence is replicated using machines, or conversely when we prove why human
intelligence cannot be replicated.
In an attempt to gain insight into intelligence, researchers have identified three processes
that comprise of intelligence: searching, knowledge representation, and knowledge
acquisition. The field of AI can be broken down into five smaller components, each of
which relies on these three processes to be performed properly. They are: game playing,
expert systems, neural networks, natural language processing, and robotics programming
[2] [28].
Game playing is concerned with programming computers to play games, such as chess,
against human or machine opponents. This sub-field of AI relies mainly on the speed and
8
computational power of machines. Game playing is essentially a search problem because
the machine is required to consider a multitude of possibilities. While the computational
power of machines is greater than that of the human brain, machines are unable to solve
search problems perfectly because the size of the search space grows exponentially with the
depth of the search, making the problem intractable.
Expert systems are programmed systems that allow trained machines to make decisions
within a very limited and specific domain. Expert systems rely on a huge database of
information, guidelines, and rules that suggest the correct decision for the situation at hand.
Although they mainly rely on their working memory and knowledge base, the systems must
make some inferences. The vital importance of storing information in the database in a
manner such that the computer can “understand” it, creates a knowledge representation
problem.
Neural networks, a field inspired by the human brain, attempts to accurately define learning
procedures by simulating the physical neural connections of the human brain. A unique
aspect of this field is that the networks change by themselves, adapting to new inputs with
respect to the learning procedures they have previously developed. The learning procedures
can vary and incorporate many different forms of learning, which include learning by
recording cases, by analyzing differences, or by building identity trees (trees that represent
hierarchical classification of data).
Natural Language Processing (NLP) and robotics programming are two fields that simulate
the way humans acquire information, an integral part to intelligence formation. The two are
separate sub-fields because of the drastic difference in the nature of their inputs. Language,
9
the input of NLP, is a more complex form of information to process than the visual and
tactile input of robotics. Robotics typically transforms its input into motion, whereas NLP
has no such state associated transformation.
A perfection of each of the sub-fields is not necessary to replicate human intelligence
because a fundamental characteristic of humans is to err. However, it is necessary to form a
system that puts these components together in an interlocking manner, where the outputs of
some of these fields should be inputs for others, to develop a high-level system of
understanding. To date, this technology does not exist over a broad domain.
1.2
Introduction to Natural Language Processing (NLP)
One of the most widely researched applications of Artificial Intelligence is Natural
Language Processing.
NLP’s goal, as previously stated, is to determine a system of
symbols, relations and conceptual information that can be used by computer logic to
communicate with humans. This implementation requires the system to have the capacity
to translate, analyze and synthesize language. With the goal of NLP well defined, one must
clearly understand the problem of NLP. Natural language is any human “spoken or written
language governed by sets of rules and conventions sufficiently complex and subtle enough
for there to be frequent ambiguity in syntax and meaning.” The processing of language
entails the analysis of the relationship between the mental representation of language and its
manifestation into spoken or written form [16].
Humans can process a spoken command into its appropriate action. We can also translate
different subsets of human language (e.g. English to Hindi).
10
If the results of these
processes are accurate, then the processor (the human) has understood the input. The main
tasks of artificial NLP are to replace the human processor with a machine processor and to
get a machine to understand the natural language input and then transform it appropriately.
Currently, humans have learned computer languages (e.g. C, Perl, and Java) and can
communicate with a machine via these languages. Machine languages (MLs) are a set of
instructions that a computer can execute. These instructions are unambiguous and have
their own syntax, semantics and morphology. The main advantage of machine languages,
and the major difference between ML’s and NL’s, is ML’s unambiguous nature, which is
derived from their mathematical foundation. They are also easier to learn because their
grammar and syntax are constrained by the finite set of symbols and signals. Developing a
means of understanding (a compiler) for these languages is remarkably easy compared to
the degree of difficulty of developing a means of understanding for natural languages.
An understanding of natural languages would be much more difficult to develop because of
the numerous ambiguities, and levels of meaning in natural language. The ambiguity of
language is essentially why NLP is so difficult. There are five main categories into which
language ambiguities fall: syntactic, lexical, semantic, referential and pragmatic [12].
The syntactic level of analysis is strictly concerned with the grammar of the language and
the structure of any given sentence. A basic rule of the English language is that each
sentence must have a noun phrase and a verb phrase. Each noun phrase may consist of a
determiner and a noun, and each verb phrase may consist of a verb, preposition and noun
phrase. There are various different valid syntactic structures, and rules such as this, make
up the grammar of a language and must be represented in a concrete manner for the
11
computer. Secondly, there must exist a parser, which is a system that determines the
grammatical structure of an input sentence by comparing it to the existing rules. A parser
must break the input down into words and determine by categorizing each word if the
sentence is grammatically sound.
The lexical level of analysis concerns the meanings of the words that comprise each
sentence. Ambiguity increases when a word has more than one meaning (homonyms). For
example “duck” could either be a type of bird, or an action involving bending down. Since
these two meanings have different grammatical categories (noun and verb) the issue can be
resolved by syntactic analysis. The sentence’ s structure will be grammatically sound with
one of these parts of speech in place. From this information, a machine can determine the
definition that appropriately conveys the sense of the word within the sentence. However
this process does not resolve all lexical ambiguities. Many words have multiple meanings
within the same part of speech, or a part of speech can have sub-categories that also need to
be analyzed. The verb “can” can be considered an auxiliary verb or a primary verb. If it is
to be considered a primary verb, it can convey different meanings. The primary verb “can”
can either mean “to fire” or “the process of putting stuff into a container”. In order to
resolve these ambiguities we must resort to semantic analysis.
The semantic level of analysis addresses the contextual meanings of the words as they
relate to word definitions. In the “can” example, if another verb follows the word, then it is
most likely an auxiliary verb. Otherwise, if the other words in the sentence are related to
jobs or work then the former definition of the real verb should be taken. If the other words
were related to preserves or jams, the latter definition would be more suitable. The field of
12
statistical analysis provides methodology to resolve this ambiguity. When this type of
ambiguity arises, we must rely on the meaning of the word to be defined by the
circumstances of its use. Statistical Natural Language Processing (SNLP) looks at language
as a non-categorical phenomenon and can use the current domain and environment to
determine the meanings of words.
SNLP can also be used to gather another type of contextual information. It can track the
slow evolution of word meanings. For example, years ago the word “ like” was used in
comparisons, as a conjunction or a verb. Currently, it is often inadvertently used as a
colloquialism. This is the type of contextual information that is necessary in order to
resolve pragmatic ambiguities. Pragmatic ambiguities are cultural phrases or idioms that
have not been developed according to any set rules. For example, in the English language,
when a person asks, “ Do you know what time it is?” he usually is not wondering if you are
aware of the hour, but more likely wants you to tell him the time.
Referential ambiguities deal with the way clauses of a sentence are linked together. For
example, the sentence “ Ram hit the man with the hammer” has referential ambiguity
because it does not specify if Ram used a hammer to hit a man, or if Ram hit the man who
had a hammer. Referential ambiguities in a sentence are very difficult to reduce because
there may be no other clues in the sentence. In order to determine which clauses of the
sentence refer to or describe each other (in the example, who the hammer belongs to), the
processor would have to increase its scope of analysis and consider surrounding sentences
to look for clarification.
13
There are many tasks that require an understanding of Natural Language. Database queries,
fact retrieval, robot command, machine translation and automatic text summarization are
just a small subset of the tasks.
Although complete understanding has not yet been
achieved, there are imperfect versions of NLP technologies on the market.
1.3
Applications of NLP
One important application of NLP is Machine Translation (MT): “ the automatic translation
of text…from one [natural] language to another.” The existing MT systems are far from
perfect; they usually output a buggy translation, which requires human post-edit. These
systems are useful only to those people who are familiar enough with the output language
to decipher the inaccurate translations. The inaccuracies are in part a result of the imperfect
NLP systems. Without the capacity to understand a text, it is difficult to translate it. Many
of the difficulties in realizing MT will be resolved when a system to resolve pragmatic,
lexical, semantic and syntactic ambiguities of natural languages is developed [19].
There are currently three approaches to Machine Translation: direct, semantic transfer and
inter-lingual. Direct translation entails a word-for-word translation and syntactic analysis.
The word-for-word translation is based on the results of a bilingual dictionary query, and
syntactical analysis parses the input and regenerates the sentences according to the output
language’ s syntax rules.
For example the sentence “ He reads the book.” could be
accurately translated into “ vaha pusawaka paxawA hai” using this technology. This kind of
translation is most common today in commercial systems, such as Altavista. However this
approach to MT does not account for semantic ambiguities in translation.
14
The semantic transfer approach is more advanced than the direct translation method
because it involves representing the meaning of sentences and contexts, not just equivalent
word substitutions. This approach consists of a set of templates to represent the meaning of
words, and a set of correspondence rules that form an association between word meanings
and possible syntax structures in the output language. Semantics, as well as syntax and
morphology, are considered in this approach. This is useful because different languages use
different words to convey the same meaning. However, one limitation of this approach is
that each system must be tailored for a particular pair of languages [3].
The third and closest to ideal (thus inherently most difficult) approach to MT is translation
via interlingua.
“ An interlingua is a knowledge representation formalism that is
independent of the way particular languages express meaning.” This approach would form
the intermediate step for translation between all languages and enable fluent
communication across cultures.
This technology, however, greatly depends on the
development of a complete NLP system, where all levels of analysis and ambiguities in
natural language are resolved in a cohesive nature.
15
Chapter 2
Machine Translation
2.1
Definition
"Machine Translation (MT) can be defined as a translation where the initiative is with a
computer system, either autonomously (FAHQT = Fully Automatic High Quality
Translation) or where the user is asked to apply post-editing or pre-editing, or to answer
clarification/disambiguation dialogues [8]."
The term ‘‘Machine Translation’’ (MT) refers to the use of a machine for aiding or
performing translation tasks involving more than one human language. Bearing this
definition in mind, the work on MT was in fact started in the 17th century when the use of
mechanical dictionaries was first suggested. The machines translation (MT) `systems’
invented in those days were merely mechanical dictionaries for aiding human translation.
The whole translation process was very much relied on human efforts. Though the MT
`systems'invented in the 17th century were referred to as mechanical dictionaries, they were
not aiming at merely providing the meaning of words in the lexicon. They were aiming at
forming an unambiguous language based on logical principles and iconic symbols which
allow people to communicate with each other without the fear of misunderstanding. Since
then, the research on MT focused on producing different proposals of this kind of
unambiguous languages. A well-known unambiguous language of this kind is Esperanto
and it has been used as an interlingua in some interlingual MT systems and multi-lingual
dictionary programs.
16
2.2
Types of Machine Translation
Hutchins and Somers [6] divided MT systems into three different categories:
2.2.1 Machine-Aided Human Translation
In this category, we can include the following.
•
Spell, grammar or style checkers
•
Monolingual or bilingual dictionaries, thesauri and encyclopedias
•
Optical Character Recognition (OCR) programs and automatic term lookup
•
Machine pre-translation: replacing source language (SL) words and phrases that are
unambiguous
2.2.2 Human-Aided Machine Translation
Human-Aided Machine Translation System generally consists of the following.
•
Pre-editing: checking through the source text for foreseeable problems of MT and
attempts to remove them, e.g. marking grammatical categories of homographs or
substituting unknown words. The use of a controlled language can also be
considered as a form of pre-editing.
•
Interactive MT: an MT system which would pause and ask the user to resolve the
problem of ambiguity
17
•
Post-editing: correct the output of MT to an agreed standard, e.g. amending the style
of the output sentences, or any minimal amendments which will make the text more
readable.
2.2.3 Fully-automated Machine Translation (FAMT)
The source language text is fed into the computer as a file, and the computer produces a
translation automatically without any human intervention. This is sometimes referred to as
batch mode. There are two types of fully automatic machine translation. There is fully
automatic high-quality machine translation (FAHQMT) and there is low-quality machine
translation.
2.3
Historical Review of MT
2.3.1 Before the computer
It is possible to trace ideas about mechanizing translation processes back to the seventeenth
century, but realistic possibilities came only in the 20th century. In the mid 1930s, a
French-Armenian Georges Artsrouni and a Russian Petr Troyanskii applied for patents for
‘translating machines’ . Of the two, Troyanskii'
s was the more significant, proposing not
only a method for an automatic bilingual dictionary, but also a scheme for coding
interlingual grammatical rules (based on Esperanto) and an outline of how analysis and
synthesis might work. However, Troyanskii’ s ideas were not known about until the end of
the 1950s. Before then, the computer had been born.
18
2.3.2 The pioneers, 1947-1954
Soon after the first appearance of ‘electronic calculators’ research began on using
computers as aids for translating natural languages. The beginning may be dated to a letter
in March 1947 from Warren Weaver of the Rockefeller Foundation to cyberneticist Norbert
Wiener. Two years later, Weaver wrote a memorandum (July 1949), putting forward
various proposals, based on the wartime successes in code breaking, the developments by
Claude Shannon in information theory and speculations about universal principles
underlying natural languages. Within a few years research on machine translation (MT) had
begun at many US universities, and in 1954 the first public demonstration of the feasibility
of machine translation was given (a collaboration by IBM and Georgetown University).
Although using a very restricted vocabulary and grammar it was sufficiently impressive to
stimulate massive funding of MT in the United States and to inspire the establishment of
MT projects throughout the world [30].
2.3.3 The decade of optimism. 1954-1966
The earliest systems consisted primarily of large bilingual dictionaries where entries for
words of the source language gave one or more equivalents in the target language, and
some rules for producing the correct word order in the output. It was soon recognized that
specific dictionary-driven rules for syntactic ordering were too complex and increasingly ad
hoc, and the need for more systematic methods of syntactic analysis became evident.
Optimism remained at a high level for the first decade of research, with many predictions of
imminent "breakthroughs". However, disillusion grew as researchers encountered "semantic
19
barriers" for which they saw no straightforward solutions. There were some operational
systems – the Mark II system (developed by IBM and Washington University) installed at
the USAF Foreign Technology Division, and the Georgetown University system at the US
Atomic Energy Authority and at Euratom in Italy – but the quality of output was
disappointing (although satisfying many recipients’
needs for rapidly produced
information). By 1964, the US government sponsors had become increasingly concerned at
the lack of progress; they set up the Automatic Language Processing Advisory Committee
(ALPAC), which concluded in a famous 1966 report that MT was slower, less accurate and
twice as expensive as human translation and that "there is no immediate or predictable
prospect of useful machine translation." It saw no need for further investment in MT
research; and instead it recommended the development of machine aids for translators, such
as automatic dictionaries, and the continued support of basic research in computational
linguistics.
2.3.4 The aftermath of the ALPAC report, 1966-1980
Although widely condemned as biased and short-sighted, the ALPAC report brought a
virtual end to MT research in the United States for over a decade and it had great impact
elsewhere in the Soviet Union and in Europe. However, research did continue in Canada, in
France and in Germany. Within a few years the Systran system was installed for use by the
USAF (1970), and shortly afterwards by the Commission of the European Communities for
translating its rapidly growing volumes of documentation (1976). In the same year, another
successful operational system appeared in Canada, the Meteo system for translating weather
reports, developed at Montreal University [30].
20
In the 1960s in the US and the Soviet Union MT activity had concentrated on RussianEnglish and English-Russian translation of scientific and technical documents for a
relatively small number of potential users, who would accept the crude unrevised output for
the sake of rapid access to information. From the mid-1970s onwards the demand for MT
came from quite different sources with different needs and different languages. The
administrative and commercial demands of multilingual communities and multinational
trade stimulated the demand for translation in Europe, Canada and Japan beyond the
capacity of the traditional translation services. The demand was now for cost-effective
machine-aided translation systems that could deal with commercial and technical
documentation in the principal languages of international commerce.
2.3.5 The 1980s
The 1980s witnessed the emergence of a wide variety of MT system types, and from a
widening number of countries. First there were a number of mainframe systems, whose use
continues to the present day. Apart from Systran, now operating in many pairs of
languages, there was Logos (German-English and English-French); the internally developed
systems at the Pan American Health Organization (Spanish-English and English-Spanish);
the Metal system (German-English); and major systems for English-Japanese and JapaneseEnglish translation from Japanese computer companies.
Throughout the 1980s research on more advanced methods and techniques continued. For
most of the decade, the dominant strategy was that of ‘indirect’ translation via intermediary
representations, sometimes interlingual in nature, involving semantic as well as
morphological and syntactic analysis and sometimes non-linguistic ‘knowledge bases’ . The
21
most notable projects of the period were the GETA-Ariane (Grenoble), SUSY
(Saarbrücken), Mu (Kyoto), DLT (Utrecht), Rosetta (Eindhoven), the knowledge-based
project at Carnegie-Mellon University (Pittsburgh), and two international multilingual
projects: Eurotra, supported by the European Communities, and the Japanese CICC project
with participants in China, Indonesia and Thailand.
2.3.6 The early 1990s
The end of the decade was a major turning point. Firstly, a group from IBM published the
results of experiments on a system (Candide) based purely on statistical methods. Secondly,
certain Japanese groups began to use methods based on corpora of translation examples, i.e.
using the approach now called ‘example-based’ translation. In both approaches the
distinctive feature was that no syntactic or semantic rules are used in the analysis of texts or
in the selection of lexical equivalents; both approaches differed from earlier ‘rule-based’
methods in the exploitation of large text corpora.
Another feature of the early 1990s was the changing focus of MT activity from ‘pure’
research to practical applications, to the development of translator workstations for
professional translators, to work on controlled language and domain-restricted systems, and
to the application of translation components in multilingual information systems.
2.3.7 The late 1990s.
These trends have continued into the later 1990s. In particular, the use of MT and
translation aids (translator workstations) by large corporations has grown rapidly – a
particularly impressive increase is seen in the area of software localization (i.e. the
22
adaptation and translation of equipment and documentation for new markets). There has
been a huge growth in sales of MT software for personal computers (primarily for use by
non-translators) and even more significantly, the growing availability of MT from on-line
networked services (e.g. AltaVista, and many others). The demand has been met not just by
new systems but also by ‘downsized’ and improved versions of previous mainframe
systems. While in these applications, the need may be for reasonably good quality
translation (particularly if the results are intended for publication), there has been even
more rapid growth of automatic translation for direct Internet applications (electronic mail,
Web pages, etc.), where the need is for fast real-time response with less importance
attached to quality. With these developments, MT software is becoming a mass-market
product, as familiar as word processing and desktop publishing.
2.4
Various Strategies to Machine Translation
The history of MT is dominated by two generations of MT systems. First generation MT
systems refer generally to the ones which were constructed before 1960s. These systems
employed a direct approach to MT which was mainly based on word-to-word and/or
phrase-to-phrase translations. A simple word-to-word translation cannot resolve the
ambiguities arising in MT. A more thorough analysis of source language text is required to
produce better translation. As the major problem of the first generation MT was the lack of
linguistic information about source text, researchers therefore moved onto finding ways to
capture this information. This gave rise to the development of the indirect MT systems
which are generally regarded as second generation MT systems. This section reviews the
characteristics of the first and second generations of MT systems and explains how these
23
systems attempt to tackle the problem of ambiguity. A brief summary of the relationship
between these systems is shown in the Figure 2.1.
Figure 2.1: The Vauquois Triangle
2.4.1 Direct MT system
A direct MT system (also known as a transformer) simply translates source language text to
the corresponding target language (TL) text in a word-for-word or phrase-to-phrase manner
by means of bilingual dictionary lookup. Then the resulting TL words are reorganized
according to the target language sentence format. In order to improve the output quality,
some direct MT systems perform some morphological analysis before the bilingual
dictionary lookup but they rarely analyze the sentence structure of the source language (SL)
text.
24
Figure 2.2: Typical building blocks of a direct MT system
Direct MT systems were developed in the 1950s. In those days, computers were very
primitive and had a very long processing time. This explains why direct MT systems are
very primitive and do not analyze the linguistics of sentences before performing the
translation. Owing to its primitive nature, the direct MT approach is very straight-forward
and easy to implement. It supports the translation of SL sentences which have both
matching source-to-target language words and similar structures as the TL sentences.
However, as very little, if any, effort has been put in disambiguating SL sentences, this
approach does not support the translation of ambiguous sentences. This approach also fails
to translate sentences to a language which has very different syntactic structures and/or
different use of words/phrases from the source language. The main problem of the direct
MT approach is that it does not analyze the linguistic information nor the meaning of
source sentences before performing the translation. Without this information, the resulting
MT system cannot resolve the ambiguities that arise in the source sentence and/or during
the translation. Thus, this approach fails to translate any seemingly ambiguous sentences
(e.g. ‘‘Ram saw a bank on the bank of a river.’’). As a result, the first generation MT
systems cannot provide a quality translation of the source language text.
25
2.4.2 Indirect MT system
Owing to the fact that linguistic information helps an MT system to disambiguate SL
sentences and to produce better quality target language translation, with the advance of
computing technology, MT researchers started to develop methods to capture and process
the linguistics of sentences. This was when the era of indirect MT systems started. Hutchins
and Somers [6] identified two kinds of second-generation MT systems: transfer-based and
interlingual systems as shown in Figures 2.3 and 2.4.
Figure 2.3: Typical building blocks of a transfer-based MT system
Figure 2.4: Building blocks of an interlingual MT system
The structures of these systems are fairly similar. The module ‘Source Text Analysis’ aims
at capturing the required linguistic information about the SL sentences for aiding the
translation. The transfer-based approach uses the information obtained from the analysis
26
module directly to lookup the corresponding TL words. The interlingual approach involves
the use of an intermediate language (i.e. an interlingua) for the transfer -- with the SL text
translated to the interlingua and the interlingua translated to the TL text. As suggested by
Hutchins and Somers, an interlingua is an intermediate ‘meaning’ representation and this
representation:
‘‘includes all information necessary for the generation of the target text without ‘looking
back’ to the original text. The representation is thus a projection from the source text and at
the same time acts as the basis for the generation of the target text; it is an abstract
representation of the target text as well as a representation of the source text [6]. ’’
Some researchers used an existing artificial language (e.g. Esperanto) as the interlingua
because it is generally believed to be more regular and consistent, both lexically and
structurally, than natural languages and could capture the characteristics of any natural
language in a relatively precise way. In addition, as these artificial languages had already
been developed, they can be incorporated to an interlingual MT system directly. No
additional effort is required to define the interlingua. The use of an interlingua enables an
MT system to perform the translation without looking back at and referring to the original
SL text. After translating the SL words to their TL forms, the job of the ‘Target Text
Generation’ module is to synthesize the resulting TL words to form the target sentences.
One advantage of the transfer-based approach is that it allows the source language text to be
analyzed according to what is required for facilitating its translation to a target language.
Thus, much less effort, if any, would be wasted in analyzing the unnecessary features of the
SL sentences. In addition, this approach also facilitates a close examination of the
27
differences between a language pair. This, in turn, will facilitate the design and
implementation of the required MT system. The interlingual approach, however, is more
time-consuming as a lot of processing time is consumed in the ‘double-transfer’. It also
allows a double chance -- during both the from and to interlingua translations -- for
ambiguities to occur. However, if a multilingual MT system is to be built, this approach
would reduce the time and effort needed to produce a transfer module for each language
pair (as required in the transfer-based approach) in the system as shown in figure 2.5.
English
Text
English
Text
Analysis
Hindi
Text
Hindi
Text
Analysis
Punjabi
Text
Punjabi
Text
Analysis
Interlingua
Hindi
Text
Analysis
Hindi
Text
Punjabi
Text
Analysis
Punjabi
Text
English
Text
Analysis
English
Text
Figure 2.5: Interlingual System
The system structures of both the transfer-based and interlingual approaches allow a
systematic analysis and processing of the linguistic information about sentences. However,
these approaches do not provide an immediate solution to the problem of ambiguities and
language difference. A lot of detailed investigations into resolving the linguistic problems
occurred during the translation are still required [5].
28
2.4.3 Knowledge-based MT (KBMT)
Arnold et al. define it as
‘‘The term knowledge-based MT has come to describe a rule-based system displaying
extensive semantic and pragmatic knowledge of a domain, including an ability to reason, to
some limited extent, about concepts in the domain" [8].
The assumption behind KBMT is that high quality translation requires in-depth
understanding of the text. A domain model which supports this in-depth understanding of
the meaning and relationship of words in the text is therefore used to aid the translation
process. And the Motivation behind KBMT is that post-editing is time-consuming and
expensive, thus it is worth putting more effort in designing an MT system which can
produce high quality output without human intervention. So No post-editing is involved for
obtaining very high quality translation. KBMT tends to be domain specific (especially a
domain which is relatively less ambiguous, e.g. technical documents) because it is very
complicated and difficult to represent a complete knowledge about the whole world.
Some basic components of a KBMT system are:
•
An ontology of concepts (serves as an interlingua)
•
SL lexicon and grammar for the analysis process
•
TL lexicon and grammar for the generation process
•
Mapping rules between the Interlingua and SL/TL syntax
Strengths of the KBMT approach are:
29
•
It supports the production of very high quality translation.
•
It allows a good modularity in the resulting MT system. Therefore, the development
of a parser can be completely independent of the generator of the MT system.
•
The parser and generators are independent of each other. Therefore, the
development of SL and TL components can be overlapped with each other. This, in
turn, reduces the system development time.
•
As no explicit SL-TL mapping is required for each language pair, any source
language supported by the system can be translated to any target language defined
in the system.
•
It makes the addition of a new language to the existing system easier. A new
language can be added to the existing system through the implementation of a new
parser and/or generator module(s) which link(s) this language to the interlingua. The
newly incorporated parser and/or generator will then be able to co-operate with
other parsers or generators to produce the required translation.
Some Weaknesses of the KBMT approach are:
•
Owing to the fact that the main reason for the inadequacy of many existing MT
systems is the lack of an adequate analysis and understanding of the SL text, the
idea to use deep textual understanding for MT is perhaps one of the best ways to
improve existing MT technology. However, an effective KBMT system relies on a
good means to knowledge acquisition and representation, which is not highly
available.
30
•
The use of an interlingua for meaning representation reduces the amount of effort
required for developing a multilingual MT system. However, it is not easy to select
or to define an adequate interlingua. Without an adequate interlingua, a deep textual
understanding will not be supported by the resulting KBMT system, thus its
effectiveness will be reduced significantly.
•
The success of a KBMT system depends on a large amount of hand-coded lexical
knowledge. This hand-coding process is time-consuming and labour-intensive.
Some means to alleviate this problem is required.
2.4.4 Example-Based Machine Translation (EBMT)
According to Turcato et al.
“EBMT is essentially translation by analogy. EBMT is also regarded as a case-based
reasoning approach to MT, where previously resolved translation cases are reused to
translate new SL text” [14].
The basic assumption of EBMT is:
‘‘If a previously translated sentence occurs again, the same translation is likely to be
correct again.’’ This idea is sometimes thought to be reminiscent of how human translators
proceed when using a bilingual dictionary: looking at the examples given to find the SL
example that best approximates what they are trying to translate, and constructing a
translation on the basis of the TL example that is given. Konstantinidis presented the
general architecture of an EBMT system as
31
Figure 2.6: EBMT Architecture
The EBMT approach proposed by Nagao [1] uses raw, unanalyzed, unannotated bilingual
data and a set of SL and TL lexical equivalences mainly expressed in terms of word pairs
(with SL and TL verb equivalences expressed in terms of case frames) as the linguistic
backbone of the translation process. The translation process is mainly a matching process
which aims at locating the best match in terms of semantic similarities between the input
sentence and the available example in the database.
In EBMT, instead of using explicit mapping rules for translating sentences from one
language to another, the translation process is basically a procedure of matching the input
sentence against the stored example translations. The basic idea is to collect a bilingual
corpus of translation pairs and then use a best match algorithm to find the closest example
to the source phrase in question. This gives a translation template, which can then be filled
in by word-for-word translation. The distance calculation, for finding the best match for a
source phrase, can involve calculating the closeness of items in a hierarchy of terms and
concepts provided by a thesaurus.
32
Strengths of the EBMT approach are:
•
EBMT is not domain specific. As the example set becomes more complete, the
quality of translation will improve incrementally without the need to update and
improve detailed grammatical and lexical descriptions.
•
This approach can be (in principle) very efficient, since in the best case there is no
complex rule application to perform -- all one has to do is find the appropriate
example and (sometimes) calculate distances.
•
An EBMT system is potentially multilingual: An EBMT program can be
implemented in such a way that it reads in any bilingual translation data and process
them in order to produce the database for translation.
Some Weaknesses of the EBMT approach are:
•
This method is dependent on the collection of good bilingual data, which might not
be highly available.
•
The calculation of the best match might be a complicated and lengthy process.
•
For instance, as suggested by Arnold et al.:
‘‘When there are a number of different examples each of which matches part of the
string, but where the parts they match overlap, and/or do not cover the whole string.
In such cases, calculating the best match can involve considering a large number of
possibilities” [8].
•
In terms of improving the translation quality, the more examples which cover
different translation cases the better. However, more examples stored in the
33
translation database means that the time for searching through the database in order
to locate the best match is longer.
•
In some cases, especially when an input sentence is relatively less ambiguous, a
simple rule-based system which analyses the linguistic information about the input
sentence would be less complicated and thus more efficient.
2.4.5 Statistical MT
The use of statistical data for MT has been suggested since the age of first generation MT.
However, this approach was not pursued extensively. This is perhaps mainly due to the fact
that computers in those days were not powerful enough to support such kind of
computationally intensive approach. Statistical approaches to MT can mean:
•
Approaches which does not use explicitly formulated linguistic knowledge to
perform MT (i.e. pure statistical MT); or
•
The application of statistical techniques or techniques on calculating probability to
aid parts of the MT task (e.g. word sense disambiguation).
The idea behind the pure statistical MT approach is to let a computer learn automatically
how to translate text from one language to another by examining large amounts of parallel
bilingual text, i.e. documents which are nearly exact translations of each other. The
statistical MT approach uses statistical data (e.g. which SL lexical unit is translated to
which TL word(s) and how often this translation occurs) to perform translation. This
statistical data is obtained from an analysis of a vast amount of bilingual texts. Different
probabilities are extracted from the bilingual texts automatically by a computer, i.e.:
34
•
The probability of a source sentence to occur in the texts,
•
The probabilities of a source word to be translated as one, two, three, etc. target
words,
•
The translation probabilities for each word in each language, and
•
The probabilities of the position of each word in an SL sentence which is not in the
same position of the TL word in the target sentence (i.e. the probability of
distortion).
These probabilities are vital to the translation process as they are the sole information for
calculating how an SL sentence should be translated to the TL form. In a pure statistical
MT system, no bi-lingual dictionary or any explicit linguistic information is required to aid
the translation. Therefore, techniques for aligning the bilingual text (i.e. bilingual phrase or
even word alignment) are required to help the system to learn how to perform translation. If
there are more than one TL equivalents for an SL word, the frequency of each translation
will be used for calculating the probability of the use of each translation. In order to cope
with the translation of an ambiguous word, the probabilities for the current and neighboring
words in a sentence are combined and used for resolving the ambiguity.
Strengths of the Statistical approach are:
•
Even if an exact match of a translation is not listed in the bilingual corpus, the MT
system can still use the translation probabilities to approximate a possible
translation.
•
Provided that a good corpus of bilingual texts is available, statistical MT offers a
fast and less costly approach to MT.
35
•
The IBM team involved in the Candide project also demonstrated the fact that the
knowledge of both source and target languages is not essential for this approach as
the people from the IBM team either know very little French or no French at all [8].
•
The fact that pure statistical MT learns how to perform MT through observing the
translation behaviors of vast amount of bilingual text means that this method is
language independent.
Some Weaknesses of the Statistical approach are:
•
One limitation of statistical MT is that unless the corpus is very large and contains
text from different domain (e.g. technical text, newspaper, novels, etc.), the statistics
generated would tend to be domain specific. Thus, this method might tend towards
domain-specific, i.e. produce less accurate result for the text from different domain
than the training data.
•
One major drawback for statistical MT is that its translation performance is rather
poor. Out of 100 short test sentences, only 39% of the translations are correct [8].
•
If this approach is used in real-life MT tasks, a lot of post-editing on the resulting
translations will be required which makes this approach very costly.
2.4.6 Hybrid Machine Translation Paradigms
Current thinking in MT circles suggests that significant progress in the field of MT is
unlikely to be achieved by refining any single approach. It has therefore become a common
interest to merge different MT paradigms into one system in order to yield better translation
results. In recent years, one can consequently see the development of an increasing number
36
of hybrid MT systems with the aim of combining the strengths of each individual approach
and improving overall translation quality as a result. At this point in time, the extent to
which such hybrid MT paradigms can improve the performance of MT engines is not yet
fully known, since the work carried out in this field is still in its infancy.
2.5
What makes MT so difficult?
Natural language translation is not an easy task. Due to the versatile usage of words and
phrases, sometimes even a well-trained and experienced human translator has difficulties in
translating a piece of source language (SL) text correctly. By no means a computer in the
present age can compare with an average human being in terms of understanding
knowledge of the real world. Without the ability to understand real-world knowledge, it is
more difficult to ‘teach’ a computer to perform a task which well-trained and experienced
human translators found difficult at times. In addition to this inability, there are other
problems which impedes a computer system to perform high quality natural language
translation. Here we discuss some of these problems. Similar to a human translator, before a
computer can translate text from one language to another, some means is required to ‘teach’
a computer to perform translation. The simplest way to perform MT is to find out the
corresponding target language (TL) equivalent for each word present in the source text one
by one.
Direct translation is simple and straight-forward: no syntactic analysis is required on the SL
sentence and the source-to-target language equivalents obtained build up the required TL
sentence without the need of further processing. Though the word-for-word translation
method works fine in translating the English sentence ‘‘Ram loves Sita.’’ to Chinese, if this
37
method is used to translate the same English sentence to, say, Hindi, the output sentence
obtained will be syntactically incorrect. If an ambiguous SL sentence is to be translated by
the simple word-for-word translation method, the output translation might seem like a piece
of junk text to a target language speaker, or worse still, convey a wrong meaning and cause
misunderstanding. Therefore, a more detailed processing is required for effective MT. A
more effective MT approach contains three parts: SL text analysis, source-to-target
language transfer and TL text generation. This approach allows a more thorough analysis of
the source language text so as to help resolving the ambiguity within it before the required
source-to-target language translations are looked up during the source-to-target language
transfer. This method also allows the reorganization and/or deletion of selected TL words
and introduction of additional TL words so that the output sentences will conform with the
TL grammar. Even though this three-stage MT method can give rise to a better MT output,
due to the complexity of natural languages, there still exist many problems which affect the
effectiveness of MT systems. Here we discuss how and why the three-stage MT method is
inadequate in catering for real-life translation needs.
2.5.1 Linguistic Problems
If any word within a natural language has only one interpretation (i.e. having one syntactic,
semantic and pragmatic analysis), MT would become a much simpler task. An MT system
can obtain the TL translation by simply analyzing each word within a SL sentence and
generating the target sentence according to the TL grammar. However, this is often not the
case with any natural language. A word not only can have more than one interpretation, it
can also combine with other constituents within a sentence to form other interpretations.
38
For instance, a word may appear in more than one syntactic category, e.g. the word ‘ships’
can be a noun or a verb. A word may combine with other word(s) to form a new lexical
unit, e.g. the phrasal verb ‘fish for’ as in ‘‘Ram fished for invitations’’. Even within the same
syntactic category, a word can have more than one meaning, e.g. the noun ‘saw’ can mean a
tool for cutting, or a short, well-known saying or proverb. The existence of ambiguous
words makes it more difficult for an MT system to capture the appropriate meaning of a
source sentence so as to produce the required translation. Here we briefly discuss the
different kinds of problems occurring in natural languages which affect the effectiveness of
MT systems.
1. Lexical Ambiguity
Lexical ambiguity occurs when a word possesses more than one meaning. One famous kind
of lexical ambiguity is caused by homographs. A homograph is a word (i.e. a sequence of
characters) with more than one meaning. For instance, the English word ‘saw’ is commonly
used as the past tense of the verb ‘see’, but it can also mean a tool for cutting, the action of
using this tool for cutting as well as a short, well-known saying or proverb. It is not always
difficult to disambiguate a homograph. Some homographs have only one meaning within a
single syntactic category. For instance, as a noun, the word ‘minute’ means a unit for
measuring time; as a verb, it means to make a written record of what is said or decided
during a meeting; as an adjective, it means tiny.
•
One minute has sixty seconds.
•
Part of the job of a secretary is to minute meetings.
•
There is only minute difference between these pictures.
39
It is relatively easy to disambiguate this kind of homographs. Simply analyze the syntactic
structure of the sentence and find out the syntactic category of the homograph within the
sentence, the appropriate meaning can then be obtained. Knowing the syntactic category of
a word does not always help the disambiguation of homographs. It is because some
homographs have more than one meaning even when they are used in the same syntactic
category. For instance, the noun ‘ball’ can mean a dance party or a round object for sports.
With this kind of homographs, where more than one meaning exist in the same syntactic
category, one way to disambiguate them is to consider their semantic properties in relation
to the semantic properties of other words in the sentence. For instance, the meaning of ‘ball’
in the sentence ‘‘Ram kicked a ball.’’ must be a round physical object for sports because the
verb ‘kick’ requires physical contact with a physical object, but a dance party is an abstract
event which cannot be kicked. However, even comparing the semantic properties of words
within a sentence does not always help. As pointed out by Hutchins and Somers [6], with a
sentence like ‘‘When you hold a ball, ...’’ in which both senses of the verb ‘hold’ (i.e. to
grasp and to organize) can be used with the different senses of the noun ‘ball’, it would be
difficult to obtain the appropriate meanings unless the later part of the sentence provides
more clue to disambiguate these senses.
2. Structural Ambiguity
Structural ambiguity is concerned with the syntactic representation of sentences. It occurs
when more than one syntactic structure can be associated with a sequence of words. For
instance, a well-known example of this kind is the sentence ‘‘Flying planes can be
40
dangerous.’’ in which the word ‘flying’ can function as a noun or an adjective, and results in
more than one meaning for this sentence [6]:
•
It can be dangerous to fly planes.
•
Planes that are flying can be dangerous.
Each of these interpretations results in a different translation of the sentence. With this kind
of structural ambiguity which human translators would find difficult to disambiguate
without the knowledge of the actual event, it is very unlikely that computers can perform
the required disambiguation without any human intervention. With ambiguous sentences of
this kind where both analyses result in a valid meaning, without knowing the author’s
intended meaning or the context of the sentence, it is perhaps impossible to translate this
sentence appropriately. In such a situation, perhaps an MT system should generate two
translations for this sentence. However, not all potential structural ambiguities would
trigger the need to generate more than one target language translation. For instance, as
suggested by Arnold et al. [8] if the modal of above sentence is replaced by the appropriate
tense, i.e.:
•
Flying planes is dangerous.
•
Flying planes are dangerous.
the syntactic structure of the word sequence ‘flying planes’ can then be disambiguated by
analyzing the number agreement between the subject and the verb of each sentence. In
some cases, structural ambiguity can even be resolved by analyzing the phrasal structure of
the sentences. For instance, consider the following sentences:
41
•
The tape measures are all sold out.
•
The tape measures five inches long.
The word sequence ‘tape measures’ have two interpretations in the above sentences: a noun
group (i.e. noun modifier + noun) and a noun with a verb respectively. Upon analyzing the
structure of the sentence, the appropriate occurrences of ‘tape measures’ in these sentences
can be obtained. Syntactic processing alone is adequate to perform this kind of
disambiguation.
3. Multiword Units
A word can possess more than one meaning and causes problems of ambiguity in an MT
system. When a word is used in conjunction with other word(s), even if each of these words
possesses only one meaning, they can also become ambiguous. Two common examples of
this kind in the English language are phrasal verbs and idioms. According to the Collins
COBUILD English Grammar
‘‘Phrasal verbs are a special group of verbs which are made up of a verb and an adverb
and/or a preposition which are used to extend or change the meaning of a verb.
As a phrasal verb often constitutes a meaning that is different from the literal meaning of its
constituents, sentences with this kind of verbs have a higher chance to be ambiguous. For
instance, consider the phrasal verb ‘eat in’ and the co-occurrence of the verb ‘eat’ and the
preposition ‘in’, as in [6]:
•
Ram eats in on Sundays.
42
•
Ram eats in a restaurant on weekdays.
The first ‘eats in’ has the modified meaning ‘‘to eat at home’’, whereas the second one uses
the literal meaning of the verb ‘eat’ and the preposition ‘in’, i.e. ‘‘to eat in a particular
place’’. One way to disambiguate the above sentences is by analyzing their syntactic
structures: the word ‘in’ functions as an adverb, which does not govern an object, and a
preposition, which governs the object noun phrase (NP) ‘‘a restaurant’’, respectively.
However, not all phrasal verbs can be disambiguated by simply analyzing the syntax of a
sentence. For instance:
•
Ram fell for Sita.
•
Ram fell for a lie.
Where the phrasal verb ‘fall for’ means ‘‘to be attracted towards’’ and ‘‘to be tricked’’
respectively. In both cases, the phrasal verb ‘fall for’ governs an object (i.e. ‘Sita’ and ‘a lie’
respectively) and they have the same sentence structure. To disambiguate this kind of
phrasal verbs will require the analysis of lexical semantics (i.e. the meaning of words).
4. Language Differences
The problems we have looked at so far concern with finding out the appropriate word
senses used in SL text. Even though a SL sentence can be successfully disambiguated by an
MT system, when it is translated to a target language, there are other problems which can
hinder the production of an appropriate TL translation. One common problem in translation
is that a word in one language might not have an immediate equivalent in another language,
i.e. lexical holes. For instance, the English verb ‘stab’ has no immediate equivalent in
43
Spanish [6]. One possible way to translate this kind of words is to express the meaning by
several TL words, e.g. the English verb ‘stab’ can be translated to a Spanish phrase meaning
‘give knife wound to’ in English. However, some lexical holes might be too difficult to fill
(i.e. no TL expression can adequately express the meaning of a SL word) and the only way
is to leave the word untranslated. To decide whether or not to leave a SL word or phrase
untranslated is not an easy task. A computer cannot perform this decision making on its
own. If all lexical holes have to be filled by additional dictionary entries and translation
rules while developing an MT system, it will prolong the system development and
processing time.
Different language seems to have a different classification of the world for example Even
both Americans and Brits speak English, there is still room for misunderstanding due to
different usage of words. For instance, Brits call the front engine cover of a car ‘bonnet’ and
the storage space at the back of a car ‘boot’. However, Americans use different words to
express the same meaning. In fact, to average Americans, ‘bonnet’ is a kind of hat whereas
‘boot’ is a kind of footwear. Therefore, the sentence ‘‘I unlocked the boot and laid the tools
on the bonnet’ ’ which sounds normal to a Brit, might sound funny to average Americans.
44
Chapter 3
Role of Interlingua in Machine Translation
3.1
Interlingua
The approach to machine translation (MT) known as Interlingual MT requires the
composition of an unambiguous language-neutral representation of the meaning of the
source text from which an equivalent text in a target language may be generated. Thus, a
sub problem for any Interlingua (IL)-based MT system is that of decoding the lexical and
compositional meaning of the source language (SL) text.
There are a number of clear attractions to an interlingual architecture. First, from a purely
Intellectual or scientific point of view, the idea of an interlingua is interesting, and exciting.
Second, from a more practical point of view, an interlingual system promises to be much
easier to extend by adding new language pairs, than a transfer system (or a transformer
system). This is because, providing the interlingua is properly designed, it should be
possible to add a new language to a system simply by adding analysis and synthesis
components for it. Compare this with a transfer system, where one needs not only analysis
and synthesis, but also transfer components into all the other languages involved in the
system. Since there is one transfer for each language pair, N languages require N*N - 1
transfer components (one does not need a transfer component from a language into itself)
[9].
45
3.2
Machine Translation with and without an Interlingua
Machine translation methodologies are commonly categorized as direct, transfer, and
interlingual. The methodologies differ in the depth of analysis of the source language and
extent to which they attempt to reach a neutral representation of meaning or intent between
the source and target languages. Direct translation involves very little analysis of the source
language --- often only looking up the words in a bilingual dictionary. Transfer usually
involves some analysis of the source language. However in transfer systems, the
representation of the source language sentence may not be identical to the representation of
the target language sentence. The two representations would be related to each other by
transfer rules -- rules that specify which source language structures correspond to which
target language structures. Interlingual MT may involve the deepest analysis of the source
language. The analysis must be deep enough to neutralize the differences between the
source and target languages. Of course, in practice, the boundaries between the three
methodologies are not sharp. For example, many transfer systems perform quite deep
analysis of the source language.
3.3
Advantages of Translating with an Interlingua
The choice of direct, transfer, or interlingual MT depends on the application of MT and on
the available resources. For example, direct MT may not be able to re-order the words in
the target language and may not provide good translations for idioms and other
constructions for which a word-by-word substitution is not adequate. However, direct MT
may be quick to implement and may be useful for applications for which getting the gist of
46
the meaning is sufficient. Furthermore, direct MT may be the only option when the only
resource available is a bilingual glossary.
Interlingual MT is particularly advantageous in multi-lingual applications involving more
than two languages. The reason is that interlingual MT requires fewer components in order
to relate each source language to each target language.
An interlingual system is illustrated schematically in Figure 3.1. For each language, there is
an analyzer and a generator. The analyzer takes as input a source language sentences and
produces as output an interlingual representation of the meaning. The generator takes an
interlingual representation of meaning as input and produces a sentence with that meaning
as output. To translate from L1 to L2, L1’s analyzer produces an interlingual representation
and L2’s generator produces an L2 sentence with the same meaning.
If there are n languages and we want to be able to translate from each language to each
language, n analyzers and n generators are needed, for a total of 2n components. In contrast,
a transfer-based system or a direct system might require up to n-squared components --rules that map L1 to L2, L2 to L1, L1 to L3, L3 to L1, L2 to L3, L3 to L2, etc.
47
Punjabi
English
(input sentence)
The pain started
three days ago.
Bangla
Marathi
Asamiya
Hindi
Oriya
Analyzers
Gujrati
c:give-information+occurrence+health-status
(health-status=pain, phase=start,
e-time=previous,
Interlingua
time=(relative-time=(time-distance=
(quantity=3, time-unit=day),
time-relation=before))))
Gujrati
Bangla
English
Hindi
Punjabi
(output sentence)
Xaraxa wIn xin
pahale SUrU hUA.
Marathi
Oriya
Asamiya
Generators
Figure 3.1: Multilingual Translation with an Interlingua
There are other advantages of interlingual MT. First, related to the point we have already
made, it takes fewer components to add a new language. For example, suppose we want to
add language Lm to the system shown in Figure 3.1, and we want all-ways translation
between all of the languages. We only need to add an analyzer for Lm and a generator for
Lm. Once Lm is connected to the interlingua with an analyzer and a generator, it is
automatically connected to input and output L1-Ln.
Another advantage of the interlingua approach is that the analyzers and generators can be
written by mono-lingual system developers. For example, building an MT system for Hindi
and Punjabi does not require anyone to be bilingual in Hindi and Punjabi. It only requires
48
that the Hindi speakers connect Hindi to the interlingua and the Punjabi speakers connect
Punjabi to the interlingua.
Interlingual MT also supports paraphrase of the input in the original language. When an
English speaker says The pain started three days ago, the analysis process produces the
interlingua shown in figure 3.1. The interlingua is a system-internal representation which is
not of interest to most users, and so is not visible to users. The generator may then produce
a target language sentence like xaraxa wIn xin pahale SUrU hUA. The source language
speaker, however, does not know whether the target language translation is correct (because
s/he presumably does not speak the target language). In order to give the source language
speaker a chance to check the translation, the source language generator can produce a
source language sentence from the same interlingua. Since an interlingua represents the
meaning of the sentence, the generator might produce a syntactically different sentence
such as Pain is for the last three days, but the meaning of the input sentence should be
preserved. The source language speaker can then verify that the meaning is correct.
Of course, paraphrase from the same interlingua might not always reveal a problem.
Suppose the target language generator malfunctions, producing Hindi sentence, but the
source language generator works properly and produces a correct paraphrase of the original
sentence. In that case, the source language speaker will not be alerted to the problem with
the target language generator. Conversely, the source language generator may malfunction
giving the speaker the mistaken impression that there is a problem with the source language
analyzer or the target language generator.
49
3.4
Grain Size of Meaning: The Challenge of Interlingua Design
The biggest problem of interlingua design is that “ meaning” is a bottomless pit. It is always
possible to add more detail to a meaning representation, but in order to implement an MT
system, the details must end at some point. Many interlingua developers find that the most
time-consuming part of interlingua design is in deciding when to stop refining the meaning
representation. For example, should there be a slightly different shade of meaning for I have
high blood pressure (more likely to be a persistent condition) and My blood pressure is
high (more likely to be a temporary current condition)?
50
Chapter 4
Angla Bharti System Overview
4.1
System Overview
As AnglaHindi is a derivative of Anglabharti, let us first look at the Anglabharti
methodology. As pointed out earlier, Anglabharti is a machine-aided translation
methodology specifically designed for translating English to Indian languages. English is a
SVO language while Indian languages are SOV and are relatively of free word-order.
Instead of designing translators for English to each Indian language, Anglabharti uses a
pseudo-interlingua approach. It analyses English sentences only once and creates an
intermediate structure with most of the disambiguation performed. The intermediate
language structure has the word and word-group order as per the structure of the group of
target languages. The intermediate structure is then converted to each Indian language
through a process of text-generation. The effort in analyzing the English sentences is about
70% and the text generation accounts for the rest of the 30%. Thus only with an additional
30% effort, a new English to Indian language translator can be built. Anglabharti is a
pattern directed rule based system with context free grammar like structure for analysis of
English as source language. The analysis generates a ‘pseudo-target’ applicable to a group
of Indian languages. A set of rules obtained through corpus analysis is used to identify
plausible constituents with respect to which movement rules for the ’pseudo-target’ is
constructed. The idea of using ‘pseudo-target’ is primarily aimed at incorporating
advantages similar to that of using interlingua approach exploiting structural similarity.
Indian languages are verb ending, free word-group order, and a lot of structural similarity.
51
Indian languages can be classified into four broad groups according to their origin and
similarity [13]. These are Indo-Aryan family (Hindi, Bangla, Asamiya, Punjabi, Marathi,
Oriya, Gujrati etc.); Dravidian family (Tamil, Telugu, Kannada & Malayalam); AustroAsian family and Tibetan-Burmese family. Within each group, there is a high degree of
structural similarity. Paninian framework based on Sanskrit grammar using Karak (similar
to ’case’) relationship provides an uniform way of designing the Indian language text
generators using selectional constraints and preferences.
A block schematic diagram of the Anglabharti methodology is depicted in figure 4.1. A
brief description of some of the major building blocks of Anglabharti is given in the
following paragraphs [26].
Rule-base: This contains rules for mapping structures of sentence from English to Indian
languages. This database of pattern transformations from English to Indian languages is
entrusted the job of making a surface-tree to surface-tree transformation, bypassing the task
of getting a deep tree of the sentence to be translated. The database of structural
transformation rules from English to Indian languages forms the heart of the Anglabharti
system. The system is designed to cater to compound, complex, imperative, interrogative
and other constructs such as headings etc. As mentioned earlier, by making a generic rulebase for Indian languages, Anglabharti exhibits a potential benefit while translating from
English. This module is also responsible for picking up the correct sense of each word in
the source language to the extent feasible using interleaved semantic interpreter. Further
disambiguation and choice of right construct and lexical preferences are performed by the
target language text-generator module. Many a time, multiple rules may get invoked
52
leading to multiple interpretation of the input sentence. The rules are ordered in terms of
their preferences and an upper limit is put on the number of alternatives produced.
These multiple translations are available for further post-editing. Multi-lingual dictionary/
Lexical database and Sense Disambiguator: The lexical database is the fuel to the
translation engine. It contains various details for each word in English, like their syntactic
categories, possible senses, keys to disambiguate their senses, corresponding words in
target languages with their tags. A number of ontological/semantic tags are used to resolve
sense ambiguity in the source language. Most of the disambiguation rules are in the form of
syntactosemantic constraints. We use semantics to resolve most of the intra-sentence
anaphora/pronoun references. Alternative meanings for the unresolved ambiguities are
retained in the pseudo target language. The lexical database is hierarchically organized to
allow domain specific meanings and also prioritize meanings as per users’ requirement.
Target text generators and Corrector for ill Formed Sentences [10] [7]: These form the tail
end of the system. Their function is to generate the translated output for the corresponding
target languages. A text generator module for each of the target languages transforms the
pseudo target language to the target language. These transformations do lead to sentences,
which may be ill-formed. The ill-formed sentences are target language specific and are
usually related to incorrect placement of emphasizers, negation and forms denoting cultural
dependence (such as plurals being used for persons whom you pay respect). A corrector for
ill-formed sentences is used for each of the target languages. Finally, a human-engineered
post-editing package is used to make the final corrections. It is our experience that for more
than 50% of the normal text, the human post-editor needs to know only the target language
53
as the humans use a lot of contextual information in making the right choice. For resolving
the structural ambiguity, one needs to consult the source language. It may be noted that by
having
Figure 4.1: System Architecture of ANGLABHARTI
different text generators using the same rule-base and sense disambiguator, a generic MT
system is obtained for a host of target languages. We have used Paninian framework with
54
verb-centric expectation driven methodology [4] with selectional restrictions/semantic
constraints for synthesizing the Indian language text.
AnglaHindi besides using all the modules of Anglabharti, also makes use of an abstracted
example-base for translating frequently encountered noun phrases and verb phrasals. The
example-based approach developed by the author’ s group, named ANUBHARTI [11] [18],
is invoked before the rule-based approach is applied. The example-base is statistically
derived from the corpus. Ambiguities in the meanings of the verb phrasals are also resolved
using an appropriate distance function in the example-base [21]. AnglaHindi accepts
unconstrained text [22] [20]. The text may be made up of headings, parenthesized texts, text
under quote marks, currencies, varying numeral & date conventions, acronyms,
unknowns
and other frequently encountered constructs. The performance of the system has been
evaluated by human translators. The system generates approximately 90% acceptable
translation in case of imple, compound and complex sentences up to a length of 20 words
[10].
Current version of AnglaHindi is not tuned to any specific domain of application or topic.
However, it has user-friendly interfaces, which allows hierarchical structuring of the lexical
database leading to preferences on lexical choice. Similarly, it has provisions for
augmenting its abstracted example-base specific to an application domain. This not only
eliminates the alterative translations but also generates more accurate and acceptable
translation. Currently, the alternate translations are being ranked with respect to the
ordering of the rule-base. This can be further enhanced by using domain specific
55
information and target language statistics. The alternate translations can be ranked based on
hidden Markov model of Hindi in the specific domain. For each alternate translation, the
language model yields a figure of merit reflecting preferences for style and lexical choice.
Overall, the AnglaHindi system attempts to integrate [15] example-based approach with
rule-base and human engineered post-editing. An attempt is made to fuse the modern
artificial intelligence techniques with the classical Paninian framework based on Sanskrit
grammar.
4.2
PLIL: Pseudo-Lingua for Indian Languages
Anglabharti system architecture exploits structural similarity of Indian languages. This
structural similarity is more homogeneous within each family of languages such as within
the Indo-Aryan family, Dravidian family and others. Anglabharti system translates the
English source language into an intermediate language that follows the structure of the
family of target Indian languages. It contains most of the semantic information needed to
construct the text in the final target Indian language within the class of languages. This
intermediate language has been referred to as Pseudo-lingua of an MT systems using Interlingua approach, but it is not so in real sense as it caters only to a class of languages for
which it has been designed and here the source language is assumed to be English. An
Inter-lingual MT system envisages embodying a knowledge representation schema wherein
all ambiguities of the source language are assumed to have been resolved. This is an ideal
situation that is hard to achieve. On the other hand PLIL does not claim to have a
representation wherein all ambiguities have been resolved. The English to PLIL encoder
56
generates a structure which is as per the requirement of the target Indian language. Thus the
PLIL to target language decoder design becomes a text generator task.
PLIL consists of two major components [25]:
A multi-lingual lexical data-base of English to Indian languages: An English root
word/lexicon is mapped on to corresponding target language lexicon along with its
associated grammatical and semantic information. A root word may have multiple
categories and/or multiple meanings. A lexicon in a language represents a certain mental
concept as visualized by the native speaker. The mapping into the target language meaning
is to the lexicon representing closest concept of the speaker. In PLIL, a concept is uniquely
represented by the syntacto-semantic information associated with root word and its
meaning.
A grammar representing the family/class of target language: This grammar has been
loosely defined around a CFG formalism which generates the word order for the class of
languages. A sentence in PLIL is defined in terms of NP, VP and other constructs as
expected in case of any natural language. In addition, a number of keywords/terms are used
to denote the nature of the sentence, connectives and indicators that help in lexical choice or
invoke functions in the process of target language synthesis. Many of these symbols are self
explanatory and are directly taken from English sentence, these keywords/symbols can be
found in map.c, tam_gen_rules.c, verb_para, *.pl files marked with keyword TLDC, *.txt
files, phrasals.txt and already_hindi. Explanation for some of the additional symbols used is
given in appendix. Anglabharti uses a pattern directed rule-base to convert the input English
sentence structure into PLIL structure. The constituents of PLIL are formally explained
57
below and some examples are included. In many of PLIL examples, only one alternative is
shown for explanation.
4.2.1 PLIL Structure:
<np>:
{<det> <adj> (<lexicon><grammatical category><GNP> [<semantic type>] [<list of
meanings>:<gender><paradigm number>] [<other language>] [<other language>] ) }
<adj>:
(<det> ( <lexicon> <grammatical category> <degree> [<semantic>] [<list of meanings>]
[<other language>] [<other language>] )
<pp>:
{ pp <np> ( <lexicon> <grammatical category> [prep_name] ) }
<vp>:
{verb_type <verb_pattern> }
<verb_pattern>:
(<lexicon> <verb form> <pattern_type> < auxiliary> < GNP> [<list of meanings>] verb
paradigm number [<other language>] [<other language>] )
[(verb_types: Active, Passive), (verb forms: verb_1 (e.g., eat), verb_2 (e.g., eats), verb_3
(e.g., ate), verb_4 (e.g., eaten), verb_5 (e.g., eating)), (pattern_type: see tam_gen_rules.c),
(auxiliary : am, was, is, are, were, has, have, had, has_been, have_been, had_been, will,
will_be, will_have, will_have_been etc.)]
58
<adv>:
(<lexicon> <grammatical category> [<list of meanings>] [<other language>] [<other
language>])
<S>:
<adv> <verb_pattern> <toinf_pattern> < <sen_type> <sub_np> <connective> <pp>
<connective> <obj_np> <connective> <toinf_pattern> <verb_pattern> <adv> <vp>
>.sviram
<comp_sentence>:<S><sentence_connectors><S>
<sub_np>:<np>
<obj_np>:<np>
<toinf>:
{toinf <np> <connective> <verb_pattern) to_in}
| {toinf (verb_pattern) to_in} | {toinf}
<sen_type>:
aff: affirmative (negative sentences preceded by ‘not’ in VP)
imp: imperative type
com: complex type
let: let type
qs qwhat: interrogative, yes/no type
qs: interrogative, wh-type
59
com if: if-then type
com sen_either: either-or type
com prfxas: sentence starting with ‘as’
complex: multi component type
(the list is partial)
<connective>: k1 | k2 | k3 | k4 | other markers (map.c)
4.2.2 Examples:
Present Simple:
English sentence: They speak Greek.
Hindi translation: ve griika BARA bolawe hain.
PLIL:
<aff {sub_np ( they noun don’ t_care plural third [human] [ve: m 8] [] [] ) } {obj1_np (
greek sadjnoun don’ t_care singular third [topic] [grIka BARA : f 3] [] [] ) } k1
{main_vp_active ( speak verb_1 normal normal don’ t_care plural third [bola] 11 [] [] ) } > .
sviram
Present Progressive:
English sentence: He is writing a letter.
Hindi translation: vaha eka pawra liKa rahA hai.
60
PLIL:
<aff { sub_np ( he noun masculine singular third [human] [vaha: m 8] [] [] ) } { obj1_np ( a
det [eka/{}] [tamil_a] [telgu_a] ) ( letter noun neuter singular third [topic] [pawra : m 6] []
[] ) } k1 { main_vp_active ( write verb_5 normal is masculine singular third [liKa] 11 [] [] )
} > . sviram
61
Chapter 5
Implementation
In order to implement a Machine Aided Translation System, we have to perform
Morphological Analysis. We start this chapter with a brief note on Morphological Analysis.
5.1
Why Morphological Analysis?
The first question is why we need to perform morphological analysis. If we had an
exhaustive lexicon which listed all the word forms of all the roots, and along with each
word form, it listed its features values then clearly we do not need a morphological
analyzer. Given a word, all we need to do is to look it up in the lexicon and retrieve its
feature values. For example, suppose an exhaustive lexicon for Hindi contains the following
entries related to the roots ‘laDakA’ and ‘kapaDA’ as in figure 5.1 [24]:
Word
Cate-
Root
Gender
Number
Per-
Case
Form
gory
laDakA
noun
laDakA
masc.
sg.
3rd
direct
laDake
do.
do.
do.
pl.
do.
do.
laDake
do.
do.
do.
sg.
do.
oblique
laDakoM
do.
do.
do.
pl.
do.
do.
kapaDA
noun
kapaDA
masc.
sg.
3rd
direct
kapaDe
do.
do.
do.
pl.
do.
do.
kapaDe
do.
do.
do.
sg.
do.
oblique
kapaDoM
do.
do.
do.
pl.
do.
do.
son
Figure 5.1: Example of exhaustive lexicon for Hindi
62
Now, given a word, it can be looked up and its feature values returned.
This method has several problems. First, it is extremely wasteful of memory space. Every
form of the word is listed which contributes to the large number of entries in such a lexicon.
Even when two roots follow the same rule, the present system stores the same information
redundantly.
Second, it does not show relationships among different roots that have similar word forms.
Thus, it fails to represent a linguistic generalization. This is necessary if the system is to
have the capability of understanding (even guessing) an unknown word. (In fact, human
beings routinely deal with word forms they have never heard before when they know the
root and the affixes separately.) In generation process, the linguistic knowledge can be used
if the system needs to coin a new word.
Third, some languages have a rich and productive morphology. The number of word forms
might well be infinite in such a case. Clearly, the above method cannot deal with such
languages.
There is another criterion by which to judge a morphological analyzer or a scheme for
morphological analysis. This is the speed with which it performs the analysis. In case of the
exhaustive lexicon, the time spent in analysis is zero, the only time needed is in searching
and retrieving a word from the lexicon. As the analysis scheme becomes more
sophisticated, it is also likely to take more time. A proper balance may, therefore have to be
struck. The schemes popular in NLP have chosen speed over the requirements of dealing
with unknown words.
63
5.2
Morphological Generation Using Paradigms
For morphological generation, we should have different tables of word forms covering the
words in a language. Each table of word forms covers a set of roots which means that the
roots follow the pattern (or paradigm) implicit in the table for generating their word forms.
For example, in Hindi the paradigm for ‘laDakA’ and other roots in its class can be
specified by giving its word forms. Other roots such as ‘kapaDA’ (cloth) behave like
‘ladakA’ and belong to the same paradigm.
The paradigm can be extracted from the word forms of ‘laDakA’ by identifying the number
of characters to be deleted from the root and the characters to be added to obtain the word
forms. For example, we can say that if you want plural oblique case of the root ‘laDakA’ ,
delete the last character (‘A’ ) and add (‘oM’ ) at the end.
[root = laDakA, number = plural, case = oblique] Æ laDakoM
This can be expressed as
Case
Number
Direct
Oblique
Singular
(0,∅)
(1,e)
Plural
(1,e)
(1,oM)
Figure 5.2: Paradigm table for ‘ladakA’ class
64
5.2.1 Algorithm: Forming paradigm table
a) Create an empty table PT of the same dimensionality, size and labels as the word
forms table WFT.
b) For every entry w in WFT, do
a. if w=r
i. then store “ (0,∅)” in the corresponding position in PT
b. else begin
i. let i be the position of the first characters in w and r which are
different,
ii. store (size(r)-i+1, suffix(i, w)) at the corresponding position in PT.
c. end.
c) Return PT.
Along with the roots, the types and other grammatical information that is common to all the
associated endings (i.e. word forms) can be stored. Figure 5.3 shows some example roots
together with common gender information.
Root
Type
Gender
laDakA
(n, laDakA)
m
kapaDA
(n, laDakA)
m
bhASA
(n, bhASA)
f
roTii
(n, laDakii)
f
laDakii
(n, laDakii)
f
Figure 5.3: Dictionary of roots
Here the endings of type (n, laDakA) are applicable to ‘laDakA’ as well as ‘kapaDA’
(cloth), ‘ghoDA’ (horse) etc. Similar is the case with ‘roTii’ (bread), ‘laDakii’ (girl),
‘lakaDii’ (wood), etc. The paradigm table can be used with any of the roots in the same
class to generate its word. For example, ‘kapaDoM’ can be generated from root ‘kapaDaa'
,
65
number plural, and case oblique, by deletion and addition as specified by the paradigm
table.
This leads to efficient storage because there is only one paradigm table for a class of roots
rather than a separate word forms table for each root.
5.2.2 Algorithm: Generating a word form
a) If root r belongs to Dictionary of indeclinable words (DI), then return (word stored
in DI for r (irrespective of Feature Values FV)
b) let p = paradigm type of r as obtained from Dictionary of roots (DR)
c) let PT = paradigm table for p.
d) let (n, s) = entry in PT for feature values FV
e) w : = r minus n characters at the end
f) w : = w plus suffix s
In fact, the word form table given by the language expert is from the point of view of
generation. It is set up so that given a root and the desired features, one can locate the right
table and then look up the right entry. It is not surprising, therefore that the paradigm table
is also set up for generation.
5.3
The Generator Module
The PLIL or intermediate representation contains all relevant syntactic and semantic
information. The translation of the text is performed with this PLIL as input. Here we give
description of rules of Punjabi grammar which are of relevance to this system design. The
66
modifications in the nouns and the verbs depending on their use are discussed. The rules of
all these modifications have been incorporated in the system.
5.3.1 Introduction to Punjabi Language
Punjabi uses a different word order than English. The main differences are that verbs are
placed at the end of the sentence and that Punjabi (like other Indian languages) uses
postpositions instead of prepositions. Postpositions are like prepositions except that they are
written after the noun.
Affirmative Sentences
English: Subject Verb Object Æ I learn Punjabi.
Punjabi: Subject Object Verb Æ I Punjabi learn.
English: Subject Verb Preposition Object Æ I go to shop.
Punjabi: Subject Object Postpositions Verb Æ I shop to go.
Imperative Sentences
English: Verb Place Adverb Æ Come here now.
Punjabi: Place Adverb Verb Æ Here now come.
English: Verb Negative Verb Adverb Æ Do not eat quickly.
Punjabi: Adverb Negative Verb Æ Quickly not eat.
67
5.3.2 PLIL Examples
1) We write books.
<aff {sub_np ( we noun dont_care plural first [human] [asiM :m 8] [] [] ) } {obj1_np
(books noun neuter plural third [thing] [kiwAba : f 9] [] [])} k1 {main_vp_active ( write
verb_1 normal normal dont_care plural first [lika] 11 [] [] ) } > . sviram
2) He is writing a letter.
<aff {sub_np ( he noun masculine singular third [human] [Oha:m 8] [] [] ) } {obj1_np (a
det [ika/{}] [tamil_a] [telgu_a]) (letter noun neuter singular third [topic] [pawara : m 6] []
[])} k1 {main_vp_active ( write verb_5 normal is masculine singular third [lika] 11 [] [] )
} > . sviram
A sentence in PLIL is defined in terms of NP, VP and other constructs as expected in case
of any natural language. In addition, a number of keywords/terms are used to denote the
nature of the sentence, connectives and indicators that help in lexical choice or invoke
functions in the process of target language synthesis.
The Noun Phrase
The noun is modified from the root form suitably to indicate the number information as
well. There are two cases in which a word is declined: Direct, Oblique.
Direct form of a word does not undergo any changes in its original form when used in a
sentence. The oblique form of a word most often reflects the change in the last consonant or
68
vowel of the word when used in a sentence. The declination rules incorporated in the
system are shown below in figure 5.4.
Noun Paradigm Examples: Apawwi 5, rAwa 9, ladakA 11 (Root word, Paradigm number)
Root
num
case
"Apawwi"
"s"
"d"
"Apawwi"
"p"
"Apawwi"
num_dl_ch
suffix
del_ch
0
""
""
"d"
0
"yAz"
""
"s"
"o"
0
""
""
"Apawwi"
"p"
"o"
0
"yoM"
""
"rAwa"
"s"
"d"
0
""
""
"rAwa"
"p"
"d"
1
"eM"
"a"
"rAwa"
"s"
"o"
0
""
""
"rAwa"
"p"
"o"
1
"oM"
"a"
"ladakA"
"s"
"d"
0
""
""
"ladakA"
"p"
"d"
1
"e"
"A"
"ladakA"
"s"
"o"
1
"e"
"A"
"ladakA"
"p"
"o"
1
"oM"
"A"
Figure 5.4: Some Declination rules incorporated in the system
num Æ Number
num_dl_ch Æ Number of characters to be deleted
del_ch Æ character to be deleted
69
The Verb Phrase
The form of the verb normally depends on the number and gender of the AGENT. Consider
as examples the following group of sentence.
ladkA bAzAra jAtA hai.
ladkein bAzAra jAtein hain.
ladkI bAzAra jAtI hai.
ladkiyAna bAzAra jAti hain.
Thus the form of the verb ‘jA’ (go) changes according to the gender. If however, the tense
is past, past perfect, present perfect or future perfect, the form of the verb depends on the
number and gender of the OBJECT. This is illustrated by the following example:
ladke ne santrA khAyA.
ladke ne santrein khAyin.
The verb ‘khA’ (eat) modifies according to the gender and number of object.
Apart from these modifications, the verb is modified also according to the tense of the
sentence. For example if the verb is ‘khA’ (eat) then the simple present tense form of the
verb is indicated by ‘khAtA hai’ . For the past tense the verb appears as ‘khAyA’ . Thus the
verb form varies according to the tense of the sentence.
70
5.4
Results
Example 1:
English Sentence: We write books.
PLIL form:
<aff {sub_np ( we noun dont_care plural first [human] [asiM : m 8] [] [] ) } {obj1_np
(books noun neuter plural third [thing] [kiwAba : f 9] [] [])} k1 {main_vp_active ( write
verb_1 normal normal dont_care plural first [lika] 11 [] [] ) } > . sviram
Generated Punjabi Sentence: asiM kiwAbAm likade haM.
Example 2:
English Sentence: He is writing a letter.
PLIL form:
<aff {sub_np ( he noun masculine singular third [human] [Oaha:m 8] [] [] ) } {obj1_np (a
det [ika] [] []) (letter noun neuter singular third [topic] [pawara : m 6] [] [])} k1
{main_vp_active ( write verb_5 normal is masculine singular third [lika] 11 [] [] ) } > .
sviram
Generated Punjabi Sentence: Oaha ika pawara lika reha hE.
71
Example 3:
English Sentence: They speak Greek.
PLIL form: <aff {sub_np ( they noun dont_care plural third [human] [Oaha : m 8] [] [] ) }
{obj1_np (greek sadjnoun don’ t_care plural third [human] [grIka bolI : f 3] [] [])} k1
{main_vp_active ( speak verb_1 normal normal dont_care plural third [bola] 11 [] [] ) } >
. sviram
Generated Punjabi Sentence: Oaha grIka bolI bolade ne.
Example 4:
English Sentence: He was reading the book.
PLIL form:
<aff {sub_np ( he noun masculine singular third [human] [Oaha:m 8] [] [] ) } {obj1_np (the
det [] [] []) (book noun neuter singular third [thing] [kiwAba : f 9] [] [])} k1
{main_vp_active ( read verb_5 normal was masculine singular third [paDa] 11 [] [] ) } > .
sviram
Generated Punjabi Sentence: Oaha kiwaAba paDa reha si.
72
Chapter 6
Conclusion and Future Scope
6.1
Conclusion
MT is relatively new in India, about a decade old. In comparison with MT efforts in Europe
and Japan, which are at least 3 decades old, it would seem that Indian MT has a long way to
go. However, this can also be an advantage, because Indian researchers can learn from the
experience of their global counterparts.
The system uses the interlingua approach for transforming English language sentence to the
corresponding Punjabi language sentence. System is capable to translate the simple English
sentence given in the interlingua form. Though the module that has been implemented
performs translations from English to Punjabi, the underlying principles are general enough
to be used for translation from English to any Indian language.
The implemented system is helpful, but not perfect. There are linguistic problems that
cannot be handled by the system. In future, System can be upgraded to solve the linguistic
problems.
As we realize that the perfect automatic translation cannot be expected with the current
technology, therefore, for the time being, we have to promote a systematization of machine
translation, and consider post-editing as a part of the system, while continuing efforts to
improve the accuracy of translation.
73
6.2
Future Scope
The designed system is just an example, a prototype of an MT system. This system can be
further expanded to incorporate more features. The sentences which are linked together
(e.g. - I met Ram. He was going to market) can not be handled by the system as of now. So,
it can be extended to have this feature. The system handles only affirmative sentences.
System can be expanded to handle more complex and variety of sentences. Clauses are not
handled here. There are some words which can be used as noun or adjective or verb
depending on use in a sentence. This type of ambiguity can be resolved by making
modifications in the present system.
74
REFERENCES
[1] Nagao, M., “ A Framework of a Mechanical Translation between Japanese and
English by Analogy Principle” , Artificial and Human Intelligence, Elithorn, A. and
Banerji, R. (eds.), Elsevier Science Publishers, B. V. 1984.
[2] Jackson, Philip, C., “Introduction to Artificial Intelligence”, 2nd ed. New
York: Dover Publications, 1985.
[3] Nagao, Makoto, “ Machine Translation: how far can it go?” , Oxford University
Press,1989.
[4] R.M.K. Sinha, “A Sanskrit based Word-expert model for machine
translation among Indian languages”, In Proc. of workshop on Computer
Processing of Asian Languages, AIT, Bangkok, Thailand, Sept.26-28, pp.
82-91, 1989.
[5] Raman, S. & Alwar, N., “ An AI-Based approach to Machine Translation in Indian
Languages” , Communications of the ACM, Volume 33, No. 5, May 1990.
[6] Hutchins, W. J. & Somers, H.L., “ An Introduction to Machine Translation” ,
Academic Press, London, 1992.
[7] R.M.K. Sinha and C. Sanyal, “Correcting ill-formed Hindi sentences in
machine translated output”, In Proceedings of Natural Language
Processing Pacific Rim Symposium NLPRS’93, Fukuoka, Japan, pp 109119, 1993.
[8] Arnold, D., Balkan, L., Humphreys, R. L., Meijer, S. & Sadler, L., “ Machine
translation:
An
introductory
guide” ,
Blackwells/NCC,
London,
http://www.essex.ac.uk/linguistics/clmt/MTbook/HTML/book.html
75
1994.
[9] Deryle W. Lonsdale, Alexander M. Franz, and John R. R. Leavitt, “ Large-Scale
Machine Translation: An Interlingua Approach” , Center for Machine Translation,
Carnegie Mellon University, Pittsburgh, Pa., USA, 1994.
[10] R.M.K. Sinha, R. Srivastava and A. Agrawal, “Designing Hindi Text
Generator for Machine Translation”, In Proc. Symposium on Natural
Language Processing, SNLP’95, Bangkok, Thailand, pp 286-296, 1995.
[11] Renu Jain, R.M.K. Sinha and A. Jain, “Role of Examples in Machine
Translation, In Proc. IEEE International Conference on Systems, Man
and Cybernetics, Vancouver, Canada, pp 1615-1620, 1995.
[12] Finlay, Janet and Alan Dix, “An Introduction to Artificial Intelligence”,
London: UCL Press, 1996.
[13] Jain, R., Sinha, R. M. K., & Jain, A., “Translation between English and
Indian Languages”, Journal of Computer Science and Informatics, pp 19
–25, 1997.
[14] Turcato, D., McFetridge, P., Popowich, F. & Toole, J., “ A unified example-based
and lexicalist approach to machine translation” , `Proceedings of the 8th
International Conference on Theoretical and Methodological Issues in Machine
Translation (TMI '
99)'
, Chester, 1999.
[15] R.M.K.
Sinha,
Approaches
in
“Hybridizing
Machine
Aided
Rule-Based
Translation
and
Example-Based
System”,
In
Proc.
International Conference on Artificial Intelligence IC-AI’2000, June 2629, Las Vegas, USA, 2000.
76
[16] C. Manning and H. Schutze, “Foundations of Statistical Natural
Language Processing”, Cambridge: The MIT Press, 2000.
[17] Sinha, R. M. K., Renu Jain & Ajai Jain, “ Translation from English to Indian
Languages: ANGLABHARTI Approach” , In Proc. Symposium on Translation
Support Systems STRANS2001, February 15-17, Kanpur, India, 2001.
[18] Renu Jain, R.M.K. Sinha and Ajai Jain, “ANUBHARTI: Using Hybrid
Example-Based Approach for Machine Translation”, In Proc. Symposium
on Translation Support Systems STRANS2001, February 15-17, Kanpur,
India, 2001.
[19] Generation 5, “An Introduction to Natural Language Theory”, 24th April
2001. http://www.generation5.org/nlp.shtml
[20] R.M.K.
Sinha,
“Dealing
with
Unknown
Lexicons
in
Machine
Translation from English to Hindi”, In Proc. of IASTED International
Conference on Artificial Intelligence and Soft Computing, May 21-24,
Cancun, Mexico, pp 333-336, 2001.
Vartika Bhandari, R.M.K. Sinha and Ajai J
77