Download a computational grammar of sinhala for english

Document related concepts

Embodied language processing wikipedia , lookup

Machine translation wikipedia , lookup

Transcript
A COMPUTATIONAL GRAMMAR OF SINHALA FOR
ENGLISH-SINHALA MACHINE TRANSLATION
B. Hettige
(08/8021)
Degree of Master of Philosophy
Department of Information Technology
University of Moratuwa
Sri Lanka
December 2010
A COMPUTATIONAL GRAMMAR OF SINHALA FOR
ENGLISH-SINHALA MACHINE TRANSLATION
Budditha Hettige
(08/8021)
Thesis submitted in partial fulfillment of the requirements for the degree
Master of Philosophy
Department of Information Technology
University of Moratuwa
Sri Lanka
December 2010
Declaration of the Candidate and the Supervisor
I declare that this is my own work and this thesis does not incorporate any material
previously submitted for a Degree or Diploma in any other University or institute of
higher learning, without acknowledgement. It does not contain any material
previously
published
or
written
by
another
person
except
where
the
acknowledgement is made in the text to the best of my knowledge and belief.
Also, I hereby grant to University of Moratuwa the non-exclusive right to reproduce
and distribute my thesis, in whole or in part in print, electronic or other medium. I
retain the right to use this content in whole or part in future works (such as articles or
books).
Signed
………………………………
…………………..
Budditha Hettige
Date
Candidate
The above candidate has carried out research for the M. Phil. dissertation under my
supervision.
……………………………………………..
…………………..
Prof. Asoka S. Karunananda
Date
…………………………………………….
……………..
Dr.
i
Abstract
Communication is fundamental to the evolution and development of all kinds of living
beings. With no disputes, languages should be recognized as the most amazing artifacts ever
developed by mankind to enable communication. Computer has also become such a unique
machine, due to its capacity to communicate with humans through languages. It is worth
mentioning that the languages understood by computers and humans are quite different, yet
people can communicate with computers. This has been possible since the computer is
fundamentally an artifact that can translate one language to another. Therefore, computers
must be able to do language translations than any other computing task. Nowadays,
computing is evolving to enable machine-machine communication with no or little human
intervention, yet humans continue to face with what is called language barrier for
communication. In particular, a vast collection of world knowledge written in English has
been inaccessible to communities who cannot communicate in English. Such communities
are unable to contribute to the development of world knowledge due to the language barrier.
As a result many people have embarked into research in computer aided natural language
translation. This area is commonly known as Machine Translation. Among others, Aptium,
Bable fish, Google translator, SYSTRAN, EDR, Anusaaraka, AngalaHindi, AnagalaBarathi,
and Mantra are some examples for popular machine translation systems. These systems use
various approaches
including Human-assisted, Rule-based, Corpus-based, Knowledgebased, Hybrid and Agent-based to translate from one language to another. However, due to
inherent diversifications of natural languages, a generic machine translation approach is far
from reality.
This thesis presents a computational grammar for Sinhala language to develop English to
Sinhala machine translation system with an underlying theoretical basis. This system is
known as BEES, an acronym for Bilingual Expert for English to Sinhala machine translation.
The concept of Varanegeema (conjugation) in Sinhala language has been considered as the
philosophical basis of this approach to the development of BEES. The Varanegeema in
Sinhala language is able to handle large number of language primitives associated with
nouns and verbs. For instance, Varanegeema handles the language primitives such as person,
gender, tense, number, preposition and subjectivity/objectivity. More importantly,
Varanegeema allows deriving all associated word forms from a given base word. This
enables to drastically reduce the size of the Sinhala dictionary. Since the concept of
Varanegeema can be expressed by a set of rules, it nicely goes with rule-based
implementation of machine translation systems. BEES implements 85 grammar rules for
Sinhala nouns and 18 rules for Sinhala verbs. BEES compresses with seven modules
namely English Morphological analyzer, English Parser, English to Sinhala base word
translator, Sinhala Morphological Generator, Sinhala Parser, Transliteration module and
Intermediate Editor. In addition to the main modules, system comprises of four dictionaries,
namely, English dictionary, Sinhala dictionary, English-Sinhala Bilingual dictionary and the
Concept dictionary. BEES primarily shares the features with the Rule-based, Context-based
and Human-assisted approaches to machine translation. The BEES has been implemented
using Java and Swi-Prolog to run on both Linux and Windows environments.
The English to Sinhala Machine Translation system, BEES has been evaluated to test the
hypothesis that concepts of Varanegeema can be used to drive English to Sinhala machine
translation. The English to Sinhala machine translation system has been evaluated through
three steps. As the first step, all the language processing primitives such as morphological
analyzers, parsers, translator and the transliteration module have been tested through the
white box testing approach. In order to test each module, several online testing tools
ii
including English morphological analyzer, English parser and Sinhala word generator have
been implemented. By using these online tools each module has been completely tested
through a carefully created test plan. In addition, an online evaluation test bed has also been
implemented to continuously capture feedback from online users. This online evaluation test
bed gives facilities to make different types of sentences using a given set of words. Word
Error Rate and the Sentence Error Rate were calculated by using these evaluation results.
Finally the intelligibility and the accuracy tests have been conducted through the human
support.
In order to evaluate the intelligibility and the accuracy of the English to Sinhala machine
translation system, following steps were followed. Two hundred sample sentences were
collected and grouped into 20 sets (10 sentences per each set). Then each sentence was
translated using the English to Sinhala Machine Translation system. Each set was given to
the human translators and scored. The intelligibility and the accuracy were calculated
through the above evaluation results. The experimental result shows that English
morphological analyzer, English parser, English to Sinhala base word translator, Sinhala
morphological generator and the Sinhala sentence generator successfully work with more
than 90% accuracy. Overall result of the evaluation shows 89% accuracy with the word error
rate of 7.2% and the sentence error rate of 5.4%.
The BEES successfully translates English sentences with simple or complex subjects and
objects. The translation system successfully handles most commonly used patterns of the
tenses including active and passive voice forms.
iii
Acknowledgements
This thesis is the result of four years of devoted work whereby I have been
accompanied and supported by many people. It is a pleasant aspect that I have now
the opportunity to express my gratitude for all of them.
I am grateful to the University of Moratuwa especially to the faculty of
Information Technology for providing me the opportunity to do a research study.
The first person I would like to thank is my supervisor Prof. Asoka Karunananda
for whom a few lines are too short to make a complete account of my deep
appreciation. This study would not have been such a success without his
commonsense knowledge and perceptiveness. I owe him lots of gratitude for
showing me this way of research. Besides apart from being an excellent supervisor
Prof. Karunananda has been an understanding teacher and he has provided me
support in every aspect for the success of this research.
I am also grateful to thank Dr. Sarath Bannayake, Head, Department of Statistics
and Computer Science, University of Sri Jayawardenepura for assistance he has
given to me during the research work.
With the great pleasure and deep sense of gratitude, I acknowledge Mr. P. Dias
former head; Senior Lecturer Department of Statistics and Computer Science,
University of Sri Jayawardenepura for the great help provided me to make a method
for evaluation.
I would also like to thank Mr. Niranjan Bandara, Lecturer, Department of Sinhala
and Mass Communication, University of Sri Jayawerdenepura for his valuable
support to correct some Sinhala language issues.
I would like to give my great pleasure and deep sense of gratitude to Venerable
Kirioruwe Dhamananada thera, Venerable Kukulpane Sudassi thera and Venerable
Matttumagala Chandanada thera for their valuable support given to me to solve
Sinhala and English language problems by sharing their knowledge of Sinhala, Pali
and Sanskrit Language structures.
iv
I am deeply indebted to Mr. Duminda de Silva, Head, Department of Mathematics
and Computer Science, The Open University of Sri Lanka for the encouragement
extended to me throughout this study.
I wish to extend my sincere gratitude to Ms. G. S. Makalanda, Dr. T.G.I. Fernando
and Dr. E. A. T. A. Edirisooriya, for their great support and encouragement extended
to me throughout this study.
My deepest gratitude goes to my mother and my wife for the unconditional support
given and without their support, this would have been impossible. Again, I must give
a big thank to my wife Lakshimi for tolerating my busy schedules due to the research
work. Last but not least I thank all who supported me to make this work a success.
January 3, 2011
Budditha Hettige
v
Table of Contents
Declaration of the Candidate and the Supervisor
i Abstract
ii Acknowledgements
iv Table of Contents
vi List of Figures
xi List of Tables
xii 1 Chapter 1 Introduction
1.1 Preamble
1 1.2 English to Sinhala Machine Translation
2 1.3 What are Machine Translation Systems?
2 1.4 Aim of the Research
4 1.5 Objectives of the Research
4 1.6 1.5 Scope of the Project
5 1.7 Hypothesis
5 1.8 Structure of the Thesis
5 1.9 Summary
7 Chapter 2 State of the Art of Machine Translations
8 2.1 Introduction
8 2.2 Fundamentals of the Natural Language Processing
8 2.3 Machine Translation Systems
9 2.4 Current Approaches to Machine Translation
10 2.4.1 Human-assisted Machine Translation
10 2.4.2 Rule-based Machine Translation
12 2.4.2.1 Transfer-based Machine Translation
14 2.4.2.2 Interlingua Machine Translation
15 2.4.2.3 Dictionary based Machine Translation
16 2.4.3 Statistical Machine Translation
17 2.4.4 Example-based Machine Translation
18 2.4.5 Knowledge-based Machine Translation
19 2.4.6 Hybrid Machine Translation
20 2.4.7 Agent-based Machine Translation
20 2.5 Existing English to Sinhala Machine Translation Systems
21 2.6 Concepts and Techniques for Machine Translation
22 vi
2.6.1 Morphological Analysis
23 2.6.2 Syntax Analysis
24 2.7 Problem Definition
25 2.8 Summary
26 Chapter 3 Overview of the English and Sinhala Languages
28 3.1 Introduction
28 3.2 The English Language
28 3.3 The English Language Morphology
28 3.3.1 English Noun Morphology
29 3.3.2 English Verb Morphology
30 3.3.3 English Adjective Morphology
31 3.4 Syntax of the English Language
32 3.4.1 The English Sentence Subject
33 3.4.2 The English Predicate
33 3.4.3 Verb Tense
33 3.4.4 The Complement
34 3.5 Semantics of English Language
35 3.5.1 Word Level Semantics
35 3.5.2 Sentence Level Semantics
35 3.5.3 The paragraphs Level Semantics
35 3.6 The Sinhala Language
35 3.6.1 Sinhala Alphabet
36 3.7 Sinhala Language Morphology
38 3.7.1 Sinhala Noun Morphology
38 3.7.2 Sinhala Verb Morphology
41 3.8 Syntax of the Sinhala Language
43 3.9 Semantics of the Sinhala Language
44 3.10 Comparison Between English and Sinhala
44 3.10.1 Fundamental Differences
45 3.10.2 Morphological Differences
45 3.10.3 Syntax in the two Languages
46 3.11 Language Issues
46 3.11.1 Grammatical Issues
47 3.11.2 Text Manipulation Issues
47 vii
3.12 Challenges in English to Sinhala Machine Translation
48 3.12.1 Word and Sentence Segmentation
49 3.12.2 Lexical Selection
49 3.12.3 Conjugation
49 3.12.4 Tense Detection
50 3.12.5 Article Insertion
50 3.12.6 Sentence boundaries
50 3.12.7 Word Order
50 3.13 Summary
51 Chapter 4 Novel Approach to Machine Translation
52 4.1 Introduction
52 4.2 A Theoretical-based Approach to Machine Translation
52 4.3 Computational Model of Grammar for Sinhala
53 4.3.1 Computational Model for Sinhala Morphology
53 4.3.2 Context-Free Grammar for Sinhala language
53 4.4 Hypothesis
57 4.5 Approach in a Nutshell
57 4.6 Features of BEES
57 4.7 Input for BEES
58 4.8 Output of BEES
58 4.9 Process of BEES
58 4.10 Summary
59 60 Chapter 5 Design of BEES
5.1 Introduction
60 5.2 Design of BEES
60 5.2.1 English Morphological Analyzer
60 5.2.2 English Parser
62 5.2.3 English to Sinhala Base Word Translator
62 5.2.4 Sinhala Morphological Generator
63 5.2.5 Sinhala Parser
63 5.2.6 Transliteration module
64 5.2.7 Intermediate Editor
64 5.2.8 Lexical Resources
65 5.3 Supporting modules
66 viii
5.3.1 Dictionary Updater
66 5.3.2 Sinhala Word Generator
67 5.3.3 Online Search module
67 5.4 Summary
68 69 Chapter 6 Implementation
6.1 Introduction
69 6.2 Development Stages
69 6.3 Implementation of the BEES
70 6.3.1 English Morphological Analyzer
70 6.3.2 English Parser
74 6.3.3 English to Sinhala Bilingual Translator
77 6.3.4 Sinhala Morphological Generator
78 6.3.5 Sinhala Sentence Composer
81 6.3.6 Transliteration Module
82 6.3.7 Intermediate Editor
83 6.3.8 Lexical Resources
84 6.3.8.1 English Dictionary
84 6.3.8.2 Sinhala dictionary
86 6.3.8.3 English-Sinhala Bilingual dictionary
89 6.3.8.4 Concept Dictionary
90 6.4 Supporting modules
91 6.4.1 Online Updater
91 6.4.2 Sinhala Word Generator
92 6.4.3 Online Search module
93 6.5 Summary
94 95 Chapter 7 BEES in Action
7.1 Introduction
95 7.2 BEES as an Online Translator
95 7.3 BEES as a Web Page Translator
97 7.4 BEES as a Selected Sentence Translator
100 7.5 BEES as a Desktop Application
102 7.6 Summary
106 107 Chapter 8 Evaluation
8.1 Introduction
107 ix
8.2 Evaluation of MT systems
107 8.3 BEES Evaluation
109 8.4 Stage1: Module Testing
110 8.4.1 English Morphological Analyzer
110 8.4.2 English Parser
111 8.4.3 English to Sinhala Base Word Translator
112 8.4.4 Sinhala Morphological Generator
113 8.4.5 Sinhala Sentence Composer
114 8.4.6 Transliteration Module
115 8.5 Stage 2: Performance Testing
115 8.6 Stage 3: Accuracy Testing
117 8.7 Result of the Experiments
118 8.8 Summary
121 Chapter 9 Conclusion and Further Work
122 9.1 Introduction
122 9.2 Revisited Objectives
122 9.3 Limitations
124 9.4 Further Works
124 9.5 Summary
125 References
126 Appendix A: English Morphological analyzer- Test plan
135 Appendix B: Conjugation Table for Sinhala Language
137 Appendix C: Context-Free Grammar for Sinhala Language
143 Appendix D: Finite State Transducer for Sinhala Transliteration
145 Appendix E: Sample Evaluation form
147 Appendix F: Sample of evaluator’s Comments
148 x
List of Figures
Figure 2.1: Architecture for a rule-based machine translation system
13 Figure 4.1: Finite State Automata for Kaputu Ganaya
54 Figure 4.2: Parser tree for the sample sentence
56 Figure 5.1: Design of the BEES
61 Figure 5.2: FST for Vowels in model 1 transliteration
64 Figure 5.3: Design of the three supporting module
67 Figure 6.1: The Intermediate Editor
83 Figure 7.1: Web based architecture for the BEES
95 Figure 7.2: User interface of the Online BEES
96 Figure 7.3: A web page translator
97 Figure 7.4: BEES as a web page translator
100 Figure 7.5: Selected sentence translator
101 Figure 7.6: Desktop screen for selected sentence translation
101 Figure 7.7: User interface of the BEES
103 Figure 8.1: English Morphological analyzer with test results
111 Figure 8.2: Sinhala word conjugator
114 Figure 8.3: User interface of the evaluation test bed
116 Figure 8.4: Online evaluation form
117 Figure 8.5: Translation accuracy
121 xi
List of Tables
Table 2.1: Existing Machine translation systems
26 Table 3.1: Regular and irregular forms of the English Noun
29 Table 3.2: English Noun Morphological rules
30 Table 3.3: English verb Morphology
31 Table 3.4: Morphological rules for English Verbs
32 Table 3.5: Tense patterns (Active voice)
33 Table 3.6: The Sinhala Alphabet
36 Table 3.7: Vocalic Stokes and their position
37 Table 3.8:The consonant ‘l’ with vocalic stokes
37 Table 3.9: Sample case makers in Sinhala
40 Table 3.10: conjugation table for ‘we;a’ ganaya
41 Table 3.11: Inflection form of the Sinhala verbs (Active)
42 Table 3.12: Inflection form of the Sinhala verbs (Passive)
43 Table 4.1: Paradigm table for Kaputu Ganaya
54 Table 6.1: Grammatical notations for the English Dictionary
84 Table 8.1: Sample test plan for English Morphological analyzer
110 Table 8.2: Sample test plan for English parser
112 Table 8.3: Sample Sinhala Morphological rules
113 Table 8.4: Results for module testing
119 Table 8.5: Human evaluation results
120 Table 8.6: Accuracy results
120 Table 8.7: Final evaluation results
121 .
xii
Chapter 1
INTRODUCTION
1.1 Preamble
A Natural Language is a kind of marvelous artifact ever invented by mankind. It is a
cornerstone of all kinds of communications. Each natural language plays the role of
describing thoughts of humans in a particular environment. As such, a natural
language has a strong bearing on the culture and the environment within which a
certain community of persons live. This is why we identify large number of different
natural languages worldwide. Despite the differences in languages, people still want
to communicate with persons who use different languages. Differences in languages
have become a barrier for cross-cultural communications. In particular, many nations
have not been able to access a huge reservoir of world knowledge written in English,
unless those nations have a sound knowledge in English. On the other hand, people
do not know English will not be able to contribute to the world knowledge. It is
undisputable the importance of mother tongue for discovery and creation of new
systems of knowledge. Consequently, this has resulted in what is called language
barrier for communication. In fact, this issue is not only between English and other
languages, but also between any two languages.
Of course, people have been practicing a solution for the issue. That is nothing but
translation between two languages by knowing the both languages. However, can we
really expect everyone to know every language? Undoubtedly, this is impractical.
The emergence of digital computer technology in early 1950s had postulated the
concept of machine translation to seek assistance from computers to seek solutions
for long felt language needs of humans. Since then hundreds of research works have
been conducted to translate between natural languages. The machine translation has
been a branch of Natural Language Processing, which comes under the broad area of
Artificial Intelligence. It is commonly cited that machine translation has been one of
1
the least achieved area in Artificial Intelligence over the last sixty years. As such, a
generic approach to machine translation has been an unrealized dream of researchers.
Thus, machine translation approaches have become so much language specific.
1.2 English to Sinhala Machine Translation
This thesis presents a research conducted to develop English to Sinhala machine
translation system. Sinhala is one of the Indo Aryan family languages and it is the
spoken language of 74% of the people in Sri Lanka. Sinhala has also been one of the
constitutionally recognized official languages of Sri Lanka [53]. Numbers of
Statistical results show that, more than 80% of Sinhala spoken community does not
have the ability to read and write in English [46][126]. While encouraging the
learning of English, one also cannot devalue the importance of mother tongue for
discovery of knowledge for the betterment of mankind.
In the Asian region, many countries including India, Thailand, Malaysia and Japan
have conducted considerable amount of research in machine translation. Despite Sri
Lanka has been working on various projects in machine translation, still little behind
as compared with similar researches conducted in the Asian region. Weerasinghe
[154] has pioneered machine translation research in Sri Lanka. Thus, this project will
contribute to extend machine translation initiatives in Sri Lanka. The project presents
a theoretical-based translation approach, which would also be beneficial to machine
translation projects, which handles languages closer to Sinhala language.
Before presenting the aim and objectives of the project, a brief introduction to field
of machine translation is given in section 1.3.
1.3 What are Machine Translation Systems?
The Machine Translation system refers to computer software that translates text or
voice from one natural language into another with or without human assistance [73]
2
[154]. According to the design, each Machine translation system can be broadly
categorized into two groups, namely, the direct translation system and the indirect
translation system. The direct translation system translates source language into
target language by using word-to-word or phrase-to-phrase mapping. In contrast,
indirect translation systems use an Interlingua or some kind of transfer method. This
approach starts with an analysis of source text and performs a synthesis to generate
corresponding text in the target language. Figure 1.1 gives classic pyramid to show
relationship between these two approaches to machine translation.
Figure 1.1: Relationship between direct and indirect translations
Under the above two broad areas, several approaches have been used to develop
hundreds of machine translation systems all over the world. Among other
approaches, Human-assisted, Rule-based, Statistical, Example-based, Knowledgebased, Hybrid, and Agent-based are commonly cited as the most successful
approaches for machine translation.
Comparing the existing machine translation systems and their approaches, many of
these systems use sequential level architecture for Natural Language Processing and
machine translation [59]. This sequence comprises of steps such as preprocessing,
3
morphological analysis, syntax analysis, semantic analysis, pragmatic analysis and
post processing.
Despite many attempts have been taken to develop machine translation systems, at
present this area has achieved very little. In fact due to ever felt need of machine
translation, some people have rushed to develop such systems without a proper
conceptual or theoretical basis for their approaches. This has resulted in creating
many machine translation systems that go through ad-hoc processes to translate
between languages. This also amounts to constraint the development in the field of
machine translation.
1.4 Aim of the Research
This thesis proposes to design and develop English to Sinhala machine translation
system with a theoretical basis.
1.5 Objectives of the Research
In order to reach the above aim, the following key objectives have been identified.
These objectives range from critical review of existing approaches to machine
translation to evaluation of the proposed theoretical-based approach to machine
translation.
Objective 1
Critically review the existing systems, concepts and tools for machine
translation.
Objective 2
Develop a Computational grammar for Sinhala Language
Objective 3
Design and develop English to Sinhala Machine Translation system
4
Objective 4:
Evaluate the system
1.6 1.5 Scope of the Project
The scope of the project is limited to develop a computational grammar for Sinhala
language as per concept of Varanegeema to handle most commonly used 27-noun
forms and 36 verb forms.
1.7 Hypothesis
In order to achieve the above aim and objectives, the hypothesis employed in the
thesis can be stated as concepts of “Varanegeema” (Conjugation) in Sinhala
languages can be used to drive English to Sinhala Machine translation.
1.8 Structure of the Thesis
The thesis has been structured with nine chapters. The following is the structure of
the thesis with a brief explanation of the contents of each chapter.
Chapter 1 has provided an overall introduction to the whole research project. It
briefly explained the research problem addressed in the thesis, overview for machine
translation, aim, objectives and the hypothesis employed in the thesis.
Chapter 2 reports on the literature survey on Machine Translation with a detailed
description leading to highlight the problem addressed in the thesis. Also this chapter
provides a detailed study about the state of the art Natural Language Processing by
describing different approaches adapted.
5
Chapter 3 is on an overview of the English and Sinhala languages as per
Morphology, Syntax and Semantic concerns of the both languages. This chapter also
gives a compression between English and Sinhala languages by showing issues
related to machine translation.
Chapter 4 discusses the novel approach taken to develop English to Sinhala machine
translation system. It presents the hypothesis of the project in the first place. Then the
chapter explains the mechanism of the translation process, nature of input, output and
key features of the system.
Chapter 5 is about the design of the proposed English to Sinhala Machine Translation
system. Each and every module of the design model is explained separately by
describing the functionality and relation among the modules.
Chapter 6 presents the implementation of the English to Sinhala machine translation
system. This chapter gives implementation details about prolog-based modules, java
based user interface, Intermediate editor and ontology of the lexical databases.
Chapter 7 presents how BEES works in practice when translating a given English
text. This chapter also explains applications of BEES as, a standalone translator, an
on demand translator, web page translator and selected text translator for machine
translation.
Chapter 8 reports evaluation of the English to Sinhala machine translation. The
evaluation methodology, evaluation steps, participants and the result of the
evaluation are also given in this chapter.
Chapter 9 concludes the thesis by referring to achievement of each objective. The
chapter also presents limitations and further work of the research conducted.
6
1.9 Summary
This chapter provided an overview for the entire project by describing the problem
to be addressed, aim, objectives and the hypothesis employed in the thesis. It briefly
explained the proposed English to Sinhala Machine Translation. Structure of the rest
of the thesis has also been presented in the chapter.
The next chapter reports on critical review of the existing approaches to machine
translation together with major machine translation systems that are based on these
approaches.
7
Chapter 2
STATE OF THE ART OF MACHINE TRANSLATIONS
2.1 Introduction
The previous chapter presented an overview of the thesis. This chapter gives the state
of the art of Natural language processing with a special attention on the Machine
Translation. Some of the related fundamental aspects in Machine Translation will
also be discussed in this chapter.
2.2 Fundamentals of the Natural Language Processing
The Natural Language Processing (NLP) is a field of computer science and
linguistics concerned with the interactions between computers and human (Natural)
languages [107]. It is also a sub field of Artificial Intelligence (AI) in the area of
Computer Science [128]. According to many electronic resources, the history of the
Natural language processing began with the Turing article named “Computing
Machinery and Intelligence” [151]. It is known as the Turing test as a criterion of
intelligence. After that, In 1957 Noam Chomsky in the academic and scientific
community as one of the fathers of modern linguistics, introduced the Syntactic
Structures for grammar [31]. It is recognized as a most important text in the field of
linguistics. After that, it becomes fundamental theory for Natural Language
Processing and many of these Machine Translation systems use this syntactic
structure [31][33].
The Natural language processing has come under broad area of the field of Artificial
Intelligence. The NLP is used to do several tasks including machine translation,
automatic summarization, Information retrieval, optical character recognition, speech
recognition, text-to-speech etc [107][128][147].
Based on the task, the Natural Language Processing systems reserved several issues
such as Natural language understanding, Natural language generation, Speech and
text segmentation, Part-of-speech tagging and the Word sense disambiguation [84]
8
2.3 Machine Translation Systems
Machine Translation system is a computer software to translate text or speech from
one natural language to another [161][162]. The Machine translation is a sub area of
the Natural language processing which is identified during early days of Artificial
Intelligent (AI). Due to various reasons associated with complexity of languages, for
more than last sixty years, Machine Translation has been identified as one of the least
achieved areas in computing [74]. These issues range from Morphological to
semantics of source and target languages.
The history of Machine Translation dates back to late 1940s. A look-up dictionary
at Birkbeck College in London has been cited as an early work of machine
translation in 1948. After that, 1950 to 1960 many researchers attended to develop
Machine Translation systems by using trial-and-error approach [75] especially for
Russian to English language. In 1950 first machine translation system was developed
to translate Russian sentences into English.
In 1958 first practical machine translation system was implemented by the IBM
Corporation to US Air force under direction of Gilbet King [76]. This system
translates Russian text into English and it successfully works until 1970. In the
meantime RAND cooperation distributed current linguistic theory and emphasized
the Statistical analysis. They were prepared bilingual glossaries with grammatical
information and the grammar rules with the first parser based on the dependency of
grammar.
In 1970, SYSTRAN [144] implemented a new Russian-English machine
translation system which is the replacement of the previous system of the US Air
force. This system translated more than 100000 pages per year. In the mean time,
many researchers were attempting to develop machine translation systems. Among
others, syntactic transfer system for English-French is one of the strong researches in
the field. Further, principal experimental effect focused on the Interlingua
approaches with more attention pays to the syntactic aspects [75].
9
In 1980, many computer companies attempted to develop computer-aided
translations especially for Japanese-English. These systems are low level direct
translation systems that are confined to morphological and syntactic analysis. After
1980 Machine translation researches were developed through many areas. Corpusbased machine translation approach is the most popular approach until now.
However, due to the complexity of the natural languages, development of the
machine translation systems has become a research challenge. In addition, many
researchers have also noted that, Operational syntax, idioms and Universal syntactic
categories are some completely unsolved linguistic problems in the machine
translation [171].
2.4 Current Approaches to Machine Translation
Considering the translation approaches, machine translation system can be
classified into seven categories, namely, Human-assisted, Rule-based, Statistical,
Example-based, Knowledge-based, Hybrid and Agent-based. Statistical, Example
based, Knowledge based and Hybrid approaches are used copra for the machine
translation. Therefore, these approaches are named as corpus-based approach. All of
these machine translation approaches have their own strengths and weakness.
Obviously, the success rate of a translation is depended on the approach. Each
approach for the machine translation is discussed below.
2.4.1 Human-assisted Machine Translation
Human-assisted machine translation approach is an approach for the machine
translation particularly Indian families of machine translation. The human assisted
approach uses human interaction for the pre editing, post editing and/or intermediate
editing stages[85]. This approach uses human support for the semantic handling in
the machine translation. Using this human assisted approach, numbers of machine
translation systems have been developed.
10
In the Indian region a number of machine translation systems have used this
approach, including Anusaaraka, ManTra, MaTra, Angalabarathi etc [133][38][146].
Anusaaraka [4] [7] is a popular Human-assisted translation system for Indian
languages that makes text in one Indian language accessible to another Indian
language. This system uses Paninian Grammar model [6] to its language analysis.
The Anusaaraka project [16] has been developed to translate Punjabi, Bengali,
Telugu, Kannada and Marathi languages into Hindi. English-Hindi Anusaaraka
translates English text into Hindi. The approach and lexicon is general, but the
system has mainly been applied for children’s stories [95].
MaTra is a human-assisted transfer-based translation system for English to Hindi
[11]. This System uses general-purpose lexicons and applied mainly in the domains
of news. MaTra follows a structural and lexical transfer approach for its machine
translation. The MaTra aims to produce understandable output for wide coverage,
rather than perfect output for a limited range of sentences.
Mantra [106] is a machine assisted translation tool that, translates English text into
Hindi in several domains. ManTra is based on the Tree Adjoining Grammar (TAG).
The Mantra system was started with the translation of administrative documents such
as appointment letters, notification and circular issued in central government from
English to Hindi.
Angalabharti [103] is also a human-assisted machine translation system used in
India. Since India has many languages, there are a variety of machine translation
systems. For example, Angalahindi [133] translates English to Hindi using machineaided translation methodology. Human-aided machine translation approach is a
common feature of most Indian machine translation systems. In addition, these
systems also use the concepts of both pre-editing and post-editing as the means of
human intervention in the machine translation system.
Chandrashekhar Research Centre [20] has developed a machine aided translation
system for Tamil to Hindi.
Tamil to Hindi translator is based on Anusaaraka
Machine Translation System and the input text is in Tamil and the output can be seen
11
in a Hindi text. Stand-alone, API and Web-based on-line versions are developed.
Tamil morphological analyzer and Tamil-Hindi bilingual dictionary are the
byproducts of this system [133].
In addition to the above, KSHALT is a human assisted Machine Translation
system that translates English to Korean language [85]. This translation system
contains four phrases namely English Parser, English Analyzer, English to Korean
transfer and the Korean generation.
2.4.2 Rule-based Machine Translation
Rule-based approach is yet another approach for machine translation. This
approach gives grammatical correct translation by using set of rules. Basically, the
rule-based machine translation system contains a source language morphological
analyzer, a source language parser, translator, target language morphological
analyzer, target language parser and several lexicon dictionaries. Source language
morphological analyzer analyzes a source language word and provides the
morphological information. Source language parser is a syntax analyzer that analyzes
source language sentences. Translator is used to translate a source language word
into target language. Target language morphological analyzer works as a generator
and it generates appropriate target language words for the given grammatical
information. Also target language parser works as a composer and it composes a
suitable target language sentence. Furthermore, this type of machine translation
system needs minimum of three dictionaries namely the source language dictionary,
the bilingual dictionary and the target language dictionary. Source language
morphological analyzer needs a source language dictionary for morphological
analysis. Bilingual dictionary is used by the translator for translating source language
into target language; and the target language morphological generator uses the target
language dictionary to generate target language words. Figure 2.1 can present general
architecture of the rule-based machine translation system.
12
A number of machine translation systems have been designed through the rulebased approach. Among others Apertium [18] is a rule-based Machine Translation
system, which translates related languages. This is an open–source system that can
be used to translate any related two languages. The Apertium engine follows a
shallow transfer approach and consists of the eight pipelined modules, such as deformatter, A morphological analyzer, A parts-of-speech (PoS) tagger, A lexical
transfer module, A structural transfer module, A morphological generator, A postgenerator, and A re-formatter.
Source language
Source Language Morphological
Source language
Analyzer
Dictionary
Source Language parser
Bilingual translator
Bilingual
Dictionary
Target language Morphological
generator
Target language
Dictionary
Target language sentence generator
Target Language
Figure 2.1: Architecture for a rule-based machine translation system
Toshiba [145] is another Rule-based Machine translation system for English to
Japanese vice versa. To translate a given source text, system uses Morphological
analysis, Syntax analysis, translation word selection and structural transformation,
syntax transformation and morphological generation steps. This system can translate
open-domain written texts by using rule-based. This system uses three dictionaries
namely common word dictionary, a technical-term dictionary and a user-defined
13
dictionary. The common word dictionary includes both English-Japanese and
Japanese- English translation. The technical term dictionary includes domain-specific
technical terms. They have used user defined dictionary to store user provided
information such as unknown word information.
Further, rule-based machine translation approaches can be categorized as three
groups namely transfer-based, Interlingua and dictionary based. The transfer based
and Interlingua approach has same idea for translation. Both two approaches used
intermediate representation that captures the "meaning" of the original sentence
[10][84][56]. The difference between both approaches is the interlingua-based
system uses language independent intermediate representation and transfer-based
system uses language dependent intermediate representation. Most of these machine
translation systems include Morphological analysis, lexical categorization, lexical
transfer, Structural transfer and Morphological generation. The dictionary based
machine translation system uses dictionary for its machine translation with or
without Morphological or syntax analysis. These type of Machine Translation
systems ideally suitable to translate long lists of phrases. Numbers of machine
translation systems have been developed under the above three border headings.
2.4.2.1 Transfer-based Machine Translation
Lavie and others [96] have applied transfer based approach to the Hindi-to-English
translation system named Xferand. It trained under the extremely limited data
scenario. This Xfer system uses IIITMorpher (Morphological analyzer) [79] to
analyze Hindi words with the root and the other features such as gender, number, and
tense. The Xfer system uses 70 transfer rules including a rather large verb paradigm,
with 58 verb sequence rules, ten recursive noun phrase rules and two prepositional
phrase rules. They have noted that, this approach is particularly suitable for
languages with very limited data resources.
Arabic to English machine translation system has been developed through the
Transfer-based approach [120]. This system is named as Npae-Rbmt. The Npae14
Rbmt is used an intermediate representation that captures the “meaning” of the
original sentence in order to generate the correct translation. This system has
evaluated through the 88 thesis titles and journals from the computer science domain.
The accuracy of the result was 94.6%.
Apertium platform follows a transfer-based machine translation model [18]. Using
these shallow-transfer approach Swedish to Danish machine translation system has
been developed [125]. Swedish to Danish machine translation system uses two
morphological dictionaries to analysis and generation. This is the first free software
translator of Swedish to Danish.
Using Affix-Transfer-based approach, Tagalog-to-Cebuano [170] Unidirectional
Machine Translator system has been developed. The morphological analysis is based
on TagSA (Tagalog Stemming Algorithm) and is focused on an affix
correspondence-based POS (parts-of-speech) tagger.
Opentrad is an open source transfer based Machine translation system intended for
related language pairs and not so similar pairs [3][48]. The Opentrad uses different
translation methods according to each language pair. For related languages it uses
shallow transfer, even though for nonrelated pairs the system uses deep transfer [49].
Opentrad also uses open-source machine translation engine[101] (Matxin) as the
translation engine.
OpenLogos is the Open Source version of the Logos Machine Translation System
[122]. It is one of the earliest and longest running commercial machine translation
products in the world. This system accepts documents in various formats and
produces high quality translations [136]. OpenLogos translates from English and
German to the major European languages, including Spanish, Italian, French and
Portugese.
2.4.2.2 Interlingua Machine Translation
The Interlingua approach gives language independent meaning representation for the
source language to target language translation. The Interlingua gives one single
meaning representation for all the languages and it has been reserved as an extremely
15
difficult task in practice [135]. However, there are several advantages in the
Interlingua approach. Among others Interlingua gives more easy way to adding new
language than all other methods. Also it seems several disadvantages. Meaning
representation is the critical approach in Interlingua. If the meaning is too simple
then meaning will be lost in the translation. On the other hand it is too complex and
analysis and generation will be too difficult.
Numbers of Machine translation system have been developed through the Interlingua
approach. Abdelhadi and others have been developed English to Arabic machine
translation system based on Interlingua approach [1]. They have used mapping
system to Arabic to intermediate representation. This mapping system contains three
steps namely, selecting lexical items for each Interlingua concepts, mapping the
semantic roles and mapping the semantic features for each Interlingua concept to
appropriate syntactic feature in the feature structure.
Among others ICENT is the interlingua-based Chinese-English natural language
translation system [167]. This system introduces the realization mechanism of
Chinese language analysis, which contains syntactic parsing and semantic analyzing
and gives the design of Interlingua in details.
Tai to English machine translation system is another successful machine
translation system for Tai to English [29]. This system translates the Thai sentences
into Interlingua of a Thai LFG tree using LFG grammar and a bottom up parser.
2.4.2.3 Dictionary based Machine Translation
The dictionary based machine translation systems are commonly used for crosslanguage retrieval systems [77]. This dictionary based approach uses dictionarybased method to generate the equivalent target query for the given source language
query.
Mandal and others [105] have been developed a cross-language retrieval system
for the retrieval of English documents in response to queries in Bengali and Hindi.
16
This dictionary-based machine translation system uses to generate the equivalent
English query out of Indian language topics.
Thenmozhi and Aravindan have been developed Tamil-English Cross Lingual
Information Retrieval System for Agriculture Society [149]. This system developed
for the Farmers of Tamil Nadu which helps them to specify their information need in
Tamil and to retrieve the documents in English. It uses a Morphological Analyzer to
obtain the root terms of source query. This Machine Translation approach retrieves
the pages with mean average precision of 95%.
2.4.3 Statistical Machine Translation
Statistical machine translation approach is by far the most widely-studied machine
translation method in the field of natural language processing. This approach tries to
generate translations using statistical methods based on bilingual text corpora [84].
Using this statistical approach, large numbers of machine translation systems have
been developed.
Moses is a Statistical machine translation system that allows automatically train
translation models for any language pair [108]. The Moses system has several
features. It offers two types of translation models namely, phrase-based and treebased. Moses system uses factored translation models, which enable the integration
linguistic and other information at the word level.
Babel Fish [168] is a web-based application developed by AltaVista which
translates text or web pages from one language into another. The translation
technology for Babel Fish is provided by SYSTRAN [144], whose technology also
powers the translator at Google and a number of other sites. It can translate among
English, Simplified Chinese, Traditional Chinese, Dutch, French, German, Greek,
Italian, Japanese, Korean, Portuguese, Russian, and Spanish. A number of sites have
sprung up that used the Babel Fish service to translate back and forth between one or
more languages.
17
Bing Translator [112] is a service provided by Microsoft as part of its Bing
services which allow users to translate texts or entire web pages into different
languages. All translation pairs are powered by Microsoft Translation, developed by
Microsoft Research; it uses Microsoft's own syntax-based statistical machine
translation technology.
Google Translator [51] translates a section of text, or a webpage, into another
language. It does not always deliver accurate translations and does not apply
grammatical rules, since its algorithms are based on statistical analysis rather than
traditional rule-based analysis.
In the Indian region, Udupa and Faruquie have developed an English-Hindi
Statistical Machine Translation System [152]. This machine translation system is
based on IBM Models 1, 2, and 3. The system has been tested through the EnglishHindi parallel corpus consist of 150,000 sentence pairs.
Singh and Bandyopadhyay have been developed Manipuri-English bidirectional
statistical machine translation system [133]. The system uses four useful translation
factors namely case markers and POS tags information at the source side and suffixes
and dependency relations at the target side. This translation system has been
evaluated through the BLEU score.
2.4.4 Example-based Machine Translation
The example-based machine translation system uses bilingual corpus with the parcel
text for the machine translation. These systems are trained through the bilingual
parallel copra, which contain sentence pairs. The example based approach is more
useful for detecting the context from the text. Also this approach uses translation
memories [13]. Using this approach number of machine translation systems have
been developed all over the world.
18
Among others, OpenMaTrExis one of the open source Example-based machine
translation systems which is freely available on the OpenMaTrEx web site [121].
OpenMaTrEx has been developed through the marker hypothesis, which is
compressed on marker-driven chunker, a collection of chunk aligners and two
engines.
Kyoto-U is a successful Example based machine translation system that translates
English-Japanese [119]. This system uses a morphological analyzer and dependency
analyzer to detect Japanese sentence structures and converted into dependency
structures. In addition, Japanese and English parsers and bilingual dictionary were
used as external resources.
At present many researchers are researching to develop example-based machine
translation systems by using World Wide Web as parallel corpora [55]. The wEBMT
is an example-based machine translation (EBMT) system that uses the World Wide
Web as the parallel corpus [13].
2.4.5 Knowledge-based Machine Translation
Knowledge-based machine translation approach uses knowledge for machine
translation. This is an extended idea of the example-based machine translation. This
approach uses linguistic and computational instructions, which are supplied by a
human. Numbers of commercial quality Machine Translation systems have used this
knowledge-based approach. Among others EDR[150] and KANT [86] are the major
knowledge-based machine translation systems.
EDR (Electronic Dictionary Research) [114], by Japanese, is the most successful
machine translation system. This system has taken a knowledge-based approach in
which the translation process is supported by several dictionaries and a huge corpus
[115]. While using the knowledge-based approach, EDR is governed by a process of
statistical machine translation. As compared with other machine translation systems,
EDR is more than a mere translation system but provides lots of related information.
19
KANT (Knowledge-based Accurate Natural-language Translation) is a knowledge
based machine translation system for specific domain [86]. Prototype of the KANT
architecture translates French, German, and Japanese successfully. KANT is
currently being extended in a large-scale commercial application [118]. The KANT
prototype has been implemented in the domain of technical electronics manuals, and
translates from English to Japanese, French and German.
2.4.6 Hybrid Machine Translation
The Hybrid machine translation system uses combine method in rule-based and
Statistical machine translation approaches. This hybrid approach has several
advantages.
Among others, SYSTRAN is the market leading provider of language translation
software products and solutions for the desktop, enterprise and Internet that facilitate
communication in 52 language combinations and in 20 vertical domains [124].
Introducing combination of self-learning and linguistic technologies SYSTRANS has
been developed hybrid machine translation system [144] named as a SYSTEMS
Enterprise server 7.
The English to Arabic machine translation system has also been developed through
the hybrid approach, which is combined between rule-based and example based
approaches [133].
2.4.7 Agent-based Machine Translation
Agent technology, more specifically multi-agent systems, have also been used to
handle machine translations. This Multi-agent system provides tools for building
artificial Complex Adaptive Systems [131].
In general any multi agent system
contains four key components, namely Multi-Agent Engine, Virtual world, Ontology
and Interfaces [130][131]. The multi agent engine provides a run time support for
agents. The engine starts as the first step of the system.
Virtual world is the
20
environment of the multi agent systems. Using this Virtual world, agents are
cooperated and competed with each other as they construct and modify the current
scene. The Ontology contains conceptual problem domain knowledge of each agent.
There are a number of NLP systems that have been developed using multi agent
system technology [175][129][130][113][36]. Most of these systems use agents to
handle semantics in the translation.
Minakow and others [113] have developed a Multi Agent-based text understanding
system for car insurance domain. This system uses Multi agent system based
approach to understand a given text. The system uses four steps to text understanding
namely morphological analysis, Syntax analysis, semantic analysis and pragmatic
analysis. To analyze the whole text is divided into sentences. Then first three stages
are applied to each sentence. After analyzing each paragraph text is passed to
pragmatic analysis.
Stefanini and others have developed a Multi-agent based general Natural language
processing system named Talisman [141]. Talisman agents can communicate with
each other without the central control. These agents are able to directly exchange
information using an interaction language. Linguistic agents are governed by a set of
local rules. The TALISMAN deals with ambiguities and provides a distributed
algorithm for conflict resolutions arising from uncertain information.
2.5 Existing English to Sinhala Machine Translation Systems
During the past few years many Sri Lankan researchers contributed to develop
Machine Translation systems for local languages. Among others University of
Colombo has recorded a significant research to develop English to Sinhala and
Sinhala-Tamil machine translation system with several Local language resources
such as Sinhala corpus [99][159], Sinhala text to Speech system [160], Parts of
Speech Tagger[45] and OCR system for Sinhala language [158]. As a first attempt
Weersinghe and others have been researching to develop Sinhala to Tamil machine
translation system through the corpus based approach [157]. This translation system
21
evaluates through the BLUE score matrix [123] and reasonable result were achieved.
At present they are researching to develop English to Sinhala machine translation
system through the translation memories[156]. They have designed translation tool
named OpenTM, which is based on the translation memories. They have mentioned
that this OpenTM is suitable for any language pairs around the world, where at least
one language requires complex script support.
Further, many other local researchers have developed several prototype English to
Sinhala machine translation systems through several approaches. In 2003, Vithanage
and others have developed English to Sinhala machine translation systems for
weather forecasting domain [153].
Vithanage’s translation system can translate
simple sentences and works on the limited set of words and the limited sentence
patterns. This translation system is fundamental rule-based and it has used
Paragraphs and sentence tokenization, simple parsers (English and Sinhala),
translators and Sinhala sentence generators for English to Sinhala translation.
In 2008, Fernando and others have developed English to Sinhala machine translation
system using Artificial Neural Networks [47]. A Probabilistic Neural Network is
used to identify the English grammar and it is based on Bayesian classifiers. This
system has been achieved 50% accuracy in the grammatical translation. It has been
tested through 84 test cases including 12 tenses and it only capable to translate only
the simple sentences.
In addition to above, some people all over the world have attempted to develop
machine translation system for Sinhala. Among others, Hearth and others have
attempted to develop translation system for Japanese to modern Sinhalese [57]. The
system has a limited vocabulary and it handles translations only within its domain.
2.6 Concepts and Techniques for Machine Translation
In the previous section the author has discussed several existing approaches for
Machine Translation. Many of these machine translation systems have used the
Morphological analysis and the syntax analysis to analyze the source language. This
22
Morphological analysis and syntax analysis is done by Morphological analyzers and
parsers. Morphological analyzers and parsers act the major task in any machine
translation. Therefore the following sub section gives brief description about
Morphological analysis and syntax analysis.
2.6.1 Morphological Analysis
The morphological analysis is the identification (analysis) of the structure of
morphemes and other units of meaning in a language like words, affixes, and parts of
speech [84][162][176]. Historically, the first attempt made for the morphological
analysis, was done by the ancient Indian linguist Panini, who formulated the 3,959
rules of Sanskrit morphology (Vyakarana). This Panini grammar [24] is the basis of
all the Indian families of language including, Hindi, Sinhala, Pali, Sanskrit etc. Using
this Panini grammar model, many researchers have developed number of
morphological analyzers for their language analysis [5][6].
The Morphological analyzers for English language have been developed by many
researchers. Koskenniemi’s two-level morphology was the first practical and most
general model in the history of computational linguistics for the analysis of
morphologically complex languages [92][93]. Koskenniemi’s Pascal implementation
of morphological analysis was quickly followed by others. The most influential of
them was the KIMMO system by Lauri Karttunen and his students at the University
of Texas. PC-KIMMO is yet another morphological analysis tool, which was based
on Koskenniemi’s work and implemented in C [87]. Among others, PC-KIMMO is
supposed to be the only available free English morphological analyzer with a wide
coverage [34]. The lexicon used in PC-KIMMO considers verb, pronoun, noun,
prepositions, adverbs and adjectives. The current version PC-KIMMO is
implemented in C and can be run on a PC [93]. The PC-KIMMO accepts an input
word from a user, and provides all possible morphological details of the word. In
addition, many European and Scandinavian countries have developed morphological
analyzers for their languages.
These countries have exploited real power of
computer technology for machine translation.
23
Asian countries including India, Japan and Thailand have also developed
morphological analyzers for computer-based natural language processing [5][6]. For
example, Anusaaraka system has developed morphological analyzers for six Indian
languages [16]. Anusaaraka has been designed to translate among major Indian
languages and its morphological analysis is based on the paradigms. The Paradigm is
used both for word analysis as well as word generation. Also Akshar Bharati and
others have developed a Generic Morphological Analysis Shell that can be used to
develop morphological analyzers for different minority languages [5]. This Shell
uses finite state transducers with features to give the analysis of a given word.
Further, it integrates paradigms with augmented FSTs. The current model has been
developed for sample data of Hindi, Telugu, Tamil and Russian. The above generic
Morphological Analysis Shell uses dictionaries, s paradigm table and paradigm
classes.
2.6.2 Syntax Analysis
Syntax analysis is used to analysis structure in the text and is used to determine
whether or not a text conforms to an expected format [84][91]. In the Machine
Translation point of view, this syntax analysis is done by the Parser, which is used to
analyze the given text (sentences). To analyze the given text Parsers use several
techniques coming under Top-down and Bottom-up parsing.
The Top-down parsers are analyzing the input source left to right and searching for
parse trees using a top-down expansion [162]. Using this top-down parsing approach
there are several types of Parsers that are also developed including Recursive descent
parser, LL parser, Earley Parser and the X-SAIGA parser. These parsers have
demonstrated their own properties in addition to the top-down parsing features.
The Recursive descent parser is the straightforward forms of top-own parsing [97].
The LL Parser is also used top-down parsing and parses the input from Left to right,
and constructs a leftmost derivation of the sentence. The ANTLR [148] is the
popular LL parser, especially for compilers. The LL(k) parser uses the above
techniques to parse the sentences without backtracking. The Earley parsers are
24
especially suitable for ambiguous grammars and use for parsing the computational
linguistics. Many of these parsers are already implemented through the C, Java, Perl
and Python languages. The X-Saiga parsers are developed under the X-Saiga project
to create algorithms and implementations which enable the construction of language
processors such as recognizers, parsers, interpreters, translators, etc. they have
implemented several algorithms, at various stages to develop X-Saiga [166].
The bottom-up parser attempts to identify the most fundamental units first. Then it
attempts to build trees upwards the start. These parsers are mainly used to analyze
both natural languages and computer languages. Using this bottom-up parsing
approach several types of Parsers are also developed including Operator Precedence
parsers, LR parsers and the CYK parsers.
The operator precedence parser is a bottom-up parser that interprets an operatorprecedence grammar [162]. The LR Parser [132] is also used bottom-up parsing and
parses the input from Left to right, and constructs a rightmost derivation of the
sentence. The CYK Parsers are used Cocke–Younger–Kasami algorithm and parsing
techniques are based on the bottom-up parsing. The CYK parsers operate on contextfree grammars given in Chomsky normal form (CNF) [31][32].
In addition to the above Parsers are developed by using several computer
languages especially prolog [25] and number of tools are used to develop parsers
including ANTLR, Yacc, JavaCC etc.By using these programming languages and
development tools numbers of parsers have been developed by many people for
several Natural languages as well as computer programming languages.
2.7 Problem Definition
The existing Machine translation systems that use the stated approaches are not
directly able to translate English text into Sinhala. Since each natural language is
built on its own building blocks and structures, two languages may not be able to
handle in the same manner. Despite some Indian languages may have common
features with Sinhala, they are not identical. On the other hand such systems do not
25
provide an underlying theory to generalize machine translations. As such, it is
impossible to figure out which building block or the structure should be exactly
customized to create English to Sinhala machine translation system. Therefore, lack
of theoretically-based approach to machine translation has led to develop ad-hoc
translation systems.
2.8 Summary
This chapter gave a detailed discussion about Machine Translation systems and the
approaches used. The table 2.1 shows selected successful machine translation
systems with language pair, approach and system type.
Table 2.1: Existing Machine translation systems
System
Language pair
Approach & Type
Anusaaraka
Among Indian languages
Angalabarath
English
AngalaHindi
to
Human-Assisted, Application
Indian Human-Assisted, Rule-based,
languages
Application
English to Hindi
Machine-aid, Rule-based/ examplebased, Web based
ManTra
English to Hindi
English to Urdu English to Urdu
Human-aided, web based
Example based, Application
MT
Matra
English to Hindi
Human-aided, transfer-based
Application
Google TR
Several languages
Statistical, Web-based
Bable fish
Several languages
Systran technology, Web based
Yahoo TR
Several languages
Statistical, web-based
Aprtium
Related languages
Rule-based, Application
EDR
English/Japanese
Knowledge based, Application
26
According to the literature survey, the author has identified that human assisted and
rule-based approaches are more suitable for none-related language pairs such as
English and Sinhala. Next chapter reviews features of English and Sinhala languages
with a view to identify issues related to machine translation from English to Sinhala.
27
Chapter 3
OVERVIEW OF THE ENGLISH AND SINHALA LANGUAGES
3.1 Introduction
The previous chapter discussed in detail about the Machine Translation systems. The
author has pointed out issues in adapting an existing translation system for
constructing English to Sinhala machine translation system. The literature review
also revealed that the development of the Machine Translation system absolutely
depends on the structure of the source and the target languages. Therefore, this
chapter studies about language primitives and structures of English and Sinhala
languages. This study would help to provide an insight about how the translation
from English to Sinhala can be done.
3.2 The English Language
English is the international communication language and more than 53 countries are
already using it as an official language. It is a West German language that originated
from the Anglo-Frisian and Old Saxon dialects brought to Britain [162]. English
language contains 26 letters with 5 vowels [116]. The English language has eight
parts of speech such as Noun, Adjective, Pronoun, Verb, Adverb, Preposition,
conjunction and Interjection [8][165]. Rest of the section describes Morphology,
Syntax, and Semantics of the English Language.
3.3 The English Language Morphology
Morphology is the study of the way words are built up from smaller meaning bearing
units called morphems that often define as the minimal meaning-bearing unit in a
language [84]. For example the word boy consists single morpheme and the word
boys consists two morphemes namely boy and the -s.Furher, in the Morphological
view point there are two types of morphemes such as stems and affixes. In the
28
previous example a morpheme boy is a stem and the –s is an affix. These stems and
affixes are participated both inflection and derivation of the word which is called
word formation [109].The Inflection provides various forms of any single word such
as Singular, Plural etc. (E.g. singular man, plural men in English). Derivation creates
new words from old ones. (E.g. the creation of dogcatcher from ‘dog’, ‘catch’ and
‘er’ is a derivational process) [117][84]. Comparing the other Indo-European
languages, English grammar has minimal inflections. Therefore, the English
morphology is simpler than the other Indo-European languages. With the exception
of pronouns, English words have relatively few forms.
3.3.1 English Noun Morphology
English Noun contains two types of inflections such as number and possessive case.
Nouns generally have only two forms for Number inflection such as singular and
plural. In the possessive case, the words usually end in ( ’s ) or ( ’ ) for example
boy’s and boys’.
The English noun participates regular and irregular inflections. The regular inflection
gives general forms of the singular, plural and possessive cases. Table 3.1 shows
regular and irregular nouns with the inflection forms.
Table 3.1: Regular and irregular forms of the English Noun
Grammar rule
Regular
Irregular
Singular
boy
Man
Plural
boys
Men
Singular Possessive
boy's
man's
Plural Possessive
boys'
men's
Considering the morphology of the English noun, it has very limited number of
rules for noun inflections. The table 3.2 shows some morphological rules for the
29
English Noun. Basically, the plural noun is formed by adding some suffixes to the
singular noun such as s, es, ies, ves etc. The posessive case is formed by adding ‘s
or s’.
Table 3.2: English Noun Morphological rules
English Noun Morphology
No
Morphological structure
Base word
Example
1
Singular noun
Boy
boy
2
Plural Base + s
Boy
Boys
3
Plural Base + es
Class
Classes
4
Plural Base –y + ies
Baby
Babies
5
Plural Base – f + ves
Knife
Knives
6
Singular Possessive Base + ’s
School
School’s
7
Plural Possessive Plural + ’
Boy
Boys’
3.3.2 English Verb Morphology
English verb contains five types of inflection namely Infinitive, simple present, past
tense, past participle and present participle. In regular verbs, 3rd person singular ends
with ‘s’, past tense and past participle ends with ‘ed’ and the present participle ends
with ‘ing’. Note that English has a large number of irregular verbs and these verbs do
not fit with this pattern. The personal pronoun has different forms depending on
number (singular and plural), case (subject, object, possessive, etc.), and person (1st,
2nd and 3rd person). In the 3rd person singular, there is gender too. The table 3.3
shows the entire verb forms available for the English verb play (Regular) and eat
(Irregular).
The Morphological point of view, English regular verbs have several
morphological rules. The table 3.4 shows Morphological rules for English verb.
Most of the English regular verbs have simple inflection rule. However, Irregular
30
verbs use different patterns. Then the regular verbs expect simple present (adding s)
and the Present Participle (adding ing) forms.
3.3.3 English Adjective Morphology
Adjectives have comparative and superlative forms namely comparative adjectives
are end with 'er') and the superlative adjectives end with 'est'). For example; higher
and highest are the comparative and superlative forms of the adjective ‘high’. Other
parts of speech; adverb, preposition, conjunction and Interjection do not show
inflections.
Table 3.3: English verb Morphology
English Verb Morphology
Morphological structure
Regular verb
Irregular
verb
Infinitive
play
eat
Past
played
ate
Present Participle
playing
eating
Past Participle
played
eaten
I
play
eat
You
play
eat
He, She, It
plays
eats
We
play
eat
You
play
eat
They
play
eat
Present:
31
Table 3.4: Morphological rules for English Verbs
English Verb Morphology
No
Morphological structure
Regular verb
Irregular verb
1
Infinitive verb (Base verb)
play
eat
2
Simple present (base + s)
plays
eats
3
Past(base + ed)
played
ate
4
Present Participle (base + ing)
Playing
eating
5
Past Participle (Base +ed)
played
eaten
3.4 Syntax of the English Language
The syntax is the study of the rules that gives the structure of the sentences [162].
English Language has its own format and it differs from the Sinhala language syntax.
The below section gives a brief description about English sentence syntax, which is
based on the scientific psychin web site [172][174]. English language contains four
main sentence types namely declarative, Interrogative, Imperative and conditional.
The English sentence may be simple or compound. The compound sentences consist
of two or more simple sentences joined by conjunctions.
The declarative sentence consists of a subject and a predicate. The subject may be
a simple subject or a compound subject. A simple subject consists of a noun phrase
or a nominative personal pronoun. Compound subjects are formed by combining
several simple subjects with conjunctions. All the sentences in this paragraph are
declarative sentences.
Interrogative sentences are used to form questions. One form of an interrogative
sentence is a declarative sentence followed by a question mark and there are several
ways available for Interrogative sentences that start with what, who, which etc.
The Imperative sentences are commands; consist of predicates that only contain
verbs in infinitive form. Generally, imperative sentences are terminated with an
exclamation mark instead of a period.
32
The Conditional sentences are used to describe the consequences of a specific
action, or the dependency between events or conditions. Conditional sentences
consist of an independent clause and a dependent clause.
In addition to the above, deep structural analysis needs to develop machine
translation for English source sentence analysis specially, subject, object, predicate
and sentence patterns. These information are very useful to develop English Phrases.
3.4.1 The English Sentence Subject
The subject is the part of the sentence that performs an action or which is
associated with the action. The subject may be simple or compound. The Simple
subject may be a noun phrase or a nominative personal pronoun. (The nominative
personal pronouns are: I, you, he, she, it, we and they)
3.4.2 The English Predicate
The predicate is the part of the sentence that contains a verb or verb phrase and its
complements. English has three main kinds of verbs: auxiliary verbs, linking verbs,
and action verbs.
3.4.3 Verb Tense
Verb tenses are inflectional forms of verbs or verb phrases that are used to express
time distinctions [8]. The table 3.5 shows the structure of some common tenses.
Table 3.5: Tense patterns (Active voice)
Tense
Simple present
Example
I write a book
The boy sings a new song
Present
I am writing a book
33
continuous
The boy is singing a new song
Present perfect
I have written a book
The boy has sung a new song
Present perfect
continuous
I have been writing a book
Past tense
I wrote a book
The boy has been singing a new song
The boy sang a new song
Past continuous
I was writing a book
The boy was singing a new song
Past perfect
I had written a book
The boy had sung a new song
Past perfect
continuous
I had been writing a book
Future tense
I will write a book
The boy had been singing a new song
The boy will sing a new song
Future continuous I shall be writing a book
The boy will be singing a new song
Future perfect
I shall have written a book
The boy will have sung a new song
Future perfect
continuous
I shall have been writing a book
The boy will have been singing a new
song
3.4.4 The Complement
The predicate consists of a verb or verb phrase and its complements, if any. A verb
that requires no complements is called intransitive. A verb that requires one or two
complements is called transitive.
34
3.5 Semantics of English Language
Semantics is the study of the meaning. It typically focuses on the relation between
signifiers, such as words, phrases, signs and symbols, and what they stand for [162].
Semantics can be classified as three groups namely, word level meaning sentence
level meaning and the paragraph level meaning.
3.5.1 Word Level Semantics
Word level semantics means semantics may define by the words in the sentence. As
an example consider the following sample sentences, “This is a red rose”, “this paper
is red”, and “the supervisor flashes the red light for his student”. The word ‘red’
gives different meaning in each sentence.
3.5.2 Sentence Level Semantics
The sentence level semantics refers to the meaning that depended on the sentence.
Analyzing the sentence level semantics of the sentence is very important for many
areas [37].
3.5.3 The paragraphs Level Semantics
The paragraphs level semantic analysis [173] is a solution for the word sense
ambiguity [80]. Further, many of the researchers have done researches to analyze
paragraphs level semantics [127].
3.6 The Sinhala Language
The Sinhala Language is constitutionally recognized as the official language of Sri
Lanka, along with Tamil. Sinhala is the mother tongue of the Sinhalese. Sinhala
language has its own writing system, which is an offspring of the Brahmi script [22].
35
Maldives, Dhivehi are the closest relative languages to Sinhala. Further, Sinhala
scripts are the world’s 16th most creative alphabet among today’s functional
languages [35]. The Sinhalese most historical book Mahavansa [102] noted that, the
prince Vijaya and his entourages who came from India in the 5th century BC were
merged with the native Hela tribes known as Yakka and Naga who spoke Elu
language (the ancient form of the Sinhalese language) and the new nation called
‘Sinhala’ came to exist with the Sinhala language.
Further, Sinhala differs from all other Indo-Aryan languages. It contains a pair of
vowel sounds that are unique to it, such as short vowel: ‘we’ – ae and Long vowel:
‘wE’ – aae. Also Sinhala contains a set of five nasal sounds known as “half nasal” or
“prenasalized stops”. These sounds as represented in modern Sinhala writing and
their Romanized notations are as follows: Õa (nng), `ca (ndj), `â (nnd), |a (nd), ò(mb)
[88].
The next sub section briefly describes the Sinhala alphabet, morphology and the
syntax of the Sinhala language.
3.6.1 Sinhala Alphabet
The Sinhala alphabet consists of 61 letters comprising 18 vowels, 41 consonants and
2 semi-consonants [40][22].These symbols represent 40 sounds: 14 vowel sounds
and 26 consonant sounds. This is quite similar to other Indic alphabets, as all of
them appear to be offshoots of the Sanskrit alphabet [50]. Table 3.6 shows the
Sinhala alphabet.
Table 3.6: The Sinhala Alphabet
Letter Type
Vowels
Sinhala Letters
w, wd, we, wE, b, B, W, W! ,Ì, Ï iD, iDD, t, ta, ft, T, ´, T!
l, L, . , >, V, Õ, p, P, c, Cv [, {, P, g, G, v, V, K,
Consonants
Ë, ; , : , o, O, k, |, m, M, n, N, u, U, h, r, ,, j, Y, I, i,
y, <, *
Semi-Consonants
x, (
36
Furthermore, some graphical symbols, stokes, are used in conjunction with
consonants. They are used in writing some vowels too (example. wd" ta" ft). Unlike
in English, a stoke may be positioned at any of the four sides of the base letter.
Table 3.7 shows Sinhala stokes and their positions [42].
Table 3.7: Vocalic Stokes and their position
No
Stoke
Name
Position
Example
A
Al-lakuna1
Upper
ia
A
Al-lakuna2
Upper
¾
1
2
D
Aela-pilla
Right
ld
3
E
Kettiaedapilla
Right
le
4
E
Digaaedapilla
Right
lE
5
S
Ketti ispilla
Upper
ls
6
S
Diga ispilla
Upper
lS
7
Q
Kettipaa pilla1
Lower
nq
=
Kettipaa pilla2
Lower
l=
Q
Digapaa pilla1
Lower
nQ
+
Digapaa pilla1
Lower
l+
8
9
D
Gaettapilla
Right
iD
10
f
Kombuva
Left
fu
11
!
Gayanukitta
Right
T!
In addition to above, Sinhala letters (characters) are generated using vowels,
consonants and conjunction with consonant and stokes.
Table 3.6 shows the
combination of the consonant l (k) with vocalic stokes.
Table 3.8:The consonant ‘l’ with vocalic stokes
No
Character
Letter
1
la
la
2
la + w
l
37
3
la + wd
ld
4
la + we
le
5
la + wE
lE
6
la + b
ls
7
la + B
lS
8
la + W
l=
9
la+ W!
l+
10
la + iD
lD
11
la + iDD
lDD
12
la + t
fl
13
la + ta
fla
14
la + ft
ffl
15
la + T
fld
16
la + ´
flda
17
la+ T!
fl!
3.7 Sinhala Language Morphology
Sinhala is an inflationary rich language and it participates inflection, derivation and
conjugation for nouns and verbs. Inflection is the modification of a word to express
different grammatical categories such as tense, mood, voice, aspect, person, number,
gender and case [54]. The Derivation is "Used to form new words, as with happiness
and un-happy from happy, or determination from determine [162] and conjugation
refers to the creation of derived forms of a verb from its principal parts by inflection
Conjugation may be affected by person, number, gender, tense, aspect, mood, voice,
or other grammatical categories. A table giving all the conjugated variants of a verb
in a given language is called a conjugation table or a verb paradigm.
3.7.1 Sinhala Noun Morphology
The Sinhala Noun is a word that represents the noun, pronoun and the adjective in
the English language. The Sinhala noun has four types of inflections such as Gender
38
(lingaya), Number (Wachana), Person (Purusha) and Case (Vibhakthi). There are
three genders namely masculine gender, feminine gender and neuter gender. Singular
and plural are the Number and there are three persons namely first person
(Uthtamapurusha)
second
person
(Maddamapurusha)
and
third
person
(prathamapurusha). Also there are nine cases in Sinhala such as Nominative
(prathama), Accusative (karma), Instrumental (kaththru), Auxiliary (karana), Dative
(sampadana), Ablative (avadhi), Genitive (Sambanda), Locative (adara) and
Vocative (alapana) [54][134]. There are 27 inflection forms generated for single base
noun such as nine Vibhakthi, article and the number. For example Sinhala base word
‘.j’ inflects as ‘.jhd, .jfhda, .jfhla etc. The base word is directly affected by
the nine cases. Some case suffixes are written with the base word and some are
written separately. Table 3.9 shows sample case makers of the Sinhala noun. There
are number of case maker forms available in Sinhala that depends on the gender of
the noun.
From morphological point of view, a Sinhala noun is also a word, and nouns are
participated inflection and derivations. The Sinhala nouns can be divided into three
categories, namely, simple, complex and compound. A Simple noun contains only a
prakurthi (base form) while a complex noun contains prakurthi and suffix [41]. A
compound noun contains two or more prakruthi. Prakurthi is a base form of a word
and it is also in the non-inflection form.
As mentioned, Sinhala nouns can be divided into three categories. A brief
description of these categories is as follows. A simple noun contains only nama
prakurthi [38]. Nama prakurthi is a prakurthi, out of five prakurthis, such as, nama,
kriya, guna vilasa and nipatha. Nama prakurthi is an adjective form of a noun. Nama
prattya, vibakthi pratthaya, upasarga and thadhitha are used to inflect prakurthi.
Upasarga is a prefix and others are suffixes. The follwing example shows how nouns
are participated in inflection.
ñksid = ñksia + wd
= Prakurthi + Nama prathya
ñksidg = ñksia + wd + g
= Prakurthi + Nama prathya + Vibakthi prattya
fkdñksia = fkd +ñksia
= Upasarga + Prakurthi
ñksis = ñksia + b
= Prakurthi + Thadhitaya
39
Note that, in the above the word ñksia is a prakurthi and ‘wd’ is a nama prathya ,‘g’
is a vibakthi suffix, ‘fkd’ is a upasarga and ‘b’ is a Tthadithaya [43]. Note that nama
prakurtiya is a base form and nama prathya is one of the inflection parts of the noun.
Also vibakthi suffix is an inflection part. Table 3.9 shows some case makers in the
sinhala nouns. Upasarga and Thadditha change meaning of the noun. Note that any
morphologically complex word can be broken up into several meaningful units called
morphos. Therefore prakurthi, nama prattya, vibakthi prattya, thadditha and upasarga
are morphos in Sinhala.
Table 3.9: Sample case makers in Sinhala
No.
Case
Suffix
1
Nominative
-
2
Accusative
-
3
Instrumental
jsiska
4
Auxiliary
f.ka
5
Dative
g$ yg
6
Ablative
f.ka
7
Genitive
f.a
8
Locative
flfrys
9
Vocative
-
There are 27 forms of nouns that can be generated by inflecting a single root word
(prakurthi). This inflecting is called ‘Nama varanagilla’(Word conjugation). The
Sinhala noun contains more than hundred rules to conjugate a noun using a given
base form (Prakurthi). In Sinhala there are 15 conjugation patterns identified for
generating a Sinhala noun.These patterns are called ‘Gana’. There are six noun
generation forms (aeth ganaya, ali ganaya, tara ganaya, vasu ganaya, kaputu ganaya
and bamara ganaya) [41] that used to generate masculine gender nouns. There are
nine generation forms (poth, akshara, basha, pili, akuru, polo, sulan, nuwara and
mutu) that used to generate neuter gender . The table 3.10 shows some rules for the
40
noun conjuagation in “Eath” ganaya
Table 3.10: conjugation table for ‘we;a’ ganaya
m%lD;sh
Example
ksh; tal
A
R
Example
wksh; Wla;
A
R
wksh; wkqla;
Example
A
R
Example
we;a
d
A
we;d
f;la
;a
wef;la
l=
a
wef;l=
fldla
d
A
fldld
flla
la
fldflla
fll=
a
fldfll=
f.dka
d
A
f.dkd
fkla
ka
f.dfkla
fkl=
a
f.dfkl=
kslï
d
A
kslud
fula
ï
kslfula
ful=
ï
kslful=
lsUq,a
d
A
lsUq,d
f,la
,a
lsUqf,la
f,l=
,a
lsUqf,l=
ñksia
d
A
ñksid
fila
ia
ñksfila
fil=
ia
ñksfil=
Furthermore, sandhi rules are the morpho-graphemic rules describing changes that
occur due to concatenation of different morphemes. There are ten sandhi rules that
are availble in Sinhala language, namely, purwasswara lopa, parasawara lopa, swara,
swaradesha, gatradesha, purwarupa, pararupa, gathashwara lopa, agama and
dithwarupa. Nouns also undergo in darivations. Derivation creates new words from
pre-existing words, often of different syntactic categories. The Sandhi rules are used
for derivations.
3.7.2 Sinhala Verb Morphology
Sinhala verbs are divided into two general classes, namely, transitive verb
(sakarmaka) and intransitive (Akarmaka). Further, these two verb categories are
inflected for voice (karaka) , mood (vidi) , tense (kala), number (wachana) and
person (purusha). voice can be either active or passive. There are four types of
moods, namely, indicative, optative, imperative and conditional [54]. Sinhala
41
language has only three tenses. They are Past tense, Present tense and future tense.
Main verb (Akkyathaya) participate three types of inflections namely person, number
and sex. Table 3.11 and 3.12 shows inflection forms of a verb in the active voice and
the passive voice.
Furthermore, structure of the Sinhala verbs is different from English. In comparison
with the English language, the Sinhala language has only three tenses such as present
(Varthamana), past (Athitha) and future (Anagatha) and the English shows 20 tenses
for active and passive. Note that, More than 18 inflection forms are available in a
Sinhala base verb including inflection of the tense, number and the person. In
addition, there are four moods such as Indicative mood, Optative mood, Imperative
mood and Conditional Mood and two participles Present participle (Misrakriya) and
Past participle (Purvakriya). For example hñka, f.dia is the inflection form of the
above two participles.
In addition to the above, other parts of speech namely Nipatha and Upasarga do not
participate any inflections.
Table 3.11: Inflection form of the Sinhala verbs (Active)
Person
Number
Present
Past
Future
First
Singular
n,ñ
ne,Sñ
n,kafkñ
First
Plural
n,uq
ne,Suq
n,kafkuq
Second
Singular
n,ys
ne,Sys
n,kafkys
Second
Plural
n,yq
ne,Syq
n,kafkyq
Third
Singular
n,hs
ne,S
n,kafkah
Third
Plural
n,;s
ne,Q
n,kafkdah
42
Table 3.12: Inflection form of the Sinhala verbs (Passive)
Person
Number
Present
Past
Future
First
Singular
nef,ñ
ne,sKsñ
nef,kafkñ
First
Plural
nef,uq
ne,sKsuq
nef,kafkuq
Second
Singular
nef,ys
ne,sKsys
nef,kafkys
Second
Plural
nef,yq
ne,sKsyq
nef,kafkyq
Third
Singular
nef,hs
ne,sKs
nef,kafkah
Third
Plural
nef,;s
ne,qKq
nef,kafkdah
From the morphological point of view, a verb contains two parts, namely, Base verb
and a suffix. Base verb is a prakurthi, and it is named as kriya prakurthi. Diffrent
verb forms are generated by adding diffrent suffixes for the kriya prakurthi.
3.8 Syntax of the Sinhala Language
Syntax teaches how sentences are constructed in conformity to the rule of grammar
[54].
According to the Sinhala grammar, Sinhala sentences can be categorized into six
types such as simple, complex, contracted, collateral, compound and elliptical. The
simple sentence contains only one subject and one finite verb. The complex
sentences contain a principal sentence with one or more dependent or subordinate
clauses. Subordinate clause can be divided into three parts such as Substantive
clauses, Adjective clauses and Adverbial clauses.
There are 36 syntax rules in the Sinhala language to generate grammatically correct
Sinhala sentences. Most of these rules represent subject verb agreement of the
Sinhala sentences.
43
3.9 Semantics of the Sinhala Language
Actually, the machine translation is a process of translating the meaning from one
language (source langue) to another (target language) [44]. Linguistically, meaning
existing on number levels such as grammatical meaning, phrasal meaning, contextual
meaning idiomatic meaning restricted meaning and proverbial meaning.
The grammatical meaning refers to the grammatical categories of the word such as
noun, verb, adjective etc. As a simple example the Sinhala word ‘fldiai’ has
several Sinhala meanings such as (msrsisÿ lsrSug .kakd fldiai" ;=jd,hl we;s
fldiai) etc.
The phrasal meaning analyzes the term of the grammatical function in phrase. In
Sinhala language, there are large numbers of phrasal meanings.
In the machine translation point of view, identification of the Contextual meaning,
Idiomatic meaning, restricted meaning and proverbial meaning is more complex and
a difficult task. It is also a critical challenge in any machine translation system.
3.10 Comparison Between English and Sinhala
This section describes brief comparison between English and Sinhala languages.
Sinhala and English arises from total different language families and they have many
differences and some similarities.
Differences between both languages can be
categorized in several ways including fundamental, morphological and syntax levels.
Considering the deep structure and the surface structure of both the languages,
there are number of similarities. Both languages have left to right word order and
deep structure [30][31] of both the languages have some similarities. Below section
describes details on the surface structure of both the languages.
44
3.10.1 Fundamental Differences
Comparing English and Sinhala languages, there are several fundamental differences.
In the word level, English contains 8 parts of speech and Sinhala contains only 4
parts of speech. The English nouns, pronouns and adjectives can be directly mapped
into the Sinhala noun (nama). Some English verbs directly mapped into the Sinhala
verb (Kriya) as there is no direct meaning to some auxiliary verbs. These auxiliary
verbs are used to make tenses. (I shall eat rice ,uu n;a lkafkñ there is no direct
meaning for the English word shall). The English adverb can be directly mapped into
the Sinhala adverb. Some Sinhala grammar specialists have noted that, Sinhala
adverbs are grouped into the Kriya [54]. And some are using adverbs as a separate
part [88][89]. In addition to above, the English prepositions and conjunctions can be
mapped into the Sinhala Preposition (Nipatha). Further, some English prepositions
do not have direct meaning for the Sinhala sentence and these effect to change only
the case form of the noun. (For example “to boy” is mapped into the Sinhala Noun
“<uhdg”). This issue is one of the challenging areas in the English to Sinhala
machine translation. English Interjection and others (Prefixes and suffixes) can be
directly mapped into the Sinhala Avya pada (Other words).
In the alphabetical point of view, English has only five vowels and 21 consonants
and the Sinhala has18 vowels, 41 consonants and 2 semi-consonants. Also Sinhala
has two unique short vowel: ‘we’ – ae and Long vowel: ‘wE’ – aae and set of five
nasal sounds Õa (nng), `ca (ndj), `â (nnd), |a (nd), ò(mb).
In addition to these, inflection and derivation forms of the both languages are
different from each other. These differences are discussed in the next section.
3.10.2 Morphological Differences
There are some differences in the English Morphology and the Sinhala morphology.
The English uses suffixes and affixes to generate the English words. For example,
most useable suffices are s (boys) es (boxes) ed (played) and ing (playing) in the
45
English language. Also there are several prefixes available such as “none” (none
usable) , “un”(uninstall) etc.
However, in Sinhala there are different ways to generate Sinhala word The Sinhala
part of speech named “Upsarga” acts as the prefix of the Sinhala words and ‘Sandi
rules are used to combine two or more words.
When comparing English and Sinhala nouns, the English nouns have only three
types of inflections namely number, case and person. The Sinhala noun has four
types of inflections namely Number, person, gender and determination. (English
determinations are used as a separate word, a boy the boy etc.)
Sinhala verb is inflectionally richer than the English verb. Normally English verb
has 5 forms including simple present, past, past perfect etc. However the Sinhala
verb has more than 36 inflection forms for the two voices (active and passive) and
person number word inflections. Also Sinhala has 4 moods namely Indicative,
Operative, Imperative and conditional [54].
3.10.3 Syntax in the two Languages
Syntax of the both languages have significant differences mainly word order, tense
and the sentence structure. English uses SVO word order and the Sinhala uses SOV
word order. In addition to this order preposition phrases have reverse order in
Sinhala language.
The Sinhala has only three tenses where English uses 12 tense forms. Further the
sentence structure contains fundamental differences in both languages.
3.11 Language Issues
In the previous section discussed about morphology syntax and semantics on both
English and Sinhala languages and some difference had been shown. According to
the above differences, there are number of issues came for the translation through the
46
above two languages. This section describes more on the issues that need to handle
in the English to Sinhala machine translation.
3.11.1 Grammatical Issues
Several issues have been identified. Having different language structures in English
and Sinhala languages, the translation of English to Sinhala is a difficult process.
English is a West German language that originated in Anglo-Saxon England. Sinhala
belongs to the Indo-Aryan branch of the Indo-European languages. The following list
shows some grammatical issues in both languages.
•
The literary language and the spoken language differ from each other in
Sinhala.
•
Sinhala uses SOV (Subject Object Verb) word order and English uses SVO
(Subject Verb Object) word order
•
Sinhala nouns have five types of inflections, namely, gender, number, person,
case and article (definite/indefinite). The English nouns have four types of
inflections, namely gender, number, person and case.
•
Sinhala has nine cases; these are differ from English
•
There is a difference between Sinhala noun and the adjective form of the noun
However, there is no difference in English
•
Sinhala language contains only three tenses while English has 12 tenses.
•
Sinhala sentence contain 8 components, namely Ukkthavishshana (adjunct of
subject), Ukthaya (Subject), karma vishashenaya (attributive adjunct of object),
karmaya (object) and akkyanaya etc. However, these structures differ from
English sentence structure.
3.11.2 Text Manipulation Issues
Source language document contains lot of several tags and text. Some of these texts
are not complete sentences. These texts are available in several formats such as;
47
•
Complete sentences
•
Noun phrases
•
URLs
•
Equations
•
Numbers etc.
The translation system needs to handle these texts for target language generation.
Identification of the complete sentence is one of the critical problems in machine
translation. Any sentence in English ends with a dot sign (.) after the dot sign the
space is appears. Using these two character combinations, the system identifies the
sentence. However there is a problem to understand the names (Example: A. B.
Fernando) Note that, the “A.” is not a sentence ending therefore HTML/Text parser
requires to use internal mechanism to remove these issues. In addition, the noun
phrase identification is another issue in the translation. As an example Consider the
following phrase “ A Computer Science Subject”, is translated as a “mrs.Kl jsoHd
jsIhla”. Note that there are grammatical differences between English and Sinhala
language; therefore, word level translation cannot be used. This is because there is a
difference between Sinhala nouns in the noun form and adjective form (“mrs.Klh”
is a noun form and mrs.Kl is an adjective form.) Also in Sinhala language, article
comes with a Sinhala noun.
3.12 Challenges in English to Sinhala Machine Translation
The below section describes more about Sinhala and English languages including
Morphology, syntax, semantics and some language issues. This section describes
some challenges for the English to Sinhala machine translation including
segmentation, lexical selection, conjugation, tense detection, article insertion,
sentence boundaries, word order, measure words and translation divergences.
48
3.12.1 Word and Sentence Segmentation
Machine translation system must need to segment sentences and words before the
translation process starts. The word segmentation is the problem of dividing a string
of written language into its component words [162]. Number of researches are
available on this area for text or voice summarizing and machine translation [82].
In English and many other languages using some form of the Latin alphabet, the
space is a good approximation of a word determiner. However each times this
segmentation is not successful. For example sentence can be classified through the
“dot and space sequence in the paragraphs” but it is not perfectly correct for the
following types of sentence “Mr. Fenando is a lecturer”
3.12.2 Lexical Selection
Lexical selection is one of the other issues in any machine translation especially for
statistical approach [140]. Further, lexical selection is more complex for the
inflationary rich languages. As an example there are number of forms available for
the English verb read; infinitive- read, past- read, present participle- reading, past
participle- read and simple present tense reads. Infinitive, past and past participle
forms are same and the identification of each has some difficulties.
To address this issue more powerful Source language morphological analyzer is
needed.
3.12.3 Conjugation
Conjugation is another issue for machine translation and it needs to generate
number of words form for the given single base-word. To address these issues,
machine translation system needs successful word generator to generate appropriate
word form. In the English to Sinhala machine translation point of view, authors use
Sinhala morphological generator to handle the conjugation issues.
49
3.12.4 Tense Detection
Tense detection is another machine translation system. Tenses and the sentence
patterns are different from language to language. For an example English language
tenses 12 for active and 8 for passive voice. However, in Sinhala language there are
only 3 tenses.
3.12.5 Article Insertion
The Article insertion of the English [83] text is a problem in the English source,
particularly in the complex sentences are different. Considering English and Sinhala
language, Articles come in English as separate words. In Sinhala there are no
separate words for the article and article effect available with the noun. Therefore,
Sinhala noun generation needed to consider this most article effect for the Sinhala
noun generation.
3.12.6 Sentence boundaries
Detection of the sentence boundary of the Source sentence is another problem in the
machine translation. There are number of researches have done in this area to
identify the problem well [9]. Actually identification of the sentence boundary is a
most important prerequisite of any machine translation system.
3.12.7 Word Order
Word order of the sentence is another problem in any machine translation. Some
language uses SVO word order and the (English) and some are used SOV word
order. Therefore, identification of the word boundaries and the generation of the
correct word order is another critical task. Considering the English and Sinhala
language both two languages have different word order and also order of the
preposition Phrases are also differ from both languages.
50
3.13 Summary
In this chapter, the author made an in-depth study about English and Sinhala
language with deep concern morphologically, syntactically and semantically with the
existing language issues. The next chapter discusses on our novel approach to
English to Sinhala machine translation.
51
Chapter 4
NOVEL APPROACH TO MACHINE TRANSLATION
4.1 Introduction
Chapter 3 reviewed features of English and Sinhala languages with a view to identify
issues pertaining to English to Sinhala machine translation. It was pointed out that
machine translation systems need a theoretical base for analysis of source language
and creation of target language sentence. This chapter presents a theoretical-based
approach to machine transition from English to Sinhala.
4.2 A Theoretical-based Approach to Machine Translation
The concepts of ‘Varanegeema’ (conjugation) in Sinhala language has been
identified as the theoretical basis of the proposed approach to machine translation
from English to Sinhala. The conjugation in Sinhala language presents “how we can
drive various word forms from a given base word”. Sinhala is an inflationary and
morphologically rich language than English language. For instance, a Sinhala Noun
contains 27 conjugation forms while a Verb has more than 36 conjugation forms. A
large number of language construction in Sinhala have been handled by the
conjugation. For instance, conjugation handles Person, Preposition, Tense,
Singular/Plural, and Active/Passive. Usage of conjugation also contributes to
drastically reduce the number of words to be stored in lexical databases. On the other
hand, conjugations can be easily implemented as a set of rules that follows a very
simple structure. As such, concept of conjugation can drive many aspects of
language processing for machine translation. Therefore, the author has postulated
that Varanegeema could be used to develop a theoretical-based approach to machine
translation from English to Sinhala. This concept works at the morphological
generation level of the Sinhala language.
52
4.3 Computational Model of Grammar for Sinhala
Design a computational model of grammar for highly inflected language is a
complex task. This is because, these languages are highly inflected with three gender
forms and two number forms. This thesis presents computational model of grammar
for Sinhala language by considering the Morphological and the syntax analysis of
Sinhala language. Finite State Transducers (FST) and Context-free grammar (CFG)
have been used to describe the computational grammar for Sinhala.
4.3.1 Computational Model for Sinhala Morphology
Sinhala language is a morphologically rich language than English. Prakurthi,
Pratya, Thaddhita and Upasarga are the morphological components of the Sinhala
language. In addition to the above components, Sandhi rules are used to join two or
more words in Sinhala. For example Sinhala Noun usksid is generated by using
Prakurthi (usksia) + Nama prathya (wd). The word ñksidg can be divided into
Prakurthi (usksia) + Nama prathya (wd) + Vibakthi prattya (g). In Addition to the
above, Nama Gana (kdu .K) and Kriya Gana (ls%hd .K) give the way, how each
nouns and verbs are derived from its base form. It also gives the theoretical basics for
the concept of Varanegeema. To implement the concept of Varanegeema, the author
has implemented 85 grammar rules for Sinhala Nouns with considering the Nama
gana in Sinhala language. To implement the Kriya Gana 18 rules have been
considered. Figure 4.1 shows the Finite state automata for Sinhala kaputu Gana.
Table 4.1 shows Paradigm table for the Kaputu Ganaya.
4.3.2 Context-Free Grammar for Sinhala language
The Context-Free Grammar (CFG) stands for a particular method of describing the
syntax of languages. A context free grammar has four parameters namely set of nonterminal symbols, set of terminal symbols, set of productions and the start symbol S
[84]. By using these set of symbols each grammar module can be represented.
53
Considering the Sinhala language, a Sinhala sentence can be divided into eight
components namely
1. Attributive adjunct of Subject (Wla; úfYaIKh)
2. Subject (Wla;h)
3. Attributive adjunct of Object (l¾u úfYaIKh)
4. Object (l¾uh)
5. Attributive adjunct of Predicate (wdLHd; úfYaIKh)
6. Attributive adjunct of the complement of predicate (wdLHd; mQ¾K
úfYaIKh)
7. Complement of predicate (wdLHd; mQ¾Kh)
8. Predicate (wdLHd;h)
Table 4.1: Paradigm table for Kaputu Ganaya
lmqgq .Kh
lmqgq
Base Form
Form
Add
Remove
Example
ksh; tal
d
q
lmqgd
wkshl Wla;
fgla
gq
lmqfgla
wksh; wkqla;
fgl=
gq
lmqfgl=
nyq Wla;
fgda
gq
lmqfgda
nyq wkqla;
ka
q
lmqgka
Figure 4.1: Finite State Automata for Kaputu Ganaya
54
These components are building blocks of the Sinhala sentence. Some select
context-free grammar rules for the Sinhala language are listed below. All the
implemented rules are listed in the Appendix C.
SubP = Subject Phrase
VebP = Verb Phrase
Sub = Subject
Obj = Object
ObjP = Objective Phrase
AdjSub = Attributive adjunct of Subject
AdjObj = Attributive adjunct of Object
Pre = Predicate
AdjPre = Attributive adjunct of Predicate
AdjCmp = Attributive adjunct of Complement
CmpPre = Complement of predicate
CmpPreP = = Complement of predicate phrase
S Æ SubP
VebP
SubP Æ Sub
SubP Æ AdjSub Sub
VebP Æ ObjP PreP
VebP Æ PreP
ObjP Æ Obj
ObjP Æ AdjObj Obj
PreP Æ AdjPre CmpPrep
PreP Æ CmpPrep
CmpPrep Æ Pre
CmpPrep Æ Pre CmpPre
CmpPre Æ Cmp
CmpPre Æ AdjCmp Cmp
Sub Æ Noun
AdjSub Æ Noun
55
Obj Æ Noun
AdjObj Æ Noun
AdjPre Æ Adv
Cmp Æ Noun
AdjCmp Æ Noun
Pre Æ Verb
For example, Sinhala sentence “olaI .=rejrhd ;u YsIHhd blauKska oeKque;s
úYdrohl= lf<ah” Figure 4.2 shows the parser tree for the sentence.
Figure 4.2: Parser tree for the sample sentence
Noun Æ [olaI]
Noun Æ [.=rejrhd]
56
Noun Æ [;u]
Noun Æ [YsIHhd]
Noun Æ [úYdrohl=]
Noun Æ [±Kque;s]
Verb Æ [lf<ah]
Adv Æ [blauKska]
4.4 Hypothesis
The hypothesis employed in the thesis can be stated as concepts of “Varanegeema”
(conjugation) in Sinhala language can be used to drive English to Sinhala machine
translation.
4.5 Approach in a Nutshell
The proposed theoretical-based approach to machine translation has been named as
BEES, an acronym for Bilingual Expert for English to Sinhala machine translation.
Below is a description of features of BEES together with input, output and process
inside translation system.
4.6 Features of BEES
BEES employes the following key features as a machine translation system with a range of
facilities to convert English text from various sources to Sinhala.
•
BEES uses rule-based, context based and human-assisted strategies for
translations.
•
BEES has been built with a theoretical basis
•
Lexical resource of BEES consumes very limited memory space
•
BEES can be used as a standalone application as well as a web-based
application
•
BEES can be used as a translation plug-in for any application
57
•
BEES has been implemented to run on both Windows and Linux
•
BEES is a Prolog based system with Java support
•
BEES provides built-in tools for maintenance, evaluation and updating of the
system
4.7 Input for BEES
The input to BEES is English sentence(s) or HTML document with English text.
BEES can also accept selected English text from any source. BEES assumes that
English sentences are grammatically correct.
4.8 Output of BEES
The output of the BEES is grammatically correct Sinhala sentence(s). The system
gives the translation output in the following forms.
• Normal Sinhala Wijesekara key layout [39]
• HTML format (With normal Sinhala fonts)
• Sinhala Unicode [138]
4.9 Process of BEES
In order to translate an English sentence to Sinhala, BEES uses 5 steps. As the first
step, the English Morphological Analyzer reads the input English sentence word by
word and provides the Morphological information for each word. Then English
parser analyses the Input English Sentence by reading the above morphological
information and the Input English Sentence. Consequently, the English to Sinhala
Base Word Translator translates the English base words into appropriate Sinhala
based words. This process is rather complex and it uses two supporting dictionaries
namely, the English-Sinhala bilingual dictionary and the Concept dictionary. As the
first step, English to Sinhala Base Word Translator uses English-Sinhala bilingual
dictionary and reads the available Sinhala based words for the given English base
58
word. If there are multiple words available in the Bilingual dictionary, then system
looks up the relevant information from concept dictionary to indentify the most
suitable Sinhala base word. The concept dictionary is used to store concepts
information for each Sinhala word. Otherwise, English to Sinhala Base Word
Translator gives most usable Sinhala based word for the given English based word.
After successful base word translation, the Sinhala parser (Sentence composer)
generates appropriate Sinhala sentence with supporting the Sinhala Morphological
generator. The Sinhala Morphological Generator generates appropriate Sinhala
words by using the translated Sinhala based word for the given grammar information.
The Sinhala Parser uses above generated Sinhala word to generate grammatically
correct Sinhala sentence.
4.10 Summary
This chapter described a novel approach with a theoretical basis for English to
Sinhala machine translation. The translation system presented as a rule-based system
known as BEES. The chapter also discussed the theoretical basis of the approach,
hypothesis, input to the system, output of the system and overall features. Next
chapter describes the design of the software solution of BEES.
59
Chapter 5
DESIGN OF BEES
5.1 Introduction
The previous chapter reported on a novel approach for English to Sinhala machine
translation. It pointed out theoretical basis, hypothesis, input, output and process of
the translation system, which is known as BEES. This chapter gives the design of
the English to Sinhala machine translation system, BEES. The system has been
designed as a rule-based machine translation system with 7 modules.
5.2 Design of BEES
The English to Sinhala Machine Translation system has been designed and
developed as a rule-based System. It contains seven modules, namely, English
Morphological Analyzer, English Parser, English to Sinhala Base Word Translator,
Sinhala Morphological Generator, Sinhala Parser, Transliteration module and
Intermediate Editor. In addition to the above, system uses four lexical dictionaries
namely, English dictionary, Sinhala dictionary, English-Sinhala Bilingual dictionary
and Concept dictionary. Figure 5.1 shows top-level design of the English to Sinhala
machine translation system. A brief description of each module in the architecture is
given below.
5.2.1 English Morphological Analyzer
The English Morphological analyzer is a Prolog based system that can identify
morphology of the given word. This Analyzer is capable to analyze all the English
parts of speech with the irregular and regular words forms through the set of
available grammatical rules.
Further, English base words and its grammatical information are stored in the lexical
database. This English morphological analyzer can identify each inflection and
derivation forms. For an example English verb ‘play’ is in the lexical dictionary, the
60
Morphological analyzer can identify its inflection forms such as ‘play’, ‘plays’ and
‘playing’. However the irregular words cannot be identify by using Morphological
analyzer and theses words are needed to store in the lexical dictionary separately.
English Sentence
English language system
English Morphological analyzer
English
Dictionary
English Parser
English to Sinhala translation system
English to Sinhala word translator
English-Sinhala
Bilingual &
Concept dictionary
Sinhala language system
Sinhala Morphological Generator
Sinhala
Dictionary
Sinhala Parser
Sinhala Sentence
Figure 5.1: Design of the BEES
61
5.2.2 English Parser
The English parser is one of the key modules in the English to Sinhala machine
translation system. The English parser has been designed to identify all the simple
and complex sentence patterns through the set of English grammar rules. Many of
these grammar rules are available on the online web resources [15]. The English
Parser uses Input English sentence and morphological information for the each word.
After syntax analysis, The English Parser gives syntax information o the input
sentence.
5.2.3 English to Sinhala Base Word Translator
The English to Sinhala base word translator is the key semantic handling module
in the machine translation system. This English to Sinhala base word translator
identifies the most suitable Sinhala based-word for each word in the English
sentence. This module designed with the set of rules to handle semantics in the
translation. These rules are listed in the below;
•
Find the suitable Sinhala base-word from bilingual dictionary with the
full grammatical mapping (Two or more words available in the
bilingual dictionary System uses context dictionary to find the suitable
Sinhala base-word)
•
If the grammatical mapping is not satisfied then the system uses
Intermediate editor
•
If there is no any suitable Sinhala based word available in the Bilingual
dictionary, then the system uses correspond Sinhala transliteration
The English to Sinhala base word translator translates the English base word into
the Sinhala base word by using the concept dictionary and the English to Sinhala
bilingual dictionary.
62
5.2.4 Sinhala Morphological Generator
The Sinhala Morphological Generator is the key module of the English to Sinhala
machine translation system. This generator fundamentally works through the
concepts of Varanegeema in Sinhala language. The Concepts of Varanegeema is the
theoretical basics of the English to Sinhala machine translation[58]. The Sinhala
morphological generator accesses the Sinhala dictionary and generates appropriate
Sinhala word forms. This morphological generator has been designed through the
Sinhala grammar rules to generate appropriate Sinhala words [58]. To generate a
Sinhala Noun, it requires appropriate Sinhala base-word, number, case and the form
of the noun (direct or indirect). The Sinhala verb requires appropriate Sinhala baseword tense, person and number. The Sinhala adjective requires type and the baseword. The Adjective type may be positive, comparative or superlative. The other
type of words do not participate the word conjugation. Therefore, these words are
stored in the Sinhala dictionary. The Sinhala morphological generator directly reads
these words from the Sinhala dictionary.
5.2.5 Sinhala Parser
The Sinhala parser works as a Sinhala sentence composer. It receives Sinhala words
from the Sinhala morphological generator and composes grammatically correct
Sinhala sentence.
Further, generally, a Sinhala sentence contains 8 components [88] [89]. Namely
Uktah vishashaka, Uktha (subject), Karma Vishashana, Karma (Object) and
Akkyathya (Verb). etc. These 8 components of a Sinhala sentence are the building
blocks for design and implementation of the Sinhala parser. The Sinhala parser is the
unique component of the Sinhala language and other existing parsers cannot directly
use for the Sinhala language. This parser handles only the simple sentences and all
the grammar rules are design through the Context-free grammar.
63
5.2.6 Transliteration module
English to Sinhala machine translation system needs to solve Out-of-vocabulary
problems and handle technical terms. Machine transliteration can be used as a
resanable solution for that. Two types of transliteration models have been develop
under the this research [68]. One of these models transliterates original English text
into Sinhala transliteration (model 1 ) and the other transliteration module
transliterates the Sinhala words that are written in English into Sinhala (model 2).
Finite State Transducers (FST) are used to develop these two modules [87]. By using
this prolog based transliteration modules, If the English to Sinhala base word
translator cannot be translated a given English word, then transliteration module
transliterates it. Figure 4.2 shows Finite State Transducers for Sinhala vowels (Model
1). Complete FSTs are given in appendix D.
V1
i
r
V2
e
e, r
a, e, i, o, u, y
A
a
o
V3
V4
B
w, u
o, u
Figure 5.2: FST for Vowels in model 1 transliteration
5.2.7 Intermediate Editor
English to Sinhala machine translation system uses the Intermediate Editor to handle
ambiguities in semantic, pragmatic and multi-word expressions before proceeding to
Sinhala linguistic modules in the machine translation system. Intermediate Editing
facility is provided as a human interface for the machine translation system [69].
This editor provides facilities such as showing synonyms, anti-synonyms, related
words, etc.
The intermediate-editor is linked up both English and Sinhala
dictionaries in the machine translation system. The process of the intermediateediting, before composing a Sinhala sentence, drastically reduces computational cost
for running Sinhala morphological analyzer and the parser. In addition, requirement
64
for post-editing can be reduced by the process of intermediate editing. On the other
hand, intermediate-editing can be used as means of continuous capturing of human
expertise for machine translation. This knowledge can be reused for subsequent
translations. As such the concept of intermediate-editing can be introduced as an
approach to automatic knowledge management in the machine translation system. It
should be noted that the knowledge used for pre-editing and post-editing cannot be
readily captured by the machine translation system, as this process can be done even
outside the machine translation system. In contrast, intermediate editing will be an
integral part of the machine translation system, in which human directly interact with
the system. If the English to Sinhala base word translator cannot be identified the
most suitable Sinhala word (Grammatical mapping is not satisfied), then intermediate
editor provides abilities to use to select the suitable Sinhala word.
5.2.8 Lexical Resources
The translation system uses four dictionaries such as English word dictionary,
English-Sinhala bilingual dictionary, Sinhala dictionary and the concept dictionary
[67]. The English dictionary is used to store base word of the English regular words
and the irregular words. To develop this English dictionary author has considered the
structure of the standard wordnet [14] [164] lexical database, the EDR word
dictionary [114], Cambridge Advanced learner’s dictionary [28] and the Oxford
English dictionary [142]. The English dictionary is designed as a prolog database.
The English to Sinhala bilingual dictionary is used to identify appropriate Sinhala
base word for the given English word. This dictionary shows relations between
English and Sinhala words. There are several bilingual dictionaries are available for
English-Sinhala including Madhura online dictionary [94], Malalasekara English to
Sinhala bilingual dictionary [104], Carter’s Sinhalese-English dictionary [27] and
Godage English to Sinhala dictionary [155]. Most of these dictionaries provide the
related Sinhala terms for the given English word. However, these dictionaries do not
provide the information about word conjugation and the other lexical resources.
65
Therefore, author has designed a new structure for the development of the English to
Sinhala bilingual dictionary [67]. The English Sinhala bilingual dictionary is also
designed as a prolog database.
The Sinhala dictionary stores Sinhala regular words, irregular words, lexical
information and sets of rules, which are required to generate Sinhala words [54][58].
All the rules are based on the Sinhala language fundamentals.
The concept dictionary [67] contains the context information for the Sinhala words.
This dictionary is used to identify the semantics of the words. All these four
dictionaries are work as ontology of the machine translation system.
5.3 Supporting modules
Three supporting systems have been developed to automatically update the lexical
resources namely dictionary updater, Sinhala word generator and online search
module. These modules are only supporting modules and do not participate the
translation process directly. Therefore, these modules are not visible in the design.
Figure 5.3 shows design of the three supporting modules
5.3.1 Dictionary Updater
The dictionary updater has been developed to update lexical dictionaries by using
online resources [110][94]. This module uses online resources and updates the
English dictionary, Sinhala dictionary, Sinhala Concepts dictionary and the English
Sinhala bilingual dictionary as needed. The dictionary-updating module first searches
the English online resources and finds the word type (Noun, Verb Adjective etc.) for
an given English word. Then it identifies the suitable Sinhala resources and updates
the Sinhala dictionary. After that, using Sinhala word generator, the dictionary
updater validates the word availability. Further, dictionary updater can update the
concept dictionary by supporting the online search module. In addition to that, these
supporting module run on the user request
66
Online Searching module
Internet
Sinhala Word Generator
Sinhala
Corpus
Sinhala
Morphological
Generator
Dictionary Updator
Concept
Dictionary
English-Sinhala
Bilingual Dictionary
Sinhala
dictionary
English
dictionary
Figure 5.3: Design of the three supporting module
5.3.2 Sinhala Word Generator
The Sinhala word generator is used to identify suitable word forms for the Sinhala
word conjugation. This module can generate all the word forms for the given Sinhala
base-word. Then system validates each word through the Sinhala corpus and the
several online Sinhala resources with the help of the online search module. After
validating the availability of the words, this module updates the Sinhala resources.
According to the complexity of the Sinhala language, these update results need to
recheck by a human expert.
5.3.3 Online Search module
The online search module can access the online web (internet) resources and the
Sinhala corpus [93] to search availability of the given Sinhala word. Also this
67
module identifies the usage and the availability of the given set of words. Further,
Some Sinhala adjectives give unique meaning and some words have special usage.
Example: The English term “dangerous” has several Sinhala meaning including
“Nhdkl”, “úif>dar”, “kmqre” etc. However, each Sinhala terms are not suitable for
each noun for example “Nhdkl fldáhd”, “úif>dar i¾mhd” are the some sample
Sinhala words. However, there is no meaning about “úif>dar fldáhd”. The online
search module has been designed to identify this type of word usage through the
online Sinhala resources and these information are stored on the concept dictionary.
5.4 Summary
This chapter describes the design of the English to Sinhala machine translation
system which is contains 7 module, three supporting module and four dictionaries.
Processes of the each module are discussed in the chapter. Next chapter describe the
implementation of the software solution of BEES
68
Chapter 6
IMPLEMENTATION
6.1 Introduction
In the previous chapter, it was described the design of the English to Sinhala machine
translation system. This chapter gives implementation details about all these modules
identified in the design.
6.2 Development Stages
Implementation of the English to Sinhala machine translation system is a complex
task and it needs more time and knowledge about technical matter, source and target
language structures. At the beginning of the research, only a limited number of
Sinhala resources were available such as several types of Sinhala fonts and several
bilingual dictionaries such as Madhura electronic dictionary [94], Malalasekera
English Sinhala dictionary [104] and Godage English to Sinhala dictionary [155] etc.
Therefore, the project was started as the primary level. After the development of
several versions, author has implemented the final system. This section gives a brief
description about each state of development and the final system.
At present, BEES has gone through four stages of development. These
development stages are listed below;
1. The English to Sinhala machine translation system (BEES) is primarily
implemented using SWI-Prolog [143] and Java. This first version of the BEES
translates only the simple present tense sentences. It handled only the simple
subject and object forms with adjectives, adverbs and articles. Further, to
handle out-of-vocabulary issues, BEES transliterated English terms into
Sinhala. However, this version did not handle semantic issues.
2. Including an intermediate-editor [69], the human-assisted machine translation
system has been developed to solve Out-of-vocabulary and semantic issues in
69
3. The context-based Machine translation system has been developed by
improving the intermediate editor to capture human knowledge. This system
uses concept dictionary to store these human knowledge [62]. Then fully
automated English to Sinhala machine translation system has been developed
by including a concept dictionary. Online version of the BEES has also been
implemented through the PSP (prolog server page) [23] technology. This
extension enables selected text translation while reading on the World Wide
Web [65][66].
4. The final English to Sinhala Machine Translation system implanted by using
three approaches namely rule-based, human-assisted, and context-based
[63][60].
6.3 Implementation of the BEES
The final version of the BEES has been implemented as a rule-base System compress
with 7 modules namely English Morphological analyzer, English parser, English to
Sinhala base word translator, Sinhala morphological generator, Sinhala Sentence
composer, transliteration module and Intermediate Editor. A brief description of each
module is given below.
6.3.1 English Morphological Analyzer
The English morphological analyzer (EMA) is a prolog based module which is used
to analyze given English words. EMA uses English dictionary to identify to English
70
word. The following codes are used to consult English dictionaries. This code shows
how EMA consult eng_reg_nouns.pl prolog file.
consult('eng_reg_nouns.pl'),
The EMA uses the prolog predicates namely loadEMA/2 to start the morphological
analysis. This predicate gives finish, unknown or error as the result of the analysis.
For example
loadEMA(‘boy eats rice’, X).
X = finish.
The result ‘finish’ means English Morphological analysis successfully completed
and ‘unknown’ means the English Morphological analysis successfully completed
with the unknown words.
To analyze the given text the English Morphological analyzer uses the following
procedure
1. Create a list of words for the given text
2. Initialize the variable and clear the output
3. Analyze the text word by word until the end of the list
The following code shows how prolog creates the list
createList(X,O) :downcase_atom(X,LX), removedot(LX,LXO),
concat_atom(O1, ' ', LXO), delete(O1, '', O).
Example: createListgood boy eats red rice, C) gives the result
C = [‘a’, ‘good’, ‘boy’, ‘eats’, ‘red’, ‘rice’]
The English morphological analyzer writes all the output data to a file name
‘ema_out.pl’. Before analyses the new data set EMA clear the all data and ready to
new data set
Each word in the text is analyzed by the EMA word by word. For each word it
gives all the grammatical information.
71
English Morphological analysis can be divided into two categories namely regular
word analysis and the irregular word analysis. The irregular words are available on
the dictionary. The English irregular nouns, irregular verbs, irregular adjectives,
adverbs, prepositions, conjunctions and determinations are available in the irregular
form. The following code shows how EMA analyze the English adverb.
search_irr_word(EngWord):eiw(ID, av,Type, EngWord),
write_output_advb(ID, Type, EngWord).
Write write_output_advb/3 ids used to write the output result to the output
file
The Prolog predicate eng_advb/3 is used to represent the irregular adverb. The
following sample shows the English adverb ‘slowly’ in ema_out.pl file
eng_advb([3000015], p, 'slowly').
In addition to above the prolog predicates namely eng_verb/3, eng_detm/3,
eng_prep/3, eng_conj/3 are used to identify verbs, determinations, prepositions and
conjunctions.
To analyze the English regular nouns there are number of rules available of the
system. The prolog predicate analyze_english_noun/1 is used to analyze the English
regular nouns
analyze_english_noun(EngWord) :get_eng_noun_info(EngWord, RootID, SP, Sex, Type),
write_output_noun(RootID, td, SP, Sex, Type, EngWord).
The predicate named get_eng_noun_info/5 is used to get the English
regular noun information and write_output_noun/6 is used to store output
result in an output file.
The following code shows how EMA analyze the rule Noun + s = prural
get_eng_noun(EWL,RootID,Sp, Sex, Type) :append(Rest,[s],EWL),concat_atom(Rest,GRoot),
72
erw(RootID, na, Sex, GRoot), Type =sb, Sp = pr.
The following code shows how EMA analyze the English Noun
Singular (Base noun) – y + ies = Plural noun
get_eng_noun(EWL,RootID,Sp, Sex, Type)
append(Rest1,[i,e,s],EWL),
:-
append(Rest1,[y],Rest),
concat_atom(Rest,GRoot), erw(RootID, na, Sex, GRoot),
Type =sb, Sp = pr.
To analyze the English verbs EMA uses the same method. The following code
shows how EMA analyze the English verb in Simple present tense
get_eng_verb(EWL, RootID, Tens)
:append(Rest,[s],EWL),concat_atom(Rest,GRoot),
erw(RootID, vb, GRoot), Tens =sp.
The English Morphological analyzer has been implemented with the 14 rules for
analysis the regular nouns and 14 rules for English adjectives 11 rules for English
regular verbs and 7 rules for Irregular verbs.
The following output shows result for the Morphological analysis of the given
English sentence “A good boy and his friend read the books everyday”
eng_input_sen_list(['a', 'good', 'boy', 'and', 'his', 'friend',
'read', 'the', 'books', 'quickly', []]).
eng_detm([3000001], id, 'a').
eng_adjv([3000004], p, 'good').
eng_noun([1000001], td, sg, ma, sb, 'boy').
eng_noun([1000001], td, sg, ma, ob, 'boy').
eng_conj([3000027], 0, 'and').
eng_noun([4000004], td, sg, ma, po, 'his').
eng_noun([4000004], td, sg, ma, po, 'his').
eng_noun([1000011], td, sg, ma, sb, 'friend').
73
eng_noun([1000011], td, sg, ma, ob, 'friend').
eng_verb([5000008], if, 'read').
eng_verb([5000008], pt, 'read').
eng_verb([5000008], pp, 'read').
eng_detm([3000003], dr, 'the').
eng_noun([1000004], td, pr, no, sb, 'books').
eng_noun([1000004], td, pr, no, ob, 'books').
eng_advb([3000016], p, 'quickly').
The above example shows how EMA analyze the given words.
6.3.2 English Parser
The English parser has been implemented to analyze the given English sentence.
This parser has been implemented through the SWI-Prolog. To analyze the given
English sentence it uses original English sentence and the result of the English
morphological analysis. All the results of the analysis are stored in a prolog file
named epa_out.pl.
The prolog predicate named eng_sen_syntax_analysis/1 is used to analyze the
input sentence. The prolog predicate eng_sen_syntax_analysis/1analysis the input
sentence which is available in the ‘ema_out.pl’ (English morphological analyzer
previously modified) and send the results of the analysis. the result may be finish or
error. The following code shows rule for the eng_sen_syntax_analysis/1.
eng_sen_syntax_analysis(Result) :(eng_sentence_syntax_analysis(_)
->
Result = 'finish',
add_eng_sen_results('sucess')
;
Result = 'error',
74
add_eng_sen_results('error')
).
The prolog predicate named eng_sentence_syntax_analysis/1 is used to analyze the
sentence. Before the analysis, EPA consults the ema_out.pl file by using the
following code.
:- consult('c:/bees7/ema_out.pl').
Then the EPA clears all the variables and the previous data on the epa_out.pl file.
The following rules are used to analyze the simple sentence and the complex
sentence.
english_sentence(Out, NL, []) :simple_sentence(Out, NL, []).
english_sentence(Out, NL, []) :compound_sentence(Out, NL, [])
The compound sentence may be two simple sentences with the conjunction
compound_sentence(Out, Sen, End)
:-
simple_sentence(O1, Sen, S1),
conjunction_simple_sentence(O2, S1, End),
append(O1, O2, Out),
add_eng_sentnce_info('compscs', Out).
The Simple sentence may be the four types namely declarative, interrogative,
imperative or conditional. The following code are shows the implementation.
simple_sentence(Out, NL, End) :declarative_sentence(Out, NL, End).
simple_sentence(Out, NL, End) :interrogative_sentence(Out, NL, End).
simple_sentence(Out, NL, End) :imperative_sentence(Out, NL, End).
simple_sentence(Out, NL, End) :-
75
conditional_sentence(Out, NL, End).
The English Parser analyzes the English sentence with the following information
1. Type of the sentence
2. Tense of the sentence
3. Subject, Complemant, verb and the predicate
The following results are given for the English sentence “A good boy and his
friend read the books everyday”
eng_sen_verb([5000008]).
eng_sen_complement([3000003, 1000004, 3000016]).
eng_sen_subject([3000001,
3000004,
1000001,
3000027,
4000004,
1000011]).
eng_sen_predicate([5000008, 3000003, 1000004, 3000016]).
eng_sen_type(declarative).
eng_sen_ekeys([3000001,
3000004,
1000001,
3000027,
4000004,
1000011, 5000008, 3000003, 1000004, 3000016]).
eng_sen_tence(simplepresent).
eng_sen_result(sucess).
The Prolog predicate eng_sen_verb/1 gives the verb of the sentence. This verb id is
equal to the verb in the morphological analysis.
eng_verb([5000008], if, 'read').
The
Prolog
predicate
eng_sen_complemant,
eng_sen_subject
and
eng_sen_predicate are given information about complement, subject and the
predicate of the input sentence.
eng_sen_type(declarative).
eng_sen_tence(simplepresent).
eng_sen_result(sucess).
76
The above three code gives the type, tense and the result of the analysis. Note that,
these information are used to generate the corresponding Sinhala sentence.
6.3.3 English to Sinhala Bilingual Translator
The English to Sinhala Bilingual translator is the prolog based module which used to
get suitable Sinhala base-word for the given English base-word. The Bilingual
translator (BA) uses output result of the English morphological analysis, output
result of the English syntax analysis, English-Sinhala-bilingual dictionary, context
dictionary and transliteration module have been used to find the appropriate Sinhala
base word. The following code shows how Bilingual translator consults the above
source.
:- consult('c:/bees7/ema_out.pl').
:- consult('c:/bees7/epa_out.pl').
:- consult('c:/bees7/dic/eng_sin_word_dic.pl').
:- consult('c:/bees7/plsource/dtrSource.pl').
:- consult('c:/bees7/dic/eng_sin_cons_dic.pl').
:- consult('c:/bees7/dic/eng_sin_usage_dic.pl').
The bilingual translator stores all the output results of the base-word translation into
the file named “est_out.pl’
To identify the corresponding Sinhala based word the Bilingual translator uses the
following three rules.
eng_to_sin_word_all(H, S, Type,EW, SW) :eng_cons_word(H, S, SubID),
subject_form_avlable(SubID),
esw(_, H, S, Type, EW, SW).
eng_to_sin_word_all(H, S, Type,EW, SW) :esu(EW, H, S, Type, SW, _).
eng_to_sin_word_all(H, S, Type,EW, SW) :-
77
esw(_, H, S, Type, EW, SW).
The Prolog predicate eng_to_sin_word_all/5 is used to generate
appropriate Sinhala based word by searching the three dictionaries. The following
result shows the Bilingual translator output of the given English sentence “A good
boy and his friend read the books everyday”
estrwords(1001, 3000001, 3000000, dt).
estrwords(1002, 3000004, 3000004, aj).
estrwords(1003, 1000001, 1000001, na).
estrwords(1004, 3000027, 3000027, cn).
estrwords(1005, 4000004, 4000004, na).
estrwords(1006, 1000011, 4000011, na).
estrwords(1007, 5000008, 5000008, vb).
estrwords(1008, 3000003, 3000000, dt).
estrwords(1009, 1000004, 1000004, na).
estrwords(1010, 3000016, 3000016, av).
6.3.4 Sinhala Morphological Generator
The Sinhala morphological generator is the key module of the system and it is
implemented by Using SWI-Prolog. The Sinhala morphological analyzer uses
Sinhala dictionary and the result of the Bilingual translator. The following code
shows how Sinhala morphological generator consults the Sinhala dictionary
:- consult('convertor.pl').
The Sinhala morphological generator generates appropriate Sinhala words for the
given grammar information. The Sinhala morphological generator generates Sinhala
Nouns, verbs, adjectives, adverbs and prepositions.
78
To generate the Sinhala nouns SMG uses the get_sin_noun/8 prolog predicate. The
prolog predicate get_sin_noun/8 uses Sinhala base word id, person, number, sex live,
DI-code and case to generate a Suitable Sinhala noun.
snoun([1000001], td, sg, ma, li, dr, v1,'පිරිමි ළමයා').
To generate the Sinhala noun, it uses Sinhala rules. The following code shows how
Sinhala Morphological generator generates the Sinhala noun.
Case 1: The Sinhala noun can directly get form the Sinhala dictionary (No need to
generation)
get_sin_noun(WID,P, N,S, L, DI, VB, NW) :sn([WID], P, N, S, L, DI, VB, NW).
Case 2: Generate the Sinhala Noun through the Noun generator
To generate a noun, generator uses Word information from Sinhala dictionary, the
word generation rules from rule dictionary and case rules form rule dictionary. The
following code shows how it generates a noun.
get_sin_noun(WID, PE, sg, S, L, dr, v1, OUT) :sn([WID], PE, S, L, NID, C1, _, _, WD),
get_sin_noun_baseform(WD, L, NID, BASE),
validate_sds(BASE, L, NID, NW),
ensure_loaded('c:\\bees7\\dic\\sin_rules_dic.pl'),
noun_vib_postfix(C1, v1, AV, RV),
atom_chars(NW, WDL),atom_chars(AV, ABL),
atom_chars(RV, RBL),append(WDL1, RBL, WDL),
append(WDL1, ABL, NWDL),concat_atom(NWDL, OUT).
The noun generation is done through the three steps;
1. Get noun information from Sinhala dictionary
sn([WID], PE, S, L, NID, C1, _, _, WD)
2. Generate the base form of a noun
79
get_sin_noun_baseform(WD, L, NID, BASE)
3. Generates the required word form
validate_sds(BASE, L, NID, NW)
4. Get the suitable case form for the generated noun
noun_vib_postfix(C1, v1, AV, RV)
5. Generate a Sinhala noun
All the rules are implemented by using language basics for the noun generation
(Sinhala Noun Gana).
In addition to the Sinhala Noun, Sinhala verb generator is used to generate the
Sinhala verb. The Sinhala verbs may be irregular or regular. The irregular verbs are
directly identified from the Sinhala dictionary. The following code shows how
Prolog identifies these words from the Sinhala dictionary.
get_sin_final_verb(Skey, Type, P, N,Tence, SW) :sfv([Skey], P, N, Type, Tence, SW), write(SW).
Same as the Sinhala Nouns Sinhala regular verbs are generated through the set of
Sinhala rules. The following code shows some rules to generate Sinhala regular verbs
get_sin_final_verb(Skey, ps, P, N, fu, SW) :sfv([Skey], _,_, _,_,_, APR, _, WD),
verb_posfix(APR, ps, P, N, fu, ADD, REM),
atom_chars(WD, WDL),atom_chars(ADD, ABL),
atom_chars(REM, RBL), append(WDL1, RBL, WDL),
append(WDL1, ABL, NWDL),
concat_atom(NWDL, SW), write(SW).
The above code shows how prolog generates the Sinhala verb; As a first step
Sinhala verb and the conjugation form have been identified through the Sinhala
dictionary. After that, Conjugation rules are identified from the Sinhala rule
dictionary. Finally, using all these information Final Sinhala word is generated.
80
6.3.5 Sinhala Sentence Composer
Sinhala Sentence composer is used to generate grammatically correct Sinhala
sentence. To generate the Suitable sentence Composer and the Sinhala word
generator works together.The Sinhala composer uses all the previous information for
the sentence generation including Sinhala morphological generation, English
sentence analysis, English morphological analysis and the English to Sinhala
translation.
This Sinhala sentence composer works as the following stages
1. Generate subject and object separately
2. Use the structure of the original sentence and re-generate the correspondent
Sinhala sentence
To generate the subject, object and the verb, the generator uses a different
mechanisms.
The
prolog
predicates
namely
‘translateSubject’,
‘translate_simple_sentence_verb’ and ‘translateComplement’ are used to translate
English subject, object and the verb into Sinhala.
The following sample code shows how translates the English subject into Sinhala
translateSubject :load_english_sentence_subject(Sub),
clearsinsubject, clear_sinsubject_word,
create_sample_code_out(F),
close(F),set_di_code_default,
set_sin_sub_pncode_default,
set_case_code_default,
set_previous_word_default,
set_sin_complex_sub_pncode_default,
loadSubwordByword(Sub),
appendSinhalaSubject,
appendSinhalaSubjectWD.
81
6.3.6 Transliteration Module
The Transliteration module has been implemented by Using SWI-Prolog as a Finite
State Transducers (FST). The Prolog file dtrsource.pl is the source file of the
transliteration module. Using FST author has developed two modules for Sinhala
transliteration. The prolog predicate named eng_to_sin_dtr/2 is used to
transliteration. The following code shows rule of the eng_to_sin_dtr/2 Predicate
eng_to_sin_dtr(In,Out) :convert_word_list(In, ANL),
printList(ANL,Out).
As the first step, trsnliteration module converts given set of words into a list. After
that, it transliterates the given word by word by using FST. In addition to this the
module uses character encoding system for FST.
The following sample code shows the some rules in the FST to represent the
Sinhala Vowels letters.
initial(1).
final(99).
% ***************************************************************
% Finite State Automata for Sinhala Vowels
% ***************************************************************
arc(50, 62, a, []).
arc(62, 70, e, []).
arc(70, 99, e, [e]).
arc(62, 99, a, [c]).
arc(62, 99, e, [d]).
arc(62, 99, i, [p]).
arc(62, 99, u, [s]).
82
6.3.7 Intermediate Editor
The Intermediate Editor has been implemented using Java. The Intermediate editor
uses to make the better translation through the human support.
The following header is used to implement the intermediate editor.
public class IEETool extends JFrame implements Runnable {
In addition to the above, the intermediate editor uses two xml files namlely
“reldata.xml” and “trasdata.xml” to store relations and the translated data. The figure
6.1 shows the user interface of the Intermediate editor including sample data.
Figure 6.1: The Intermediate Editor
83
6.3.8 Lexical Resources
The English to Sinhala machine translation system uses four dictionaries namely
English dictionary, Sinhala dictionary, English-Sinhala bilingual dictionary and the
concepts dictionary. Each dictionary implementation is given below.
6.3.8.1 English Dictionary
The English dictionary has been implemented through the 5 prolog data files namely
‘eng_irr_noun.pl’,
eng_reg_noun.pl,
eng_irr_verb.pl,
eng_reg_verb.pl
and
eng_irr_word.pl. The eng_irr_noun.pl file contains English irregular noun
information. To represent the irregular noun information author used prolog
predicate name eiw/7 prolog predicate. The prolog predicate eiw/7 represents word
ID, word type, person, number, sex, case and the English word. As an example, the
following Prolog predicate shows lexical information for the English word ‘I’.
eiw(4000001, na, fs, sg, co, sb, 'i').
The table 6.2 shows codes, which are used to implement grammar notation in the
English dictionary.
Table 6.1: Grammatical notations for the English Dictionary
Criteria
Person
Number
Sex
Case
code
Meaning
fs
1st person
sc
2nd person
td
3rd person
sg
Singular
pl
Plural
ma
Masculine gender
fe
Feminine gender
co
Common gender
no
Neuter gender
sb
Nominative case
ob
Objective case
po
Possessive case
rf
Reflexive (pronoun)*
84
Verb type
Determination
Adjectives
If
Infinitive
pa
Past
pp
Past Participle
rp
Present Participle
sp
Simple present
dr
Direct
id
indirect
P
Passive
c
Comparative
s
Superlative
The following code shows sample data for the English irregular words
eiw(4000001, na, fs, sg, co, sb, 'i').
eiw(4000001, na, fs, sg, co, ob, 'me').
eiw(4000001, na, fs, sg, co, po, 'my').
eiw(4000001, na, fs, sg, co, po, 'mine').
eiw(4000001, na, fs, sg, co, rf, 'myself').
The English regular nouns are stored on a prolog file namely “eng_reg_noun.pl”
using the erw/4 prolog predicates. The erw/4 represents the Word ID, word type and
the sex. The following two samples are the regular nouns that are stored in the
eng_reg_noun.pl
erw(1000001, na, ma, 'boy').
erw(1000002, na, fe, 'girl').
The English morphological analyzer reads files prolog predicates and uses to analyze
the English word.
The English irregular verbs are saved in a file named eng_irr_verb.pl. This file
contains English irregular verbs, which are available on the prolog predicates named
eiw/4. It represents word id, word type, tense of the verb and the English irregular
verb.
eiw(5000001, vb, if, 'eat').
eiw(5000001, vb, pt, 'ate').
eiw(5000001, vb, pp, 'eaten').
85
The English regular verbs are stored in a prolog file named ‘eng_reg_verb.pl’. This
file contains English regular verbs in erw/3prolog predicates format. The following
code shows how prolog represents the English regular verbs.
erw(2000001, vb, 'play').
The erw/3 prolog predicate uses word id, word type and the word for the strong
regular word information. The English morphological analyzer uses this information
to analyze English regular verbs.
In addition to the above all, the other parts of speech such as adjectives, adverbs,
propositions, conjunctions and interjections are stored on the prolog file named
‘eng_irr_word.pl’. The prolog predicate named eiw/4 is used to store all the words.
The following code shows each words how store in the eiw/4 format. The special
notation is used to identify each word type (na-noun, vb-verb, dt-determinations, ajadjective, av-adverb, pp-proposition, cn-conjunction and uv for auxiliary verbs)
eiw(3000001, dt, id, 'a').
eiw(3000004, aj, p, 'good').
eiw(3000014, av, p, 'badly').
eiw(3000026, pp, v5, 'to').
eiw(3000027, cn, 0, 'and').
eiw(3000029, vb, uv, 'will').
By using online update module, this English dictionary can be updated automatically.
This is the main purpose of the separating English dictionary into several files.
6.3.8.2 Sinhala dictionary
The Sinhala dictionary is used to store all the Sinhala words, grammar information
and rules which are used to generate Sinhala words.
The Sinhala dictionary
compress the with the prolog type files namely sin_reg_nouns.pl, sin_irr_nouns.pl,
sin_reg_verb.pl,
sin_irr_verb.pl,
sin_irr_words.pl,
sin_case_rules.pl
and
sin_rule_dic.pl.
The file sin_reg_nouns.pl contains the Sinhala regular noun information. The
prolog predicate sn/9 is used to store all the information in the regular noun. The
86
following sn/9 prolog predicate shows information about ‘පිරිමි ළමයා’ it shows word
id person, sex, live, and conjugation rules for Singular direct, singular indirect, plural
and the case. The relevant rules are stored in the sin_rule_dic.pl file and the
sin_case_rule.pl file.
sn([1000001],td,
ma,
li,
s900004,
s910000,
s910000,
s910000,
'පිරිමිළමයා').
The Sinhala irregular nouns are also stored in the prolog file name ‘sin_irr_nouns.pl’
with the use of sn/8 prolog predicate. The sn/8 prolog predicate shows word id,
person, number, sex, live, direct/indirect form, case and the Sinhala words. The
Sinhala dictionary uses Sinhala Unicode (Sinhala Unicode) to store all the Sinhala
words. The following code shows samples for the Sinhala irregular words.
sn([4000001], fs, sg, co, li, dr, v1, 'මම').
sn([4000001], fs, sg, co, li, dr, v2, 'මා').
sn([4000001], fs, sg, co, li, dr, v3, 'මාවිසින්').
The Sinhala noun contains nine cases and these cases are represented v1 to v9 code.
The Sinhala regular verbs are stored in the prolog file named ‘sin_reg_verb.pl’ with
the use of the prolog predicate named sfv/9. It represents word id and the verb forms
for the active and passive voice forms and other verb (Moods) forms.
sfv([5000001],
s910001,
s910002,
s910001,
s910001,
s910001,s910001,s910001, 'කනවා').
The Sinhala irregular verbs are stored in the prolog file named sin_irr_verb using the
prolog predicate sfv/6. The sfv/6 represents Word id, person, number, voice, tense
and the Sinhala verb. The following code shows samples for the Sinhala irregular
verbs.
sfv([8000002], fs, sg, at, pr, 'සිටිමි').
sfv([8000002], fs, pr, at, pr, 'සිටිමු').
All other Sinhala words namely Sinhala adjectives, adverbs and prepositions are
stored in a prolog file named ‘sin_irr_word.pl’ using the prolog predicate siw/4. The
87
siw/4 prolog predicate represents the Sinhala word id, type, property and the Sinhala
word. The following sample code shows the Sinhala words in the dictionary.
siw([3000034], aj, p, 'අලුත්').
siw([3000015], av, p, 'ෙහමින්').
siw([3000033], pp, v3, 'මගින්').
To generate Sinhala noun several rules are needed. These rule are stored in the
‘sin_rule_dic.pl’, These rules are used to generate appropriate Sinhala noun form
from its base form. The following sample rules are used to generate Sinhala word
‘කපුටා’. These rules represent the implementations of the Sinhala kaputu ganaya
(කපුටු ගණය). In the Sinhala_rule_dic.pl has been implemented by using more than
100 rules to generate appropriate Sinhala noun.
noun_posfix(s935001, li, bas,
'ටු', 'ටා').
noun_posfix(s935001, li, sds,
'ටා', 'ටු').
noun_posfix(s935001, li, sdo,
'ටා', 'ටු').
noun_posfix(s935001, li, sis,
'ෙටක්', 'ටු').
noun_posfix(s935001, li, sio,
'ෙටකු', 'ටු').
noun_posfix(s935001, li, pds,
'ෙටෝ', 'ටු').
noun_posfix(s935001, li, pdo,
'ටන්', 'ටු').
The noun_posfix/5 is the rule format for the Sinhala noun and it represents rule id,
live_code and, noun type, add and remove code. These rules are the implementation
of the Sinhala noun palromdrim (Conjugation table). In addition to the above, the
case rules are used to generate complete Sinhala noun with the case effect. The case
rules are stored in a prolog file name sin_case_rule.pl. The following code shows the
sample case rules.
noun_vib_postfix(s910001, v1, '', '').
noun_vib_postfix(s910001, v2, '', '').
noun_vib_postfix(s910001, v3, ' විසින්', '').
noun_vib_postfix(s910001, v4, 'ෙයන්', '').
noun_vib_postfix(s910001, v5, 'ට', '').
noun_vib_postfix(s910001, v6, 'ෙයන්', '').
noun_vib_postfix(s910001, v7, 'ෙයන්', '').
88
noun_vib_postfix(s910001, v8, ' ෙකෙරහි', '').
noun_vib_postfix(s910001, v9, '','').
The prolog predicate named noun_vib_postfix/4 gives the rule id, case, add part
and the remove part of the word. The Sinhala morphological generator uses all of
these rules to generate grammatically correct Sinhala terms.
The sin_rule_dic.pl also stores the rules which are used to generate Sinhala verb. The
prolog predicate verb_posfic/7 is used to store rule id, voice, person, number, tense,
add part and the remove part of the Sinhala verb. The following sample code shows
the sample rule for Sinhala verb generation.
verb_posfix(s910001, at, fs, sg, pr, 'මි', 'නවා').
verb_posfix(s910001, at, fs, pr, pr, 'මු', 'නවා').
verb_posfix(s910001, at, sc, sg, pr, 'හි', 'නවා').
verb_posfix(s910001, at, sc, pr, pr, 'හු', 'නවා').
verb_posfix(s910001, at, td, sg, pr, 'යි', 'නවා').
verb_posfix(s910001, at, td, pr, pr, 'ති', 'නවා').
6.3.8.3 English-Sinhala Bilingual dictionary
English to Sinhala bilingual dictionary is used to identify appropriate Sinhala base
word for given English word. The following code shows syntax used for storing
information in the English-Sinhala Bilingual dictionary.
esw(6000006, 1000001, 1000001, na, 'boy', 'පිරිමිළමයා').
esw(6000006, 1000002, 1000002, na, 'girl', 'ගැහැණුළමයා').
The esw/6 prolog predicate is used to store appropriate Sinhala base word for a given
English base word. The esw/6 prolog predicate gives id, English word id, Sinhala
word id, word type, English word and the Sinhala word. Using the above predicate
all the Sinhala and English words are combined through the English-Sinhala
bilingual dictionary.
89
6.3.8.4 Concept Dictionary
The concept dictionary is used to store context information and relevant semantic
information for each word. All the context information are stored in a two prolog
data files namely eng_sin_cons_dic.pl and the eng_sin_uase_dic.pl
The eng_sin_cons_dic.pl file contents context information that are used in the
intermediate editor. the prolog predicate eng_cons_word/3 is used to store these
context details. The following sample code shows how data are stored in the concept
dictionary.
eng_cons_word(e1000000, s1000000, e1000000).
The eng_sin_usage_dic is used to store most usable terms on the web. This
dictionary is automatically updated by the online update module to store usable
words. Same as the eng_cons_word/3, the eng_usage_word/3 prolog predicate is
used to store these usage information.
In addition to above all Sinhala resources Sinhala corpus is used as a supporting
resource to find available word forms. The Sinhala corpus information are stored in a
prolog predicate named ‘sc/1’ and all information are stored in a prolog file name
‘sinhalacop.pl’
sc('අද').
sc('අෙප්').
sc('ජාතික').
sc('ක්රීඩාව').
sc('ඒ').
In the present corpus uses 18613180 words and these resources were collected from
the UCSC Sinhala corpus (LTRL). The Sinhala word generator uses these resources
to identify the suitable Sinhala word forms directly.
90
6.4 Supporting modules
Three supporting modules have been developed for the update lexical resources
namely online updater, Sinhala word Generator and online search module. The
implementation details of the each module is given below.
6.4.1 Online Updater
The Online updater module is used to update each lexical resources. This module can
update English dictionary, Sinhala dictionary and English-Sinhala bilingual
dictionary automatically. This module also gets the support from Sinhala word
generator and the online search module to do the update task. As the first step online
updater load all the dictionaries by using the following predicates
consult_eng_dic:consult('c:\\bees3.2\\dic\\eng_reg_nouns.pl'),
consult('c:\\bees3.2\\dic\\eng_reg_verbs.pl'),
consult('c:\\bees3.2\\dic\\eng_irr_nouns.pl'),
consult('c:\\bees3.2\\dic\\eng_irr_verbs.pl'),
consult('c:\\bees3.2\\dic\\eng_irr_words.pl').
Then updater uses online search module and get the grammar information by using
set of online resources. For example, online search module uses madhura online
dictionary, Cambridge dictionary, sensagent online dictionary and yahoo search
engine to get relevant English grammar information. Online updater get the relevant
word information such as word type (regular Noun, irregular Noun, regular Verb,
irregular verb, Adjective etc.) then system update each information. The following
sample code is used to update English regular noun.
update_eng_reg_noun(Word, ID) :write('try to update regular noun'),
consult('c:\\bees3.2\\dic\\eng_reg_nouns.pl'),
( erw(ID, na, _, Word)
91
->
write('English regular noun avilable ('),
write(ID), write(')'), nl
;
consult('c:\\bees3.2\\updateinfo.pl'),
(new_noun(Word, re, Word, _)
->
get_new_eng_reg_noun_key(ID),
open('c:\\bees3.2\\dic\\eng_reg_nouns.pl', append, File),
write(File, 'erw('), write(File, ID),
write(File,', na, no, \''),
write(File, Word),
write(File, '\').'), nl(File), close(File)
;
update_eng_irr_noun(Word, ID)
)
).
The prolog predicate “get_new_eng_reg_noun_key ” is used to get new key value
for the regular noun. In addition to the above the following code shows how does the
module use java program to get online information
search_cmb_dic(Word, Out)
:-
use_module(library(jpl)),
write('Call : http://dictionary.cambridge.org ..... '),
jpl_new( 'SearchCambDic', [], F),
jpl_call( F, searchDic, [Word], Out), write(Out), nl.
6.4.2 Sinhala Word Generator
Sinhala word generator is implemented to generate appropriate Sinhala word form.
The following sample code is used to generate base form of a given noun. This
92
Sinhala word generator can generate all the word form for the given Noun or Verb.
These word forms are need validate the requires rules.
validate_baseform(WD, P, NP,BASE)
:-
ensure_loaded('c:\\bees3.2\\dic\\sin_rules_dic.pl'),
noun_posfix(NP, P, bas, AB, RB),
atom_chars(WD, WDL),
atom_chars(AB, ABL),
atom_chars(RB, RBL),
append(WDL1, RBL, WDL),
append(WDL1, ABL, NWDL),
concat_atom(NWDL, BASE),
write(BASE), nl.
6.4.3 Online Search module
Online search module gets relevant information from web resources. The
following sample java program is used to search Sinhala word on the yahoo search
engine.
public static String searchWeb(String word)
{
String outstr ="f";
try{
//System.setProperty("http.proxyHost", "10.32.193.254");
//System.setProperty("http.proxyPort", "3128");
System.out.println("Connecting to http://search.yahoo.com/");
FileOutputStream
fout
FileOutputStream("tmp\\yahoo_search.html");
=
new
BufferedWriter
out
=
new
OutputStreamWriter(fout, "ISO-8859-1"));
BufferedWriter(new
String uu = "http://search.yahoo.com/search?p="+word ;
String resultString = new String(uu.getBytes("UTF-8"));
93
String str = sendGetRequest( resultString , "");
int index1 = -1;
index1 = str.indexOf("We did not find results" );
if( index1 >= 10){
outstr = "n";
}
out.write(str);
out.write(word);
out.close();
} catch(Exception e){
System.out.println("Connection Error ........"+e);
outstr ="e"
}
System.out.println("Result : " + outstr );
return outstr;
}
6.5 Summary
This chapter reports implementation of all the modules and dictionaries completely.
To implement all modules, author has used Java and prolog technologies. The next
chapter will be discussed how does the BEES work on the four environments
namely desktop application, online translator, webpage translator and selected text
translator.
94
Chapter 7
BEES IN ACTION
7.1 Introduction
The previous chapter described implementation of all the modules and dictionaries.
The BEES has been implemented through several online and standalone applications.
This section describes various applications of BEES. The English to Sinhala machine
translation system has been implemented through the four applications namely
1. BEES as an online translator
2. BEES as a webpage translator
3. BEES as a selected text translator
4. BEES as a Desktop Application
7.2 BEES as an Online Translator
BEES has been developed as an online translator. This development is primarily
based on the use of Prolog Server Pages [23]. The architecture of the web-based
BEES (English to Sinhala machine translation system) is shown in Figure 7.1.
Figure 7.1: Web based architecture for the BEES
The web-based system contains four modules, namely, web client; Apache web
server [17], PSP (Prolog Server Pages) module and the Prolog based core translation
system. Note that, prolog based core translation system is a rule-based machine
translation system which is developed using all the functional modules of the BEES.
95
The web browser is the user interface of the system. Apache web server handles all
the web-based transaction of the system. PSP provides facilities to run Prolog-based
system through the web. Prolog-based system is the core of the machine translation
system. Through the PSP scripts, the core system reads input English sentence that
comes from the web client. After the translation, the core machine translation system
returns the output Sinhala sentence to the web client. Figure 7.2 shows user interface
of the online BEES [72].
Figure 7.2: User interface of the Online BEES
96
7.3 BEES as a Web Page Translator
BEES has been improved as a web page translator, which can be used to translate a
given web page [66]. This section describes how System translates a given English
web page into Sinhala. Figure 7.3 shows user interface of the web page translator.
Figure 7.3: A web page translator
This system translates a given English web page into Sinhala and it shows output of
the translation by using a web browser. Figure 7.4 shows translated output of the
Sample web page. Process of the translation is given below. Assume that the system
reads following simple HTML document. As a first step HTML parser [66] analyzes
the document and identifies the tags and the text. Consider the following simple html
document part.
<tr><td>
The Rabbit
</td></tr>
<tr><td>
<imgsrc="trabsl1.jpg">
The Rabbit is a small and herbivorous animal.
It lives in the jungle. Rabbit has long and powerful legs.
</td></tr>
97
This HTML source contains several HTML tags and text. “The rabbit” is a text
identifies by the HTML parser. Then the parser sends this text into the translation
module. Translation module reads the above text and tries to translate. In the
sentence analyzing stage, the English parser rejects the input text, because it is not
a sentence. Therefore, the system tries to identify it as a noun phrase. The English
parser recognized the input text “The rabbit” as a noun phrase. Then the translation
module uses English to Sinhala word translator, Sinhala morphological analyzer
and the Sinhala parser, and generates the appropriate Sinhala translation as “ydjd”.
This is the time to show how translation module works for given complete
sentence. Assume that, translation module reads a sentence “The Rabbit is a small
and herbivorous animal” as an input text. Then English morphological analyzer
reads the input sentence and returns the following.
eng_detm([e1000002], dr, 'the').
eng_noun([e1000077], td, sg, ma, sb, 'rabbit').
eng_verb([e1000057], if, 'is').
eng_detm([e1000001], id, 'a').
eng_adjv([e1000074], p, 'small').
eng_conj([e1000020], 0, 'and').
eng_adjv([e1000076], p, 'herbivorous').
eng_noun([e1000059], td, sg, co, sb, 'animal').
eng_detm/3, eng_noun/6, eng_verb/3, eng_adjv/3 and eng_conj/3 are the prolog
predicates to represent English words. Then English parser reserves above
information and analyzes the English sentence. The English parser returns the
following predicates.
eng_sentence_type(simple,if).
eng_sen_verb([e1000057]).
eng_sen_complement([e1000001, e1000074, …]).
eng_sen_subject([e1000002, e1000077]).
eng_sen_ekeys([e1000002, e1000077, …]).
This English parser identifies the subject, verb and complement of the sentence. It
stores these information using prolog predicates such as eng_sen_verb/1,
98
eng_sen_complement/1 and eng_sen_subject/1. After successful syntax analysis,
word translator translates corresponding Sinhala root word for given input root word.
The word translator returns the following predicates.
estrwords(1001, e1000002, s1000000, dt).
estrwords(1002, e1000077, s1000078, na).
estrwords(1003, e1000057, s1000059, vb).
estrwords(1004, e1000001, s1000000, dt).
estrwords(1005, e1000074, s1000076, aj).
estrwords(1006, e1000020, s1000018, cn).
estrwords(1007, e1000076, s1000077, aj).
estrwords(1008, e1000059, s1000060, na).
The estrwords/4 prolog predicates represent bilingual information for each English
root words. By using these entire information Sinhala morphological generator
generates suitable Sinhala words for corresponding English words.
snoun([s1000078], td, sg, ma, li, dr, v1,'ydjd').
sin_fverb([s1000059], td, sg, pr,'h').
sin_adjv([s1000076],'l=vd').
sin_conj([s1000018],'iy').
sin_adjv([s1000077],'Ydl NlaIl').
snoun([s1000060], td, sg, co, li, id, v1,'isjqmdfjla').
Using all these information Sinhala parser generates appropriate Sinhala sentence as
“ydjd l=vd iy Ydl NlaIl isjqmdfjla h'”.
After the successful translation, HTML parser reads this translated text and
composes a corresponding web page. Using this interface user can see the original
English web page and the translated Sinhala web page separately. Figure 7.4 shows
the output web interface of the web page translator.
99
Figure 7.4: BEES as a web page translator
7.4 BEES as a Selected Sentence Translator
As an improved version of the online BEES, the author has developed BEES as a
selected sentence translator. The Select sentence translator is a client application that
runs on the client machine and translation process run through the online translator
[61]. This client tool has been implemented through the VB application and online
connection created through the Winsock client. Figure 7.5 shows the user interface of
the selected sentence translator. Using this tool user can translate a sentence just only
to select it. This application is very useful to readers to translate a sentence while it is
being read. Figure 7.6 shows a desktop that show how this tool gives the translation
for the selected sentence “we gave a new book to your friend”.
100
Figure 7.5: Selected sentence translator
Figure 7.6: Desktop screen for selected sentence translation
101
7.5 BEES as a Desktop Application
The BEES has been designed as a desktop application. In this section describes,
how the system works for a given input sentence. The translation system uses 7
modules to process the translation. To start the translation system reads input
sentence form GUI and start the translation process. As the first step, the English
Morphological Analyzer reads the input English sentence word by word and provides
the Morphological information for each word. Then English parser analysis the Input
English Sentence by reading the above morphological information and the input
English sentence. Consequently, the English to Sinhala Base Word Translator
translates the English base words into appropriate Sinhala based words. This process
is rather complex and it uses two supporting dictionaries namely, the English-Sinhala
bilingual dictionary and the Concept dictionary. As the first step, English to Sinhala
Base Word Translator uses English-Sinhala bilingual dictionary and reads the
available Sinhala based words for the given English base word. If there are multiple
words available in the Bilingual dictionary, then system lookup the relevant
information from concept dictionary to indentify the most suitable Sinhala base
word. The concept dictionary is used to store concepts information for each Sinhala
word. Otherwise, English to Sinhala Base Word Translator gives most usable
Sinhala based word for the given English based word. After successful base word
translation, the Sinhala parser (Sentence composer) generates appropriate Sinhala
sentence with supporting the Sinhala Morphological generator. The Sinhala
Morphological Generator generates appropriate Sinhala words by using the
translated Sinhala based word for the given grammar information. The Sinhala Parser
uses above generated Sinhala word to generate grammatically correct Sinhala
sentence.
The figure 7.7 shows the user interface of the BEES. Translation system works on
the two modes namely user mode and the expert mode. If the system runs as an
expert mode then it assumes as user is expert for the both languages. Therefore, The
Intermediate editor automatically provides facilities to change the sentence through
102
the intermediate editing, as it needs to semantic handling. In addition, system also
updates the lexical resources automatically.
Figure 7.7: User interface of the BEES
The translation system runs on the user mode, the Intermediate editor appears only
for the user ask to change the sentence. The following sample data is shown how
translation is processed.
Assume that system reads “The good boy and his old mother are reading books” as
an input sentence. Then English Morphological Analyzer returns the following
output.
% Auto generated
output
% **********************
eng_input_sen_list(['the',
'good',
'boy',
'and',
'his',
'old',
'mother', 'are', 'reading', 'books', []]).
eng_detm([3000003], dr, 'the').
eng_adjv([3000004], p, 'good').
eng_noun([1000001], td, sg, ma, sb, 'boy').
eng_noun([1000001], td, sg, ma, ob, 'boy').
103
eng_conj([3000027], 0, 'and').
eng_noun([4000004], td, sg, ma, po, 'his').
eng_noun([4000004], td, sg, ma, po, 'his').
eng_adjv([3000035], p, 'old').
eng_adjv([3000062], p, 'mother').
eng_noun([1000025], td, sg, no, sb, 'mother').
eng_noun([1000025], td, sg, no, ob, 'mother').
eng_verb([5000026], if, 'are').
eng_verb([3000030], uv, 'are').
eng_verb([5000008], rp, 'reading').
eng_noun([1000004], td, pr, no, sb, 'books').
eng_noun([1000004], td, pr, no, ob, 'books').
After the syntax analysis, English parser returns the following;
eng_sen_verb([3000030, 5000008]).
eng_sen_complement([1000004]).
eng_sen_subject([3000003,
3000004,
1000001,
3000027,
4000004,
3000027,
4000004,
3000035, 1000025]).
eng_sen_predicate([3000030, 5000008, 1000004]).
eng_sen_type(declarative).
eng_sen_ekeys([3000003,
3000004,
1000001,
3000035, 1000025, 3000030, 5000008, 1000004]).
eng_sen_tence(presentcontinus).
eng_sen_result(sucess).
By using these entire information English to Sinhala base word translator returns the
suitable Sinhala terms. The following code displays the result of the English to
Sinhala base word translator.
estrwords(1001, 3000003, 3000000, dt).
estrwords(1002, 3000004, 3000004, aj).
estrwords(1003, 1000001, 1000001, na).
estrwords(1004, 3000027, 3000027, cn).
estrwords(1005, 4000004, 4000004, na).
estrwords(1006, 3000035, 3000035, aj).
estrwords(1007, 1000025, 1000045, na).
104
estrwords(1008, 3000030, 3000030, uv).
estrwords(1009, 5000008, 5000008, vb).
estrwords(1010, 1000004, 1000004, na).
Then Sinhala Morphological generator generates suitable Sinhala word with full
grammatical information. The output of the Sinhala Morphological generation is as
follows.
sin_adjv([3000004],'').
snoun([1000001], td, sg, ma, li, dr, v1,' ').
sin_conj([3000027],'').
snoun([4000004], td, sg, ma, li, dr, v7,'').
sin_adjv([3000035],'').
snoun([1000045], td, sg, no, nl, dr, v1,'').
sin_sub_info([3000004,
1000001,
3000027,
4000004,
3000035,
1000045]).
sin_sub_word([,
, ,
, , , []]).
sin_fverb([5000008], td, pr, pr,' ').
sin_veb_info([5000008]).
sin_veb_word([ , []]).
snoun([1000004], td, pr, no, nl, dr, v2,'').
sin_cmp_info([1000004]).
sin_cmp_word([, []]).
Finally Sinhala parser generates corresponding Sinhala sentence “ෙහොද පිරිමි ළමයා
සහ ඔහුෙග් වයසක මව ෙපොත් කියවමින් සිටිති”.
105
7.6 Summary
This chapter described how BEES works on four environments namely as an online
application, as a web page translator, as a selected sentence translator and desktop
application. The next chapter reports how evaluate our system to find the accuracy of
the English to Sinhala machine translation.
106
Chapter 8
EVALUATION
8.1 Introduction
The approach and the implementation stages were discussed in the preceding
chapters. The evaluation of the approach is described in this chapter based on
hypothesis formulated to the test whether the BEES is able to translate English text
into Sinhala. This chapter also reports existing evaluation methodology for the
machine translation and our approach to evaluate the English to Sinhala machine
translation.
8.2 Evaluation of MT systems
Evaluation of the Machine Translation system has been received significant attention
in the past few years. In general, the Machine translation system can be evaluated
through several ways such as comparison with human, comparison of multiple
machine translation systems etc.
To evaluate the machine translation systems,
several methods are used. These evaluation methods can be categorized into two
groups namely the automated evaluation and the human supported evaluation [98].
Numbers of standard evaluation matrices (methods) are available for automated
machine translation system evaluation such as BLEU [123], NIST [111] and
METRO [21] etc. These evaluation metrics do not use the human support for the
evaluation process. These metrics are much faster, easier and cheaper than the human
evaluation [2]. Most of these techniques are based on n-gram metrics evaluation [90].
The BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the
quality of text, which has been machine-translated from one natural language to
another [123]. It is one of the most commonly used evaluation matrices for Statistical
machine translation systems. However, it does not provide sentence level scores
[169].
107
METEOR is another evaluation matrix that automatically evaluates the output of
machine translation engines by comparing them to one or more reference translations
[21]. It has been designed to explicitly address the weakness in the BLEU matrices.
On the other hand, Round-trip translation [139] is a traditional approach to
evaluate machine translation systems. The Round-trip translation is the process of
translating a word, phrase or text into another language then translates the results at
least more than once without reference to the original text, until it ends up back in the
language it started in [162].
Note that, many researchers agreed that, these automated evaluation techniques are
more suitable for closely related language pairs such as Sinhala-Tamil [157],
English-German etc. However, the BLUE types of automated evaluation techniques
are not suitable for structurally different language pairs such as English-Hindi [12].
In addition to that, Goyal and others [52] have noted that, Hindi type of languages
need more criteria for evaluating purpose than the single question evaluation (“Is the
translation good” Yes/No). They have mentioned that, the answers are needed for
several questions to complete the evaluation such as Gender/Number is properly
translated or not, Tense in the translated sentence is proper or not, and Voice of a
sentence (i.e. active or passive) is properly translated or not etc. Further, the Sinhala
language is closely related to the Hindi language and both languages have same
linguistic properties. Therefore, the evaluation methodology of the BEES is based on
the above factors.
Traditionally, the evaluation of the Machine translation system has performed by
using human support. It is complex and a time waste process. However, the result of
the human evaluation is perfect than the automatic evaluation. Therefore, many
machine translation system developers have used black-box and white-box based
testing techniques to evaluate their machine translation systems through the human
support. Among others, Goyal and Lehal [52] proposed human supported approach to
evaluate their Hindi to Punjabi machine translation system. To evaluate their
machine translation system, they have selected more than 100,000 sentences from
newspaper articles, official language quest and blogs. They have used 50 people and
108
scoring has been done based on the degree of intelligibility and comprehensibility.
Four point scale has been made for their evaluation. Highest point has assigned to the
perfect translation and the lowest point has assigned to the unintelligible sentence.
Error analysis is one of the important factors for evaluation of the machine
translation systems. Error is analyzed through the Word Error Rate (WER) and the
Sentence Error Rate (SER). Word error rate is a common matrix of the performance
of a speech recognition or machine translation system. Word error rate and sentence
error rate can then be computed as:
Considering the above facts, author has developed an evaluation methodology for
our English to Sinhala machine translation system.
8.3 BEES Evaluation
The English to Sinhala machine translation system has been evaluated through the
following three stages;
1. Conducted a white box testing approach and tested each module in the machine
translation system through the developed testing tools (Module testing)
2. Evaluated the system performance and calculated the error rate through the
evaluation test bed (Performance testing)
3. Intelligibility and accuracy test was conducted through the human support.
(Accuracy testing) [61]
109
8.4 Stage1: Module Testing
The English to Sinhala Machine Translation system contains six modules that are
directly supported for the translation namely English Morphological analyzer,
English parser, English to Sinhala bilingual base-word translator, Sinhala
morphological generator, Sinhala Sentence generator and the transliteration module.
Author has designed and developed test tools for each module and tested each of
them. These tools have been developed as online systems that are available on the
BEES web site [72].
8.4.1 English Morphological Analyzer
The English Morphological analyzer analyzes English words and gives the
morphological information for each word. To test the English Morphological
analyzer, author has implemented the online version, which gives morphological
information for the given English word(s). Using this online English morphological
analyzer, author has tested each type of word through the created test plan. The
complete evaluation test plan is attached in the appendix A. Using more than 50 test
cases, the English Morphological analyzer has been successfully tested. Table 8.1
shows a sample test plan for the English regular nouns. The complete test plan has
been attached at the end of the thesis. The figure 8.1 shows user interface and the
output result of the English morphological analyzer.
Table 8.1: Sample test plan for English Morphological analyzer
No
1
2
3
4
5
6
7
8
Test case
Morphologi
cal rules for
English
Noun
Grammar
Singular noun
Plural noun
Plural noun
Plural noun
Plural noun
Singular
Possessive
Plural Possessive
Singular noun
Morphological
structure
Base
word
Examp
le
Base word
Base + s
Base + es
Plurals Base –y + ies
Plurals Base – f + ves
Base + ‘s
boy
boy
class
baby
knife
Home
boy
Boys
Classes
Babies
Knives
Home’s
Plural + ‘
Verb Base + er
boy
play
Boys’
player
110
9
10
Plural noun
Singular noun
Verb Base + ers
Verb Base + ment
play
Pay
11
Plural noun
Verb Base + ments
Pay
players
paymen
t
paymen
ts
8.4.2 English Parser
The English Parser analyzes the English sentence by using the output result of the
English Morphological analyzer. The online test tool has been developed to test all
the functionality of the English Parser. The English parser has been tested through
the created test plan. This parser is able to handle all the simple as well as complex
sentences of declarative, interrogative, imperative types and it returns syntax
information of the given English sentence. Author has successfully tested more than
500 sentence patterns through the developed test tool. Some selected test cases are
shown in table 8.2.
Figure 8.1: English Morphological analyzer with test results
111
Table 8.2: Sample test plan for English parser
No
Pattern
Example
1
Simple Present
A boy reads a book
2
Present Continuous
I am writing a new book
3
Present perfect
Good boys have read the books
4
Present Perfect Continuous
I have been writing a book
1
Simple past
I gave a book
2
Past Continuous
I was giving a book
3
past perfect
I had written a book
4
Past Perfect Continuous
The boy had been giving a book
8.4.3 English to Sinhala Base Word Translator
English to Sinhala Base Word Translator provides suitable Sinhala base word for
the given English base word. The translator uses following rules to generate
appropriate Sinhala base word;
•
Find the suitable Sinhala base-word from bilingual dictionary with the full
grammatical mapping (Two or more words available in the bilingual dictionary
system uses concepts dictionary to find the suitable Sinhala base-word)
•
If the grammatical mapping is not satisfied, then the system uses Intermediate
editor.
•
If there are no any correspondent Sinhala words for the given English base
word in the bilingual dictionary, then the system uses corresponding Sinhala
transliteration.
112
To evaluate English to Sinhala base word translator, author has implemented a test
tool to test the functionality of the bilingual translator. The English to Sinhala
bilingual base-word translator has been tested through the created test plan.
8.4.4 Sinhala Morphological Generator
Sinhala Morphological Generator is the key module of the English to Sinhala
translation system. It generates required word form for a given Sinhala base word.
By using this Sinhala Morphological Generator, a testing tool has been created to
generate all the forms of a given Sinhala base word. Further, the Sinhala language
contains large number of conjugation forms for the nouns and the verbs. Our Sinhala
Morphological generator handles 85 grammar rules for the Sinhala nouns and 36
grammar rules for the Sinhala verbs. Sample Conjugation table for Sinhala nouns is
attached in the appendix B. All these rules are implemented by using fundamentals
of the Sinhala grammar such as Prakurthi, and Nama and Kriyagana [41] [88]. Table
8.3 shows the sample palindrome table for the Sinhala noun form “Ethganaya” (we;a
.Kh).
Table 8.3: Sample Sinhala Morphological rules
.Kh
,sx.h
m%lD;sh
ksh; tal
example
Add
rem
example
mq
ඇත්
D
a
ඇතා
mq
ෙකොක්
D
a
ෙකොකා
we;a
mq
ෙගොන්
D
a
ෙගොනා
.Kh
mq
නslම්
D
a
නිකමා
mq
කිඹුල්
D
a
කිඹුලා
mq
මිනිස්
D
a
මිනිසා
To test the Sinhala morphological generator author has implemented a “Sinhala word
conjugator” which gives all the Sinhala words form for the given Sinhala word. The
113
figure 8.3 shows how Sinhala word conjugator runs in the swi-prolog [143] interface.
The complete set of rules, which are used to implement the Sinhala word generation,
is attached at the end of the thesis.
Figure 8.2: Sinhala word conjugator
8.4.5 Sinhala Sentence Composer
Structures of the Sinhala and English sentences are different from each other.
Therefore, each English sentence cannot directly map into the Sinhala sentence
especially for the passive voice and perfect forms. The Sinhala sentence composer is
composed grammatically correct Sinhala sentence for the given Sinhala subject,
object, verb phrase and tense pattern. Each of the corresponding sentence patterns for
the English is tested through the test plan.
114
8.4.6 Transliteration Module
The transliteration modules are used to transliterate the English text into Sinhala.
To test all the functionality of the transliteration module, the online tool has been
implemented. By using the transliteration tool, the transliteration module has been
tested.
8.5 Stage 2: Performance Testing
After evaluating each module in the English to Sinhala machine translation system,
the evaluation test bed has been implemented as an experimental setup [68]. The
evaluation test bed contains limited number of words (100 nouns, 50 verbs, 50
adjectives 50 adverbs, determiners, and some auxiliary verbs for tenses). Using the
evaluation test bed, performance of the translation system, the Word Error Rate and
the Sentence Error Rate of the system has been calculated. Figure 8.4 shows the user
interface of the evaluation test bed.
Using the evaluation test bed, anyone can make a sentence by using the available
words. After generating a Sinhala sentence, the evaluating test bed shows the
evaluation form. The evaluation form contains the following questions to evaluate
the translation.
•
Subject verb agreement (correct/incorrect)
•
Tense of the sentence (correct/incorrect)
•
Word conjugation (all correct/some are correct/ all incorrect)
•
Word order in the sentence (correct/incorrect)
•
Meaning of the translated sentence
0– Error
1 - Meaningless
2 - Basically OK
3 - Perfect
115
This evaluation form is used to evaluate the English to Sinhala machine translation
system. Figure 8.5 shows online evaluation test bed and the figure 8.6 shows the user
interface of the online evaluation form.
Figure 8.3: User interface of the evaluation test bed
116
Figure 8.4: Online evaluation form
8.6 Stage 3: Accuracy Testing
The previous two stages are used to check each module of the translation system
and calculated the performance of the system. To evaluate accuracy and the
intelligibility of the translation system, following three steps are followed.
1. 200 sample sentences are collected and group them into 20 sets (10
sentences for each group)
2. Each sentence is translated using BEES
117
3. Each set of sentences is given to the human translator and scored for each
sentence with the following criteria (Same as the evaluation form of the
evaluation test bed)
•
Subject verb agreement (correct/incorrect)
•
Tense of the sentence (correct/incorrect)
•
Word conjugation (all correct/some are correct/ all incorrect)
•
Word order in the sentence (correct/incorrect)
•
Meaning of the translated sentence
0– Error
1 - Meaningless
2 - Basically OK
3 - Perfect
The accuracy and the performance of the system have been calculated though all the
above results.
8.7 Result of the Experiments
To get the results, 200 sample sentences were used. The following list shows some
sample sentences and the Sinhala translation of the each sentence. Sample evaluation
form and sample evaluator’s comments attached in appendix C and D
1. I write books
uu fmd;a ,shñ
2. I am writing a new book
uu w¨;a fmd;la ,shñka isáñ
3. I have written a new book
uu w¨;a fmd;la ,shd we;af;ñ
4. We have written new books
wms w,q;a fmd;a ,shd we;af;uq
118
5. A good boy and his mother have been reading new books
olaI msrsñ <ufhla iy Tyqf.a uj w,q;a fmd;a lshjñka isg we;af;dah
6. The beautiful girl was singing a song
,iaik .eyeKq <uhd .S;hla .dhkd lrñka isáfhah
7. We had written new books
wms w,q;a fmd;a ,shd ;snqfKuq
8. A good boy reads a good book
olaI msßñ <ufhla fydo fmd;la lshjhs
9. A new book is written by me
ud úiska w¨;a fmd;la ,shkq ,nhs
10. A new book is being written by my good friend
uf.a fydo ñ;=rd úiska w¨;a fmd;la ,shñka we;
After the evaluation, following experimental results were collected. Table 8.4 shows
the result of the module test including English morphological analysis, English
syntax analysis, Sinhala Morphological generation etc. It shows each test case and
percentage of the success of the test.
Table 8.4: Results for module testing
Test case
Percentage (%)
English Morphological Analysis
96
English Syntax analysis
90
English to Sinhala base-word translation
92
Sinhala Morphological generation
94
Sinhala sentence generation
90
Sinhala transliteration
80
Table 8.5 shows the evaluation result of the human evaluation including correct
subject verb agreement, correct tense translation, correct noun verb generation etc.
The experimental result shows number of correct sentences/words from 200 sample
119
sentences. Each test case has been shown more than 80 % corrected results in the
evaluation.
Table 8.5: Human evaluation results
Case
Results
Correct subject verb agreement
186
Correct tense translation
190
Correct Noun verb generation
180
Correct word order
185
Total number of sentences
200
The table 8.6 shows the accuracy result of the 200 sample sentences. The
experimental result shows 71% of the sample is translated perfectly and 26 % of the
sample is basically OK. Therefore, the system gives 97% accuracy of the translation.
The figure 8.7 shows the result of the system accuracy test.
Table 8.6: Accuracy results
Test case
Sentences
Perfect translation
143
Basically OK
52
Meaningless
2
Error
3
Using the above all results the Word Error Rate (WER), the Sentence Error Rate
(SER) and the accuracy of the system are calculated. Table 8 shows result of the
error calculation.
120
Table 8.7: Final evaluation results
Evaluation
Percentage
Word Error Rate (WER)
7.2 %
Sentence Error Rate (SER)
5.4 %
Intelligibility and Accuracy
89.1 %
Figure 8.5: Translation accuracy
8.8 Summary
This chapter discusses evaluation of the English to Sinhala Machine Translation
system (BEES). The evaluation was conducted through three steps. As the first step,
evaluation was conducted through the white box testing approach and tested each
module in the machine translation system through the developed testing tools. Then,
evaluated the system performance and calculated the error rate through the result of
the evaluation test bed. Finally, Intelligibility and the accuracy test will be conducted
through the human support. The experimental result shows 89% accuracy of the
overall system and 7.2% word error rate and the 5.4% sentence error rate.
121
Chapter 9
CONCLUSION AND FURTHER WORK
9.1 Introduction
Chapter 8 presented how BEES has been evaluated to check the hypothesis that
“concepts of Varanegeema (Conjugation) can be used to drive English to Sinhala
Machine translation.” The hypothesis was tested by checking whether the BEES is
able to translate English text into Sinhala. This chapter discusses the conclusions
drawn from the evaluation process. The chapter reports 89% overall accuracy of
BEES with 7.2 % word error rate and the 5.4 % sentence error rate. This chapter also
reports on some limitations and further works.
9.2 Revisited Objectives
In order to conclude the thesis, objectives are recapitulated as follows. Then the
achievement of each objective will be discussed separately.
Objective 1
Critically review the existing systems, concepts and tools for machine
translation.
Objective 2
Develop a Computational grammar for Sinhala Language
Objective 3
Design and develop English to Sinhala Machine Translation system
Objective 4:
Evaluate the system
122
The first objective is to “critically review the existing systems for machine
translation”. The machine translation is a sub field of the Natural Language
Processing in the area of the Artificial Intelligence. During the last sixty years,
hundreds of machine translation systems have been developed all over the world.
Most of these systems have been developed by using rule-based, statistical-based,
agent-based or human-assisted approaches. All of these approaches and 35 successful
systems have been discussed in the second chapter. In addition to the above,
available English to Sinhala prototype machine translation systems were also
discussed in the second chapter. Further, the author has critically reviewed the
existing concepts/techniques for Natural language processing with more attention on
the machine translations”. Each concept/technique was also discussed. Therefore, the
author has successfully achieved the first objective.
The second objective is to “Develop a Computational grammar for Sinhala
Language” To achieve this objective, syntax of the Sinhala language has been
implemented through the Context-Free Grammar. In addition to the above, 85
grammar rules for Sinhala nouns and 18 rules for Sinhala verbs are also implemented
through the paradigm approach. These paradigm tables are used to generate all the
word forms through the concept of Varanegeema. Therefore, the author has
successfully achieved the second objective.
The third objective is to “Design and develop English to Sinhala Machine
Translation system”. To achieve this objective, the English to Sinhala machine
translation system has been designed with seven modules namely, English
Morphological Analyzer, English Parser, English to Sinhala Base word translator,
Transliteration module, Sinhala Morphological Generator, Intermediate Editor and
Sinhala Parser. As the lexical resources of the system, four dictionaries have been
developed. Therefore, the author has successfully achieved this objective.
123
The final objective is to “Evaluate the system”. The English to Sinhala machine
translation system has been evaluated through the three stages. As the first stage;
evaluation was conducted through the white box testing approach and tested each
module in the Machine Translation system through the developed testing tools. Then,
evaluated the system performance and calculated the error rate through the
evaluation test bed. Finally, Intelligibility and accuracy test was conducted through
the human support. The experimental result shows 89 % accuracy of the system and
7.2 % word error rate and the 5.4 % sentence error rate. According to the above facts,
the author has successfully achieved this objective too.
9.3 Limitations
The English to Sinhala machine translation system has been developed as a rule
based system and the translation process done by the translation modules namely
English morphological analyzer, English parser, translator, Sinhala morphological
generator and the Sinhala sentence composer. The system has several limitations.
The translation system perfectly works on the simple sentences. Translation of small
complex sentences also shows reasonably accurate results. However, the translation
system does not successfully handle multi-word expressions, idioms and compound
sentences. At present the lexical resources in the system are limited. For example,
bilingual dictionary requires regular updating until the system gets way from the outof-vocabulary issue.
9.4 Further Works
This project has played a foundation for various projects in machine translation.
Several major areas of further work can be identified as follows.
In particular, concept of Varanegeema (conjugation) can be tried out for machine
translation systems that deal with languages closer to Sinhala. For instance, Tamil
language and many other Indian languages can use the theoretical basis of BEES for
124
the development of systems for machine translation from English to those languages.
It should be noted that all languages have a kind of concept similar to conjugation.
Obviously, the system can also be expanded with more lexical resources such as
dictionaries. In fact, BEES can be updated via intermediate editor, while it is being
used. It would be appropriate to encourage human-assisted translation until the
system gets matured with enough resources.
Handling compound sentences and expansion to the parser for handling more
grammatical structures would also be another direction of further work. In addition,
it would be worth considering the use of Agent technology for improving various
aspects of BEES including, Semantic handling and autonomous updating of lexical
resources.
Sinhala to English machine translation (reverse of BEES) would also be yet
another interesting further work.
9.5 Summary
This chapter provided the conclusions of each objectives and limitation of the
English to Sinhala Machine Translation system. In addition, it points out some
further work related to English to Sinhala machine translation system. The
conclusion supported the author’s aim of developing a machine translation system
that works through the concept of Varanegeema. Based on the hypothesis formulated
in this thesis, author’s evaluation revealed that English to Sinhala machine
translation system (BEES) is able to achieve the aim and objectives of this thesis.
125
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
S. Abdelhadi, C. Violetta, J. Abderrahim, “A prototype English-to-Arabic
interlingua-based MT system”, Third International Conference on Language
Resources and Evaluation, Workshop Arabic language resources and evaluation:
status and prospects, 2002.
A. Agarwal, A. Lavie, “Meteor, m-bleu and m-ter: Evaluation Metrics for HighCorrelation with Human Rankings of Machine Translation Output”, Proceedings of
Workshop on Statistical Machine Translation., Columbus Association of
Computational Linguistics (ACL-2008), 2008.
I. Aizpurua, G. Ramirez, J. Pichel, J. Waliño, “Opentrad: bringing to the market
open source based Machine Translators”,Langtech, Rome, 2008.
B. Akshar, C. Vineet, P. A. Kulkarni,R. Sangal,“Anusaaraka: Overcoming language
barrier in India”, New Delhi, India, 2001.
B. Akshar, R. Sangal, D. M. Sharma,R. Mamidi, “Generic Morphological Analysis
Shell”, In Proceedings of LREC, 2004.
B. Akshar, V. Chaitanya, R. Sangal, “Natural Language Processing: A Paninian
Perspective”, New Delhi, India, Prentice Hall of India, 1995.
B. Akshar, V.Chaitanya, P. A. Kulkarni, R. Sangal, “ANUSAARAKA: Machine
Translation in Stages”, A Quarterly in Artificial Intelligence, Vol.10, 1997, pp. 2225.
B. Anandamaitreeya, “English Grammar in Sinhalalese”, Modern Book Company,
Nugegoda, Sri Lanka, 1980.
N. Aluthgedara, “Recognizing Sentence Boundaries and Boilerplate”, Computer
Science
Honors
Program.2003.
Available:
http://www.cs.umd.edu/Honors/reports/Nilani.pdf.
I. Alegria, et al, “An Open Architecture for Transfer-based Machine Translation
between Spanish and Basque”, MT Summit, A workshop at Machine Translation
Summit X, Thailand 2005.
R. Ananthakrishnan, et al., “MaTra: A Practical Approach to Fully-Automatic
Indicative English-Hindi Machine Translation”, Inthe proceedings of MSPIL-06.,
2006.
R. Ananthakrishnan, et al., “Some Issues in Automatic Evaluation of English-Hindi
MT: More blues for BLEU”, International conference on Natural Language
Processing (ICON), 2007.
W. Andy, G. Nano, “wEBMT: Developing and Validating an Example-Based
Machine Translation System Using the World Wide Web”, Computational
Linguistics Volume 29, Number 3, 2003, pp 421-457.
S. R. Annam, “ABHIDHA: An extended wordNet for Indo Aryan
Languages”,Journal of: Research Issues in Data Engineering., 2003.
Z. Antonio, “Basic English Sentence Structures”,scientificpsychic 2007,Available:
http://www.scientificpsychic.com/grammar/enggram1.html.
Anusaaraka, Available: http://anusaaraka.iiit.ac.in/.
Apache web server, Available: http:// www.apache.org/.
Apertium, Available: http://www.apertium.org/.
H. Avancini, A. Amandi, “A Java Framework for Multi-agent Systems”, SADIO
Electronic Journal of Informatics and Operations Research, (EJS) vol 3, no. 1, pp. 112 2000.
AU-KBC, Available: http://www.au-kbc.org/.
126
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
S. Banerjee, A. Lavie, “METEOR: An Automatic Metric for MT Evaluation with
Improved Correlation with Human Judgments” Workshop on Intrinsic and Extrinsic
Evaluation Measures for MT and/or Summarization, Michigan: Association of
Computational Linguistics (ACL-2005), 2005.
V. G. Balagalle, “BashaAdauanayasaha Sinhala Vivaharaya”, S. Godage and
Brothers, Colombo 10, Sri Lanka, 1995.
J. Benjamin, “Prolog Server Page” Available:
http://www.benjaminjohnston.com.au/template.prolog?t=psp.
S. Bhate, S. Kak, “Panini's Grammar and Computer Science”, Annals of the
Bhandarkar Oriental Research Institute, vol. 72. 1993, pp. 79-94.
P. Blackburn, K. Striegnitz,“Natural Language Processing Techniques in Prolog”,
2002.,
Available:
http://cs.union.edu/~striegnk/courses/nlp-withprolog/html/index.html.
I. Bratko, “PROLOG Programming for Artificial Intelligence”, Addison-Wesley
publishing Company, 1990.
C. Carter, “Sinhalese-English Dictionary”, Colombo: The Baplist Missionary
Society, 1924.
Cambridge
Advanced
Learner's
Dictionary,
Available:
http://dictionary.cambridge.org/
T. Chimsuk, S. Auwatanamongkol, “A Thai to English Machine Translation System
using Thai LFG tree structure as Interlingua”, World Academy of Science,
Engineering and Technology. - 2009. pp 690-695.
N. Chomsky, “Aspects of the Theory of Syntax”, MIT press, 1965.
N. Chomsky “Syntactic Structures”, Mouton, 1957.
R. Cole, “Converting CFGs to CNF (Cho msky Normal Form)”
2007.Available:http://cs.nyu.edu/courses/fall07/V22.0453-001/cnf.pdf.
D. Chiang, “An Introduction to Synchronous Grammars”, ACL 2006, Available:
http://www.isi.edu/~chiang/papers/synchtut.pdf
K. Daneilf, Y. Schabes, M. Zaidel, D. Egedi, “A Freely Available Wide Coverage
Morphological Analyzer for English”, Proceedings of COLIN-92. 1992, pp. 23-28.
N. De Silva, “Sinhala Accepted As one of the World’s Most Creative
Alphabets”,2009, Available:
http://www.asiantribune.com/news/2009/11/30/sinhala-accepted-oneworld%E2%80%99s-most-creative-alphabets.
A. Denis, E. L. Gachot, J. Yang, “The SYSTRAN NLP Browser an Application of
Machine Translation Technology”, in Multilingual Information Retrieval.
SYSTRAN, 1992.
W. Dingding, et al., “Multi-document summarization via sentence-level semantic
analysis and symmetric matrix factorization”, Proceedings of the 31st annual
international ACM SIGIR conference on Research and development in information
retrieval,ACM, 2008. pp. 307-314.
H. Darbari, “Computer-assisted translation system – an Indian perspective”,
Machine Translation Summit VII, 1999.
G. V. Dias, Challenges of enabling IT in the Sinhala Language, 27 th
Internationalization and Unicode Conference 1 Berlin, Germany, April 2005.
G. Dias, A. Goonetilleke, "Development of Standards for Sinhala Computing", 1st
Regional Conference on ICT and E-Paradigms 24th – 26th June 2004, Colombo, Sri
Lanka
J. B. Disanayake, “BasakaMahima 6: Prakurthi”, Colombo 10, Sri Lanka : S.
Godage and Brothers, 2000.
127
[42]
[43]
[44]
[45]
[46]
[47]
[48]
[49]
[50]
[51]
[52]
[53]
[54]
[55]
[56]
[57]
[58]
[59]
[60]
[61]
J. B. Disanayake, “BasakaMahima2: Akuru ha pili”, Colombo, Sri Lanka, S.
Godage& Brothers, 2000.
J. B. Disanayake, “BasakaMahima8: Tadditha”, Colombo, Sri Lanka, S. Godage&
Brothers, 2000.
J. B. Disanayake, “Meaning: A Linguistic Analysis”, Symposium on localized
Systems and Applications, Moratuwa,2009, CLSA-2009, 2009.
H. Dulip, R. Weerasinghe, “A Stochastic Part of Speech Tagger for Sinhala”,
Proceedings of 6th International Information Technology Conference. Colombo, Sri
Lanka, 2004.
NationMaster, Available: http://www.nationmaster.com/country/ce-sri-lanka/lanlanguage.
B. T. L. Fernando, et al., “English to Sinhala language Translator using Artificial
Neural Networks”, PSLIIT Vol2., SLIIT, 2008, pp. 42-45.
M. T. Francis, L. Wiechetek,T. Trosterud, “Developing prototypes for machine
translation between two Sami languages”, EAMT-2009, Proceedings of the 13th
Annual Conference of the European Association for Machine Translation, Spain
2009, pp.120-127.
S. Felipe, M. L.Forcada, “Automatic induction of shallow-transfer rules for opensource machine translation”, TMI-2007: Proceedings of the 11th International
Conference on Theoretical and Methodological Issues in Machine Translation, 2007,
pp.181-190.
P. Gabriel, M. Andrés, “The Sanskrit alphabet”, 2010, Available:
http://www.sanskrit-sanscrito.com.ar/en/essentials_alphabet/alphabet.shtml.
Google Translator, Available: http://translate.google.com.
V. Goyal, G. S. Lehal, “Evaluation of Hindi to Punjabi Machine Translation
system”, International Journal of Computer Science Issues(IJCSI)vol4, 2009.
The constitution of the democratic socialist republic of Sri Lanka, Chapter 4:
Language, The Official Website of the Government of Sri Lanka, Avialble:
http://www.priu.gov.lk/Cons/1978Constitution/Chapter_04_Amd.html
A. M. Gunasekera“A Comprehensive Grammar of the Sinhalese Language”, New
Delhi, India : AES Reprint, 1986.
G. Gregory, “The World Wide Web as a resource for example-based machine
translation tasks”, In Proceedings of the ASLIB Conference on Translating and the
Computer, volume 21, London 1999.
W. Haifeng, W. Hua, “Improving Statistical Word Alignment with a Rule-Based
Machine Translation System”,COLING 2004: Proceedings of the 20th International
Conference on Computational Linguistics.,ACL Anthology, 2004.
A. Herath and et al., “A Practical Machine Translation, System from Japanese to
Modern Sinhalese”, The Logico-Linguistic Society of Japan.,1995.
B. Hettige, A. S. Karunananda, “Varanageema: A Theoretical basics for English to
Sinhala”, Accepted to present, 7th Annual Sessions of Sri Lanka Association for
Artificial Intelligence (SLAAI), Kelaniya, 2010.
B. Hettige, A. S. Karunananda, “Multi-Agent architecture for English to Sinhala
Machine Translation”, Proceedings of the 27th National IT conference (NITC10),
Sri Lanka. 2010.
B. Hettige, A. S. Karunananda, “A Novel Approach for English to Sinhala Machine
Translation”, Accepted to present ITRU Research Symposium, Moratuwa, 2010.
B. Hettige, A. S. Karunnanda, “An Evaluation methodology for English to Sinhala
machine translation”, Accepted to present 6th International conference on
Information and Automation foe Sustainability (ICIAfS 2010), IEEE., 2010.
128
[62]
[63]
[64]
[65]
[66]
[67]
[68]
[69]
[70]
[71]
[72]
[73]
[74]
[75]
[76]
[77]
[78]
[79]
B. Hettige, A. S. Karunananda, “Context-based approach to semantics handling in
English to Sinhala Machine Translation”, Poster presentation of the 26th National IT
conference (NITC), Sri Lanka. - Colombo, 2009.
B. Hettige, A. S. Karunananda, “Swarm Intelligence of BEES for Machine
Translation”, Proceedings of ITRU Research symposium, Moratuwa, 2009.
B. Hettige, A. S. Karunananda, “Theoretical based approach to English to Sinhala
machine translation”, Proceedings of the 4th International Conference on Industrial
and Information Systems (ICIIS2009), Colombo, 2009. pp 380-385.
B. Hettige, A. S. Karunananda, “BEES ready to a web based translation”,
Proceedings of the 6th Annual Sessions of Sri Lanka Association for Artificial
Intelligence (SLAAI), Kelaniya, 2009.
B. Hettige, A. S. Karunnanda, “Web-based English-Sinhala translator in action”,
Proceedings of the 4th International conference on Information and Automation foe
Sustainability (ICIAfS 08), IEEE., 2008, pp. 80-85.
B. Hettige, A. S. Karunananda, “Developing Lexicon Databases for English to
Sinhala Machine Translation”, proceedings of second International Conference on
Industrial and Information Systems (ICIIS2007), Colombo, IEEE, 2007.
B. Hettige, A. S. Karunananda, “Transliteration System for English to Sinhala
Machine Translation”, Proceedings of second International Conference on Industrial
and Information Systems (ICIIS2007), Colombo: IEEE, 2007.
B. Hettige, A. S. Karunananda, “Using Computer-Assisted Machine Translation to
overcome language barrier in Sri Lanka”, Proceedings of the 4th Annual Sessions of
Sri Lanka Association for Artificial Intelligence(SLAAI),Moratuwa, SLAAI, 2007.
B. Hettige, A. S. Karunananda, “A Morphological analyzer to enable English to
Sinhala Machine Translation”, Proceedings of the 2nd International Conference on
Information and Automation (ICIA2006), Colombo, IEEE, 2006, pp 21-26.
B. Hettige, A. S. Karunananda, “A Parser for Sinhala Language - First Step Towards
English to Sihala Machine Translation”, To appear in the proceedings of
International Conference on Industrial and Information Systems ICIIS, Colombo :
IEEE, 2006.
B. Hettige, Bilingual Expert for English to Sinhala, Available:
http://dscs.sjp.ac.lk/~budditha/bees.htm.
J. Hutchins, “Current commercial machine translation systems and computer-based
translation
tools:system
types
and
their
uses”,
Available:
http://www.hutchinsweb.me.uk/IJT-2005.pdf
J. Hutchins, “Machine Translation: past, present, future”, New York : Halsted Press,
1986.
J. Hutchins, “Machine translation over fifty years”, Published in: Histoire,
Epistemologie, Langage, Tome XXII. 2001, pp. 7-31.
J. Hutchins, “Machine Translation: A Brief History”, Oxford Pergamon
Press,1995,pp 431-445.
D. Hull, G. Grefenstette, “Querying across languages: A dictionary-based approach
to multilingual information retrieval”, In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland, 49–57, 1996.
S. Hussain, N. Durrani, S. Gul, Survey of Language Computing in Asia, Center for
Research in Urdu Language Processing, National University of Computer and
Emerging Sciences, 2005.
IIIT Available: http://www.iiit.net.
129
[80]
[81]
[82]
[83]
[84]
[85]
[86]
[87]
[88]
[89]
[90]
[91]
[92]
[93]
[94]
[95]
[96]
[97]
[98]
[99]
[100]
[101]
[102]
[103]
[104]
F. A. James, “Semantic Analysis of Text”, Proceedings of the 2008 Conference on
Semantics in Text Processing, Italy, ACL, 2008 pp 343-354.
JAVA Available: http://www.java.com/
M. Joanna and et al., “Automatic sentence summarization of speech for automatic
summarization, To appear in Proc. ICASSP2006., France : IEEE, 2006.
B. Jocelyn, “English Article Insertion”, Mechanical Translation and Computational
Linguistics, vol.9, 1966, pp. 83-96.
D. Jurafsky, J. H. Martin, “Speech and Language Processing”, Boulder: University
of Colorado, 2005.
S. Kang and et al, “An English to Korean System for Human Assisted Language
Translation”, TENCON 87, IEEE, 1987, pp 509-515.
KANT, Available: http://www.lti.cs.cmu.edu/Research/Kant.
L. Karttunen, R. B. Kenneth, “Twenty-five years of finite-state morphology”, In
Inquiries into Words, Constraints and Contexts. CSLI Publications, 2005 pp71-83.
W. S. Karunathilaka, “Sinhala BashaViharanaya”, Colombo 11, Sri Lanka: M. D.
Gunasenanad Ltd, 2004.
S. Karurarathna, “Sinhala Viharanaya”, Dankotuwa, WasanaPrkashkaya,
Dankotuwa, Sri Lanka, 2005.
S. N. Kim, T. Baldwin, M. Kan, “Evaluating N-gram based Evaluation Metrics for
Automatic Keyphrase Extraction”, Proceedings of the 23rd International Conference
on Computational Linguistics, Beijing Coling, 2010, pp. 572–580.
P. Klint, “Syntax Analysis”, 2007.
K. Koskenniemi, “Two-level morphology: A general computational model for wordform recognition and production Publication”, University of Helsinki, Department of
General Linguistics, Helsinki, 1983.
K. Koskenniemi, "Two-level Model for Morphological Analysis," IJCAI 83, pp.
683-685, 1983.
M. Kulathunga, “Madhura Online Dictionary”, Available: http://maduraonline.com/.
P. A. Kulkarni, “Design and Architecture of anusAraka: An Approach to Machine
Translation”, Satyam Techical Review vol 3, Oct 2003, Available:
http://ltrc.iiit.ac.in/~anusaaraka/PUBLICATIONS/ANusaaraka.pdf
A. Lavie and et al, “Experiments with a Hindi-to-English Transfer-based MT System
under a Miserly Data Scenario”, ACM Transactions on Computational Logic, Vol.
V, 2004.
F.
D.
Lewis,
Recursive
Descent
Parsing,
1996,
Available:
http://www.cs.uky.edu/~lewis/essays/compilers/rec-des.html.
J. A. Linares, “Empirical Machine Translation and its Evaluation”, TALP Research
Center, 2009.
LTRL,
Language
Technology
Research
Laboratory,
Available:
http://ucsc.cmb.ac.lk/ltrl/.
Machine Translation Archive, Available: http://www.mt-archive.info
Matxin: an open-source transfer machine translation engine, Available:
http://matxin.sourceforge.net/
Mahavansa, Ceylon Government, 1912.
R. Mahesh, K. Sinha, “Integrating CAT and MT in AnglaBhart-II architecture”, 10th
EAMT conference "Practical applications of machine translation". - 2005. pp. 235244.
G. P. Malalasekera, “English-Sinhala Dictionary”, Sri Lanka: M. D. Gunasena and
Samagama, 217, Olkotemawatha Colombo 11, 2005.
130
[105]
[106]
[107]
[108]
[109]
[110]
[111]
[112]
[113]
[114]
[115]
[116]
[117]
[118]
[119]
[120]
[121]
[122]
[123]
[124]
[125]
[126]
D. Mandal, M. Gupta, S. Dandapat, P. Banerjee, S. Sarkar, "Bengali and Hindi to
English CLIR Evaluation", Advances in Multilingual and Multimodal Information
Retrieval, Lecture Notes in Computer Science, 2008, Volume 5152/2008, 95-102,
2008.
ManTra, Available:http://mantra-rajbhasha.cdac.in/mantrarajbhasha.
B. Manaris, "Natural Language Processing: A Human-Computer Interaction
Perspective", Appears in Advances in Computers (Marvin V. Zelkowitz, ed.), vol.
47, pp. 1-66, Academic Press, New York, 1998.
Moses: Available: http://www.statmt.org/moses/
A. C. McCarthy, “An Introduction to English Morphology: Words and Their
Structure”, Edinburgh University Press, George Square, Edinburgh, 2002.
Memodata
Corporation:
Online
Dictionary,
Available:
http://www.sensagent.com/alexandria-conjugation/conjugate.jsp
R. J. Micheals, P. Grother, P. J. Phillips, “The NIST Human ID Evaluation
Framework”, Fourth International Conference on Audio- and ideo-based Biometric
Person Authentication. - 2003.
Microsoft Translator, 2010, Available: http://www.microsofttranslator.com.
I. Minakov and et al., “Creating Contract Templates for Car Insurance Using MultiAgent Based Text Understanding and Clustering”, Third International Conference
on Industrial Applications of Holonic and Multi-Agent Systems., HoloMAS:, 2007,
pp 361-371.
H. Miyoshi and et al., “An Overview of the EDR Electronic Dictionary and the
Current Status of Its Utilization”, Proceedings of COLING-96. - 1996.
H. Miyoshi, O. Takano, S. Kenji, “EDR’s Concept Classification and Description for
Interlingual Representation” AMTA/SIG-IL First Workshop on Interlinguas. - 1997.
R, Murphy, “Intermediate English Grammar”, Cambridge: Cambridge University
Press, 1992.
R, Murphy, “Murphy's English Grammar”, Cambridge: Cambridge University Press,
2005.
E. H. Nyberg, M. Teruko, “The KANT System: Fast, Accurate, High-quality
translation in practical domains”, Proceedings of COLING-92. - 1992.
T. Nakazawa, S. Kurohashi, "Kyoto-U: Syntactical EBMT System for NTCIR-7
Patent Translation Task", Proceedings of NTCIR-7 Workshop Meeting, Japan, 2008.
S. Omar and et al., “Machine Translation of Noun Phrases from Arabic to English
Using Transfer-Based Approach”, Journal of Computer Science 6, pp 350-356.
OpenMaTrEx, Available: http://www.openmatrex.org
OpenLogos Machine Translation, Available: http://logos-os.dfki.de/
K. Papineni, S. Roukos, T. Ward, “BLEU: a Method for Automatic Evaluation of
Machine Translation”, Association for Computational Linguistics (ACL). - 2002. pp. 311-318.
F. Paris, D. San, “SYSTRAN Introduces hybrid machine translation solution for
enterprises”, SYSTRAN, 2009, Available: http://www.systran.co.uk/systran/newsand-events/press-release/hybrid-machine-translation-solution-for-enterprises.
J. A. Perez-Ortiz, F. Sanchez-Martne and F. M. Tyers, “Shallow-transfer rule-based
machine translation for Swedish to Danish”, Proceedings of the First International
Workshop on Free Open-Source Rule-Based Machine Translation. - Spain , 2009. pp. 27-33.
Population and housing information combined report for completed 18 districts
2001, Department of census and Statistics, 2006.
131
[127]
[128]
[129]
[130]
[131]
[132]
[133]
[134]
[135]
[136]
[137]
[138]
[139]
[140]
[141]
[142]
[143]
[144]
[145]
[146]
[147]
[148]
O. Ricardo and et al., “New algorithms assessing short summaries in expository texts
using latent semantic analysis”, Behavior Research Methods, 2009, pp. 944-950.
S. Russell, P. Norvig, “Artificial Intelligence: A Modern Approach”, Person
Education Inc, New Jersey 1995.
G. Rzevski, “A new direction of research into Artificial Intelligence”, Sri Lanka
Association for Artificial Intelligence 5th Annual Sessions. - 2008.
G. Rzevski, J. Himoff, P. Skobelev, "MAGENTA Technology: A Family of MultiAgent Intelligent Schedulers", conference on multi-agent systems in Karlsruhe,
February 2006.
G. Rzevski Home page: http://www.rzevski.net/
C. Samuelsson, “Notes on LR parser design”, International Conference On
Computational Linguistics, Proceedings of the 15th conference on Computational
linguistics - Volume 1. – Japan, 1994. pp. 386 - 390.
Sanjay K. D. and Pramod P. S. Machine Translation System in Indian Perspectives,
Journal of Computer Science, 1082-1087, 2010, 2010, pp 1082-1087.
U. S. Sannasgala, A. Perera, “ViyakaranaVimansawa”, Sanhida Mudranasaha
Prakashana, Pannipitiya, Sri Lanka, 1995.
K. Shin-ichiro, M. Kazunori, “Interlingua Developed and Utilized inReal
Multilingual MT Product Systems”, AMTA/SIG-IL First Workshop on Interlinguas.
- 1997.
B. Scott, A. Barreiro, "OpenLogos MT and the SAL representation language", In
Proceedings of the First International Workshop on Free/Open-Source Rule-Based
Machine Translation, 2009, pp. 19-26.
R. Sinha, A. Jain “AnglaHindi: an English to Hindi machine-aided translation
system”, T Summit IX, New Orleans, USA, 2003, pp 494-497.
Sinhala Unicode, Available: http://www.locallanguages.lk
H. Somers, “Round-Trip Translation: What Is It Good For?”,Proceedings of the
Australasian Language Technology Workshop. - Australia, 2005, pp 127–133.
B. Srinivas, H. Patrick, K. Stephan, “Statistical Machine Translation through Global
Lexical Selection and Sentence Reconstruction”, Proceedings of the 45th Annual
Meeting of the Association of Computational Linguistics. - Czech Republic:
Association for Computational Linguistics, 2007. - pp. 152–159.
M. H. Stefanini, Y. Demazeau, “TALISMAN: A multi-agent system for natural
language processing”, In Proceedings of SBIA'95. - Springer Verlag:, 1995, pp. 312322.
A. Stevenson, J. Elliott, R. Jones, “The Little Oxford English Dictionary”, Oxford
university press, 2002.
SWI-Prolog, Available: http://www.swi-prolog.org/index.html.
SYSTRAN, Available: http://www.systransoft.com.
I. Tatsuya, K. Akira, K. Yuka, “Toshiba Rule-Based Machine Translation System”,
NTCIR-7 PAT MT, Proceedings of NTCIR-7 Workshop Meeting, Japan, 2008, pp.
430-434.
TDIL, Technology
Development for Indian Languages, Available:
http://tdil.mit.gov.in/mat/ach-mat.htm.
D. Thierry, “A Short Introduction to Text-to-Speech Synthesis”, TTS research team,
TCTS Lab, 1999, Available:
http://tcts.fpms.ac.be/synthesis/introtts_old.html
P. Terence, ANTLR, ANother Tool for Language Recognition, 2008, Available:
http://www.antlr.org.
132
[149]
[150]
[151]
[152]
[153]
[154]
[155]
[156]
[157]
[158]
[159]
[160]
[161]
[162]
[163]
[164]
[165]
[166]
[167]
[168]
[169]
[170]
D. Thenmozhi, C. Aravindan, “Tamil-English Cross Lingual Information Retrieval
System for Agriculture Society”, International forum for Information Technology
for Tamil (INFITT), Tamil International conference 2009.
Y. Toshio, “The EDR electronic dictionary”, Communications of the ACM, Volume
38, Issue 11, 1995, pp. 42-44.
A. M. Turing, “Computing Machinery and IntelligenceMind”, New Series, Vol. 59,
No. 236, pp 433-460.
R. Udupa, A. Faruquie, "An English-Hindi Statistical Machine Translation System",
Natural Language Processing, IJCNLP 2004.
N. V. C. Vithanage, “English to Sinhala Intelligent Translator for Weather
forecasting domain”, Colombo: Thesis submitted BIT degree, University of
Colombo, Sri Lanka, 2003.
A. R. Weerasinghe, Available:
http://www.ucsc.cmb.ac.lk/People/rw/index.htm
A. Weerasinghe, C. P. Weerasinghe, “Godage English-Sinhala-Tamil Dictionary”,
Sri Lanka: S. Godage and brothers, Godage book shop, 661, Maradana road,
Colombo 10, 1999.
A.R.Weerasinghe and et al., “OpenTM: A Translation Memory System for Complex
Script Languages”, conference on Localized Systems and Applications CLSA2010,
Kalutara, 2010. - pp. 72-73.
A. R. Weerasinghe, “A Statistical Machine Translation approach to Sinhala-Tamil
language translation”, SCALLA. - 2004.
A. R. Weerasinghe, D.Herath, N. P. K.Medagoda, “A KNN based Algorithm for
Printed Sinhala Character Recognition”, Proceedings of 8th International
Information Technology Conference. - Colombo, 2006.
A. R. Weerasinghe, D. Herath, V. Welgama, “Corpus-based Sinhala Lexicon”,
Proceedings of the 7th Workshop on Asian Language Resources, ACL-IJCNLP,
Singapore : ACL-IJCNLP, 2009 pp. 17-23.
A. R. Weerasinghe., A. Wasala, K. Gamage, “A Rule Based Syllabification
Algorithm for Sinhala”, Proceedings of 2nd International Joint Conference on
Natural Language Processing (IJCNLP-05), Korea, 2005, pp. 438-449.
What
is
machine
translation,
Available:
SYSTRAN,
http://www.systran.co.uk/systran/corporate-profile/translation-technology/what-ismachine-translation.
Wikipedia, Available: http://en.wikipedia.org
J. Wielemaker, “SWI-Prolog 5.10, Reference Manual” Available:
http://www.swi-prolog.org/download/stable/doc/SWI-Prolog-5.10.1.pdf
WorldNet, Available: http://wordnet.princeton.edu.
P. C. Wren, H. Martin, “High School English Grammar and Composition”, S. Chand
& company LTD, 2006.
XSaiga Project, Available: http://www.cs.uwindsor.ca/~hafiz/xsaiga/pub.html
Q. H. Z. Xuan, C. Huowang, “An interlingua-based Chinese-English MT system”,
Journal of Computer Science and Technology Volume 17, 2002, pp 464-472.
Yahoo Babel fish, 2008, Available: http://babelfish.yahoo.com/.
Y. Yang, Z. Ming, L. Chin-Yew, “Sentence Level Machine Translation Evaluation
as a Ranking Problem: one step aside from BLEU”, Proceedings of the Second
Workshop on Statistical Machine Translation, Association for Computational
Linguistics ACL. - Prague : ACL, 2007, pp 240–247.
J. A. Yara, “A Tagalog-to-Cebuano Affix-Transfer-Based Machine Translator”,
Proceedings of the 4th NNLPRS, 2007.
133
[171]
[172]
[173]
[174]
[175]
[176]
B. Yehoshua, “Some linguistic Problems Connected with Machine Translation”,
Philosophy of Science, 1953, pp 217-225.
N. F. Yehuda, ‘Structure of English II: the word’, 2009, Available:
http://pluto.huji.ac.il/~msyfalk/WordStructure/.
W. Yorick, “Interactive Semantic Analysis of English Paragraphs”, International
conference on Computational linguistic, Sweden: COLOING, 1969.
A. Zamora, “Basic English Sentence Structures”, 2007, Available:
http://www.scientificpsychic.com/grammar/enggram1.html.
J. Zeng, R. Alhajj, “Multi-agent System for Translation Initiation Site Prediction”,
IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2007,
pp. 103-110.
B. Zhang, Y. Kim, “Morphological analysis and synthesis by automated discovery
and acquisition of linguistic rules”, Proceedings of the 13th conference on
Computational linguistics, p.431-436, 1990.
134
Appendix A:
English Morphological analyzer- Test plan
The following rules are used to analyze English regular words such as nouns, verbs
and adjectives. In addition to these rules, other available words such as irregular
nouns, irregular verbs, adjectives, adverbs, conjunctions and articles are directly
identified from the English dictionary.
No 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Test case Grammar Singular noun Plural noun Plural noun Plural noun Plural noun Singular Possessive Morphological Plural Possessive rules for Singular noun English Noun Plural noun Singular noun Plural noun Singular noun Plural noun Singular noun (female) (Positive) Adjective Morphological structure Base word Example Base word Base + s Base + es Plurals Base –y + ies Plurals Base – f + ves Base + ‘s Plural + ‘ Verb Base + er Verb Base + ers Verb Base + ment Verb Base + ments Verb Base + ion Verb Base + ions Base Noun + ess boy boy class baby knife Home boy play play Pay Pay boy Boys Classes Babies Knives Home’s Boys’ player players payment payments
Adjective Base good good 16 (Positive) Adjective Noun Base + ish Boy Boyish 17 18 (Positive) Adjective Noun Base + ful (Positive) Adjective Noun Base + less Care shame 19 20 21 22 23 24 25 26 (Positive) Adjective Morphological (Positive) Adjective (Positive) Adjective rules for (Positive) Adjective English Adjective (comparative) adjective (comparative) adjective (comparative) adjective (Superlative) Noun Base + en Verb Base + less Verb Base + ative Verb Base + able gold Tire Talk Move Adjective + er sweet Careful Shameles
s Golden Tireless Talkative Moveabl
e Sweeter Adjective + r fine finer Adjective –y + ierr happy Happier Adjective+ est sweet Sweetest 135
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 Adjective (Superlative) Adjective (Superlative) Adjective Infinitive Past Present Participle Past Participle Morphological Past rules for Past Participle English Past regular verbs Past Participle Past Past Participle Simple present tense Present Participle Simple present tense Simple present irregular verbs tense Present Participle Present Participle Present Participle Present Participle Determination Direct/indirect Adverb Adverb unknown Unknown word Conjunction Conjunction Adjective + st fine finest Adjective –y +ie st happy Happiest Base Base + ed Base + ing Base + ed Base + d Base + d Base + ped Base + ped Base + ied Base + ied Base + s play play play play play play Played Playing Played Plays Base + ing Base + s walk walk Walking Walks Base + es go goes Base ‐ e + ing Base +t ing Base +r ing Base ‐ e + ing the/ a , an Base Base Base write write a quickly Budditha and writing writing a Budditha and 136
Appendix B:
Conjugation Table for Sinhala Language
Sinhala Noun Conjugation (Singular forms)
wxl
h
,sx.h
m%lD;sh
mq
mq
mq
mq
mq
mq
b
b
b
b
b
b
b
b
k
k
k
k
k
k
k
k
k
mq
mq
mq
mq
mq
mq
mq
exampl
e
;reK
foaj;d
<ud
.srd
;dr
us;=re
w.k
,sh
Fj<U
wx.kd
hqj;s
.eyekq
l;a
uja
wdor
?
wl=re
.dia;=
ish,q
f.a
tla
nsla
.x
we;a
fldla
f.dka
kslua
lsUq,a
ñksia
Wmdil
mq
lmq
jd
33
mq
jiq
34
mq
bis
35
36
37
mq
mq
mq
bns
l,jeos
fn,s
38
39
40
mq
mq
mq
fmdvs
usgs
fld,q
41
mq
fnanoq
ai
d
ai
d
and
aod
A,
d
avd
agd
A,
d
aod
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
25
26
27
28
29
30
31
32
.Kh
we;a
.Kh
w,s
.Kh
;drd
.Kh
jiq
.Kh
ksh; tal
a
hd
jd
hd
jd
jd
d
j
h
sh
r
example
a
;reKhd
foaj;djd
<uhd
.srjd
;drdjd
us;=rd
w.k
,sh
fj<U
wx.kdj
hqj;sh
.eyeksh
l;
uj
wdorh
/h
wl=r
.dia;=j
ish,a,
f.h
tl
nsl
..
we;d
fldld
f.dkd
kslud
lsUq,d
usksid
Wmdilh
d
lmqjd
fhla
fjla
fhla
fjla
fjla
frla
la
la
la
jla
hla
shla
la
la
hla
hla
la
jla
Q,la
hla
la
la
.la
f;la
flla
fkla
fula
f,la
fila
fhla
q
jiaid
afila
q
jiafila
s
biaid
afila
s
biafila
s
s
s
bnand
l,jeoaod
fn,a,d
afnla
afola
Af,la
s
s
s
bnafnla
l,jeoafola
fn,af,la
s
s
q
fmdvavd
usgagd
fld,a,d
afvla
afgla
Af,la
s
s
q
fmdvafvla
usgafgla
fld,af,la
q
fnanoaod
afola
q
fnanoafola
d
d
d
e
-
q
a
a
h
h
e
j
A,
h
.
d
d
d
d
d
d
hd
wksh; Wla;
q
a
a
a
x
a
a
a
a
a
a
r
d
d
d
re
q
a
a
e
q
a
a
a
x
;a
la
ka
ua
,a
ia
fjla
example
;reKfhla
foaj;dfjla
<ufhla
.srfjla
;drdfjla
ñ;=frla
w.kla
,shla
fj<Ula
Wx.kdjla
hqj;shla
.eyekshla
l;la
ujla
wdorhla
/hla
wl=rla
.dia;=jla
ish,a,la
f.hla
tlla
nslla
..la
wef;la
fldflla
f.dfkla
kslfula
lsUqf,la
usksfila
Wmdilfhla
lmqfjla
137
42
mq
Wl=iq
43
44
mq
mq
llal=gq
ldl=
45
mq
46
47
48
49
50
51
52
53
54
55
56
57
58
mq
mq
mq
mq
mq
mq
mq
mq
mq
mq
w
w
w
lerfm
d;=
lmqgq
weoqre
nuqKq
ljqvq
.=re,q
yQkq
nur
f.dak
fjo
f,v
fmd;a
wlaIr
NdId
w
w
w
w
w
w
w
w
w
w
w
w
ms,s
le;s
fros
bks
neus
mdmsis
Fldl=
w;=
wjqreoq
l=,q
fldiq
weos
w
w
w
w
w
w
w
w
w
w
w
w
w
w
loq
,oq
Wvq
fydUq
wl=re
wl=Kq
loq,q
f;gq
fljsgs
wK
iq,x
<sx
i<x
kqjr
w
fl<
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
lmqgq
.Kh
nur
.Kh
fmd;a
wlaIr
NdId
.Kh
ms,s
.Kh
wl=re
.Kh
fmdf,da
.Kh
kqjr
.Kh
uq;=
.Kh
ai
d
agd
al
d
A;
d
d
d
d
d
d
d
d
d
d
d
q
Wl=iaid
afila
q
Wl=iafila
q
q
llal=gagd
ldlald
afgla
aflla
q
q
Llal=gafgla
ldlaflla
q
lerfmd
;a;d
lmqgd
weoqrd
neuqKd
ljqvd
.=re,d
yQkd
nurd
f.dakd
fjod
f,vd
fmd;
wlaIrh
NdIdj
Af;la
q
fgla
frla
fKla
fvla
f,la
fkla
frla
fkla
fola
fvla
la
hla
jla
gqq
re
Kq
vq
,q
kq
r
k
o
v
a
lerfmd;af;
la
lmqfgla
weoqfrla
nuqfKla
ljqfvla
.=ref,la
yQfkla
nufrla
f.dafkla
fjfola
f,fvla
fmd;la
wlaIrhla
NdIdjla
q
e
q
q
q
q
a
h
j
A,
A;
ao
ak
au
ai
al
A;
ao
A,
ai
ka
o
.
o
U
s
s
s
s
s
s
=
=
q
q
q
os
ms,a,
le;a;
froao
bkak
neuau
mdmsiai
fldlal
w;a;
wjqreoao
l=,a,
fldiai
wekao
A,la
A;la
aola
akla
aula
aila
alla
A;la
aola
A,la
aila
s
s
s
s
s
s
=
=
q
q
q
ms,a,la
le;a;la
froaola
bkakla
neuaula
mdmsiaila
fldlalla
w;a;la
wjqreoaola
l=,a,la
fldiaila
e
q
q
q
s
wl=r
Wl=K
loq<
f;dg
fljsg
wK
iq,.
<o
i<U
kqjr
la
la
la
la
la
la
.la
ola
Ula
la
e
q
q
q
wl=rla
wl=Kla
loq<la
f;dgla
fljsgla
wKla
iq,.la
<sola
i<Ula
kqjrla
fld<
j,
x
x
x
x
x
x
fl<j,
138
Sinhala Noun Conjugation (Singular forms)
wxlh
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
.K
h
we;a
.K
h
w,s
.K
h
;drd
.K
h
jiq
.K
h
,sx.h
mq
mq
mq
mq
mq
mq
b
b
b
b
b
b
b
b
k
k
k
k
k
k
k
k
k
mq
mq
mq
mq
mq
mq
mq
m%lD;sh
example
;reK
foaj;d
<ud
.srd
;dr
us;=re
w.k
,sh
Fj<U
wx.kd
hqj;s
.eyekq
l;a
uja
wdor
?
wl=re
.dia;=
ish,q
f.a
tla
nsla
.x
we;a
fldla
f.dka
kslua
lsUq,a
ñksia
Wmdil
a
fhl=
fjl=
fhl=
fjl=
fjl=
frl=
l
l
l
jl
hl
shl
l
l
hl
hl
l
jl
a,l
hl
l
l
.la
l=
fll=
fkl=
ful=
f,l=
fil=
hl=
mq
lmq
fjl=
mq
mq
mq
mq
mq
mq
mq
mq
mq
mq
mq
mq
jiq
bis
bns
l,jeos
fn,s
fmdvs
usgs
fld,q
fnanoq
Wl=iq
llal=gq
ldl=
Ail=
Ail=
Anl=
Aol=
A,l=
Avl=
Agl=
A,l=
Aol=
Ail=
Agl=
All=
wksh; wkqla;
r
example
;reKfhl=
foaj;dfjl=
d
<ufhl=
d
,srfjl=
d
;drdfjl=
re
ñ;=frl=
w.kl
,shl
fj<Ul
wx.kdjl
hqj;shl
q
.eyekshl
a
l;l
a
ujl
wdorhl
/hl
e
wl=rl
.dia;=jl
q
ish,a,l
a
f.hl
a
tll
a
nsll
x
..l
a
we;l=
a
fldfll=
a
f.dfkl=
ua
kslful=
,a
lsUqf,l=
ia
Usksfil=
Wmdilfhl=
lmqfjl=
q
s
s
s
s
s
s
q
q
q
q
q
jiail=
biail=
Bnanl=
l,jeoaol=
fn,a,l=
fmdvavl=
usgagl=
fld,a,l=
fnanoaol=
Wl=iail=
Llal=gagl=
ldlall=
139
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
mq
mq
mq
mq
lmqgq
.Kh mq
mq
mq
mq
nur mq
.Kh mq
mq
fmd; w
a
.Kh
wlaI w
r
.Kh
NdId w
.Kh
w
w
w
w
w
w
w
ms,s w
.Kh w
w
w
w
w
w
w
w
wl=re w
.Kh w
w
w
w
w
fmdf w
w
,da
.Kh w
kqjr w
.Kh
uq;=
w
.Kh
q
lerfmd;a;l=
lmqgl=
weoqrl=
nuqKl=
ljqvl=
.=re,l=
yQkl=
nurl=
f.dakl=
fjol=
f,vl=
fmd;la
lerfmd;=
lmqgq
weoqre
nuqKq
ljqvq
.=re,q
yQkq
nur
f.dak
fjo
f,v
fmd;a
A;l=
l=
l=
l=
l=
l=
l=
l=
l=
l=
l=
l
wlaIr
hl
wlaIrhl
NdId
jl
NdIdjl
ms,s
le;s
fros
bks
neus
mdmsis
Fldl=
w;=
wjqreoq
l=,q
fldiq
weos
loq
,oq
Wvq
fydUq
wl=re
wl=Kq
loq,q
f;gq
fljsgs
wK
iq,x
<sx
i<x
kqjr
A,l
A;l
aol
akl
aul
ail
all
A;l
aol
A,l
ail
s
s
s
s
s
s
=
=
q
q
q
ms,a,l
le;a;l
froaol
bkakl
neuaul
mdmsiail
fldlall
w;a;l
wjqreoaol
l=,a,l
fldiail
l
l
l
l
l
l
l
l
l
l
e
q
q
q
wl=rl
wl=Kl
q
q
q
q
q
q
a
x
x
x
f;dgl
fljsgl
wKl
iq,.l
,sol
i<Ul
kqjrl
fl<
140
Sinhala Noun Conjugation (Plural forms)
wxl
h
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
.Kh
we;a
.Kh
w,s
.Kh
;drd
.Kh
jiq
.Kh
lmqgq
.Kh
nur
,sx
.
h
mq
mq
mq
mq
mq
mq
b
b
b
b
b
b
b
b
k
k
k
k
k
k
k
k
k
mq
mq
mq
mq
mq
mq
mq
m%lD;sh
nyqjpkWla;
;reK
foaj;d
<ud
.srd
;dr
us;=re
w.k
,sh
fj<U
wx.kd
hqj;s
.eyekq
l;a
uja
wdor
?
wl=re
.dia;=
ish,q
f.a
tla
nsla
.x
we;a
fldla
f.dka
kslua
lsUq,a
Usksia
Wmdil
fhda
fjda
hs
ja
fjda
frda
fkda
fhda
fUda
fjda
fhda
yq
jre
mq
example
a
r
-
example
a
hska
jka
hska
jqka
jka
ka
ka
ka
qka
jka
hka
ka
=ka
jreka
hka
j,a
-
-
;reKfhda
foaj;dfjda
<uhs
.srja
;drdfjdA
Us;=frda
w.fkda
,sfhda
fj<fUda
wx.kdfjda
hqj;sfhda
.eyekq
l;ayq
ujqjre
wdor
/hj,a
wl=re
.dia;=
j,a
a
f.j,a
-
a
d
d
d
re
k
U
-
nyqwkql;
a
r
d
d
d
e
-
a
-
example
;reKhska
foaj;djka
<uhska
.srjqka
;drdjka
Us;=rka
w.kka
,shka
fj<Uqka
wx.kdjka
hqj;shka
.eyekqka
l;=ka
ujqjreka
wdorhka
/hj,a
wl=re
.dia;=
-
-
-
-
nsla
;=
l=
kq
uq
,q
iq
fhda
nsla
.x
we;a;=
fldlal=
f.dkakq
ksluauq
lsUq,a,q
usksiaiq
Wmdilfhda
=ka
l=ka
kqka
uqka
,qka
iqka
hka
a
we;=ka
fldlalk
= a
f.dkakqka
ksluqka
lsUq,qka
usksiqka
Wmdilhka
lmq
fjda
lmqfjdA
jka
mq
mq
mq
mq
mq
mq
mq
mq
mq
mq
mq
mq
mq
jiq
bis
bns
l,jeos
fn,s
fmdvs
usgs
fld,q
fnanoq
Wl=iq
llal=gq
ldl=
lerfmd;=
afida
afida
afnda
afoda
Af,da
afvda
afgda
Af,da
afoda
Afida
afgda
aflda
Af;da
q
s
s
s
s
s
s
q
q
q
q
q
q
jiafida
biafida
bnafnda
l,jeoafoda
fn,af,da
fmdvafvda
usgafgda
fld,Af,da
fnanoafoda
Wl=iafida
llal=gafgda
aika
aika
anka
aoka
A,ka
avka
agka
A,ka
aoka
aika
agka
alka
A;ka
mq
mq
mq
mq
mq
mq
mq
lmqgq
weoqre
nuqKq
ljqvq
.=re,q
yQkq
nur
fgda
frda
fKda
fvda
f,da
fkda
e
gq
gq
Kq
vq
,q
kq
lerfmd;af
;da
lmqfgda
weoqfrda
nuqfKda
ljqfvda
.=ref,da
yQfkda
nure
ka
ka
ka
ka
ka
ka
eka
lmqjka
q
s
s
s
s
s
s
q
q
q
q
q
q
q
q
q
q
q
q
jiaika
biaika
bnanka
l,jeoaoka
fn,a,ka
fmdvavka
usgagka
fld,a,ka
fnanoaoka
Wl=iaika
llal=gagka
ldlalka
lerfmd;a;
ka
lmqgka
weoqrka
nuqKka
ljqvka
.=re,ka
yQkka
nureka
141
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
.Kh
fmd;a
.Kh
wlaIr
.Kh
NdId
.Kh
ms,s
.Kh
wl=re
.Kh
fmdf,da
.Kh
kqjr
.Kh
uq;=
.Kh
mq
mq
mq
w
f.dak
fjo
f,v
fmd;a
Akq
Aoq
Avq
f.dakakq
fjoaoq
f,vavq
fmd;
qka
qka
qka
f.dakqka
fjoqka
f,vqka
fmd;a
w
wlaIr
wlaIr
wlaIr
w
NdId
NdId
NdId
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
w
ms,s
le;s
fros
bks
neus
mdmsis
Fldl=
w;=
wjqreoq
l=,q
fldiq
weos
loq
,oq
Wvq
fydUq
wl=re
wl=Kq
loq,q
f;gq
fljsgs
wK
iq,x
<sx
i<x
kqjr
ms,s
Le;s
fros
bks
neus
mdmsis
fldl=
w;=
wjqreoq
l=,q
fldiq
ms,s
Le;s
fros
bks
neus
mdmsis
fldl=
w;=
wjqreoq
l=,q
fldiq
wl=re
wl=Kq
Loq,q
F;dgq
fljsgs
wK
iq,x
,sx
i<x
kqjr
wl=re
wl=Kq
Loq,q
f;dgq
fljsgs
wK
iq,x
,sx
i<x
kqjr
w
fl<
142
Appendix C:
Context-Free Grammar for Sinhala Language
Grammar notations
SubP = Subject Phrase
VebP = Verb Phrase
Sub = Subject
Obj = Object
ObjP = Objective Phrase
AdjSub = Attributive adjunct of Subject
AdjObj = Attributive adjunct of Object
Pre = Predicate
AdjPre = Attributive adjunct of Predicate
AdjCmp = Attributive adjunct of Complement
CmpPre = Complement of predicate
CmpPreP = Complement of predicate phrase
MisKri = Present Participle (Misra Kriya)
Noun = Sinhala Noun
Verb = Sinhala Verb (Indicative Mood)
Advb = Sinhala Adverb
Context-Free Grammar for Sinhala
S Æ SubP
VebP
SubP Æ Sub
SubP Æ AdjSub Sub
VebP Æ ObjP PreP
VebP Æ PreP
ObjP Æ Obj
ObjP Æ AdjObj Obj
PreP Æ AdjPre CmpPrep
PreP Æ CmpPrep
143
CmpPrep Æ Pre
CmpPrep Æ Pre CmpPre
CmpPre Æ Cmp
CmpPre Æ AdjCmp Cmp
Sub Æ Noun
AdjSub Æ Noun
Obj Æ Noun
AdjObj Æ Noun
AdjPre Æ Adv
Cmp Æ Noun
AdjCmp Æ Noun
AdjObj Æ Noun
AdjObj Æ Noun Noun
AdjObj Æ Noun Preposition Noun
AdjSub Æ Noun
AdjSub Æ Noun Noun
AdjSub Æ Noun Preposition Noun
Adv Æ Advb
Adv Æ Advb Preposition Advb
Pre Æ Verb
Pre Æ MisKri Verb
144
Appendix D:
Finite State Transducer for Sinhala Transliteration
Model 1: For Original English text
V1
i
r
V2
e
e, r
a, e, i, o, u, y
A
a
B
w, u
V3
V4
o
o, u
FST for Vowels in model 1 transliteration
C1
D1
g
e
C2
d
k
c
C3
e
C4
v
e
C5
t, e, s,c ,g
C
t
h
D
q0
D
C6
h
n
e
D2
C7
e
g
C8
l
q0 = {b,c,d,f,g,h,j,k,l,m,n,p,q,r,s,t,v,w,x,y,z}
FST for Consonants in model 1 transliteration
145
Model 2: For Sinhala words that are written in English
V1
I
V2
r
I
r
e
V3
a
e
D1
Q2
A
B
i
Q1
V4
i
e
e
V5
u
o
u
V6
o, u
V7
Q1 = { a, e, ,i, o, u, Ǐ, ŕ }, Q2 = { a, e, i }
FST for Consonants in Types 2 transliteration
C7
b
Figure 1
l
C1
s
C2
s
t
D1
l
h
C3
t
t
C
C4
Q2
h
C5
d
d
h
D2
n
D
Q1
d
C6
n
n, d, y
d, j
D3
d
j
D4
Q1 = { k, g, c, j, t d ,b, m, y, r, f, v, s, h, l, n, p }
Q2 = { k, g, c, j, t, d, b, s, p}
FST for Consonants in Types 2 transliteration
146
Appendix E:
Sample Evaluation form
147
Appendix F:
Sample of evaluator’s Comments
The following sample shows some evaluator’s comments for the evaluation
148