Download Tamil Shallow Parser using Machine Learning Approach

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Tamil Shallow Parser using Machine Learning Approach
Dhanalakshmi V1, Anand Kumar M1, Soman K P1 and Rajendran S2
1Computational
Engineering and Networking
Amrita Vishwa Vidyapeetham Coimbatore, India
{m_anandkumar,v_dhanalakshmi, kp_soman} @cb.amrita.edu
2Tamil
University, Thanjavur, India
________________________________________________________________________
Abstract
This paper presents the Shallow Parser for Tamil using machine learning
approach. Tamil Shallow Parser is an important module in Machine Translation
from Tamil to any other language. It is also a key component in all NLP
applications. It is used to understand natural language by machine and also
useful for second language learners. The Tamil Shallow Parser was developed
using the new and state of the art machine learning approach. The POS Tagger,
Chunker, Morphological Analyzer and Dependency Parser were built for
implementing the Tamil Shallow Parser. The above modules gives an
encouraging result.
.
Introduction
Partial or Shallow Parsing is the task of recovering a limited amount of
syntactic information from a natural language sentence. A full parser often
provides more information than needed and sometimes it may also give less
information. For example, in Information Retrieval, it may be enough to find
simple NPs (Noun Phrases) and VPs (Verb Phrases). In Information Extraction,
Summary Generation, and Question Answering System, information about
special syntactico-semantic relations such as subject, object, location, time, etc,
are needed than elaborate configurational syntactic analyses. In full parsing,
grammar and search strategies are used to assign a complete syntactic structure
to sentences. The main problem here is to select the most possible syntactic
analysis to be obtained from thousands of possible analyses a typical parser with
a sophisticated grammar may return. This complexity of the task makes machine
learning an attractive option in comparison to the handcrafted rules.
Methodology
Machine learning approach is applied here to develop the shallow parser
for Tamil. Part of speech tagger for Tamil has been generated using Support
Vector Machine approach [Dhanalakshmi V e.tal., 2009]. A novel approach using
machine learning has been built for developing morphological analyzer for Tamil
[Anand kumar M e.tal., 2009]. Tamil Chunker has been developed using CRF++
tool [Dhanalakshmi V e.tal., 2009]. And finally, Tamil Dependency parser, which
is used to find syntactico-semantic relations such as subject, object, location,
time, etc, is built using MALT Parser [Dhanalakshmi V e.tal., 2011].
General Framework and Modules

The general block diagram for Tamil Shallow parser is given in Figure 1.
Inpu
t
Sent
ence
Tokenization
POS
Tagging
Morpholo
gical
Analyzer
Chunking
Format
Conversion
MALT
Parser
for
Relation
finding
Shallo
w
Parsed
Senten
Figure.1. General Framework
ce for Tamil Shallow Parser

Tamil Part-of-Speech Tagger [Dhanalakshmi V e.tal., 2009]: The Part of
Speech (POS) tagging is the process of labeling a part of speech or other
lexical class marker (noun, verb, adjective, etc.) to each and every word in a
sentence. POS tagger was developed for Tamil language using SVMTool
[Jes´us Gim´enez and Llu´ıs M`arquez, 2004].

Tamil
Morphological
Analyzer
[Anand
Kumar
M
e.tal.,
2009]:
Morphological Analysis is the process of breaking down morphologically
complex words into their constituent morphemes. It is the primary step for
word formation analysis of any language. Morphological Analyzer was
developed using a novel machine learning approach and was implemented
using SVMTool.

Tamil Chunker [Dhanalakshmi V e.tal., 2009]: Chunks are normally
taken to be non recursive correlated group of words. Chunker divides a
sentence into its major-non-overlapping phrases (noun phrase, verb phrase,
etc.) and attaches a label to each chunk. Chunker for Tamil language was
developed using CRF++ Tool[Sha F and Pereira F, 2003].

Tamil Dependency Parser for Relation finding [Dhanalakshmi V e.tal.,
2011]: Given the POS tag, Morphological information and chunks in a
sentence, this decides which relations they have with the main verb (subject,
object, location, etc.). Dependency parser was developed for Tamil language
using Malt Parser tool [Joakim Nivre and Johan Hall, 2005].
Dependency Parsing using Malt Parser
MALT Parser Tool is used for dependency parsing, which uses
supervised machine learning algorithm. Using this tool dependency relations and
position of the head are obtained for Tamil sentence. There are 10 tuples used in
the training data that can be user define. For Tamil dependency parsing, the
following features are defined and others are set as NULL and are mentioned as
‘_’ in the training data format.
WordID: Position of each word in the input sentence.
Words: Each word in the input sentence.
CPos Tag and Pos Tag: Defines the Parts Of Speech of each word.
Head: The position of the parent of each word.
Lemma: The lemma of the word.
Morph Features The Morphological features of the word.
Chunk The chunk information of the word.
Dependency Relation: The terminology given for each parent – child
relation.
Sample Training Data
1 அவள் _ <PRP> <PRP> 8 <N.SUB> _ _
2 முட்டைகடை _ <NN> <NN> 3 <D.OBJ> _ _
3 வாங்கி _ <VNAV> <VNAV> 4 <ATT> _ _
4 சடைத்து _ <VNAV> <VNAV> 6 <VNAV.MOD>_ _
5 தட்டில் _ <NN> <NN> 6 <NST.MOD> _ _
6 ப ாட்டு _ <VNAV> <VNAV> 8 <V.COMP> _ _
7 உனக்கு _ <PRP> <PRP> 8 <I.OBJ> _ _
8 ககாடுக்கின்றான் _ <VF> <VF> 0 <ROOT> _ _
9 . <DOT> <DOT> 8 <SYM> _ _
For Tamil language, a corpus of three thousand sentences is
annotated with dependency relations and labels using the customized tag set
(Table.1). The corpus is trained using the MALT Parser tool which generates a
model. Using this model the new input sentences are tested.
S.No
1
Tags
Description
ROOT
Head word
S.No
5
Tags
Description
NST-MOD
Spatial Time
Modifier
2
N-SUB
Subject
6
SYM
Symbols
3
D-OBJ
Direct Object
7
X
Others
4
I-OBJ
Indirect Object
Table.1 Shallow Dependency Tagset
Application of Shallow Parser
Shallow parsers were used in Verbmobil project [Wahlster W, 2000], to
add robustness to a large speech-to-speech translation system. Shallow parsers
are also typically used to reduce the search space for full-blown, `deep' parsers
[Collins, 1999]. Yet another application of shallow parsing is question-answering
on the World Wide Web, where there is a need to efficiently process large
quantities of ill-formed documents [Buchholz and Daelemans, 2001] and more
generally, all text mining applications, e.g. in biology [Sekimizu et al., 1998].
The developed Tamil Shallow Parser can be used to develop the following
systems for Tamil language.

Information extraction and retrieval system for Tamil.

Simple Tamil Machine Translation system.

Tamil Grammar checker.

Automatic Tamil Sentence Structure Analyzer.

Language based educational exercises for Tamil language learners.
Conclusion
Shallow Parsing has proved to be a useful technology for written and
spoken language domains. Full parsing is expensive, and is not very robust.
Partial parsing has proved to be much faster and more robust. Dependency
parser is better suited than phrase structure parser for languages with free or
flexible word order like Tamil. Fully functional Shallow Parser for Tamil gives
reliable results. The Shallow Parser system developed for Tamil is an important
tool for Machine Translation between Tamil and other languages.
References
1. Anand kumar M, Dhanalakshmi V , Soman K P and Rajendran S (2009) ,
“A Novel Approach for Tamil Morphological Analyzer”, Proceedings of the
8th Tamil Internet Conference 2009, Cologne, Germany.
2. Buchholz Sabine and Daelemans Walter (2001), “Complex Answers: A
Case Study using a WWW Question Answering System”, Natural
Language Engineering.
3. Collins M (1999), “Head-Driven Statistical Models for Natural Language
Parsing”, Ph.D Thesis, University of Pennsylvania.
4. Dhanalakshmi V, Anand Kumar M, Vijaya M S, Loganathan R, Soman K
P, Rajendran S (2008), “Tamil Part-of-Speech tagger based on SVMTool”,
Proceedings of the COLIPS International Conference on natural language
processing(IALP), Chiang Mai, Thailand.
5. Dhanalakshmi V, Anand kumar M, Soman K P and Rajendran S (2009),
“POS Tagger and Chunker for Tamil Language”, Proceedings of the 8th
Tamil Internet Conference, Cologne, Germany.
6. Dhanalakshmi V, Anand Kumar M, Rekha R U, Soman K.P and Rajendran
S (2011), “Data driven Dependency Parser for Tamil and Malayalam”
NCILC-2011, Cochin University of Science & Technology, India.
7. Jes´us Gim´enez and Llu´ıs M`arquez.(2004) SVMTool: A general pos
tagger generator based on support vector machines.In Proceedings of the
4th LREC Conference, 2004.
8. Joakim Nivre and Johan Hall, MaltParser: A language-independent
system for data-driven dependency parsing. In Proceedings of the Fourth
Workshop on Treebanks and Linguistic Theories (TLT), 2005.
9. Sekimizu T, Park H and Tsujii J (1998), “Identifying the interaction
between genes and gene products based on frequently seen verbs in
Medline abstracts”, Genome Informatics, Universal Academy Press.
10. Sha F and Pereira F (2003), “Shallow Parsing with Conditional Random
Fields”, Proceedings of Human Language Technology Coference’2003,
Canada.
11. Wahlster W (2000), “VERBMOBIL: Foundations of Speech-to-Speech
Translation”, Springer-Verlag.