Download Data Mining and Text Analytics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining and Text
Analytics
Quranic Arabic Corpus
By Saima Rahna & Anees Mohammad
Summary
●
●
●
Quranic Arabic corpus enables further analysis of the Quran
Uses linguistic resources for each word and verse in the quran – e.g. Morphology and
syntax
Automated algorithms were used in the Quran.
Introduction
●
Islam was born in Arabia (1400 years ago)
●
The key sacred texts are in Arabic
●
Only a minority Muslims can speak and understand Arabic
●
A larger percentage of Muslims know English as a second language or even first
●
Web resources and book resources use English in parallel with Arabic.
Data Mining
●
●
●
Uses tools and techniques to extract data
Different aspects of a single topic in the Quran can reappear in many
chapters
Therefore frequent patterns can be used to construct a subjective
index where all versus on a single topic can be covered easily.
Text Analytic
●
Referred to as information extraction
●
The Quranic corpus is an advantage to those who don't understand Arabic
●
Can give the English readers a better insight into the source
●
The translation is at a detailed text Analytic level
Resources & Techniques
Statistical techniques
●
●
Implementing statistical techniques such as keyword extraction
Can explore semiotic relationships between sound and meaning in
the Quran
●
Recognise reoccurring patterns
●
Recognise reoccurring patterns for high level of accuracy
●
Linguistic resource
●
Arabic grammar and syntax used for each word in the quran
●
A comment based system used online for visitors to discuss
and correct the data.
Algorithms
●
●
●
Quranic Arabic Corpus used Java to implement their
algorithms.
Search feature
(searching concepts and key words in the Holy
Quran)
●
Finding multi-word repetitions
●
Mining frequent patterns to a graph.
Algorithm for indexing the Quran
When a word is encountered for the first time, it is added to the index; if it already exists there,
then a new location is added to its list.
For each verse V
parse word list -> list(W)
For each word W
If INDEX contains W is false
add W and W.location to Index
Else
fetch W in INDEX
add new location to W
Filtering algorithm
● The Quranic 'quote filtering' algorithm
●
●
The Quran has the use of Arabic diacritics (symbols)
The filtering algorithm has 3 filtering stages after
making the input text.
Algorithm-Sub path Mining
●
●
This is used to generate frequent patterns within the Quran
corpus
The process starts by scanning the transaction database,
calculating the count for each vertex in the graph
Conclusion
●
Algorithms used
●
Resources and techniques used for
●
implementation of the Quranic Arabic corpus
●
How data mining is applied
●
How text analytic has also been applied
Thank you
:-)