Download A Strategy to Compromise Handwritten Documents Processing and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
A Strategy to Compromise Handwritten Documents Processing and
Retrieving Using Association Rules Mining
Prof. Dr. Alaa H. AL-Hamami
Amman Arab University for Graduate Studies, Amman, Jordan, 2009,
[email protected],.
Dr. Mohammad A. AL-Hamami
Delmon University, Bahrain, 2009, [email protected],
Dr. Soukaena H. Hashem
University of technology, Iraq, 2009, [email protected]
Data mining is Knowledge discovery,
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc. Data mining work has two
branches and these are: Descriptive: understanding
underlying processes or behavior (patterns and trends
and Clustering) in detail (Pattern and trend analysis,
Knowledge base creation, Summarization, and
Visualization). Predictive: predict an unseen or
unmeasured values (future projections, missing
values and Classification) in detail (Classification,
Question answering, Pattern and trend forecasting)
[1].
Abstract
Massive amount of new information being
created and the world’s data doubles every 18
months, 80-90% of all data is held in various
unstructured formats. Useful information can be
derived from this unstructured data. The aim of this
research is to present a framework for handling
handwritten documents in all its trends. Since the
handwritten documents are unstructured data, so the
objectives of the proposed strategy are:
 Converts the unstructured handwritten documents
to a structure one and store it in a convenient
database.
 Customizes the proposed database to contain
three dimensions first for writer features, second
for data features and third for documents
features.
 Converts the multidimensional database into
transactional one then encoding the values of the
feature for all attributes.
 Mines the proposed database, the resulting
association rules will extract new pattern which
leads to many prediction purposes.
2. Text Mining
Text Mining is a process that employs:
(Statistical Natural Processing language (NLP): a set
of algorithms for converting unstructured text into
structured data objects plus Data Mining: the
quantitative methods that analyze these data objects
to discover knowledge).Text Mining Techniques
include the following:


Keywords: handwritten documents, data mining and
association rules.

1. Introduction
1
Information Retrieval (Indexing and
retrieval of textual documents).
Information Extraction (Extraction of partial
knowledge in the text).
Web Mining (Indexing and retrieval of
textual documents and extraction of partial
knowledge using the web (ontology
building)).

(label). Find: a model for the class as a
function of the values of the features.
Goal: previously unseen records should
be assigned a class as accurately as
possible.
B. Text Mining (Clustering definition):
Given:
a set of documents and a
similarity measure among documents,
Find: clusters such that: Documents in
one cluster are more similar to one
another and Documents in separate
clusters are less similar to one another.
Goal: Finding a correct set of documents
clusters.
C. Analyzing results: Are the results
satisfactory? Does more mining need to
be done? Does a different technique need
to be used? Does another iteration of one
or more steps in the process need to be
done?
Clustering (Generating collections of similar
text documents).
Text Mining Process consists of sequenced steps [2,
3], see Fig. 1, they are:
Figure 1. the overall process of text mining.
1.
Text Preprocessing (Syntactic/Semantic text
analysis): Part Of Speech (POS) tagging
(Find the corresponding POS for each
word), Word sense disambiguation (Context
based or proximity based) and Parsing
(Generates a parse tree (graph) for each
sentence and each sentence is a stand alone
graph).
2. Features Generation (Bag of words): Text
document is represented by the words it
contains (and their occurrences). Order of
words is not that important for certain
applications (Bag of words). Stemming:
identifies a word by its root, Reduce
dimensionality, and Stop words: The
common words unlikely to help text mining.
3. Features Selection (Simple counting and
Statistics): Reduce dimensionality which
Learners have difficulty addressing tasks
with high dimensionality, only interested in
the information relevant to what is being
analyzed. Irrelevant features means not all
features help.
4. Text/Data
Mining
(Classification
(Supervised) / Clustering (Unsupervised)):
Supervised learning (classification): The
training data is labeled indicating the class;
new data is classified based on the training
set, Correct classification: The known label
of test sample is identical with the class
result from the classification model.
Unsupervised learning (clustering): The
class labels of training data are unknown;
establish the existence of classes or clusters
in the data, Good clustering method: high
intra-cluster similarity.
A. Text Mining (Classification definition):
Given: a collection of labeled records
(training set), each record contains a set
of features (attributes), and the true class
3. The Proposed System
Some Previous works have been dealt with
handwritten documents. Fig. 2 presents a method
using Artificial Neural Network (ANN) to classify
the documents according to data features for the
writing group. As a result they find that ANN does a
good job, but can’t explain clearly its output.
It is right since the result of classification will
determine the group of writers, what about the
classifications according to subject of documents and
what about the classification for document’s a
feature.
Figure 2. ANN classify handwritten documents
according to their writing group
The proposed system for text mining of the
handwritten documents can be explained in the
following steps:
Step One: Determine the input and output;
Input
: Samples of handwritten documents (200
documents).
Output: Association rules introduce predicted
patterns aid to determine and extract
much more relationships among writers,
2
features of
documents.
data
and
features
of
documents. Then dealing with these digital
documents to extract all the recognized
features. The proposed encoding with
attributes for the third dimension is:
1. Language: if language was English
then S will appear else S will not
appear.
Step Two: Determining the attributes and their
values:
 Determine the attributes of the first
dimension, features of document’s writer,
which included the following (Age, Gender,
Handedness, Ethnicity, Education and
Schooling). These attributes are gotten as a
prior knowledge associated with documents
(each standard document naturally supplied
by these information related to document’s
writer). The proposed encoding with
attributes for the first dimension is:
1. Age: since the writers of document
strongly older than 20 year will
present this attribute by A if age is
less than 45 else A will not appear.
2. Gender: if the gender was female B
will appear else B will not appear.
2.
Subject of document: if subject
medical T will appear else T will
not appear.
3. Type of document: if text only U
will appear else U will not appear.
Step Three: For the first view of the proposal we
present a multidimensional database which has three
dimensions these are: features of document’s writer,
feature of written data and finally feature of text
written in the documents, see Fig. 3.
3.


Handedness, Ethnicity, Education,
and Schooling all of these attribute
will also presented by the same
strategy.
Determine the attributes of the second
dimension, features of written data, which
included the following (dark, blob, hole,
slant, width, skew, height, slopehor,
slopeneg, slopever, slopepos, pixelfreq).
These features gotten from applying image
processing procedures specified to extract
these features. The proposed encoding with
attributes for the second dimension is:
1. Dark: will be normalized then after
that will take its normalized value
and making a threshold for it
according to their different values
in different cases. Such that if dark
less than 0.5 then G will appear
else G will not appear.
2. Blob, Hole, Slant, Width, Skew,
Height,
Slopehor,
Slopeneg,
Slopever, Slopepos, and Pixelfreq
all of these attributes will also
presented by the same strategy
Determine the attributes of the third
dimension, feature of text written in the
documents, which included the following
(language, subject of document, type of
document) these features gotten by using
Optical Character Recognition Software for
entering these documents to be digital
Figure 3. The multidimensional database.
Then this multidimensional database will be
converted into a simple transactional one, see Fig. 4.
Figure 4. The transactional database
Now the data of transactional database will be written
as the proposed encoding of feature’s values, see Fig.
5.
Tid
Attributes
Doc 1
ABCDE
Doc 2
CDEFGHIJ
Doc 3
…….
.
Figure 5. encoded transactional database.
3
Step Four: now since transactional database has very
long itemsets, so searching frequent itemsets to find
the association rules will be consume much more
space and time that, if we use one of the two
traditional methods for finding the frequent itemsets,
these methods are:
 Breadth search can be viewed as bottom up
approach where the algorithm visits patterns
of size k+1 after finishing the k sized
patterns.
 Depth search does the opposite where the
algorithm starts by visiting patterns of size k
before those of size k-1.
The proposed procedure is to find the set of
frequent itemsets in transactional database that has
long itemsets. This procedure works as the following:
1. Uses traversing approach which consists of
depth and breadth search to find the longest
frequent itemset.
2. Find all its children by that we will get most
of the frequent itemsets.
3. Detect the support for each frequent
itemset. Some frequent itemsets don’t
appear in the children of longest frequent
itemset, these exceptions frequent itemsets
will be found with their supports by using
the traditional method Apriori algorithm.
The proposed procedure consists of two phases,
the first phase must be applied while the second
phase will be applied when it is necessary. The first
phase of the traversing consists of depth search; from
this search only the deepest node on the most left side
has been taken then the support of this node in the
database has been computed. If its support passes the
minimum support threshold, the search will be
terminated, otherwise the second phase will be
applied. The second phase of the proposed procedure
consists of breadth search by taking the node that has
been generated in the first phase and considering it
the root of the tree, then traversing that tree in
breadth manner looking for the longest frequent
itemset. The search will be terminated when the
longest frequent itemset has been found.
After the extracting, we proposed a procedure
that applied before the analysis stage. This procedure
is called Rule Classification which classifies the rules
into six groups depending on the itemsets of right and
left sides in any dimensions they found.
Rule classification:
Class1: The itemsets in both sides right and left are
included in the first dimension.
Class2: The itemsets in both sides right and left are
included in the second dimension.
Class3: The itemsets in both sides right and left are
included in the third dimension.
Class4: The itemsets in both sides right and left are
included in the first and second
dimension.
Class 5: The itemsets in both sides right and left are
included in the first and third dimension.
Class6: The itemsets in both sides right and left are
included in the second and third
dimension.
Class7: The itemsets in both sides right and left are
included in the first, second and third
dimensions
The classification of association rules above are:
A-----B
(Class1)
……………..
GHFRSTU---OABCD (Class7)
…………………..
Step Six: this step includes the analysis stage which
presents the most important step because it introduces
full report for predictions, relationships and future
trends to improve the performance of the mined
database which represent the encoding for the
system. To explain this stage we will explain how to
analyze the following rule:
GHFRSTU---OABCD (Class7)
 This rule classified as class7 since left and
right sides included in the three dimensions.
Step Five: now after finding all frequent itemsets, the
traditional association rule procedure will be applied.
This procedure will introduce the extracted
association rules. As an example for the extracted
association rules are:
A-----B
……………..
GHFRSTU---OABCD
…………………..
4

Left side has the frequent itemset
GHFRSTU which composed from F in the
first dimension, GHR in the second
dimension and STU in the third dimension.

Right side has the frequent itemset OABCD
which composed from ABCD in the first
dimension and O in the second dimension.


From the classification and composition
analysis and from translating the encoded
letters into their attributes we could
predicate the following:
1. If dark less than 0.5, Blob more
than 0.3, schooling is high, pixel
frequent pass threshold, language is
English, subject of document is
medical and type of document is
text then…….
2. Slopeneg will pass threshold and
age of writer will be less than 45
and the gender will be female and
handedness will be right.
So we could predict the age, gender and
handedness of the writer and also predict
slopeneg of data document by knowing the
dark, blob and pixelfreq of data document
and schooling of writer combined with
knowing the language, subject and type of
feature’s document.
Figure 6. form1 for building the multidimensional
database and convert it to transactional.
Figure 7.: message to notice the convert process done
successfully.
Fig. 8 will display how to extract the frequent
itemsets from transactional database using the
proposed procedure after entering the initial expected
longest frequent itemset and then clicking the
extracting command, and display how to get the rule
classification after clicking the classification
command.
4. Implementation
To explain the implementation of the proposed
system, we follow the following phases::
The First phase:
The implementation is presented by taking each
handwritten document and builds the first proposed
multidimensional database that by:
 Convert the document to image and from it
will extract all the features of the second
dimension which presented by (dark, blob,
hole, slant, width, skew, height, slopehor,
slopeneg, slopever, slopepos, pixelfreq). The
values of features will obtained by the
traditional image processing procedures.
 The metadata of the writer presented by the
first dimension which presented by (Age,
Gender, Handedness, Ethnicity, Education
and Schooling) will be obtained as metadata
appended with the document.
 The metadata of the document presented by
the third dimension which presented by
(language, subject of document, type of
document) will be obtained as metadata
appended with the document.
Fig. 6 will display how to build the proposed
multidimensional database by filling the textbox with
feature’s values and scanned document. Then click
the insertion command and convert it to transactional
database by clicking the convert command, if the
process of convert done successfully then Fig. 7 will
appear.
Figure 8. form2 display extraction of frequent
itemsets and classification of rules.
The second phase:
The second phase will present the implementation
for an application of the proposed system which
implies the possibility of extracting the feature of
author from the features of written documents and
feature of the subject of written document.
This done by taking the document as an image from
the document image and extract the features values
of the document (second dimension), and the subject
features values supplied with document (third
dimension). To extract writer (author) features ;
according to the extracted features from document
image and supplied feature, the system will mine the
existing multidimensional database to get the features
of author which corresponding to the extracted and
supplied features, see Fig. 9. Surely the proposed
5
retrieving process would submit too many thresholds
related to the feature values.
Representing all features by binary will not
decrease the system performance since we will use
critical threshold for theses features which are present
the attributes of user handwritten documents.
Representing the handwritten documents
initially by multidimensional database that to be
general form for many future work on these
handwritten documents, in the proposed system we
convert this multidimensional database into
transactional database since we aim here to apply
association rule data mining technique.
Figure 9. trying to get author feature by introducing
document and subject features.
REFERENCE
5. Conclusions
From the proposed research we conclude the
following:
1. Converting
unstructured
handwritten
documents to structured frame by building
the proposed multidimensional database
then convert these multidimensional
database into transactional one enrich the
mining process since we included most of
the features of documents, writers and data.
2. Building transactional database with long
itemsets enable us to include all features we
think it is important for predictions and
extraction new patterns.
3. Using association rule techniques for
dealing with the features instead of ANN
makes the process of mining much more
powerful since there is no limitations about
no. of features entered and no. of features
resulted to make classification and
clustering.
4. The proposed procedure used to find
frequent itemsets instead of traditional
procedure makes the process of finding all
frequent itemset from long itemset efficient
and less time and space consumer.
5.
[1] M. S. Chen, J. Han, and P. S. Yu. “Data mining: An
overview from a database perspective”. IEEE Trans.
Knowledge and Data Engineering, 8:866-883, 1996.
[2] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R.
Uthurusamy. “Advances in Knowledge Discovery and
Data Mining”. AAAI/MIT Press, 1996.
[3] J. Han and M. Kamber. “Data Mining: Concepts and
Techniques”.Morgan Kaufmann, 2000.
[4]. S. Mitra, and T. Ahharya, "Data Mining Multimedia,
Soft Computing, and Bioinformatics", John Wiley and
Sons, Inc., 2003.
[5]. Ala’a H. AL-Hamami, Mohammad Ala’a Al-Hamami
and Soukaena Hassan Hasheem, “ Applying data mining
techniques in intrusion detection system on web and
analysis of web usage”, Asian Journal of Information
Technology, Vol. 5, No. 1, p: 57-63, 2006.
[6]. Ala’a H. AL-Hamami,
and Soukaena Hassan
Hasheem,” Privacy Preserving for Data Mining
Applications”, journal of technology, baghdad, Iraq,
university of technology, Vol.26.No.5,2008.
[7]. Mohammad A. Al- Hamami and Soukaena Hassan
Hashem, " Applying Data Mining Techniques to Discover
Methods that Used for Hiding Messages Inside Images ",
The IEEE First International Conference on Digital
Information Management (ICDIM2006), Bangalore, India,
2006.
The proposed procedure for classifying the
extracted rules makes the analysis process
much more easy and fast.
6. Discussion
In the proposed system we assume the attributes
number less than 26 so we represent each attribute by
one capital letter but if the no. of attribute will exceed
twenty six we will use the capital and small letters.
[8]. Ala’a H. AL-Hamami, Mohammad Ala’a Al-Hamami
and Soukaena Hassan Hasheem, “A Proposed Technique
for Medical Diagnosis Using Data Mining”, Fourth
International Conference on Intelligent Computing and
information Systems (ICICIS 2009) CAIRO, EGYPT
March 19-22, 2009.
6