Download After the preprocessing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
College of Science & Technology
Dep. Of Computer Science & IT
BCs of Information Technology
Data Mining
Chapter 2_1: Data Preparation and Preprocessing
Case Study
2013
Prepared by: Mahmoud Rafeek Al-Farra
www.cst.ps/staff/mfarra
Course’s Out Lines
2










Introduction
Data Preparation and Preprocessing
Association Rules
Classification Methods
Evaluation
Clustering Methods
Mid Exam
Knowledge Representation
Special Case study : Document clustering
Discussion of Case studies by students
Consider the following instances
3


The documents before preprocessing are the following:
Document 1:




Document 2:





Palestine freedom requires all Muslims.
All Muslims must pray five times every day.
Palestinians and Muslims are persecuted by United Nations.
Freedom for Palestine.
Palestine is a holy land for all Muslims.
The legal right of Palestine for Muslims.
I am proud to be Muslim.
Document 3:


Support our legal rights to Palestine.
I am proud to be from Palestine.
After the preprocessing
4

After passing them on the preprocessing steps
many words will be removed
 (ex.

Our, to, am, the, five and so on)
Others will be stemmed to their roots
 (ex.
Muslims is stemmed to Muslim and persecuted to
persecute and so on).
After the preprocessing
5


Now, after the preprocessing steps the three documents will be
as the follows:
Document 1:




Document 2:





Palestin freedom requir all Muslim.
All Muslim pray.
Palestin Muslim persecut unit nation.
Freedom Palestin.
Palestin holy land all Muslim.
Legal right Palestin Muslim.
Proud Muslim.
Document 3:


Support legal right Palestin.
Proud Palestin.
Then … representation
6
One of Possible ways
item1
item2
item3
item4
Doc1
0
1
1
1
Doc2
1
1
1
1
Doc3
1
1
0
0
Doc4
0
1
1
0
Then our application uses each document as a vector
Thanks
7