Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Data Mining
Theory and Practice
Dr. Azuraliza Abu Bakar
http://www.ftsm.ukm.my/jabatan/ts/aab/index.htm
What is Pattern Recognition



Pattern Recognition by Human
– perceptual
– specialized – decision making
Pattern Recognition by Computers
– benefit of automated pattern recognition
– advantage in complex calculations
Pattern Recognition from Data (Data Mining)
Pattern Recognition from Data

Pattern recognition from data is a process of
learning or observing the past data by studying the
dependencies and extracting knowledge from data
What is Data?
1
2
3
4
5
6
7
:
99
100
Studies
Education
Poor
SPM
Poor
SPM
Moderate SPM
Moderate Diploma
Poor
SPM
Moderate Diploma
Good
MSC
Works
Poor
Good
Poor
Poor
Poor
Poor
Good
Income (D)
None
Low
Low
Low
None
Low
Medium
Poor
Moderate
Good
Poor
Low
Low
SPM
Diploma
What is Knowledge??
studies(Poor) AND work(Poor) => income(None)
studies(Poor) AND work(Good) => income(Low)
education(Diploma) => income(Low)
education(MSc) => income(Medium) OR income(High)
studies(Mod) => income(Low)
studies(Good) => income(Medium) OR income(High)
education(SPM) AND work(Good) => income(Low)
What is Data Mining??
Extraction of knowledge from data
exploration and analysis of large quantities of
data to discover meaningful pattern from data.
Discover Knowledge
How data mining looks into data??
Data
Data
Data
Data Mining :
Motivation
Huge amounts of data
Important need for turning data into useful
information
Fast growing amount of data, collected and stored in
large and numerous databases exceeded the human
ability for comprehension without powerful tools
Questions??
What goods should be promoted to this customer?
What is the probability that a certain customer will respond
to a planned promotion?
Can one predict the most profitable securities to buy/sell
during the next trading session?
Will this customer default on a loan or pay back on
schedule?
What medical diagnose should be assigned to this patient?
What kind of cars should be sell this year??
Data Mining is simply...
Finds relationship
make prediction
Data Mining : 1-step of KDD
KDD
Data mining
Task
Techniques
Data Mining as a Step of KDD
Knowledge
Evaluation &
Presentation
Patterns
Data Mining
Selection and
Transformation
Cleaning and
Intergration
Databases
Data
Warehouse
Flat files
Early Steps of Data Mining

Data preprocessing
–

Data discretization/representation
–

handling incomplete data, noisy data, uncertain
data
transforms data into suitable values for the
mining algorithm to find patterns
Data selection
–
selects the suitable data for mining purposes
Data Mining Techniques
Decision Trees
Neural Network
Genetic Algorithms
Fuzzy Set Theory
Rough Set Theory
Statistical Method (Regression Analysis)
Classification of Data Mining Systems
Kinds of DB
Kinds of Knowledge
Relational
Data warehouse
Transactional DB
Advanced DB system
Flat files
WWW
Classification
Association
Clustering
Prediction
…
…
Classification of Data Mining Systems
Techniques
used
DB oriented techniques
Statistic
Machine learning
Pattern recognition
Neural Network
Rough Set etc
Application
adapted
Finance
Marketing
Medical
Stock
Telecommunication,
etc
Data Mining: confluence of multiple discipline
Database
technology
statistic
HPerformance
computing
visualization
Pattern
recognition
Machine
learning
DATA
MINING
Spatial
data analysis
Information
retrieval
Information
science
Neural network
Data Mining
What we are looking at??
What we are looking for??
Data Mining Tasks
–
–
–
–
–
–
–
–
Prediction
Classification
Clustering
Association Rules
Sequential Analysis
Deviation analysis
Similarity analysis
Trend analysis
Classification
Classification
algorithm
Training data
1
2
3
4
5
6
7
:
99
100
Studies
Poor
Poor
Moderate
Moderate
Poor
Moderate
Good
Education
SPM
SPM
SPM
Diploma
SPM
Diploma
MSC
Poor
SPM
Moderate Diploma
Works
Poor
Good
Poor
Poor
Poor
Poor
Good
Income (D)
None
Low
Low
Low
None
Low
Medium
Good
Poor
Low
Low
Classification
Rules
If studies=“poor” and
work=“poor” then
Income=“poor”
Classification
Classification
rules
Test data
Studies
Education
Moderate Diploma
Poor
SPM
Moderate Diploma
Good
MSC
:
Works
Poor
Poor
Poor
Good
Income (D)
?
?
?
?
New data
studies=“poor” and
work=“poor”
classify
poor
Type of Classifiers
Neural Classifier
Statistical Classifier
–Bayesion approach
–Multiple Regression
–K-nearest neighbour
–Naïve Bayes
–Causal Network
–Discriminant Analysis
–Hopfield Network
–Multilayer Perceptron
–Radial Basis Function
–Kohonen Networks
Rough Classifier
DATASET
1
2
3
4
5
6
7
:
99
100
Studies
Education
Poor
SPM
Poor
SPM
Moderate SPM
Moderate Diploma
Poor
SPM
Moderate Diploma
Good
MSC
Works
Poor
Good
Poor
Poor
Poor
Poor
Good
Income (D)
None
Low
Low
Low
None
Low
Medium
Poor
Moderate
Good
Poor
Low
Low
SPM
Diploma
RULES
studies(Poor) AND work(Poor) => income(None)
studies(Poor) AND work(Good) => income(Low)
education(Diploma) => income(Low)
education(MSc) => income(Medium) OR income(High)
studies(Mod) => income(Low)
studies(Good) => income(Medium) OR income(High)
education(SPM) AND work(Good) => income(Low)
Comparing Classifiers





Predictive Accuracy
Speed
Robustness
Scalability
Interpretability
Data Mining : Problems and Challenges
Noisy
data
Large
Databases
Dynamic
Databases
Incomplete
Data
Performance Issues
Time and
Memory
Constrain
t
Predictive
Ability
Performance Issues
-number of examples necessary for training
-cost of assuring the good accuracy
Performance Issues
Time and
Memory
Constrain
t
-time complexity of the learning phase
-time taken for evaluation
-time it takes to reach a certain level of accuracy
Performance Issues
Predictive
Ability
-to be able to predict the correct decision
towards the test or unseen data
-involve the generation of rules
-measuring the quality or accuracy of rules
Samples of the CLEV Dataset (before scaling)
DA
TA
AG
E
SEX
CP
1
63
Male
2
67
Male
Typical
angina
Asymp
3
67
Male
4
37
5
41
6
56
7
62
8
57
9
TRE
ST
BPS
145
CH
OL
RESTECG
THALA
CH
EXA
NG
233
F
B
S
T
LV hyper
150
No
OL
DP
EA
2.3
K
SLOPE
C
A
THAL
DISEA
SE
0
Fixed
No
1.5
Downslo
pe
Flat
160
286
F
LV hyper
108
Yes
3
Normal
Yes
Asymp
120
229
F
LV hyper
129
Yes
2.6
Flat
2
Yes
187
No
3.5
0
LV hyper
172
No
1.4
0
Normal
No
F
Normal
178
No
0.8
0
Normal
No
268
F
LV hyper
160
No
3.6
2
Normal
Yes
120
354
F
Normal
163
Yes
0.6
0
Normal
No
Asymp
130
254
F
LV hyper
147
No
1.4
Downslo
pe
Upslopin
g
Upslopin
g
Downslo
pe
Upslopin
g
Flat
Reversabl
e
Normal
Male
Non-anginal
130
250
F
Normal
Fema
le
Male
Atypical
130
204
F
Atypical
120
236
Asymp
140
Asymp
63
Fema
le
Fema
le
Male
1
Yes
Male
Asymp
140
203
T
LV hyper
155
Yes
3.1
0
57
Male
Asymp
140
192
F
Normal
148
No
0.4
Downslo
pe
Flat
12
56
Atypical
140
294
F
LV hyper
153
No
1.3
Flat
0
13
56
Fema
le
Male
Reversabl
e
Reversabl
e
Fixed
defect
Normal
10
53
11
Non-anginal
130
256
T
LV hyper
142
Yes
0.6
Flat
1
Yes
14
44
Male
Atypical
120
263
F
Normal
173
No
0
0
15
52
Male
Non-anginal
172
199
T
Normal
162
No
0.5
16
57
Male
Non-anginal
150
168
F
Normal
174
No
1.6
17
48
Male
Atypical
110
229
F
Normal
168
No
1
54
Male
Asymp
140
239
F
Normal
160
No
1.2
0
Reversabl
e
Normal
Yes
18
19
48
Non-anginal
130
275
F
Normal
139
No
0.2
0
Normal
No
20
49
Fema
le
Male
Atypical
130
266
F
Normal
171
No
0.6
Upslopin
g
Upslopin
g
Upslopin
g
Downslo
pe
Upslopin
g
Upslopin
g
Upslopin
g
Fixed
defect
Reversabl
e
Reversabl
e
Normal
0
Normal
No
0
0
0
0
No
Yes
No
No
No
No
No
No
Rules generated from data mining process
oldpeak(0.7) => disease(No)
oldpeak(4.4) => disease(Yes)
chol(233) AND restecg(LV hypertrophy) => disease(No)
chol(204) AND restecg(LV hypertrophy) => disease(No)
chol(236) AND restecg(Normal) => disease(No)
chol(203) AND restecg(LV hypertrophy) => disease(Yes)
chol(294) AND restecg(LV hypertrophy) => disease(No)
chol(275) AND restecg(Normal) => disease(No)
chol(266) AND restecg(Normal) => disease(No)
chol(247) AND restecg(Normal) => disease(No)
chol(219) AND restecg(LV hypertrophy) => disease(No)
chol(266) AND restecg(LV hypertrophy) => disease(Yes)
chol(304) AND restecg(Normal) => disease(No)
chol(254) AND restecg(Normal) => disease(Yes)
chol(267) AND restecg(Normal) => disease(Yes)
chol(264) AND restecg(LV hypertrophy) => disease(No)
chol(234) AND restecg(LV hypertrophy) => disease(No)
Related documents