Download Lecture-2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Data Mining: A Closer Look
Typical Problems
Data Mining: Typical Problems
• Classification
• Estimation
• Prediction
Classification & Estimation
• Classification deals with discrete outcomes: yes
or no; big or small; strange or no strange; sick
or healthy; yellow, green or red; etc. It
determines a class membership of a certain
object.
• Estimation is often used to perform a
classification task: estimating the number of
children in a family; estimating a family’s total
household income; etc.
• Neural networks and regression models are the
best tools for classification/estimation
3
Prediction
• Prediction is the same as classification or
estimation, except that the records are
classified according to some predicted
future behavior or estimated future value.
• Any of the techniques used for
classification and estimation can be used in
prediction.
4
Classification and Prediction:
Implementation
• To implement both classification and
prediction, we should use the training
examples, where the value of the variable to
be predicted is already known or
membership of the data instance to be
classified is already known.
5
Is Data Mining Appropriate for
My Problem?
6
Will Data Mining help me?
• Can we clearly define the problem?
• Do potentially meaningful data exist?
• Do the data contain hidden knowledge or
the data is useful for reporting purposes
only?
• Will the cost of processing the data be less
than the likely increase in profit seen by
applying any potential knowledge gained
from the data mining?
7
Data Mining vs. Data Query
• Shallow Knowledge
• Multidimensional Knowledge
• Hidden Knowledge
• Deep Knowledge
8
Shallow Knowledge
Shallow knowledge is factual. It can
be easily stored and manipulated in a
database.
9
Multidimensional Knowledge
Multidimensional knowledge is also
factual. On-line analytical Processing
(OLAP) tools are used to manipulate
multidimensional knowledge.
10
Hidden Knowledge
Hidden knowledge represents patterns
or regularities in data that cannot be
easily found using database query.
However, data mining algorithms can
find such patterns with ease.
11
Deep Knowledge
Deep knowledge is knowledge stored
in a database that can only be found if
we are given some direction about what
we are looking for.
12
Data Mining vs. Data Query
• Shallow Knowledge ( can be extracted by the
data base query language like SQL)
• Multidimensional Knowledge (can be extracted
by the On-line Analytical Processing (OLAP)
tools)
• Hidden Knowledge represents patterns and
regularities in data that can not be easily found
(data mining tools can be used)
• Deep Knowledge can be found if we are given
some direction about what we are looking for
(data mining tools can be used)
13
Data Mining vs. Data Query:
• Use data query if you already
almost know what you are looking
for.
• Use data mining to find regularities
in data that are not obvious and (or)
that are hidden.
14
A Simple Data Mining Process
Model
15
Data Mining: A KDD Process
Pattern Evaluation
– Data mining: the core of
knowledge discovery
process.
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
16
The Data Warehouse
The data warehouse is a historical
database designed for decision
support.
17
Data Mining Strategies
Data Mining
Strategies
Unsupervised
Clustering
Supervised
Learning
Market Basket
Analysis
Classification
Estimation
Prediction
A hierarchy of data mining strategies
Supervised Data Mining
Algorithms:
• A single output attribute/multiple output
attributes
• Output attributes are also called dependent
variables because they depend on the values
of input attributes (variables):
y  f ( x1 ,..., xn )
( y1 ,..., yk )  f ( x1 ,..., xn )
• Input attributes are also known as
independent variables
Data Mining Strategies:
Classification
• Learning is supervised.
• The dependent variable(s) (output) is
categorical or numeric.
• Well-defined classes.
• Current rather than future behavior.
Classify a loan applicant as a good or poor credit risk
Develop a customer profile
To classify a patient as sick or healthy
Data Mining Strategies:
Estimation
Learning is supervised.
The dependent variable(s) (output) is
numeric.
Well-defined classes.
Current rather than future behavior.
Estimate the number of minutes before a thunderstorm will
reach a given location
Estimate the amount of credit card purchases
Estimate the salary of an individual
Data Mining Strategies:
Prediction
• The emphasis is on predicting future
rather than current outcomes.
• The output attribute may be categorical
or numeric.
Predict next week’s (year’s) currency exchange rate
Predict next week’s (year’s) Dow Jones Industrial closing
value
Predict a level of the power consumption for some period of
time
Classification, Estimation or
Prediction?
The nature of the data determines whether a
model is suitable for classification,
estimation, or prediction.
The Cardiology Patient Dataset
This dataset contains 303 instances. Each
instance holds information about a patient who
either has or does not have a heart condition.
The Cardiology Patient Dataset
• 138 instances represent patients with heart disease.
• 165 instances contain information about patients free of
heart disease.
Table 2.1 • Cardiology Patient Data
Attribute
Name
Mixed
Values
Numeric
Values
Comments
Age
Numeric
Numeric
Age in years
Sex
Male, Female
1, 0
Patient gender
Chest Pain Type
Angina, Abnormal Angina,
NoTang, Asympt omatic
1–4
NoTang = Nonanginal
pain
Blood Pressure
Numeric
Numeric
Resting blood pressure
upon hospital admission
Cholesterol
Numeric
Numeric
Serum cholesterol
Fasting Blood
Sugar < 120
True, False
1, 0
Is fasting blood sugar less
than 120?
Resting ECG
Normal, Abnormal, Hyp
0, 1, 2
Hyp = Left ventricular
hypertrophy
Maximum Heart
Rate
Numeric
Numeric
Maximum heart rate
achieved
Induced Angina?
True, False
1, 0
Does t he patient experience angina
as a result of exercise?
Old Peak
Numeric
Numeric
ST depression induced by exercise
relative to rest
Slope
Up, flat, dow n
1–3
Slope of t he peak exercise ST
segment
Number Colored
Vessels
0, 1, 2, 3
0, 1, 2, 3
Number of major vessels
colored by fluorosopy
Thal
Normal fix, rev
3, 6, 7
Normal, fixed defect,
reversible defect
Concept Class
Healthy, Sick
1, 0
Angiographic disease stat us
• Most and Least Typical Instances from the Cardiology Domain
Attribute
Name
Age
Sex
Chest Pain Type
Blood Pressure
Cholesterol
Fasting Blood Sugar < 120
Resting ECG
Maximum Heart Rate
Induced Angina?
Old Peak
Slope
Number of Colored Vessels
Thal
Most Typical
Healthy Class
Least Typical
Healthy Class
Most Typical
Sick Class
Least Typical
Sick Class
52
Male
NoTang
138
223
False
Normal
169
False
0
Up
0
Normal
63
Male
Angina
145
233
True
Hyp
150
False
2.3
Down
0
Fix
60
Male
Asymptomatic
125
258
False
Hyp
141
True
2.8
Flat
1
Rev
62
Female
Asymptomatic
160
164
False
Hyp
145
False
6.2
Down
3
Rev
Classification, Estimation or
Prediction?
The next two slides each contain a rule
generated from this dataset. Are either of
these rules predictive?
A Healthy Class Rule for the
Cardiology Patient Dataset
IF 169 <= Maximum Heart Rate <=202
THEN Concept Class = Healthy
Rule accuracy: 85.07%
Rule coverage: 34.55%
A Sick Class Rule for the
Cardiology Patient Dataset
IF Thal = Rev & Chest Pain Type = Asymptomatic
THEN Concept Class = Sick
Rule accuracy: 91.14%
Rule coverage: 52.17%
Is the rule appropriate for
classification or prediction?
• Prediction: has one’s maximum heart rate
checked on a regular basis is low, he/she
may be at risk of having a heart attack.
• Classification: If one has a heart attack,
expect a maximum heart rate to decrease.
Data Mining Strategies:
Unsupervised Clustering
Unsupervised Clustering can be used to:
• determine if relationships can be found in the data.
• evaluate the likely performance of a supervised model.
• find a best set of input attributes for supervised learning.
• detect outliers.
Data Mining Strategies:
Market Basket Analysis
• Find interesting relationships among
retail products.
• Uses association rule algorithms.
Supervised Data Mining
Techniques
Generation of Production Rules
A Hypothesis for the Credit Card
Promotion Database
A combination of one or more of the dataset attributes
differentiate Acme Credit Card Company card holders
who have taken advantage of the life insurance
promotion and those card holders who have chosen not
to participate in the promotional offer.
• The Credit Card Promotion Database
Income
Range ($)
Magazine
Promotion
40–50K
30–40K
40–50K
30–40K
50–60K
20–30K
30–40K
20–30K
30–40K
30–40K
40–50K
20–30K
50–60K
40–50K
20–30K
Yes
Yes
No
Yes
Yes
No
Yes
No
Yes
Yes
No
No
Yes
No
No
Watch
Life Insurance
Promotion
Promotion
No
Yes
No
Yes
No
No
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
No
Yes
No
Yes
Yes
No
Yes
No
No
Yes
Yes
Yes
Yes
No
Yes
Credit Card
Insurance
Sex
Age
No
No
No
Yes
No
No
Yes
No
No
No
No
No
No
No
Yes
Male
Female
Male
Male
Female
Female
Male
Male
Male
Female
Female
Male
Female
Male
Female
45
40
42
43
38
55
35
27
43
41
43
29
39
55
19
Rule Accuracy and Rule Coverage
• Rule accuracy is the correctness of the rule in
terms of a percentage with respect to the class
to be determined by this rule. For example, if
the rule holds for 9 of 10 instances, to which it
is applicable, the accuracy is 90%.
• Rule coverage is the coverage of the class to be
classified by this rule in terms of a percentage.
For example, if the rule covers 10 of 20
instances from the class to be classified, the
rule coverage is 50%.
Rule Accuracy and Rule Coverage
• Rule accuracy is a between-class measure.
• Rule coverage is a within-class measure.
Production Rules for the
Credit Card Promotion Database
• IF Sex = Female & 19 <=Age <= 43
THEN Life Insurance Promotion = Yes
Rule Accuracy: 100.00% Rule Coverage: 66.67%
• IF Sex = Male & 40K<=Income Range <= 50K
THEN Life Insurance Promotion = No
Rule Accuracy: 100.00% Rule Coverage: 50%
• IF Credit Card Insurance= Yes
THEN Life Insurance Promotion = Yes
Rule Accuracy: 100.00% Rule Coverage: 33.33%
• IF 30K<=Income Range <= 40K & Watch Promotion=Yes
THEN Life Insurance Promotion = Yes
Rule Accuracy: 100.00% Rule Coverage: 33.33%
Production Rules for the
Credit Card Promotion Database
• Rules 1-3 are predictive for new card holders
• Rule 4 might be used for the classification of the
existing card holders