Download Data Mining Unit 2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
9/19/2013
Knowledge Discovery and Data
Mining
Unit # 2
Sajjad Haider
Fall 2013
1
Structured vs. Non-Structured Data
• Most business databases contain structured data
consisting of well-defined fields with numeric or alphanumeric values.
• Examples of semi-structured data are electronic images
of business documents, medical reports, executive
summaries, etc. The majority of web documents also
fall in this category.
• An example of unstructured data is a video recorded by
a surveillance camera in a departmental store. This
form of data generally requires extensive processing to
extract and structure the information contained in it.
Sajjad Haider
Fall 2013
2
1
9/19/2013
Structured vs. Non-Structured Data
(Cont’d)
• Structured data is often referred to as
traditional data, while the semi-structured
and unstructured data are lumped together as
non-traditional data.
• Most of the current data mining methods and
commercial tools are applied to traditional
data.
Sajjad Haider
Fall 2013
3
SQL vs. Data Mining
• SQL (Structured Query Language) is a standard
relational database language that is good for queries
that impose some kind of constraints on data in the
database in order to extract an answer.
• In contrast, data mining methods are good for queries
that are exploratory in nature, trying to extract hidden,
not so obvious information.
• SQL is useful when we know exactly what we are
looking for and we can describe it formally.
• We use data mining methods when we know only
vaguely what we are looking for.
Sajjad Haider
Fall 2013
4
2
9/19/2013
OLAP vs. Data Mining
• OLAP tools make it very easy to look at dimensional
data from any angle or to slice-and-dice it.
• The derivation of answers from data in OLAP is
analogous to calculations in a spreadsheet; because
they use simple and given-in-advance calculations.
• OLAP tools do not learn from data, not do they create
new knowledge.
• They are usually special-purpose visualization tools
that can help end-users draw their own conclusions
and decisions, based on graphically condensed data.
Sajjad Haider
Fall 2013
5
Statistics vs. Machine Learning
• Data mining has its origins in various disciplines, of
which the two most important are statistics and
machine learning.
• Statistics has its roots in mathematics, and therefore,
there has been an emphasis on mathematical rigor, a
desire to establish that something is sensible on
theoretical grounds before testing it in practice.
• In contrast, the machine learning community has its
origin very much in computer practice. This has led to a
practical orientation, a willingness to test something
out to see how well it performs, without waiting for a
formal proof of effectiveness.
Sajjad Haider
Fall 2013
6
3
9/19/2013
Statistics vs. Machine Learning
(Cont’d)
• Modern statistics is entirely driven by the
notion of a model. This is a postulated
structure, or an approximation to a structure,
which could have led to the data.
• In place of the statistical emphasis on models,
machine learning tends to emphasize
algorithms.
Sajjad Haider
Fall 2013
7
Types of Attributes
• There are different types of attributes
– Nominal
• Examples: ID numbers, eye color, zip codes
– Ordinal
• Examples: rankings (e.g., taste of potato chips on a
scale from 1-10), grades, height in {tall, medium,
short}
– Ratio
• Examples: temperature in Kelvin, length, time, counts
Sajjad Haider
Fall 2013
8
4
9/19/2013
Data Preprocessing
•
•
•
•
•
•
Aggregation
Sampling
Dimensionality Reduction
Feature subset selection
Discretization
Attribute Transformation
Sajjad Haider
Fall 2013
9
Aggregation
• Combining two or more attributes (or objects)
into a single attribute (or object)
• Purpose
– Data reduction
• Reduce the number of attributes or objects
– Change of scale
• Cities aggregated into regions, states, countries, etc
– More “stable” data
• Aggregated data tends to have less variability
Sajjad Haider
Fall 2013
10
5
9/19/2013
Data Normalization
• Some data mining methods, typically those that
are based on distance computation between
points in an n-dimensional space, may need
normalized data for best results.
• If the values are not normalized, the distance
measures will overweight those features that
have, on average, larger values.
Sajjad Haider
Fall 2013
11
Normalization Techniques
• Decimal Scaling
– v’(i) = v(i) / 10k
– For the smallest k such that max |v’(i)|< 1.
• Min-Max Normalization
– v’(i) = [v(i) – min(v(i))]/[max(v(i)) – min(v(i))]
• Standard Deviation Normalization
– v’(i) = [v(i) – mean(v)]/sd(v)
Sajjad Haider
Fall 2013
12
6
9/19/2013
Normalization Example
• Given one-dimensional data set X = {-5.0, 23.0,
17.6, 7.23, 1.11}, normalize the data set using
– Decimal scaling on interval [-1, 1].
– Min-max normalization on interval [0, 1].
– Standard deviation normalization.
Sajjad Haider
Fall 2013
13
Outlier Detection
• Statistics-based Methods (for one dimensional
data)
– Threshold = Mean + K x Standard Deviation
– Age = {3, 56, 23, 39, 156, 52, 41, 22, 9, 28, 139, 31, 55,
20, -67, 37, 11, 55, 45, 37}
• Distance-based Methods (for multidimensional
data)
– Distance-based outliers are those samples which do
not have enough neighbors, where neighbors are
defined through the multidimensional distance
between samples.
Sajjad Haider
Fall 2013
14
7
9/19/2013
Outlier Detection (Distance-based)
• S = {s1, s2, s3, s4, s5, s6, s7} = {(2, 4), (3, 2), (1,
1), (4, 3), (1, 6), (5, 3), (4, 2)}
• Threshold Values: p > 4, d > 3
S1
S1
S2
S3
S4
S5
Sample
p
S2
S3
S4
S5
S6
s7
S1
2
2.236
3.162
2.236
2.236
3.162
2.828
S2
2.236
1.414
4.472
2.236
1.000
1
3.605
5.000
4.472
3.162
S3
5
4.242
1.000
1.000
S4
2
5.000
5.000
S5
5
s6
3
s6
1.414
Sajjad Haider
Fall 2013
15
Outlier Detection Example II
• The number of children for different patients
in a database is given with a vector C = {3, 1, 0,
2, 7, 3, 6, 4, -2, 0, 0, 10, 15, 6}.
– Find the outliers in the set C using standard
statistical parameters mean and variance.
– If the threshold value is changed from +3 standard
deviations to +2 standard deviations, what
additional outliers are found?
Sajjad Haider
Fall 2013
16
8
9/19/2013
Outlier Detection Example III
• For a given data set X of three-dimensional samples, X
= [{1, 2, 0}, {3, 1, 4}, {2, 1, 5}, {0, 1, 6}, {2, 4, 3}, {4, 4, 2},
{5, 2, 1}, {7, 7, 7}, {0, 0, 0}, {3, 3, 3}].
• Find the outliers using the distance-based technique if
– The threshold distance is 4, and threshold fraction p for
non-neighbor samples is 3.
– The threshold distance is 6, and threshold fraction p for
non-neighbor samples is 2.
• Describe the procedure and interpret the results of
outlier detection based on mean values and variances
for each dimension separately.
Sajjad Haider
Fall 2013
17
Data Reduction
• The three basic operations in a data-reduction process are:
– Delete a row
– Delete a column (dimensionality reduction)
– Reduce the number of values in a column (smooth a feature)
• The main advantages of data reduction are
– Computing time – simpler data can hopefully lead to a
reduction in the time taken for data mining.
– Predictive/descriptive accuracy – We generally expect that by
using only relevant features, a data mining algorithm can not
only learn faster but with higher accuracy. Irrelevant data may
mislead a learning process.
– Representation of the data-mining model – The simplicity of
representation often implies that a model can be better
understood.
Sajjad Haider
Fall 2013
18
9
9/19/2013
Sampling …
• The key principle for effective sampling is the
following:
– using a sample will work almost as well as using
the entire data sets, if the sample is
representative
– A sample is representative if it has approximately
the same property (of interest) as the original set
of data
Sajjad Haider
Fall 2013
19
Types of Sampling
• Simple Random Sampling
– There is an equal probability of selecting any particular item
• Stratified sampling
– Split the data into several partitions; then draw random samples from
each partition
• Sampling without replacement
– As each item is selected, it is removed from the population
• Sampling with replacement
– Objects are not removed from the population as they are selected for the
sample.
•
Sajjad Haider
In sampling with replacement, the same object can be picked up more than
once
Fall 2013
20
10
9/19/2013
Sample Size
8000 points
Sajjad Haider
2000 Points
500 Points
Fall 2013
21
Feature Subset Selection
• Another way to reduce dimensionality of data
• Redundant features
– duplicate much or all of the information contained in
one or more other attributes
– Example: purchase price of a product and the amount
of sales tax paid
• Irrelevant features
– contain no information that is useful for the data
mining task at hand
– Example: students' ID is often irrelevant to the task of
predicting students' GPA
Sajjad Haider
Fall 2013
22
11
9/19/2013
Mean and Variance based Feature
Selection
• Suppose A and B are sets of feature values
measured for two different classes, and n1
and n2 are the corresponding number of
samples.
– SE(A – B) = Sqrt (var(A)/n1 + var(B)/n2)
– TEST: |mean(A) – mean(B)|/SE(A – B) > threshold
value
• It is assumed that the given feature is
independent of the others.
Sajjad Haider
Fall 2013
23
Mean-Variance Example
• SE(XA – XB) = 0.169
• SE(YA – YB) = 0.0875
• |mean(XA) – mean(XB)| /
SE(XA – XB) = 0.0375 < 0.5
• |mean(YA) – mean(YB)| /
SE(YA – YB) = 2.2667 > 0.5
Sajjad Haider
Fall 2013
X
Y
C
0.3
0.7
A
0.2
0.9
B
0.6
0.6
A
0.5
0.5
A
0.7
0.7
B
0.4
0.9
B
24
12
9/19/2013
Feature Ranking Exercise
• Given the data set X with three input features and
one output feature representing the classification
of samples
I1
I2
I3
O
2.5
1.6
5.9
0
7.2
4.3
2.1
1
3.4
5.8
1.6
1
5.6
3.6
6.8
0
4.8
7.2
3.1
1
8.1
4.9
8.3
0
6.3
4.8
2.4
1
• Rank the features using a comparison of means
and variances
Sajjad Haider
Fall 2013
25
Classification: Definition
• Given a collection of records (training set )
– Each record contains a set of attributes, one of the
attributes is the class.
• Find a model for class attribute as a function
of the values of other attributes.
• Goal: previously unseen records should be
assigned a class as accurately as possible.
– A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into training
and test sets, with training set used to build the model
and test set used to validate it.
Sajjad Haider
Fall 2013
26
13
9/19/2013
Classification: Motivation
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
Sajjad Haider
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
Fall 2013
27
Decision/Classification Tree
age?
<=30
31..40
overcast
student?
no
no
Sajjad Haider
yes
>40
credit rating?
excellent
yes
fair
yes
yes
Fall 2013
28
14
9/19/2013
Illustrating Classification Task
Tid
Attrib1
1
Yes
Large
Attrib2
125K
Attrib3
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Learning
algorithm
Class
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
Attrib2
Attrib3
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply
Model
Class
Deduction
10
Test Set
Sajjad Haider
Fall 2013
29
Example of a Decision Tree
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Splitting Attributes
Refund
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
10
Model: Decision Tree
Training Data
Sajjad Haider
Fall 2013
30
15
9/19/2013
Another Example of Decision Tree
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
MarSt
Married
NO
Single,
Divorced
Refund
No
Yes
NO
TaxInc
< 80K
> 80K
YES
NO
There could be more than one tree that fits
the same data!
10
Sajjad Haider
Fall 2013
31
Decision Tree Classification Task
Tid
Attrib1
1
Yes
Large
Attrib2
125K
Attrib3
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Tree
Induction
algorithm
Class
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
Attrib2
Attrib3
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply
Model
Class
Decision
Tree
Deduction
10
Test Set
Sajjad Haider
Fall 2013
32
16
9/19/2013
Apply Model to Test Data
Test Data
Start from the root of tree.
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Married
Single, Divorced
TaxInc
NO
< 80K
> 80K
YES
NO
Sajjad Haider
Fall 2013
33
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Married
Single, Divorced
TaxInc
< 80K
NO
Sajjad Haider
NO
> 80K
YES
Fall 2013
34
17
9/19/2013
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Married
Single, Divorced
TaxInc
NO
< 80K
> 80K
YES
NO
Sajjad Haider
Fall 2013
35
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Married
Single, Divorced
TaxInc
< 80K
NO
Sajjad Haider
NO
> 80K
YES
Fall 2013
36
18
9/19/2013
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Married
Single, Divorced
TaxInc
NO
< 80K
> 80K
YES
NO
Sajjad Haider
Fall 2013
37
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Married
Single, Divorced
TaxInc
< 80K
NO
Sajjad Haider
Assign Cheat to “No”
NO
> 80K
YES
Fall 2013
38
19
9/19/2013
Decision Tree Classification Task
Tid
Attrib1
1
Yes
Large
Attrib2
125K
Attrib3
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Tree
Induction
algorithm
Class
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
Attrib2
Attrib3
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply
Model
Class
Decision
Tree
Deduction
10
Test Set
Sajjad Haider
Fall 2013
39
Tree Induction
• Greedy strategy.
– Split the records based on an attribute test that
optimizes certain criterion.
• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
Sajjad Haider
Fall 2013
40
20
9/19/2013
How to Specify Test Condition?
• Depends on attribute types
– Nominal
– Ordinal
– Continuous
• Depends on number of ways to split
– 2-way split
– Multi-way split
Sajjad Haider
Fall 2013
41
How to determine the Best Split
• Greedy approach:
– Nodes with homogeneous class distribution are
preferred
• Need a measure of node impurity:
C0: 5
C1: 5
Sajjad Haider
C0: 9
C1: 1
Non-homogeneous,
Homogeneous,
High degree of impurity
Low degree of impurity
Fall 2013
42
21
9/19/2013
Measures of Node Impurity
• Gini Index
• Entropy
• Misclassification error
Sajjad Haider
Fall 2013
43
Measure of Impurity: GINI
• Gini Index for a given node t :
GINI (t )  1  [ p( j | t )]2
j
(NOTE: p( j | t) is the relative frequency of class j at node t).
– Maximum (1 - 1/nc) when records are equally distributed
among all classes, implying least interesting information
– Minimum (0.0) when all records belong to one class,
implying most interesting information
C1
C2
0
6
Gini=0.000
Sajjad Haider
C1
C2
1
5
Gini=0.278
C1
C2
2
4
Gini=0.444
Fall 2013
C1
C2
3
3
Gini=0.500
44
22
9/19/2013
Examples for computing GINI
GINI (t )  1  [ p( j | t )]2
j
C1
C2
0
6
C1
C2
1
5
P(C1) = 1/6
C1
C2
2
4
P(C1) = 2/6
Sajjad Haider
P(C1) = 0/6 = 0
P(C2) = 6/6 = 1
Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
P(C2) = 5/6
Gini = 1 – (1/6)2 – (5/6)2 = 0.278
Gini = 1 –
P(C2) = 4/6
(2/6)2 –
(4/6)2 = 0.444
Fall 2013
45
Classification: Motivation
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
Sajjad Haider
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
Fall 2013
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
46
23
9/19/2013
Binary Attributes: Computing GINI Index
• Splits into two partitions
• Effect of Weighing partitions:
– Larger and Purer Partitions are sought for.
Student?
Yes
Gini(N1)
= 1 – (6/7)2 – (1/7)2
= 0.24
No
Node N1
Node N2
Gini(Student)
= 7/14 * 0.24 +
7/14 * 0.49
= ??
Gini(N2)
= 1 – (3/7)2 – (4/7)2
= 0.49
Sajjad Haider
GINI Index for Buy Computer Example
• Gini (Income):
• Gini (Credit_Rating):
• Gini (Age):
Sajjad Haider
Fall 2013
48
24