Download Data Mining A Tutorial

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Basic Data Mining Techniques
Chapter 3-B
Association Rules
K-Means Algorithm
Genetic Algorithm
3.2 Generating Association Rules
Typical application
Affinity Analysis
Market Basket Analysis
IF A
THEN B
Association rules are unlike traditional classification rules.
IF C
THEN A and B
Rule Confidence and Support
Grocery store products: Milk, Cheese, Bread, Eggs
- If customer purchase milk they also purchase bread.
- If customer purchase bread they also purchase milk.
Number of customer transactions:
Milk  10000, Bread  20000 Milk & Bread  5000
Rule Confidence
Given a rule of the form “If A then
B”, rule confidence is the conditional
probability that B is true when A is
known to be true. (≈ rule accuracy)
P (B\A)
P(Bread\Milk) = ?
P(Milk\Bread) = ?
Rule Support
The minimum percentage of instances
(transactions) in the database that
contain all items listed in a given
association rule.
Number of transactions in a given association rule ÷
Total number of transactions
Mining Association Rules:
An Example
Apriori Algorithm (Agrawal et al. 1993)
1st Step: Generate Item Set
2nd Step: Create a set of Association Rules
Item Set: attribute-value combinations that meet a specified
coverage requirement (e.g. 4)
Table 3.3 • A Subset of the Credit Card Promotion Database
Magazine
Promotion
Yes
Yes
No
Yes
Yes
No
Yes
No
Yes
Yes
Watch
Promotion
Life Insurance
Promotion
Credit Card
Insurance
Sex
No
Yes
No
Yes
No
No
No
Yes
No
Yes
No
Yes
No
Yes
Yes
No
Yes
No
No
Yes
No
No
No
Yes
No
No
Yes
No
No
No
Male
Female
Male
Male
Female
Female
Male
Male
Male
Female
Table 3.4 • Single-Item Sets
Single-Item Sets
Number of Items
Magazine Promotion = Yes
Watch Promotion = Yes
Watch Promotion = No
Life Insurance Promotion = Yes
Life Insurance Promotion = No
Credit Card Insurance = No
Sex = Male
Sex = Female
7
4
6
5
5
8
6
4
Magazine Promotion = Yes & Watch Promotion = Yes
Magazine Promotion = Yes & Watch Promotion = No
Magazine Promotion = Yes & Life Insurance Promotion = Yes
3
4
5
Table 3.5 • Two-Item Sets
Two-Item Sets
Number of Items
Magazine Promotion = Yes & Watch Promotion = No
Magazine Promotion = Yes & Life Insurance Promotion = Yes
Magazine Promotion = Yes & Credit Card Insurance = No
Magazine Promotion = Yes & Sex = Male
Watch Promotion = No & Life Insurance Promotion = No
Watch Promotion = No & Credit Card Insurance = No
Watch Promotion = No & Sex = Male
Life Insurance Promotion = No & Credit Card Insurance = No
Life Insurance Promotion = No & Sex = Male
Credit Card Insurance = No & Sex = Male
Credit Card Insurance = No & Sex = Female
Three-Item Sets
4
5
5
4
4
5
4
5
4
4
4
Number of Items
Watch promotion = No & Life Insurance Promotion = No
& Credit Card Insurance = No
4
Two-Item Set Rules
IF Magazine Promotion = Yes
THEN Life Insurance Promotion = Yes (5/7)
IF Life Insurance Promotion = Yes
THEN Magazine Promotion = Yes (5/5)
Three-Item Set Rules
IF Watch Promotion = No & Life Insurance Promotion = No
THEN Credit Card Insurance = No (4/4)
IF Watch Promotion = No
THEN Life Insurance Promotion = No & Credit Card Insurance = No (4/6)
IF Credit Card Insurance = No
THEN Watch Promotion = No & Life Insurance Promotion = No (4/8)
General Considerations
• We are interested in association rules that show a
lift in product sales where the lift is the result
of the product’s association with one or more
other products.
• We are also interested in association rules that show a
lower than expected confidence for a particular
association.
3.3 The K-Means Algorithm
1. Choose a value for K, the total number of clusters.
2. Randomly choose K points as cluster centers.
3. Assign the remaining instances to their closest
cluster center.
4. Calculate a new cluster center for each cluster.
5. Repeat steps 3-5 until the cluster centers do not
change.
An Example Using K-Means
Table 3.6
• K-Means Input Values
Instance
X
Y
1
2
3
4
5
6
1.0
1.0
2.0
2.0
3.0
5.0
1.5
4.5
1.5
3.5
2.5
6.0
f(x)
7
6
5
4
3
2
1
0
x
0
1
2
3
4
Figure 3.6 A coordinate mapping of
the data in Table 3.6
5
6
Euclidean Distance between (x1, y1) and (x2, y2)
First Iteration C1=(1.0, 1.5) C2=(2.0, 1.5)
Distance (C1 - 1) = 0.00
Distance (C1 - 2) = 3.00
Distance (C1 - 3) = 1.00
Distance (C1 - 4) = 2.24
Distance (C1 - 5) = 2.24
Distance (C1 - 6) = 6.02
Distance (C2 – 1) = 1.00
Distance (C2 – 2) = 3.16
Distance (C2 – 3) = 0.00
Distance (C2 – 4) = 2.00
Distance (C2 – 5) = 1.41
Distance (C2 – 6) = 5.41
C1  1, 2 x = (1.0 + 1.0 ) / 2 = 1.0 y = (1.5 + 4.5 ) / 2 = 3.0
C2  3, 4, 5, 6
x=
y=
Table 3.7 • Several Applications of the K-Means Algorithm (K = 2)
Outcome
Cluster Centers
Cluster Points
1
(2.67,4.67)
2, 4, 6
Squared Error
14.50
2
(2.00,1.83)
1, 3, 5
(1.5,1.5)
1, 3
(2.75,4.125)
2, 4, 5, 6
(1.8,2.7)
1, 2, 3, 4, 5
15.94
3
9.60
(5,6)
6
f(x)
7
6
5
4
3
2
1
0
x
0
1
2
3
4
Figure 3.7 A K-Means clustering of
the data in Table 3.6 (K = 2)
5
6
General Considerations
• Requires real-valued data.
• We must select the number of clusters present in the
data.
• Works best when the clusters in the data are of
approximately equal size.
• Attribute significance cannot be determined.
• Lacks explanation capabilities.
3.4 Genetic Learning
Algorithm:
1.Initial Population P  n elements :
Chromosome, Potential solution
2. Until a specified termination condition is satisfied
a. if an elements passes the fitness function,
it remains in P
b. Population P  m elements (m < n)
Use Genetic Operation to create new (n-m) elements
P  m + (n-m)
Genetic Learning Operators
• Crossover (교차)
• Mutation (돌연변이)
• Selection (선택)
(1) Genetic Algorithms and Supervised Learning
Keep
Population
Elements
Fitness
Function
Training
Data
Candidates
for Crossover
& Mutation
Figure 3.8 Supervised genetic
learning
Throw
Example: Credit Card Promotion database
Goal  to create a model able to differentiate individuals who have
accepted Life Insurance Promotion from those who have not.
Table 3.8 • An Initial Population for Supervised Genetic Learning
Population
Element
1
2
3
4
Income
Range
Life Insurance
Promotion
Credit Card
Insurance
Sex
Age
20–30K
30–40K
?
30–40K
No
Yes
No
Yes
Yes
No
No
Yes
Male
Female
Male
Male
30–39
50–59
40–49
40–49
Table 3.9 • Training Data for Genetic Learning
Training
Instance
Income
Range
Life Insurance
Promotion
Credit Card
Insurance
Sex
Age
1
2
3
4
5
6
30–40K
30–40K
50–60K
20–30K
20–30K
30–40K
Yes
Yes
Yes
No
No
No
Yes
No
No
No
No
No
Male
Female
Female
Female
Male
Male
30–39
40–49
30–39
50–59
20–29
40–49
Fitness Function: compare each population to the training
instances and calculate fitness score
For a single population element E
1. N  # of matches of input attribute value of E
with training instances from its own class
2. M  # of matches of input attribute value of E
with training instances from its competing class
3. M  M + 1
4. Fitness Score  N/M
F(1) = 4 / 5 = 0.80
F(3) = 6 / 5 = 1.20
F(2) = 6 / 7 = 0.86
F(4) = 5 / 5 = 1.00
Population
Element
Income
Range
Life Insurance
Promotion
Credit Card
Insurance
Sex
Age
Population
Element
Income
Range
#1
20-30K
No
Yes
Male
30-39
#2
30-40K
Population
Element
Income
Range
Life Insurance
Promotion
Credit Card
Insurance
Sex
Age
Population
Element
Income
Range
#2
30-40K
Yes
No
Fem
50-59
#1
20-30K
Life Insurance Credit Card
Promotion
Insurance
Yes
Yes
Life Insurance Credit Card
Promotion
Insurance
No
Figure 3.9 A crossover operation
No
Sex
Age
Male
30-39
Sex
Age
Fem
50-59
Table 3.10 • A Second-Generation Population (Crossover Operation)
Population
Element
1
2
3
4
Income
Range
Life Insurance
Promotion
Credit Card
Insurance
Sex
Age
20–30K
30–40K
?
30–40K
No
Yes
No
Yes
No
Yes
No
Yes
Female
Male
Male
Male
50–59
30–39
40–49
40–49
F(1) = 7 / 5 = 1.40
F(2) = 6 / 4 = 1.50
Example of Mutation Operation:
30 – 40 K  50 – 60 K
(2) Genetic Algorithms and Unsupervised Clustering
a1
a2
. . .
a3
E11
S1
E12
an
I1
P
instances
I2
.
.
.
.
.
Ip
E21
S2
.
.
.
.
.
.
.
SK
Solutions
Figure 3.10 Unsupervised genetic
clustering
E22
Ek1
Ek2
Table 3.11 • A First-Generation Population for Unsupervised Clustering
S
1
S
2
S
3
Solution elements
(initial population)
(1.0,1.0)
(5.0,5.0)
(3.0,2.0)
(3.0,5.0)
(4.0,3.0)
(5.0,1.0)
Fitness score
11.31
9.78
15.55
Solution elements
(second generation)
(5.0,1.0)
(5.0,5.0)
(3.0,2.0)
(3.0,5.0)
(4.0,3.0)
(1.0,1.0)
Fitness score
17.96
9.78
11.34
Solution elements
(third generation)
(5.0,5.0)
(1.0,5.0)
(3.0,2.0)
(3.0,5.0)
(4.0,3.0)
(1.0,1.0)
Fitness score
13.64
9.78
11.34
Fitness Score = Average Euclidean Distance
General Considerations
• Global optimization is not a guarantee.
• The fitness function determines the
complexity of the algorithm.
• Explain their results provided the fitness
function is understandable.
• Transforming the data to a form suitable for
genetic learning can be a challenge.
3.5 Choosing a Data Mining Technique
Initial Considerations
• Is learning supervised or unsupervised?
• Is explanation required?
• What is the interaction between input and output
attributes?
• What are the data types of the input and output
attributes?
Further Considerations
• Do We Know the Distribution of the Data?
e.g.. Normal
Distribution
• Do We Know Which Attributes Best Define the Data?
Decision Tree, Certain Statistical Approach  Determine predictive attributes
Neural Network, Clustering Approach  Equal important attribute
• Does the Data Contain Missing Values? Neural Network
• Is Time an Issue? Decision Tree, Production Rules are faster than
Neural network or Genetic learning
• Which Technique Is Most Likely to Give a Best Test
Set Accuracy?
Related documents