Download Data Mining A Tutorial

Basic Data Mining Techniques Chapter 3-B Association Rules K-Means Algorithm Genetic Algorithm 3.2 Generating Association Rules Typical application Affinity Analysis Market Basket Analysis IF A THEN B Association rules are unlike traditional classification rules. IF C THEN A and B Rule Confidence and Support Grocery store products: Milk, Cheese, Bread, Eggs - If customer purchase milk they also purchase bread. - If customer purchase bread they also purchase milk. Number of customer transactions: Milk  10000, Bread  20000 Milk & Bread  5000 Rule Confidence Given a rule of the form “If A then B”, rule confidence is the conditional probability that B is true when A is known to be true. (≈ rule accuracy) P (B\A) P(Bread\Milk) = ? P(Milk\Bread) = ? Rule Support The minimum percentage of instances (transactions) in the database that contain all items listed in a given association rule. Number of transactions in a given association rule ÷ Total number of transactions Mining Association Rules: An Example Apriori Algorithm (Agrawal et al. 1993) 1st Step: Generate Item Set 2nd Step: Create a set of Association Rules Item Set: attribute-value combinations that meet a specified coverage requirement (e.g. 4) Table 3.3 • A Subset of the Credit Card Promotion Database Magazine Promotion Yes Yes No Yes Yes No Yes No Yes Yes Watch Promotion Life Insurance Promotion Credit Card Insurance Sex No Yes No Yes No No No Yes No Yes No Yes No Yes Yes No Yes No No Yes No No No Yes No No Yes No No No Male Female Male Male Female Female Male Male Male Female Table 3.4 • Single-Item Sets Single-Item Sets Number of Items Magazine Promotion = Yes Watch Promotion = Yes Watch Promotion = No Life Insurance Promotion = Yes Life Insurance Promotion = No Credit Card Insurance = No Sex = Male Sex = Female 7 4 6 5 5 8 6 4 Magazine Promotion = Yes & Watch Promotion = Yes Magazine Promotion = Yes & Watch Promotion = No Magazine Promotion = Yes & Life Insurance Promotion = Yes 3 4 5 Table 3.5 • Two-Item Sets Two-Item Sets Number of Items Magazine Promotion = Yes & Watch Promotion = No Magazine Promotion = Yes & Life Insurance Promotion = Yes Magazine Promotion = Yes & Credit Card Insurance = No Magazine Promotion = Yes & Sex = Male Watch Promotion = No & Life Insurance Promotion = No Watch Promotion = No & Credit Card Insurance = No Watch Promotion = No & Sex = Male Life Insurance Promotion = No & Credit Card Insurance = No Life Insurance Promotion = No & Sex = Male Credit Card Insurance = No & Sex = Male Credit Card Insurance = No & Sex = Female Three-Item Sets 4 5 5 4 4 5 4 5 4 4 4 Number of Items Watch promotion = No & Life Insurance Promotion = No & Credit Card Insurance = No 4 Two-Item Set Rules IF Magazine Promotion = Yes THEN Life Insurance Promotion = Yes (5/7) IF Life Insurance Promotion = Yes THEN Magazine Promotion = Yes (5/5) Three-Item Set Rules IF Watch Promotion = No & Life Insurance Promotion = No THEN Credit Card Insurance = No (4/4) IF Watch Promotion = No THEN Life Insurance Promotion = No & Credit Card Insurance = No (4/6) IF Credit Card Insurance = No THEN Watch Promotion = No & Life Insurance Promotion = No (4/8) General Considerations • We are interested in association rules that show a lift in product sales where the lift is the result of the product’s association with one or more other products. • We are also interested in association rules that show a lower than expected confidence for a particular association. 3.3 The K-Means Algorithm 1. Choose a value for K, the total number of clusters. 2. Randomly choose K points as cluster centers. 3. Assign the remaining instances to their closest cluster center. 4. Calculate a new cluster center for each cluster. 5. Repeat steps 3-5 until the cluster centers do not change. An Example Using K-Means Table 3.6 • K-Means Input Values Instance X Y 1 2 3 4 5 6 1.0 1.0 2.0 2.0 3.0 5.0 1.5 4.5 1.5 3.5 2.5 6.0 f(x) 7 6 5 4 3 2 1 0 x 0 1 2 3 4 Figure 3.6 A coordinate mapping of the data in Table 3.6 5 6 Euclidean Distance between (x1, y1) and (x2, y2) First Iteration C1=(1.0, 1.5) C2=(2.0, 1.5) Distance (C1 - 1) = 0.00 Distance (C1 - 2) = 3.00 Distance (C1 - 3) = 1.00 Distance (C1 - 4) = 2.24 Distance (C1 - 5) = 2.24 Distance (C1 - 6) = 6.02 Distance (C2 – 1) = 1.00 Distance (C2 – 2) = 3.16 Distance (C2 – 3) = 0.00 Distance (C2 – 4) = 2.00 Distance (C2 – 5) = 1.41 Distance (C2 – 6) = 5.41 C1  1, 2 x = (1.0 + 1.0 ) / 2 = 1.0 y = (1.5 + 4.5 ) / 2 = 3.0 C2  3, 4, 5, 6 x= y= Table 3.7 • Several Applications of the K-Means Algorithm (K = 2) Outcome Cluster Centers Cluster Points 1 (2.67,4.67) 2, 4, 6 Squared Error 14.50 2 (2.00,1.83) 1, 3, 5 (1.5,1.5) 1, 3 (2.75,4.125) 2, 4, 5, 6 (1.8,2.7) 1, 2, 3, 4, 5 15.94 3 9.60 (5,6) 6 f(x) 7 6 5 4 3 2 1 0 x 0 1 2 3 4 Figure 3.7 A K-Means clustering of the data in Table 3.6 (K = 2) 5 6 General Considerations • Requires real-valued data. • We must select the number of clusters present in the data. • Works best when the clusters in the data are of approximately equal size. • Attribute significance cannot be determined. • Lacks explanation capabilities. 3.4 Genetic Learning Algorithm: 1.Initial Population P  n elements : Chromosome, Potential solution 2. Until a specified termination condition is satisfied a. if an elements passes the fitness function, it remains in P b. Population P  m elements (m < n) Use Genetic Operation to create new (n-m) elements P  m + (n-m) Genetic Learning Operators • Crossover (교차) • Mutation (돌연변이) • Selection (선택) (1) Genetic Algorithms and Supervised Learning Keep Population Elements Fitness Function Training Data Candidates for Crossover & Mutation Figure 3.8 Supervised genetic learning Throw Example: Credit Card Promotion database Goal  to create a model able to differentiate individuals who have accepted Life Insurance Promotion from those who have not. Table 3.8 • An Initial Population for Supervised Genetic Learning Population Element 1 2 3 4 Income Range Life Insurance Promotion Credit Card Insurance Sex Age 20–30K 30–40K ? 30–40K No Yes No Yes Yes No No Yes Male Female Male Male 30–39 50–59 40–49 40–49 Table 3.9 • Training Data for Genetic Learning Training Instance Income Range Life Insurance Promotion Credit Card Insurance Sex Age 1 2 3 4 5 6 30–40K 30–40K 50–60K 20–30K 20–30K 30–40K Yes Yes Yes No No No Yes No No No No No Male Female Female Female Male Male 30–39 40–49 30–39 50–59 20–29 40–49 Fitness Function: compare each population to the training instances and calculate fitness score For a single population element E 1. N  # of matches of input attribute value of E with training instances from its own class 2. M  # of matches of input attribute value of E with training instances from its competing class 3. M  M + 1 4. Fitness Score  N/M F(1) = 4 / 5 = 0.80 F(3) = 6 / 5 = 1.20 F(2) = 6 / 7 = 0.86 F(4) = 5 / 5 = 1.00 Population Element Income Range Life Insurance Promotion Credit Card Insurance Sex Age Population Element Income Range #1 20-30K No Yes Male 30-39 #2 30-40K Population Element Income Range Life Insurance Promotion Credit Card Insurance Sex Age Population Element Income Range #2 30-40K Yes No Fem 50-59 #1 20-30K Life Insurance Credit Card Promotion Insurance Yes Yes Life Insurance Credit Card Promotion Insurance No Figure 3.9 A crossover operation No Sex Age Male 30-39 Sex Age Fem 50-59 Table 3.10 • A Second-Generation Population (Crossover Operation) Population Element 1 2 3 4 Income Range Life Insurance Promotion Credit Card Insurance Sex Age 20–30K 30–40K ? 30–40K No Yes No Yes No Yes No Yes Female Male Male Male 50–59 30–39 40–49 40–49 F(1) = 7 / 5 = 1.40 F(2) = 6 / 4 = 1.50 Example of Mutation Operation: 30 – 40 K  50 – 60 K (2) Genetic Algorithms and Unsupervised Clustering a1 a2 . . . a3 E11 S1 E12 an I1 P instances I2 . . . . . Ip E21 S2 . . . . . . . SK Solutions Figure 3.10 Unsupervised genetic clustering E22 Ek1 Ek2 Table 3.11 • A First-Generation Population for Unsupervised Clustering S 1 S 2 S 3 Solution elements (initial population) (1.0,1.0) (5.0,5.0) (3.0,2.0) (3.0,5.0) (4.0,3.0) (5.0,1.0) Fitness score 11.31 9.78 15.55 Solution elements (second generation) (5.0,1.0) (5.0,5.0) (3.0,2.0) (3.0,5.0) (4.0,3.0) (1.0,1.0) Fitness score 17.96 9.78 11.34 Solution elements (third generation) (5.0,5.0) (1.0,5.0) (3.0,2.0) (3.0,5.0) (4.0,3.0) (1.0,1.0) Fitness score 13.64 9.78 11.34 Fitness Score = Average Euclidean Distance General Considerations • Global optimization is not a guarantee. • The fitness function determines the complexity of the algorithm. • Explain their results provided the fitness function is understandable. • Transforming the data to a form suitable for genetic learning can be a challenge. 3.5 Choosing a Data Mining Technique Initial Considerations • Is learning supervised or unsupervised? • Is explanation required? • What is the interaction between input and output attributes? • What are the data types of the input and output attributes? Further Considerations • Do We Know the Distribution of the Data? e.g.. Normal Distribution • Do We Know Which Attributes Best Define the Data? Decision Tree, Certain Statistical Approach  Determine predictive attributes Neural Network, Clustering Approach  Equal important attribute • Does the Data Contain Missing Values? Neural Network • Is Time an Issue? Decision Tree, Production Rules are faster than Neural network or Genetic learning • Which Technique Is Most Likely to Give a Best Test Set Accuracy?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data Mining A Tutorial