Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
國立雲林科技大學 National Yunlin University of Science and Technology Applying Data Mining Technique to Direct Marketing Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management National Yunlin University of Science and Technology Intelligent Database Systems Lab N.Y.U.S.T. I. M. Outline Motivation Objective Introduction Background The Generalized SOM Experiments Conclusions 2 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation Firms with the huge amount of complex marketing data on hand, need to further analysis and expect to make more profits. Clustering, a technique of data mining, is especially suitable for segmenting data. However, firm’s database usually consist of mixed data (numeric and categorical data). 3 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Objective We utilize a new visualized clustering algorithm, the generalized self-organizing map (GSOM), to segment customer data for direct marketing. ─ Unlike conventional SOM, the GSOM can reasonably express the relatively distance of categorical values. Then, we apply GSOM to direct marketing would generate more profits. 4 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Introduction (1/5) Marketing practices have shifted to customeroriented from traditional mass marketing. Firms usually perform market segmentation and devise different marketing strategies for different segments. 5 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Introduction (2/5) Data mining means a process of nontrivial extraction of implicit, previously unknown and potentially useful information from a huge amount of data. Cluster analysis can assist marketers in identifying clusters of customers with similar characteristics. 6 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Introduction (3/5) The self-organizing map (SOM) network, proposed by Kohonen, is an useful visualized tool in data mining ─ ─ Dimensionality reduction & Information visualization Preserve the original topological relationship 7 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Introduction (4/5) The approach of the SOM in handling categorical data ─ It uses binary encoding that transforms categorical values to a set of binary values. 8 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Introduction (5/5) In this paper, we propose an extended SOM, named generalized SOM (GSOM), to overcome the drawback in handling categorical data ─ We construct the concept hierarchies for each categorical attributes. 9 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Background (1/2) Self-organizing map, SOM ─ ─ Find the winner (BMU) by (1) Update the winner and neighborhood by (2) v arg min || x(t ) wi (t ) || , i {1,..., M } (1) i wi (t 1) wi (t ) (t ) hvi (t ) [ x(t ) wi (t )] (2) || rv ri || 2 (3) hvi (t ) exp 2 2 (t ) 10 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Background (2/2) Problems of the conventional SOM D(Coke, Pepsi) = D(Coke, Mocca) = D(Pepsi, Mocca) 11 Intelligent Database Systems Lab N.Y.U.S.T. I. M. The Generalized SOM We use concept hierarchies to help calculate the distances of categorical values ─ ─ An input pattern and the GSOM vector are mapped to their associated concept hierarchies. The distance between the input pattern and the GSOM vector is calculated by measuring the aggregated distance of mapping points in the hierarchies. ID Drink 1 Coke 2 Pepsi 3 Mocca Input pattern Any Juice Coffee Carbonated mq Orange Apple Latte Mocca Coke Pepsi 12 x SOM network Intelligent Database Systems Lab N.Y.U.S.T. I. M. Concept hierarchies (1/3) General concepts 0 1 1 2 1 1 1 1 1 1 1 1 Specific concepts D(Coke, Pepsi) < D(Coke, Mocca) = D(Pepsi, Mocca) 13 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Concept hierarchies (2/3) ID Drink 1 Coke 2 Pepsi 3 Mocca Input pattern Any Juice Coffee mq=(Pepsi, 1.7) Carbonated mq Orange Apple Latte Mocca Coke Pepsi x SOM network A point X=(NX, dX) NX: an anchor (leaf node) of point X dX: a positive offset (distance) from X to root Example: x=(Coke, 2.0); mq=(Pepsi, 1.7) 14 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Concept hierarchies (3/3) Any 0 1 2 Juice Coffee duplication Carbonated mq red blue dx dmq Orange Apple Latte Mocca Coke Pepsi x | X Y | d X dY 2 d LA (4) d LA min( d X , dY , d LCA ( N X , NY ) ) (5) Example: x=(Coke, 2.0); mq=(Pepsi, 1.7) |x – mq | = 2 + 1.7 – 2×1 = 1.7 15 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments Experiment dataset ─ ─ Synthetic dataset consists of 6 groups of two categorical attributes, Department and Drink. Real dataset Adult from the UCI repository With 48,842 patterns of 15 attributes. 8 categorical attributes, 6 numerical attributes, and 1 class attribute Salary. 76% of the patterns have the value of ≤50K. 16 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments Parameters were set according to the suggestion in the software package SOM_PAK. ─ ─ Categorical values are transformed to binary values when we train the SOM. While mixed data are used directly when we train the GSOM. Each link weight of concept hierarchies is set to 1. 17 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Synthetic dataset (1/2) Group No Department Drink Data Count Common Pattern 1 EE Coke 20 2 CE Pepsi 10 Engineering College & Carbonated Drinks 3 MIS Latte 20 4 MBA Mocca 10 5 VC Orange 20 6 SD Apple 10 Management College & Coffee Drinks Design College & Juice Drinks Any Any Engineering EE Management CE MIS Juice Design MBA VC SD Department Orange Coffee Apple Latte Carbonated Mocca Coke Pepsi Drink 18 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Synthetic dataset (2/2) ─ An 8×8 SOM network is used for the training. After 900 training iterations, the trained maps of SOM and GSOM under the same parameters are shown in below. Binary SOM GSOM 19 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Real dataset (1/3) We randomly draw 10,000 patterns which have 75.76% of ≤50K, similar to the Salary distribution of the original Adult dataset ─ ─ Three categorical attributes, Marital-status, Relationship, and Education. Four numeric attributes, Capital-gain, Capital-loss, Age, and Hours-per-week. 20 Intelligent Database Systems Lab Concept hierarchies for the categorical attributes are constructed as shown in below. ─ Advanced College HighSchool Junior Real dataset (2/3) 21 Intelligent Database Systems Lab Doctorate Prof-school Masters Bachelors Assoc-acdm Assoc-voc Some-college HS-grad 12th 11th 10th 9th 7th-8th 5th-6th 1st-4th Preschool Married-civ-spouse Married-AF-spouse Married-spouse-absent Widowed Divorced Separated Never-married Husband Wife Own-child Other-relative Not-in-family Unmarried Little Education Marital-status Relationship Couple Single ANY ANY ANY N.Y.U.S.T. I. M. N.Y.U.S.T. I. M. Real dataset (3/3) ─ A 15×15 SOM network is used for the training. After 50,000 iterations, the trained maps of SOM and GSOM under the same parameters are shown in below. Binary SOM GSOM 22 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Distributions of Salary attribute in each cluster Group No 7 6 3 1 2 4 5 All Data Count 2,626 1,950 1,753 1,045 1,133 740 734 9,981 No. of >50K 1,287 794 216 59 48 12 8 2,424 No. of ≤50K 1,339 1,156 1,537 986 1,085 728 726 7,557 Ratio of >50K 49.01% 40.72% 12.32% 5.65% 4.24% 1.62% 1.09% 24.29% Ratio of ≤50K 50.99% 59.28% 87.68% 94.35% 95.76% 98.38% 98.91% 75.71% 23 Intelligent Database Systems Lab Application to Direct Marketing (1/2) After we utilize the GSOM to perform data clustering, this segmented dataset can be further applied to catalog marketing. Suppose that ─ ─ ─ The cost of mailing a catalog is $2. The customers whose salaries are over 50K, we make an average profit of $10 per person. Otherwise, we make an average profit of $1 per person. 24 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Application to Direct Marketing (2/2) Expected N(>50K)×10+N(≤50K)×1-[N(>50K)+N(≤50K)]×2 profits 17,500 $14,344 15,000 12,500 10,000 7,500 $7,505 5,000 Trained 2,500 Random 0 0 2,626 4,576 6,329 7,374 8,507 Number of patterns 9,247 9,981 25 Intelligent Database Systems Lab N.Y.U.S.T. I. M. N.Y.U.S.T. I. M. Conclusions In this paper, we propose a data clustering method ─ ─ ─ The GSOM extends the conventional SOM and overcomes its drawback in handling categorical data by utilizing concept hierarchies. The experimental results confirmed that the GSOM can better reveal the cluster structure of data than the conventional SOM does. We can make more profits by the marketing based on the segmentation results of the GSOM than by the marketing to the customers randomly drawn from the customer database. 26 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Q &A 27 Intelligent Database Systems Lab