Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Chapter 34 in textbook + Chapter 4 in DATA MINING by P. Adriaans and D. Zantinge 1 Data Mining Data Mining: the process of extracting valid, previously unknown, comprehensible, and actionable information from large databases and using it to make crucial business decisions. Involves analysis of data and use of software techniques for finding hidden and unexpected patterns and relationships in sets of data. Examples: A customer with income between 10,000 and 20,000 and age between 20 and 25 who purchased milk and bread is likely to purchase diapers within 5 years. The amount of fish sold to people living in a certain area and have income between 20,000 and 35,000 is increasing. 2 Data Mining Most accurate and reliable results require large volumes of data Data mining can provide huge paybacks for companies who have made a significant investment in DW. Relatively new technology, however already used in many industries. 3 Data Mining Examples of applications: Retail / Marketing Identifying buying patterns of customers. Predicting response to mailing campaigns. Banking Detecting patterns of CC fraud Identifying loyal customers. Insurance Claims analysis. Predicting which customers will buy new policies. Medicine Characterizing patient behaviour to predict surgery visits. Identifying successful medical therapies. 4 Data Mining and DW Challenge: identifying suitable data to mine. Data mining requires single, separate, clean, integrated, and self-consistent source of data. A DW is well equipped for providing data for mining. Data quality and consistency is essential to ensure the accuracy of the predictive models. DWs are populated with clean, consistent data 5 Data Mining and DW Advantageous to mine data from multiple sources to discover as many interrelationships as possible. DWs contain data from a number of sources. Selecting relevant subsets of records and fields for data mining requires query capabilities of the DW. Results of a data mining study are useful if can further investigate the uncovered patterns. DWs provide capability to go back to the data source. 6 The Knowledge Discovery Process Six stages: 1. 2. 3. 4. 5. 6. Data selection. Cleaning. Enrichment. Coding. Data Mining. Reporting. 7 The KDD Process Action 8 1. Data Selection We will illustrate the process using a magazine publisher operational data. We selected data about people who subscribed to magazines. A copy of this operational data is made. 9 Original Selected Data 10 2. Cleaning Types of cleaning: Some detected before starting. Some detected during coding or discovery stages. Elements of cleaning: 1. De-duplication. 2. Lack of domain consistency. 11 2.1 De-duplication Some clients represented by several records. Very common. Reasons: Negligence: typing errors. Data changed for client without notifying company: exp. Moving to a new address. Deliberately giving wrong info: exp. Misspelling names to avoid rejection. Solution: pattern analysis algorithms. 12 Data before De-duplication 13 Data after De-duplication 14 2.2 Lack of Domain Consistency Hard to trace. Greatly influences the DM results. Solution: NULL. Correct values. 15 Data before Correcting Lack of Domain Consistency 16 Data after Correcting Lack of Domain Consistency 17 3. Enrichment A company can purchase extra information about clients. 18 4. Coding 1. 2. Add purchased data to DB. Select records with enough information to be of value. 3. Keep important columns only. 4. Exp. We could not get extra information on client King. So, we choose to remove him from data. Exp. We are not interested in clients’ names. So, remove this column from data. Code information. What is coding? Change data in columns to ranges and enumerations. Info too detailed for pattern recognition algorithms. Why code? 5. Exp: if we use DOB, then the alg. Would put people of the same DOB in the same category. Better if it was an age group instead. Flattening: n-cardinality attribute replaced by n binary attributes. 19 4. Coding (continued) 1. 2. 3. 4. 5. 6. Some examples of coding: Address region. Birth date age. Divide income by 1,000. Divide credit by 1,000. yes-no fields 1-0 fields. Purchase date month numbers. 20 Data before Removing Insufficient Records and Columns 21 Data after Removing Insufficient Records and Columns and before Coding 22 Data after Coding and before Flattening 23 Data after Flattening 24 5. Data Mining Now, after we have cleaned the data and prepared it, we perform actual discovery (DM). Techniques: 1. 2. 3. 4. 5. 6. 7. 8. Query tools & Statistical techniques. Visualization. Online analytical processing (OLAP). Case-based learning (k-nearest neighbor). Decision trees. Association rules. Neural networks. Genetic algorithms. 25 5.1 Query Tools and Statistical Techniques Perform preliminary analysis of data. Should be done before any complex DM step. Uses simple SQL queries. No hidden patterns. But discovers 80% of the interesting information to be extracted. 20% discovered by complex techniques. 26 Data Averages 27 Age Distributions of Sports Magazines Readers 28 5.2 Visualization Techniques Useful at the beginning of DM. Gives a feeling of where patterns maybe hidden. Example: Scatter Diagram. Projection of 2 attributes in a Cartesian space. Better example: 3D Interactive Diagrams. Projection of 3 attributes. 29 Scatter Diagram 30 3D Interactive Diagram 31 5.2 Visualization Techniques (continued) Importance of visualizing points in multi-dimensional space lies in detecting likelihood and distance. If distance between 2 points is small records representing them are similar it is likely that they will behave in the same manner. If distance between 2 points is large records representing them have little in common. 32 5.2 Visualization Techniques (continued) Exp: Age, credit and income are 3 attributes/dimensions in our space. First, normalize them so they would have the same effect. Age: 1 100 while income and credit: 0 100,000. Divide credit and age by 1,000. Euclidean distance is used: [(x1-x2)2 + (y1-y2)2 + (z1-z2)2] 33 5.2 Visualization Techniques (continued) Benefits of points in multidimensional space is finding clusters. Clusters are groups of similar records. Likely to behave in the same manner. Can be targeted for marketing campaigns. Low dimensionality easy to detect clusters. Higher dimensionality need special programs to detect clusters. 34 Finding Clusters 35 5.3 OLAP Tools OLAP: OnLine Analytical Processing. Expanding the idea of dimensionality. A table with n attr. = a space with n dimensions. Managers usually ask multi-dimensional questions. Not easy in traditional DBs. Multi-dimensional relationships require multiple keys while traditional DBs have 1 key per record. OLAP useful with multi-dimensional queries. It stores data in special multi-dimensional format kept in memory. DM vs. OLAP. OLAP doesn’t learn less powerful than DM. OLAP gives you multi-dimensional knowledge NOT new knowledge. OLAP needs data in special format unlike DM. 36 5.4 k-Nearest Neighbor When records are points in data space, Neighborhood: records close to each other are in the same neighborhood. Useful in prediction. Records in the same neighborhood behave similarly. If you know how some will behave, you can assume that the rest will behave in the same way. Do as your neighbors do. To predict an individual’s behavior, Get the closest k neighbors by applying k-nearest neighbor alg. See how they behave. Average their behavior. your target is likely to behave in the same way. Search NOT learning algorithm. Not efficient with large data sets. 37 Predictions with k-Nearest Neighbor 38 5.5 Decision Trees Useful in classification and prediction. Puts records in classes. Predict behavior of an individual by observing behavior of individuals in his\her class. Advantages: Good with large data sets. Intuitive and simple simulates how humans make decisions. Steps: 1. Choose most effective attribute. Exp. Age could be the most effective in determining who would buy a car magazine. 2. Split the range into 2 based on sales. 3. Go on to the next attribute (or same attribute). 4. Step 2 again until we run out of attributes. 39 Decision Trees for the Car Magazine First tree. Age > 44.5 99% Age ≤ 44.5 38% Four-level tree. Age > 48.5 100% Age > 44.5 Age ≤ 48.5 92% Age ≤ 44.5 Income > 34.5 Income ≤ 34.5 100% Age > 31.5 Age ≤ 31.5 46% 0% 40 5.6 Association Rules Marketing managers like rules like: 90% of women with red ports cars and small dogs wear Chanel No. 5. Customer profiles for marketing campaigns. Relationship between attributes association rule. Binary attributes flattening tables is important. Algorithms for finding associations may find good and bad associations. Need to introduce some measures for accuracy to get rid of bad (useless) associations. 41 5.6 Association Rules (continued) Association rule: MUSIC_MAG, HOUSE_MAG => CAR_MAG Somebody who reads music and house magazines is very likely to read a cars magazine. Interesting association rule: is a rule that occurs in the DB with a high percentage = High support. Records with music and house and car are a big percentage of total records in DB. May have lots of records that have music and house but not car; high support but not good. We need another measure: Confidence. Confidence is the percentage of records with musichouse-car to the records with music-house. 42 Binary Associations between Magazines 43 5.7 Neural Networks Modeled after the human brain. Input nodes: receive input signals. Output nodes: produce output signals. Intermediate nodes. Connect input and output. Organized into layers. Unlimited number. 2 phases: Encoding: NN trained to perform a task. Decoding: NN classifies examples or makes predictions. 44 Example NN: Learning 45 Example NN: Classifying 46 5.8 Genetic Algorithms Based on evolution theory, Darwin’s theories, and the structure of DNA. Genetic algorithm: 1. Encode the problem into limited-alphabet strings like DNA’s building blocks (4 alphabets). 2. Invent an artificial environment and a measure for success\failure (fitness function) survival of the fittest. 3. Combine solutions and produce new ones based on combined ones DNA inherited from mother and father. 4. Provide initial population and start generating solutions from them. Remove bad solutions from each generation and combine the good ones to produce the next generation of solutions. Until you reach a family of successful solutions evolution. 47 Example Genetic Algorithm 48