Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Business Intelligence outline Data Mining and KDD Why Data Mining Applications of Data Mining Data Preprocessing Data Mining techniques Visualization of the results Summary 2 Data Mining and KDD 3 Looking for knowledge The Explosive Growth of Data The World Wide Web Business: e-commerce, transactions, stocks, … Science: Remote sensing, bioinformatics, scientific simulation Society and everyone: news, digital cameras, YouTube, forums, blogs, Google & Co We are drowning in data, but starving for knowledge! Avoid data tombs “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets. 4 What is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Alternative names Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. 5 Knowledge Discovery (KDD) Process Pattern Evaluation Data Mining Task-relevant Data Data Warehouse Selection Data Cleaning Data Integration Data sources 6 Data Mining and Business Intelligence Increasing potential to support business decisions Decision Making Data Presentation Visualization Techniques End User Business Analyst Data Mining Information Discovery Data Analyst Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems Quantity of data DBA 7 Data Mining: confluence of multiple disciplines Database Technology Machine Learning Pattern Recognition Statistics Data Mining Algorithms Visualization Other Disciplines 8 Why Data Mining? 9 Why is Data Mining so complex? A matter of data dimensions Tremendous amount of data High-dimensionality of data Walmart – Customer buying patterns – a data warehouse 7.5 Terabytes large in 1995 VISA – Detecting credit card interoperability issues – 6800 payment transactions per second Many dimensions to be combined together Data cube example: time, location, product sales High complexity of data Time-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked data Spatial, spatiotemporal, multimedia, text and Web data 10 What does Data Mining provide me with? (1) Multidimensional concept description: Characterization and discrimination Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions Characterization describes things in the same class, discrimination describes how to separate different classes Frequent patterns, association, correlation vs. causality Wine Spaghetti [0.3% of all basket cases, 75% of cases when tomato sauce is bought] Is this correlation or not? 11 What does Data Mining provide me with? (2) Classification and prediction Construct models (functions) that describe and distinguish classes or concepts for future prediction E.g., classify countries based on climate, or classify cars based on gas mileage Predict some unknown or missing numerical values Cluster analysis Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns Maximizing intra-class similarity & minimizing interclass similarity 12 What does Data Mining provide me with? (3) Outlier analysis Outlier: Data object that does not comply with the general behavior of the data Fraud detection is the main application area Noise or exception? Trend and evolution analysis Trend and deviation: e.g., regression analysis Sequential pattern mining: e.g., digital camera large SD memory Periodicity analysis Similarity-based analysis 13 Applications of Data Mining Market Analysis and Management Data sources: credit card transactions, loyalty cards, smart cards, discount coupons, ... Target marketing Find clusters of “model” customers who share the same characteristics: • Geographics (lives in Rome, lives in Trentino) • Demographics (married, between 21-35, at least one child, family income more than 40.000€/year) • Psychographics (likes new products, consistently uses the Web) • Behaviors (searches info in Internet, always defends her decisions) Determine customer purchasing patterns over time 14 Applications of Data Mining Market Analysis and Management Cross-market analysis Customer profiling Find associations between product sales, and predict based on such association Compare the sales in the US and in Italy, find associations in old products and predict if new ones will have success What types of customers buy what products Customers with age between 20-30 and income > 20K€ will buy product A Customer requirement analysis Identify the best products for different groups of customers Predict what factors will attract new customers 15 Applications of Data Mining Corporate Analysis Finance Planning and Asset Evaluation Resource Planning summarize and compare the resources and spending Competition Cash flow prediction and analysis Cross-sectional and time-series analysis (financial ratio, trend analysis) monitor competitors and market directions group customers into classes and a class-based pricing procedure set pricing strategy in a highly competitive market Other examples? 16 What’s next? Data Preprocessing Data Mining techniques Why is it needed? Data cleaning Data integration and transformation, Data reduction Discretization and Concept hiererchy Frequent patterns, association rules Classification and prediction Cluster Analysis Are you sleeping? Visualization of the results Summary 17 Data Preprocessing 18 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data • e.g., occupation=“ ”, birthdate=“31/12/2099” noisy: containing errors or outliers • e.g., Salary=“-10” inconsistent: containing discrepancies in codes or names • e.g., Age=“42” Birthday=“03/07/1997” (we are in 2007!!) • e.g., Was rating “1,2,3”, now rating “A, B, C” • e.g., discrepancy between duplicate records. In one copy of the data customer A has to pay 200.000€, in the second copy of the data A does not have to pay anything. 19 Why is data dirty? Incomplete data may come from Noisy data (incorrect values) may come from “Not applicable” data value when collected Different considerations between the time when the data was collected and when it is analyzed. Human/hardware/software problems Faulty data collection instruments Human or computer error at data entry Errors in data transmission Inconsistent data may come from Different data sources Functional dependency violation (e.g., modify some linked data) 20 Why Is Data Preprocessing Important? 21 Data Preprocessing 1. Data cleaning – missing values “Data cleaning is one of the three biggest problems in data warehousing”— Ralph Kimball Fill in missing values Name=“John”, Occupation=“Lawyer”, Age=“28”, Salary=“” Ignore the record (is it always feasible?) Manually filling missing attributes Automatically insert a constant Automatically insert the mean value (relative to the record class) Most probable value: make some inference! 22 Data Preprocessing 1. Data cleaning – binning Handle noisy data 1. 2. Binning Sort data by price (€): 4, 8, 9, 15, 21, 21, 24, 25, 26 Partition into equal-frequency (equi-depth) bins: 3. Binning, clustering, regression (not details) Bin 1: 4, 8, 9 Bin 2: 15, 21, 21 Bin 3: 24, 25, 26 Smoothing by bin means: Bin 1: 7, 7, 7 Bin 2: 19, 19, 19 Bin 3: 25, 25, 25 23 Data Preprocessing 1. Data cleaning – clustering noise 24 Data Preprocessing 2. Integration and transformation Data Integration combines data from multiple sources into a coherent store D1 D2 D3 Schema integration D1,2,3 Entity identification problem: Integrate metadata from different sources A.cust-id B.cust-number Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton Detecting and resolving data value conflicts For the same real world entity, attribute values from different sources are different (e.g., cm vs. inch) 25 Data Preprocessing 2. Integration and transformation Data integration can lead to redundant attributes Same object (A.house = B.residence) Derivates (A.annualIncome = B.salary+C.rentalIncome) Redundant attributes can be discoverd via correlation analysis A mathematical method detecting the correletion between two attributes Correlation coefficient (Pearson’s product moment coefficient): the higher it is, the stronger the correlation between attributes Χ2 (chi-square) test No details on these methods here 26 Data Preprocessing 2. Integration and transformation Aggregation: Sum the sales of different branches (in different data sources) to compute the company sales Generalization: concept hierarchy climbing From integer attribute age to classes of age (children, adult, old) Normalization: scaled to fall within a small, specified range Change the range from [-∞,+ ∞] to [-1,+1] {-13, -6, -3, 10, 100} {-0.13, -0.06, -0.03, 0.1, 1} 27 Data Preprocessing 3. Data reduction Data reduction Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results Different reduction types (dimensions, numerosity, discretization) Dimensionality: Attribute subset selection Example with a decision tree (left branches True, right False) A4? Initial attribute set: {A1, A2, A3, A4, A5, A6} A1? Class 1 Class 2 A6? Class 1 Reduced attribute set: {A1, A4, A6} Class 2 28 Data Preprocessing 3. Data reduction Dimensionality: Principal Components Analysis Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal components) that can be best used to represent data Works for numeric data only Used when the number of dimensions is large Numerosity: Clustering Partition data set into clusters based on similarity, and store cluster representation (e.g., centroid and diameter) only 2 clusters Sparse data leads to many clusters – non effective 29 Data Preprocessing 3. Data reduction Numerosity: Sampling obtaining a small sample s to represent the whole data set N Problem: How to select a representative sampling set Random sampling is not enough – representative samples should be preserved Stratified sampling: Approximate the percentage of each class (or subpopulation of interest) in the overall database Random sampling Stratified sampling No samples from here 30 Data Preprocessing 4. Discretization - concept hierarchy Three types of attributes Discretization Nominal — values from an unordered set (color, profession) Ordinal — values from an ordered set (military or academic rank) Continuous — numbers (integer or real numbers) Divide the range of a continuous attribute into intervals Reduces data size and its complexity Some data mining algorithms do not support continuous types, and in those cases discretization is mandatory Some useful methods: Binning, clustering (already presented) Entropy-based discretization (no details here) 31 Data Preprocessing 4. Discretization - concept hierarchy Concept hierarchy generation For categorical data Specification of an ordering between attributes (schema level) • street < city < state < country Specification of a hierarchy of values (data level) • {Urbana, Champaign, Chicago} < Illinois Automatic generation using the number of distinct values • For the set of attributes: {street, city, state, country} • IF: |street| = 600.000, |city|=3.000, |state|=300, |country|=15 • THEN: street < city < state < country 32 Outline Data Mining techniques Frequent patterns, association rules • Support and confidence Classification and prediction • • • • Decision trees Bayesian classifiers Support Vector Machines Lazy learning Cluster Analysis Visualization of the results Summary 33 Data Mining techniques 34 Frequent pattern analysis What is it? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set Frequent pattern analysis: searching for frequent patterns Motivation: Finding inherent regularities in data • Which products are bought together? Yesterday’s wine and spaghetti example • What are the subsequent purchases after buying a PC? • Can we automatically classify web documents? Applications • • • • Basket data analysis Cross-marketing Catalog design Sale campaign analysis 35 Basic Concepts: Frequent Patterns and Association Rules (1) Transaction-id Items bought 1 Wine, Bread, Spaghetti 2 Wine, Cocoa, Spaghetti 3 Wine, Spaghetti, Cheese 4 Bread, Cheese, Sugar 5 Bread, Cocoa, Spaghetti, Cheese, Sugar Itemsets (= transactions in this example) Goal: find all rules of type X Y between items in an itemset with minimum: Support s - probability that an itemset contains X Y Confidence c – conditional probability that an itemset containing X contains also Y 36 Support and confidence That is. support, s, probability that a transaction contains {A B } s = P(A B ) confidence, c, conditional probability that a transaction having A also contains B. c = P(A|B). Rules that satisfy both a minimum support threhold (min_sup) and a mimimum confidence threhold (min_conf) are called strong. 37 Basic Concepts: Frequent Patterns and Association Rules (2) Transaction-id Items bought 1 Wine, Bread, Spaghetti 2 Wine, Cocoa, Spaghetti 3 Wine, Spaghetti, Cheese 4 Bread, Cheese, Sugar 5 Bread, Cocoa, Spaghetti, Cheese, Sugar Suppose: support s = 50% confidence c=50% Support is used to define frequent patterns (sets of products in more than s% itemsets) {Wine} in itemsets 1, 2, 3 (support = 60%) {Bread} in itemsets 1, 4, 5 (support = 60%) {Spaghetti} in itemsets 1, 2, 3, 5 (support = 80%) {Cheese} in itemsets 3, 4, 5 (support = 60%) {Wine, Spaghetti} in itemsets 1, 2, 3 (support = 60%) 38 Basic Concepts: Frequent Patterns and Association Rules (3) Transaction-id Items bought 1 Wine, Bread, Spaghetti 2 Wine, Cocoa, Spaghetti 3 Wine, Spaghetti, Cheese 4 Bread, Cheese, Sugar 5 Bread, Cocoa, Spaghetti, Cheese, Sugar Suppose: support s = 50% confidence c=50% Confidence defines association rules: X Y rules in frequent patterns whose confidence is bigger than c Suggestion: {Wine, Spaghetti} is the only frequent pattern to be considered. Why? Association rules: Wine Spaghetti (support=60%, confidence=100%) Spaghetti Wine (support=60%, confidence=75%) 39 Advanced concepts in Asssociation Rules discovery Algorithms must face scalability problems Apriori: If there is any itemset which is infrequent, its superset should not be generated/tested! Advanced problems Boolean vs. quantitative associations age(x, “30..39”) and income(x, “42..48K”) buys(x, “car”) [s=1%, c=75%] Single level vs. multiple-level analysis What brands of wine are associated with what brands of spaghetti? Are support and confidence clear? 40 Another example for association rules Transaction-id Items bought 1 Margherita, Beer, Coke 2 Margherita, Beer 3 Quattro stagioni, Coke 4 Margherita, Coke Support s = 40% Confidence c = 70% 41 Another example for association rules Transaction-id Items bought 1 Margherita, Beer, Coke 2 Margherita, Beer 3 Quattro stagioni, Coke 4 Margherita, Coke Frequent itemsets: {Margherita} = 75% {Beer} = 50% {Coke} = 75% {Margherita, Beer} = 50% {Margherita, Coke} = 50% Support s = 40% Confidence c = 70% Association rules: Beer Margherita [c=50%,s=100%] 42 Classification vs. Prediction Classification Prediction Characterizes (describes) a set of items belonging to a training set; these items are already classified according to a label attribute The characterization is a model The model can be applied to classify new data (predict the class they should belong to) models continuous-valued functions, i.e., predicts unknown or missing values Applications Credit approval, target marketing, fraud detection 43 Classification: the process 1. Model construction 2. The class label attribute defines the class each item should belong to The set of items used for model construction is called training set The model is represented as classification rules, decision trees, or mathematical formulae Model usage Estimate accuracy of the model • On the training set • On a generalization of the training set If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known 44 Classification: the process Model construction Classification Training Data NAME Mike Mary Bill Jim Dave Anne RANK YEARS TENURED Assistant Prof 3 no Assistant Prof 7 yes Professor 2 yes Associate Prof 7 yes Assistant Prof 6 no Associate Prof 3 no Algorithms Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ 45 Classification: the process IF rank = ‘professor’ Model usage OR years > 6 THEN tenured = ‘yes’ Classifier Testing Data Unseen Data (Jeff, Professor, 4) NAME T om M erlisa G eorge Joseph RANK YEARS TENURED A ssistant P rof 2 no A ssociate P rof 7 no P rofessor 5 yes A ssistant P rof 7 yes Tenured? 46 Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data 47 Evaluating generated models Accuracy Speed handling noise and missing values Scalability time to construct the model (training time) time to use the model (classification/prediction time) Robustness classifier accuracy: predicting class label predictor accuracy: guessing value of predicted attributes efficiency in disk-resident databases Interpretability understanding and insight provided by the model 48 Example of Classification Example: Suppose that we have a database of customers on the AllEletronics mailing list. The database describes attributes of the customers, such as their name, age, income, occupation, and credit rating. The customers can be classified as to whether or not they have purchased a computer at AllElectronics. Suppose that new customers are added to the database and that you would like to notify these customers of an upcoming computer sale. To send out promotional literature to every new customers in the database can be quite costly. A more cost-efficient method would be to target only those new customers who are likely to purchase a new computer. A classification model can be constructed and used for this purpose. 49 Each internal node represents a test on an attribute. Each leaf node represents a class. A decision tree for the concept buys_computer, indicating whether or not a customer at AllElectronics is likely to purchase a computer. Assoc. Prof. Dr. D. T. Anh 50 Classification techniques Decision Trees (1) Investment type choice Income > 20K€ no yes Age > 60 Low risk no yes Married? no High risk Mid risk yes Mid risk 51 Classification techniques Decision Trees (2) How are the attributes in decision trees selected? Two well-known indexes are used • Information gain selects the most informative attribute in distinguishing the items between the classes • It biases towards attributes with a large set of values • Gain ratio faces the information gain limitations 52 Classification techniques Bayesian classifiers (2) Bayesian classification A statistical classification technique • Predicts class membership probabilities Founded on the Bayes theorem P( X | H ) P( H ) P( H | X ) P( X ) • What if X = “Red and rounded” and H = “Apple”? Performance • The simplest implementation (Naïve Bayes) can be compared to decision trees and neural networks Incremental • Each training example can increase/decrease the probability that an hypothesis in correct 53 Other Classification Methods k-nearest neighbor classifier case-based reasoning Genetic algorithm Rough set approach Fuzzy set approaches 54 The k-Nearest Neighbor Algorithm All instances (samples) correspond to points in the n-dimensional space. The nearest neighbor are defined in terms of Euclidean distance. The Euclidean distance of two points, X = (x1, x2, …,xn) and Y = (y1, y2, …,yn) is n d(X,Y) = (xi –yi)2 1 When given an unknown sample, the k-Nearest Neighbor classifier search for the space for the k training samples that are closest to the unknown sample xq. The unknown sample is assigned the most common class among its k nearest neighbors. The algorithm has to vote to determine the most common class among the k nearest neighbor. When k = 1, the unknown sample is assigned the class of the training sample that is closest to it in the space. Once we have obtained xq’s k-nearest neighbors using the distance function, it is time for the neighbors to vote in order to determine xq’s class. 55 Genetic Algorithms GA: based on an analogy to biological evolution Each rule is represented by a string of bits. Example: The rule “IF A1 and Not A2 then C2“ can be encoded as the bit string “100”, where the two left bits represent attributes A1 and A2, respectively, and the rightmost bit represents the class. Similarly, the rule “IF NOT A1 AND NOT A2 THEN C1” can be encoded as “001”. An initial population is created consisting of randomly generated rules Based on the notion of survival of the fittest, a new population is formed to consists of the fittest rules and their offsprings The fitness of a rule is represented by its classification accuracy on a set of training examples Offsprings are generated by crossover and mutation. 56 5 minutes break! 57 Classification techniques Support Vector Machines One of the most advanced classification techniques Left figure: a small margin between the classes is found Right figure: the largest margin is found Support vector machines (SVMs) are able to identify the right figure margin 58 Classification techniques SVMs + Kernel Functions Is data always linearly separable? NO!!! Solution: SVMs + Kernel Functions How to split this? SVM SVM + Kernel Functions 59 Classification techniques Lazy learning Lazy learning Simply stores training data (or only minor processing) and waits until it is given a test tuple Less time in training but more time in predicting Uses a richer hypothesis space (many local linear functions), and hence the accuracy is higher Instance-based learning Subcategory of lazy learning Store training examples and delay the processing (“lazy evaluation”) until a new instance must be classified An example: k-nearest neighbor approach 60 Classification techniques k-nearest neighbor All instances correspond to points in the n-Dimensional space – x is the instance to be classified The nearest neighbor are defined in terms of Euclidean distance, dist(X1, X2) For discrete-valued, k-NN returns the most common value among the k training examples nearest to x Which class should the green circle belong to? It depends on k!!! k=3 Red K=5 Blue 61 Prediction techniques An overview Prediction is different from classification Major method for prediction: regression Classification refers to predict categorical class label Prediction models continuous-valued functions model the relationship between one or more independent or predictor variables and a dependent or response variable Regression analysis Linear and multiple regression Non-linear regression Other regression methods: generalized linear model, Poisson regression, log-linear models, regression trees No details here 62 What is cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters It belongs to unsupervised learning Typical applications As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms (day 1 slides) 63 Examples of cluster analysis Marketing: Land use: Identification of areas of similar land use in an earth observation database Insurance: Help marketers discover distinct groups in their customer bases Identifying groups of motor insurance policy holders with a high average claim cost City-planning: Identifying groups of houses according to their house type, value, and geographical location 64 Good clustering A good clustering method will produce high quality clusters with high intra-class similarity low inter-class similarity Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, typically metric: d(i, j) The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal ratio, and vector variables. It is hard to define “similar enough” or “good enough” 65 A small example How to cluster this data? This process is not easy in practice. Why? 66 Visualization of the results Presentation of the results or knowledge obtained from data mining in visual forms Examples Scatter plots Association rules Decision trees Clusters 67 Summary Why Data Mining? Data Mining and KDD Data preprocessing Some scenarios Classification Clustering 68