Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Course Web Page y http://kisi.deu.edu.tr/engin.yildiztepe/ Data Mining -1- y Lecture Slides y Announcements y Assignments Dr. Engin g YILDIZTEPE 2 Reference Books SYLLABUS y Data Mining – Introduction y Han, J. , Kamber, M., Pei, J., (2011). Data Mining: Concepts and Techniques. Third edition. San Francisco: Morgan Kaufmann Publishers. y Larose, Daniel T. (2005). Discovering Knowledge In Data – An Introduction to Data Mining. New Jersey: John Wiley and Sons Ltd. y Alpaydın, E. (2010). Introduction to Machine Learning. Second Ed. London:MIT Press. y Databases y Data Warehouses y OLAP y Data mining process y Data mining tasks y Clustering y Classification y Association Rules y Evaluation 3 4 1 Grading y Midterm examination y Final examination y Homework Introduction 5 are drowning in information but starved for knowledge." (John Naisbitt) y Moore’s Law In 1965, Intel Corporation cofounder Gordon Moore predicted that the density of transistors in an integrated circuit would double approximately every two years. (often quoted as 18 months) y Experts on ants estimate that there are 1016 to 1017 ants on earth. In the year 1997, there was one transistor per ant. 6 y y y y y y y y y y y y 7 y "We 40% 50% 10% Computer History What is the idea? Abacus (BC 2600) Calculating Clock - Wilhelm Schickard (1623) P li – Blaise Pascaline Bl i Pascal P l (1642) Leibniz wheel - Gottfried Leibniz (1672) Analytical Engine* - Charles Babbage (1837) Marc – 1 - Howard H. Aiken (1944) ENIAC - John Mauchly and J. Presper Eckert (1947) EDVAC – J. Mauchly, J. P. Eckert and John von Neumann- (1951) MANIAC - John von Neumann Neumann- (1952) First Microprocessor – 4004 – Intel (1971) APPLE - Steve Wozniak and Steve Jobs – (1976) IBM Personal Computer (1981) y The aim of the data mining is to extract hidden knowledge from data sets. 8 2 What is Data Mining KDD Process y “Data mining is the exploration and analysis, by automatic or semiautomatic means, of large quantities of data in order to discover meaningful patterns and rules.” 1 y “Data “D t mining i i is i the th analysis l i off (often ( ft large) l ) observational b ti l data d t sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner.”2 y “Data mining is an interdisciplinary field bringing together techniques from machine learning, pattern recognition, statistics,, databases,, and visualization to address the issue of information extraction from large data bases.”3 y 1 Berry, Michael J.A. and Linoff, Gordon, “ Data Mining Techniques: For Marketing, Sales, and Customer Support” , John Wiley & Sons, Inc. 1997. 2 David Hand, Heikki Mannila, and Padhraic Smyth, Principles of Data Mining, MIT Press, Cambridge, MA, 2001 3 Peter Cabena, PabloHadjinian, Rolf Stadler, JaapVerhees, andAlessandro Zanasi,Discovering Data Mining: From Concept to Implementation, Prentice Hall, Upper Saddle River,NJ, 1998. 9 10 Why Data Mining y A relational database is a collection of tables. y The storing of the data in data warehouses, so that the y y y 11 Fayyad U., et all.,”From Data Mining to Knowledge Discovery in Databases”, 1996. Database y The explosive growth in data collection, y Figure 1. An Overview of the Steps That Compose the Knowledge Discovery in Databases Process. y Tables consist of a set of columns. columns entire ti enterprise t i h has access to t a reliable li bl currentt database The availability of increased access to data from Web navigation and intranets The competitive pressure to increase market share in a globalized economy The development of the commercial data mining softwares The tremendous growth in computing power and storage capacity y Tables store a large set of tuples (records or rows). y A database system consists of a collection of interrelated data. y Relational data can be accessed by queries written in a relational query language (SQL) 12 3 Data warehouse Data warehouse y A data warehouse is a repository of information collected from multiple sources, stored under a unified ifi d schema, h and d which hi h usually ll resides id att a single i l site. y Data warehouses are constructed via a process of data cleansing, data transformation, data integration, data loading, and periodic data refreshing. y Data warehousing provides architectures and tools f for b i business executives ti t systematically to t ti ll organize, i understand, and use their data to make strategic decisions. 13 14 Operational Database – Transactional Database what exactly is a data warehouse? y Operational Database consists of the data used to y A data warehouse refers to a database that is maintained separately from an organization's operational databases. databases y A data warehouse collects information about subjects that span an entire organization, and thus its scope is enterprise-wide. y y y y 15 run the day-to-day operations of the business. An operational database contains enterprise data which are up to date and modifiable. The operational database is the source of the data warehouse. Transactional database consists of transactions. Each record in a transactional database captures a transaction, such as a customer’s purchase. 16 4 Data Mart OLAP y By providing multidimensional data views and the precomputation of summarized data, data warehouse systems are wellll suited i d for f On-Line O Li A l i l Processing, Analytical P i or OLAP. y OLAP operations make use of background knowledge regarding the domain of the data being studied in order to allow the presentation of data at different levels of abstraction. Such operations accommodate different user viewpoints. y Examples E l off OLAP operations i i l d drill-down include d ill d andd roll-up, ll which allow the user to view the data at differing degrees of summarization. y Data marts are a subset of data warehouse data y A data mart is a department subset of a data warehouse. y It focuses on selected objects, and thus its scope is department-wide. 17 18 OLAP OLAP y Traditional query and report tools describe what is in a database. y OLAP goes further; it’s used to answer why certain things are true. y The user forms a hypothesis about a relationship and verifies it with a series of queries against the data. 19 Han, J. , Kamber, M., Pei, J., (2011). Data Mining: Concepts and Techniques. Third edition. San Francisco: Morgan Kaufmann Publishers. 20 5 OLAP vs. Data Mining OLAP vs. Data Mining y The OLAP analyst generates a series of hypothetical y Data mining is different from OLAP because rather than patterns and relationships and uses queries against the database to verify them or disprove them. y OLAP analysis is essentially a deductive process. y But what happens when the number of variables being analyzed is in the dozens or even hundreds? y It becomes much more difficult and time-consuming to find a goodd hypothesis. h h verify hypothetical patterns, it uses the data itself to uncover such patterns. y It is essentially an inductive process. y For example, suppose the analyst who wanted to identify the risk factors for loan default were to use a data mining tool. 21 22 OLAP vs. Data Mining Data Mining Tasks y Where data mining and OLAP can complement each other? y The most common data mining tasks are as follows: y OLAP is also complementary p y in the earlyy stages g of the KDD y Description p y It can help you explore your data, for instance by focusing y Estimation attention on important variables, identifying exceptions, or finding interactions. y Prediction y Classification y Clustering y Association 23 24 6 CROSS-INDUSTRY STANDARD PROCESS: CRISP–DM CROSS-INDUSTRY STANDARD PROCESS: CRISP–DM The data mining process must be reliable and repeatable 1. Business understanding phase 1. Business understanding phase y The project objectives and requirements understanding 2 Data understanding phase 2. y Data mining problem definition. definition 3. Data preparation phase y Prepare strategy for achieving these objectives. 4. Modeling phase 5. Evaluation phase 6. Deployment phase 25 27 26 CROSS-INDUSTRY STANDARD PROCESS: CRISP–DM CROSS-INDUSTRY STANDARD PROCESS: CRISP–DM 2. Data understanding phase y Initial data collection. y Exploratory data analysis y Identification of the data quality problems. 3. Data preparation phase y Prepare the final data set y Select the records and variables you want to analyze y Perform transformations on certain variables y Clean the raw data 28 7 CROSS-INDUSTRY STANDARD PROCESS: CRISP–DM CROSS-INDUSTRY STANDARD PROCESS: CRISP–DM 4. Modeling phase y Select and apply appropriate modeling techniques. y Calibrate parameters to optimize results. results y Several different techniques may be used for the same problem. y If necessary, loop back to the data preparation phase 5. Evaluation phase y Evaluate the one or more models for quality and effectiveness. y Determine whether the model in fact achieves the objectives set y Come to a decision regarding use of the data mining results. 29 30 CROSS-INDUSTRY STANDARD PROCESS: CRISP–DM Data Preprocessing 6. Deployment phase y Make use of the models created. y Why Wh preprocess the h ddata?? y Data cleaning y Data integration and transformation y Data reduction y Discretization and concept hierarchy generation y Summary 31 8 Data Preprocessing Data Preprocessing y Much of the raw data contained in databases is unpreprocessed, y Data preparation tasks are likely to be incomplete, noisy and inconsistent. p performed multiple p times and not in anyy prescribed order. y Tasks include; y incomplete: l l k attribute lacking b values, l llacking k certain attributes b off interest, or containing only aggregate data y noisy: containing errors or outliers y inconsistent: containing discrepancies in codes or names y table, record and attribute selection, y For example, the databases may contain: y y y y y y transformation, Fields that are obsolete or redundant Missing values Outliers Data in a form not suitable for data mining models Values not consistent with policy or common sense. 33 y cleaning of data 34 Tasks in Data Preprocessing Forms of data preprocessing y Clean Data y To fill missing values y To smooth out noise, identify or remove outliers y To correct inconsistencies y Integrate Data y To combine data from multiple sources y Data transformation y The production of derived attributes y Format transformations y Normalization (scaling to a specific range) y Aggregation y Data reduction y Obtains reduced representation in volume but produces the same or similar analytical results 35 y Data discretization: with particular importance, especially for numerical data y Data aggregation, dimensionality reduction, data compression,generalization 36 9 Data Preprocessing Data Cleaning y Real-world data tend to be incomplete, noisy, and inconsistent. y Data cleaning routines attempt to: y Why Wh preprocess the h ddata?? y fill in missing values, y Data cleaning y smooth out noise while identifying outliers, y Data integration and transformation y correct inconsistencies in the data y Other data problems which requires data cleaning y duplicate d li t records d y incomplete data y Data reduction y Discretization and concept hierarchy generation y Summary 38 Data Cleaning – missing data Exercise – kdnuggets.com y Data is not available. Many tuples have no recorded value for y y y y y y 39 y Datasets - UCI Machine Learning Repository several attributes, such as customer income in sales data. How can you go about filling in the missing values for this attribute? Let's look at the following methods. Ignore the tuple: usually done when class label is missing Fill in the missing value manually. (tedious + infeasible) Use a global constant to fill in the missing value. (a new class?) Use the attribute mean to fill in the missing value. Use the attribute mean for all samples belonging to the same class as given tuple. Use the most probable value to fill in the missing value: inference-based such as regression, decision tree y Adult data set y Predict whether income exceeds $50K/yr based on census data 40 10 smooth out noise y Noise is a random error or variance in a measured variable. y Data smoothingg techniques: q y Binning y Clustering y Combined computer and human inspection y Regression How to Handle Noisy Data? y Binning method: y first sort data and partition into (equi-depth) bins y then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. y used also for discretization. y Clustering y detect and remove outliers y Semi-automated method: combined computer p and human inspection y detect suspicious values and check manually y Regression y smooth by fitting the data into regression functions 41 Simple Discretization Methods: Binning Simple Discretization Methods: Binning y Equal-width (distance) partitioning: y Equal-depth (frequency) partitioning: y It divides the range g into N intervals of equal q size y if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N. y The most straightforward y It divides the range g into N intervals,, each containingg approximately pp y same number of samples (except the last one) y Good data scaling, good handling of skewed data. y Managing categorical attributes can be tricky. y But outliers may dominate presentation y Skewed data is not handled well. 11 Example – equi-width binning Binning Methods for Data Smoothing y Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 y Partition into (equi-depth) bins: Bi 1: Bin 1 4, 4 8, 8 9, 9 15 Bin 2: 21, 21, 24, 25 Bin 3: 26, 28, 29, 34 y Smoothing by bin means: Bin 1: 9, 9, 9, 9 Bin 2: 23, 23, 23, 23 Bin 3: 29, 29 29, 29 29, 29 29 y Smoothing by bin boundaries: Bin 1: 4, 4, 4, 15 Bin 2: 21, 21, 25, 25 Bin 3: 26, 26, 26, 34 • Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 • Take bin width=(40-0)/4=10 Bin # Bin elements Bin boundaries 1 {4,8,9} [0,10) 2 {15} [10,20) 3 {21,21,24,25,26,28,29} [20,30) 4 {34} [30,40) Smoothing by bin means: - Bin 1: 7,7,7 - Bin 2: 15 - Bin 3: 25,25,25,25,25,25,25 - Bin 4:34 Smoothing by bin boundaries: - Bin 1: 4,9,9 - Bin 2: 15 - Bin 3: 21,21,21,21,29,29,29 - Bin 4: 34 46 Cluster Analysis Regression y Y1 y=x+1 Y1’ •Linear regression (best line to fit two variables) •Multiple linear regression (more than two variables, fit to a multidimensional surface X1 x 12 Data Preprocessing y Why Wh preprocess the h ddata?? y Data cleaning y Data integration and transformation y Data reduction y Discretization and concept hierarchy generation y Summary Handling Redundant Data in Data Integration y Redundant data occur often when integrating multiple DBs Data Integration y Data integration: y combines data from multiple sources into a coherent store y Schema integration y integrate metadata from different sources y Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id ≡ B.cust-# y Detecting and resolving data value conflicts y for f the th same reall world ld entity, tit attribute tt ib t values l ffrom diff differentt sources are different y possible reasons: different representations, different scales, e.g., metric vs. British units, different currency Data Transformation y Smoothing: remove noise from data (binning, clustering, regression) y The same attribute mayy have different names in different databases y Aggregation: summarization summarization, data cube construction y One attribute may be a “derived” attribute in another table, e.g., annual y Generalization: where low level data are replaced by higher level revenue y Redundant data may be able to be detected by correlation analysis Σ( A − A)( B − B) rA, B = (n − 1)σ Aσ B y Careful integration can help reduce/avoid redundancies and inconsistencies and improve mining speed and quality concepts through the use of concept hierarchies. Ex: street attribute can be generalized like city; age Æchild, young,middle aged, senior y Normalization: scaled to fall within a small, specified range y min-max normalization y z-score normalization y normalization by decimal scaling y Attribute/feature construction y New attributes constructed from the given ones 13 Data Transformation: Normalization Particularly useful for classification (NNs, distance measurements,nn classification, etc) y min-max normalization x' = x − min i A (new _ maxA − new _ minA) + new _ minA maxA − minA y minA and maxA are the minimum and maximum values of an attribute y Min-max normalization maps a value of x to x x′ in the range [new_minA,new_maxA]. Data Preprocessing y Why preprocess the data? y Data cleaning y Data integration and transformation y Data reduction y Discretization and concept hierarchy generation y Summary Data Transformation: Normalization y z-score normalization :This method of normalization is useful when the actual minimum and maximum of attribute A are unknown, or when there are outliers that dominate the min-max normalization. x'= x − mean stand dev A A y normalization by decimal scaling v'= v 10 j Where j is the smallest integer such that Max(| v '|)<1 Data Reduction Strategies y Data warehouse may store terabytes of data: Complex data analysis/mining may take a very long time to run on the complete co p ete data ata set y Data reduction y Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results y Data reduction strategies y Data cube aggregation y Dimensionality reduction y Numerosity reduction y Discretization and concept hierarchy generation 14 Data Cube Aggregation y The lowest level of a data cube y the h aggregatedd ddata ffor an iindividual di id l entity i off iinterest y Multiple levels of aggregation in data cubes y Further reduce the size of data to deal with y Reference appropriate levels y Use the smallest representation which is enough to solve the task Dimensionality Reduction y Problem: Feature selection (i.e., attribute subset selection): y Select a minimum set of features such that the probability distribution of different classes given the values for those features is as close as possible to the original g distribution given g the values of all features y irrelevant, weakly relevant, or redundant features are detected and removed. y Nice side-effect: reduces # of attributes in the discovered patterns (which are now easier to understand) y Solution: Heuristic methods (due to exponential # of choices) g y usuallyy greedy: y y y y Heuristic Feature Selection Methods y There are 2d possible sub-features of d features y SSeverall h heuristic i i ffeature selection l i methods: h d y Best single features under the feature independence assumption: choose by significance tests. y Step-wise feature selection: y The best single-feature is picked first y Then next best feature condition to the first, ... y Step-wise Step wise feature elimination: y Repeatedly eliminate the worst feature y Combined feature selection and elimination: Optimal branch and bound: step-wise forward selection step-wise backward elimination combining forward selection and backward elimination decision-tree induction Numerosity Reduction y Parametric methods y Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers) y E.g.: Log-linear models: obtain value at a point in m-D space as the product on appropriate marginal subspaces y Non-parametric N t i methods th d y Do not assume models y Major families: histograms, clustering, sampling y Use feature elimination and backtracking 15 Regression and Log-Linear Models Regression Analysis and Log-Linear Models y Linear regression: Data are modeled to fit a straight line: y Often uses the least-square method to fit the line y Multiple regression: allows a response variable y to be modeled as a linear function of multidimensional feature vector (predictor variables) y Log-linear model: approximates discrete multidimensional Clustering Histograms y Partition data set into clusters, and store cluster representation only y Approximate data distributions 40 store average (sum) for each bucket y A bucket represents an attribute-value/frequency pair y Can be constructed optimally in one dimension using dynamic programming y Related to quantization problems. y Multiple regression: Y = b0 + b1 X1 + b2 X2. y Many nonlinear functions can be transformed into the above. y Log-linear models: y The multi-way table of joint probabilities is approximated by a product of lower-order tables. y Probability: p(a, b, c, d) = αab βacχad δbcd joint probability distributions y Divide data into buckets and y Linear regression: Y = α + β X y Two parameters , α and β specify the line and are to be estimated by using the data at hand. y using the least squares criterion to the known values of Y1,Y2, …, X1, X2, …. y Quality of clusters measured by their diameter (max distance between any 35 two objects in the cluster) or centroid distance (avg. distance of each cluster object from its centroid) 30 25 y Can be very effective if data is clustered but not if data is “smeared” 20 y Can have hierarchical clustering (possibly stored in multi-dimensional index tree structures (B+ (B+-tree tree, R R-tree tree, quad quad-tree tree, etc)) 15 y There are many choices of clustering definitions and clustering algorithms 10 (further details later) 5 0 10000 30000 50000 70000 90000 16 Sampling Sampling y Allow a mining algorithm to run in complexity that is potentially sub- linear to the size of the data y Cost of sampling: proportional to the size of the sample, increases lilinearly l with ith th the number b off dimensions di i y Choose a representative subset of the data y Simple random sampling may have very poor performance in the presence of skew y Develop adaptive sampling methods y Stratified sampling: y Approximate the percentage of each class (or subpopulation of interest) in the overall database y Used in conjunction with skewed data y Sampling may not reduce database I/Os (page at a time). y Sampling: natural choice for progressive refinement of a reduced data set. Raw Data Data Preprocessing Sampling Raw Data Cluster/Stratified Sample y Why preprocess the data? y Data cleaning y Data integration and transformation y Data reduction y Discretization and concept hierarchy generation y Summary 17 Discretization/Quantization y Three types of attributes: y Nominal — values from an unordered set y Ordinal — values from an ordered set y Continuous — real numbers y Discretization/Quantization: divide the range of a continuous attribute into intervals x1 yy1 x2 yy2 x3 yy3 x4 yy4 x5 yy5 Discretization and Concept Hierarchy y Discretization y reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values. y Concept Hierarchies yy6 y Some classification algorithms only accept categorical attributes. y Reduce data size by discretization y Prepare for further analysis Discretization and concept hierarchy generation for numeric data y Hierarchical and recursive decomposition using: y reduce e uce the t e data ata byy collecting co ect g and a replacing ep ac g low ow level eve co concepts cepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior). Concept hierarchy generation w/o data semantics Specification of a set of attributes Concept hierarchy can be automatically generated based on the number of distinct values per attribute in the given attribute set. The h attribute b withh the h most ddistinct values l is placed l d at the h lowest level of the hierarchy (limitations?) y Binning (data smoothing) y Histogram analysis (numerosity reduction) country 15 distinct values y Clustering analysis (numerosity reduction) province_or_ state y Entropy-based discretization 65 distinct values city 3567 distinct values street 674,339 distinct values 18 Summary y Data preparation is a big issue for both warehousing and mining y Data preparation includes y Data cleaning and data integration y Data reduction and feature selection y Discretization y A lot of methods have been developed but still an active area of research 19