Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to Data Mining Oct. 12/14, 2004 Xiangming Mu What is Data Mining • Definition: Extraction of hidden predictive information from large databases – Predict future trends and behaviors – Using automated, prospective analyses technology in • Statistics, artificial intelligence, and machine learning – Supported by three technologies • Massive data collection • Powerful multiprocessor computers • Data mining algorithms The evolution of Data Mining • Data Collection (1960s) – Retrospective static data delivery (e.g., total revenue in the last five years) • Data Access (1980s) – Retrospective, dynamic data delivery at record level – e.g, – DBMS, SQL, ODBC • Data Warehousing and Decision Support (1990s) – Retrospective, dynamic data delivery at multiple levels – On-line analytic processing (OLAP), multidimensional databases, data warehouses – e.g., what were unit sales in New England last Month? Drill down to Boston • Data Mining – Prospective, proactive information delivery – Massive databases, algorithms, and powerful computer – e.g., what’s likely to happen to Boston unit sales next month? Why? Data Warehouse • combine data from heterogeneous data sources into a single homogenous structure. • Organize data in simplified structures for efficiency of analytical queries rather than for transaction processing. • • Contain transformed data that is valid, consistent, consolidated, and formatted for analysis. • Provide stable data that represents business history. • Are updated periodically with additional data rather than frequent transactions. • Simplify security requirements. • Provide a database organized for OLAP rather than OLTP( online transaction processing) . Data Warehouse (cont’) • Usually multiple granularity are supported – – – – Current operational data Historical operational data Aggregated data Metadata • Two kinds of tables – Fact tables • Many rows, primarily numeric data, static data – Dimension tables • Reference the data stored in the fact table • Primary character data, fewer rows, updatable Scope of Data Mining • Data Mining is – – – – – Decision trees Nearest neighbor classification Neural networks Rule induction K-means clustering • Data Mining is NOT – – – – Data warehousing SQL/Queries/Reporting Software Agents Online Analytical Processing (OLAP) Models • A “black box” that makes predictions about the future based on information from the past and present • Large number of inputs usually available • A good model offers accuracy and understandability – Decision trees->rule induction->regression models->neural networks Age Beer Diapers Blood Pressure Model ? Techniques for Data Mining • Classical Techniques – Statistics, Neighborhoods and Clustering • New Techniques – Trees, Networks and Rules Statistics • Helps answer questions like – – – – What patterns are there in my database? What is the chance that an event will occur? Which patterns are significant? What is a high level summary of the data that gives me some idea of what is contained in the database? • Description – Histograms, Max, Min, Mean, Median, Variance Statistics: regression • Prediction = Regression (?) • Maps values from predictors in such a way that the lowest error occurs in make a prediction • Linear regression – Y = a + b*X – Multiple linear regression • Non-linear regression Nearest Neighbor • Objects that are “near” to each other will have similar prediction values as well • If you know the prediction value of one of the objects you can predict it for it’s nearest neighbors • Confidence: the chance of being correct • Examples: – Guess the income of you from your neighbor – “find more like this” question K Nearest Neighbors • a combination of the classes of the k record(s) most similar to it in a historical dataset Clustering • A method by which like records are grouped together • There is no absolute “best” way to cluster – Dimensions involved for clustering (e.g., one dimension– based on one attribute of the database) Clustering vs. Nearest Neighbor Nearest Neighbor Clustering For both data consolidation and predictions Primarily for data grouping Space is defined by the problem Space is predefined by past experience. Generally only uses distance Can use other metrics metrics besides distance Hierarchical clustering • The number of dimensions for clustering – Two extreme situations: one cluster or as many clusters as records—both failed to help understand data better – Allows users to choose the appropriate number of dimension – Usually be viewed as a tree • Advantages – Defined solely by the data – Easy to increase or decrease the number of clusters Clustering approaches • Single link approach – Clustering based on linked records • Complete link approach – Clustering based on the average minimum distance from the center point What is Decision Tree • Tree-shaped structures • represent sets of decisions • These decisions generate rules for the classification of a dataset. • Specific decision tree methods include – Classification and Regression Trees (CART) – Chi Square Automatic Interaction Detection (CHAID) . 50 Renewers 50 Non-Renewers New Technology Phone ? Yes No 30 Renewers 50 Non-Renewers 20 Renewers 0 Non-Renewers Customer <2.3years Yes No 25 Renewers 10 Non-Renewers 5 Renewers 40 Non-Renewers Age <55 Yes 20 Renewers 0 Non-Renewers No 5 Renewers 10 Non-Renewers Characteristics of decision tree • A collection of data is divided into two sub-collection on each branch. • Each leaf of the tree represents a segment of the data. --records in the segment were somewhat similar to each other • The distribution of records among the “leafs” provides information for prediction • The condition for a particular leaf is the combination of nodes along the path from the “root” What is a good question? • Is good at distinguishing groups in the prediction variable – i.e., distinguish between renewers and non-renewers • The level of homogeneous in each segments after the question – CART (Classification and Regression Trees): tries all questions, then picks the best (based on entropy metrics) – CHAID( Chi-Square Automatic Interaction Detector): relies on the chi-square test used in contingency tables When stops splitting? • Each segment contains only one record • All the records in the segment have identical characteristics • The improvement is not substantial enough – The entropy do not change substantially Artificial Neural Network • What is Artificial Neural Network? – – – – resemble biological neural networks in structure. Non-linear predictive models learn through training Derive primarily from artificial intelligence rather than statistics • Advantages – Highly accurate prediction – Can be applied across a large number of different types of problems • Disadvantages – Difficult to understand – Difficult to deploy Degree of automation • Even though sharing a high degree of automation, neural network still needs some manual arrangements – How should the nodes in the network be connected? – How many neuron like processing units should be used? – When should “training” be stopped in order to avoid overfitting?—can be trained forever and still not be 100% accurate on the training set. – Numeric data usually needs to normalized (between 0 and 1.0) Feature Extraction • Detect the most relevant and most important predictors – e.g. high level of features for images – Points and lines Input Nodes Hidden Nodes Output Nodes Prediction • Complexity of NN – The number of links (weight) – The number and layers of hidden nodes Income (70,000) Income (0.7) Default? Age (50) Weight=0.2 0.29 Age (0.5) 0—good credit risk (no default) 1---bad credit risk (likely default) Weight=0.3 Types of neural networks • Back propagation – Errors from the output nodes are propagated backwards through the hidden nodes to the input nodes. – At each level, the link weights between nodes are updated to avoid similar mistake – Different algorithms were developed • Gradient descent • Newton’s method • Genetic algorithms • Other methods – Kohonen feature maps • Feedforward, and no hidden layer – Radial Basis Function networks Rule Induction • Most common form of knowledge discovery in unsupervised learning systems • Is a bottom up approach (decision tree is top down) • The extraction of useful if-then rules from data based on statistical significance – e.g., if paper plates then plastic forks – But do not imply causality • Two aspects – Accuracy (confidence), or how often is the rule correct? – Coverage (support), or how often does the rule apply? Rule If nails, then hammer Accuracy Coverage 20% 85% If bread, then swiss cheese 15% 6% If 42 years old, purchased pretzels and dry roasted peanuts, then beer 0.01% 95% Interpret rules • Targeting consequent – e.g., all the rules with a consequent related to “coffee” • Targeting accuracy – Highly profitable activities • Targeting coverage – Always happen • Targeting interestingness – Balance between accuracy and coverage