Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to Predictive Analytics TIBER TRAINING, 6/12/2015 Short Bio B.A. from Allegheny College M.S. from Carnegie Mellon University Internship at the Department of Veterans Affairs July 2014 – Present: Senior Consultant, IBM Global Business Services Advanced Analytics and Optimization Today’s Goals: Students Will Be Able To: 1. Identify the purpose of Data Mining 2. Discuss major paradigms in Data Mining 3. Apply paradigms to business problems 4. Discuss major considerations in model performance 5. Evaluate model performance 6. Discuss the general functionality of some major classification algorithms 7. Use RapidMiner software to load data, train a classifier, and evaluate 8. Use RapidMiner to create clusters from data Agenda 1. Conceptual Material 1. 2. 3. 4. Data Mining Foundations Problem Types and Model Paradigms Model Performance Considerations Recognizing a Strong Model 2. Hands-On Activities 1. 2. 3. K-Nearest Neighbors Classification Decision Tree Models K-Means Clustering Data Mining Foundations – Data Mining Definition BRAINSTORM Based on exposure to literature, conversations with clients, professional publications: What is data mining? Data Mining Foundations – Several Definitions To extract insight or information from data To discover interesting relationships within data “Extracting useful knowledge from data to solve business problems” * Provost & Fawcett, “Data Science for Business” CRISP-DM Framework “Cross Industry Standard Practice for Data Mining” 1. Iterative process 2. Human stays in the loop throughout iterations 3. Understanding of both the business and the data are paramount System Design BI DASHBOARDS RESULTS REPORTING REPORTING PROCESSED STATE SOURCE SYSTEMS (Data Warehouse?) (Operational Data Store?) (Other form?) DATA MINING RESULTS STORAGE RESULTS DASHBOARD AD HOC, ONE-OFF DATA MINING AUTOMATIC DATA MINING It is assumed that some action will occur Data Format Rows, Also known as: • Records • Observations NOTE: Expects inputs and outputs to be at the same level (i.e. for an individual, for a visit to the store, for a webpage view, etc) Name Age Occupation Purchase? Total Spent Jane Smith 38 Designer Yes 40,000 James Doe 19 Student No N/A Sally Q. Public 50 CEO Yes 9,000 Columns, Also known as: • Fields • Attributes • Dimensions Input Fields Labels/Output Fields, Can be: • Binary • Categorical • Numeric Data Format in Different Disciplines BRAINSTORM Based on experience within business intelligence and data warehousing: How does this data format expectation differ from other applications? Does it pose any unique challenges? Problem Types and Paradigms SUPERVISED CLASSIFICATION REGRESSION Supervised methods assume labels/outputs UNSUPERVISED CLUSTERING VISUALIZATION Unsupervised methods lack labels/outputs The preceding slide had $ amount of purchase. Why not use it?! Problem Types: Classification Examples: (From preceding slide) Every new customer must record their name, age, and occupation. Given a set of inputs, predict the most likely value for a categorical or binary output Jimmy G. Generic 10 Student Will this new customer make a purchase? Given the dollar amount, GPS location of the store, category of the retailer, date, and time of a purchase – is it fraudulent? Given the pulse, respiration, and blood pressure – Will the patient have a complication? Problem Types: Regression Examples: (From preceding slide) Every new customer must record their name, age, and occupation. Given a set of inputs, predict the most likely value for a real-valued output Jimmy G. Generic 10 Student IF this customer makes a purchase, how much are they likely to spend? Today’s session has minimal coverage of regression. Why? Problem Types: Clustering Examples: ` ` Days as a Salesperson ` Given a set of inputs, with or without outputs, assign a class label to each record based upon similarity to other records Average $ Value Sales/Day Problem Types: Visualization Social Network Visualization, Courtesy of Wikipedia Not necessarily a problem type like the others, but: Can sometimes create new insight for existing problems What can we say about this? Would we necessarily glean the same from clustering? Problem Types and Business BRAINSTORM Now that we’ve introduced the major problem types in data mining: Form groups and create a list of at least five business use cases for each of the problem types. Model Performance Considerations We pull 10 records from the database Build a model (which type of model? Which type of problem?) X = purchase O = nonpurchase X X X # of times clicked PROBLEM SCENARIO: We have clickstream data on how long a customer has viewed an item (total) and how many times they’ve clicked it OO OO O X Total time spent viewing X Model Performance Considerations We pull 10 records from the database Build a model (which type of model? Which type of problem) X = purchase O = nonpurchase X X X # of times clicked PROBLEM SCENARIO: We have clickstream data on how long a customer has viewed an item (total) and how many times they’ve clicked it NONPURCHASE PURCHASE OO OO O X X q Total time spent viewing SUCCESS! If Total_Time >= q THEN Purchase ELSE Nonpurchase Time to cash the check! Right? Model Performance Considerations We deploy the system, 10 new customers With more rules, using both dimensions, we can still have reasonable accuracy We can’t have a perfect classifier using these two dimensions … Can we imagine a third? And another rule? # of times clicked Suddenly … Our classifier looks much worse X X X X X X X X p OO OO O O O X X O OO q Total time spent viewing Model Performance Considerations UNDERFIT OVERFIT http://datascience.stackexchange.com/questions/361/when-is-a-model-underfitted http://scott.fortmann-roe.com/docs/BiasVariance.html Which of these did we display in the original model for our clickstream customers? What if we kept adding rules, and kept adding dimensions? Testing a Model Cross Validation lets us choose random subsets of the data to hold out from the model building stage 10-fold cross validation is the most popular https://chrisjmccormick.files.wordpress.com/2013/07/10_fold_cv.png 10-Fold Cross Validation Utility 1. Choose between different modeling algorithms 2. Tune parameters of chosen algorithm to reduce bias or variance 3. Create an estimate of model performance Evaluating classifiers Refer to handout – Confusion Matrix and Performance Metrics BRAINSTORM Why not just use accuracy? Time for the big bucks! I can deliver 99% accuracy for some problems, guaranteed! How? Receiver Operating Characteristic (ROC) Curve IMPORTANT Points on the ROC Curve each represent a specific threshold for a classifier. Time for the big bucks! I can deliver a 0% False Positive Rate, guaranteed! How? http://upload.wikimedia.org/wikipedia/commons/8/8c/Receiver_Operating_Characteristic.png Using ROC Curves to Compare Models http://www.actaneurocomms.org/content/figures/2051-5960-1-3-11-l.jpg 1. Generally, models closer to the top left are best, e.g. 100% true positive rate and 0% False Positive Rate 2. Occasionally, one model will tend toward lower false positives while another favors higher true positives 3. Leads to: Cost sensitive or unbalanced classifiers K-Nearest Neighbors 1. Non-parametric approach ◦ (Does not actually extract model parameters, relies on having all examples in memory) 2. Relies on a distance function between each point and others 3. Each nearby point “votes” on the class label http://mines.humanoriented.com/classes/2010/fall/csci568/portfo lio_exports/lguo/image/knn/knn.jpg K-Nearest Neighbors, Continued 1. Parameters can be tuned: ◦ ◦ K (number of neighbors to use as a vote) Distance metric 2. What about 1 Nearest Neighbor? 3. What about 100 Nearest Neighbors? K-Nearest Neighbors 1. Fairly interpretable (present example cases) 2. Can handle multiple output categories (not just binary) 3. Can only handle numeric inputs 4. Nonparametric – Memory intensive and slower to score, no time to train 5. Distance metrics begin to break down as dimensionality increases Decision Tree Models 1. Recursive algorithm approach 2. Uses some measure of purity ◦ ◦ Given a certain split in an input, how “pure” is the output? Information Gain, based on Entropy, is a popular measure Decision Trees, continued 1. Test Information Gain of the output variable given a split on each of the input variables 2. First split – Maximum Information Gain 3. Recursively repeat for each of the nodes, which correspond to split points of original input value 4. Stop when the leaf nodes have reached an adequate level of “purity” 5. Handle numeric fields with cutpoints (beyond the scope of this session) Decision Trees – Entropy and Information Gain ENTROPY Entropy = −p(yi) ∗ log2p(yi) Example: 7/10 people purchased, 3/10 did not H(y)= -[0.7*log2(0.7)+0.3*log2(0.3)] H(y) = -[0.7*-0.51+0.3 * -1.74] H(y) = .88 http://www.talkorigins.org/faqs/information/Hbinary.gif How does .88 correspond to the chart at right? Entropy, Another View (Best Regards to CMU’s Andrew Moore, who used a similar image) PILLOW STUFFING ENTROPY: LOW PILLOW STUFFING ENTROPY: HIGH Decision Trees – Entropy and Information Gain, Continued INFORMATION GAIN IG(T,a) = H(T) – H(T|a) In other words, take the entropy of the outcome and subtract the entropy after you’ve split on some criteria. For example, if H(y) = .88 for purchases on the preceding slide, what is the entropy if you first break it into income >= 50,000 and income <= 50,000? Information Gain – Worked Example Shopped? Income >= 50k? Purchased? Y Y Y Y Y Y Y Y Y Y Y Y Y N Y Y N Y Y N Y Y Y N Y N N Y N N IG, Shopped: Entropy(Purchased) = .88 Entropy(Purchased|Shopped = Y) = .88 P(shopped = y) = 1 IG = .88 – (1*.88) = 0 IG, Income: Entropy(Purchased) = .88 Entropy(Purchased|Income >= 50) = .72 Entropy(Purchased|Income < 50) = .97 P(Income >= 50) = .5 P(Income < 50) = .5 IG = .88 – (.5 * .72 + .5 * .97) IG = .88 – (.36 + .485) IG = .035 Information Gain, Worked Example Income > 50k Purchased Income > 50k Purchased Y Y N Y Y Y N Y Y Y N Y Y Y N N Y N N N Why devote so much time to decision trees? 1. Many modern implementations, including ID3 and C4.5/C5.0 2. Intuitive, interpretable models 3. Fast, scalable 4. Can handle many dimensions, mixed categorical and numeric inputs 5. Can handle multiple valued outputs, not just binary 6. Parametric, requires longer to train, scores very quickly 7. Sometimes called CART – Classification and Regression Trees. Can be used for Regression, beyond the scope of this session. K-Means Clustering 1. K is a given – the number of clusters 2. Place K cluster centers randomly throughout the space, assigning all points to one of the K clusters 3. Calculate the centroid of the points assigned to each cluster (in all dimensions) and move the cluster centers to this new point 4. Calculate the distance of all points to the new centroid, repeat step 3 5. Repeat steps 3 and 4 until convergence (cluster centroids move very little from one iteration to the next) K-Means Clustering Example Notice that the clusters are extremely imbalanced in Iteration 1 Imagine the green points on the bottom left pulling that centroid down and left Likewise with blue Conclusion Topics introduced: 1. Definition of Data Mining 2. Data format – Inputs, Outputs, Dimensions 3. Problem Types and Paradigms 4. Model Performance Considerations, Measuring and Comparing Models 5. K-Nearest Neighbors non-parametric classification 6. Decision Trees parametric classification 7. K-Means clustering, unsupervised clustering Topics Not Covered 1. Much of classical statistics (p values, ANOVA, statistical representativeness, sampling plans), much of Bayesian 2. Preprocessing data (Binning, normalizing, deriving new variables) 3. Segmentation for model performance 4. IT Infrastructure Concerns 5. Bayesian methods (Naïve Bayes, Bayes Nets) 6. Geospatial analytics 7. Social Network Analysis and Visualization 8. Recommender Systems 9. Regression, Time Series Analysis 10. Advanced Techniques (Two-Stage Modeling, etc) 11. Many algorithms (Neural Nets, Support Vector Machines, etc) 12. Operationalizing Data Mining