Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to Data Mining Massive quantities of data exist on computers Data mining is a way to use these data to learn 1-2 Definition • DATA MINING: exploration & analysis – by automatic means – of large quantities of data – to discover actionable patterns & rules • Data mining is a way to use massive quantities of data that businesses generate • GOAL - improve marketing, sales, customer support through better understanding of customers McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 1-3 Retail Outlets • Bar coding & scanning generate masses of data – customer service – inventory control – MICROMARKETING – CUSTOMER PROFITABILITY ANALYSIS – MARKET-BASKET ANALYSIS McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 1-4 Political Data Mining Grossman et al., 10/18/2004, Time, 38 • 2004 Election – Republicans: VoterVault • From Mid-1990s • About 165 million voters • Massive get-out-the-vote drive for those expected to vote Republican – Democrats: Demzilla • Also about 165 million voters • Names typically have 200 to 400 information items McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 1-5 Medical Diagnosis J. Morris, Health Management Technology Nov 2004, 20, 22-24 • Electronic Medical Records – Associated Cardiovascular Consultants • 31 physicians • 40,000 patients per year, southern New Jersey – Data mined to identify efficient medical practice – Enhance patient outcomes – Reduced medical liability insurance McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 1-6 Mayo Clinic Swartz, Information Management Journal Nov/Dec 2004, 8 • IBM developed EMR program – Complete records on almost 4.4 million patients – Doctors can ask for how last 100 Mayo patients with same gender, age, medical history responded to particular treatments McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 1-7 Business Uses of Data Mining 1. Customer profiling Identify profitability of customers 2. Targeting Determine characteristics of most profitable customers 3. Market-Basket Analysis Determine correlation of purchases by profile Part of Customer Relationship Management McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 1-8 Reasons why Data Mining is now effective • Data are there • Data are warehoused (computerized) – Walmart: 35 thousand queries per week • Computing economically available • Competitive pressure • Commercial products available McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 1-9 Trends • Every business is service – hotel chains record your preferences – car rental companies the same – service versus price • • • • McGraw-Hill/Irwin credit card companies long distance providers airlines computer retailers ©2007 The McGraw-Hill Companies, Inc. All rights reserved 1-10 Trends • Mass Customization – produce tailored products from standardized components • • • • Levi-Strauss - custom fit jeans The Custom Foot Andersen Windows Individual, Inc. – electronic clipping – customer profiles of interests – send custom newsletter McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 1-11 Trends • Information as Product – Custom Clothing Technology Corporation • fit jeans, other clothing – Lands End – J. Crew • INFORMATION BROKERING – IMS - collects prescription data from pharmacies, sells to drug firms – AC Nielsen - TV McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 1-12 Trends • Commercial Software Available – using statistical, artificial intelligence tools that have been developed • • • • • McGraw-Hill/Irwin Enterprise Miner Intelligent Miner Clementine PolyAnalyst Specialty products SAS IBM SPSS Megaputer ©2007 The McGraw-Hill Companies, Inc. All rights reserved 1-13 How Data Mining Is Being Used • U.S. Government – track down Oklahoma City bombers, Unabomber, many others – Treasury department - international funds transfers, money laundering – Internal Revenue Service McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 1-14 How Data Mining Is Used • Safeway – offer Safeway Savings Club card • users given discounts • users must give personal information • every use, collect data – identify aggregate patterns (what sells well together; what should be sold together) • sell names for 5.5 cents per name to suppliers McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 1-15 How Data Mining Is Used • Firefly – asks members to rate music and movies – subscribers clustered – clusters get customdesigned recommendations McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 1-16 Cross-selling • USAA – insurance – doubled number of products held by average customer due to data mining – detailed records on customers – predict products they might need • Fidelity Investments – regression - what makes customer loyal McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 1-17 Warranty Claims Routing • Diesel engine manufacturer – stream of warranty claims – examine each by expert • determine whether charges are reasonable & appropriate • think of expert system to automate claims processing McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 1-18 Retaining Good Customers • Customer loss: – Banks - Attrition – Cellular Phone Companies - Churn • study who might leave, why • Southern California Gas – customer usage, credit information – direct mail contact - most likely best billing plan – who is price sensitive • Who should get incentives, whom to keep McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 1-19 Fairbank & Morris • Credit card company’s most valuable asset: – INFORMATION ABOUT CUSTOMERS • Signet Banking Corporation – obtained behavioral data from many sources – built predictive models – aggressively marketed balance transfer card • First Union – who will move soon - improve retention McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 1-20 Methodology Analyzing data Given management goals and that management can translate knowledge into action McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 1-21 Basic Styles • Top-Down: HYPOTHESIS TESTING – SUPERVISED – have a theory, experiment to prove or disprove – SCIENCE • Bottom-Up: KNOWLEDGE DISCOVERY – UNSUPERVISED – start with data, see new patterns – CREATIVITY McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 1-22 Hypothesis Testing • • • • • • Generate theory Determine data needed Get data Prepare data Build computer model Evaluate model results – confirm or reject hypotheses McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 1-23 Generate Theory • Study • Systematically tie different input sources together (MENTAL MODEL) – What causes sales volume? • sales rep performance • economy, seasonality • product quality, price, promotion, location McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 1-24 Generate Theory • Brainstorm: – diverse representatives for broad coverage of perspectives (electronic) – keep under control (keep positive) – generate testable hypotheses McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 1-25 Define Data Needed • Determine data needed to test hypothesis – Lucky - query existing database – More often - gather • pull together from diverse databases, survey, buy McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 1-26 Locate Data • Usually scattered or unavailable • Sources: warranty claims point-of-sale data (cash register records) medical insurance claims telephone call detail records direct mail response records demographic data, economic data • PROFILE: counts, summary statistics, cross-tabs, cleanup McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 1-27 Prepare Data for Analysis • Summarize: too much - no discriminant information too little - swamped with useless detail • Process for computer: EBCDIC, ASCII • Data encoding: how data are recorded can vary may have been collected with specific purpose (CAL omitting LA) • Textual data: avoid if possible (may need to code) • Missing values: missing salary - use mean? McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 1-28 Build Computer Model • Convert mental model into quantitative – roamers less sensitive to price than others • threshold defining roamer • average price per call, or number of calls above price level – families with children in high school most likely to respond to home equity loan offer • identify families with, without high school age • past data - responded or didn’t McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 1-29 Evaluate Model • Determine if hypotheses supported – statistical practice – test rule-based systems for accuracy • Requires both business and analytic knowledge McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 1-30 SUPERVISED Dorn, National Underwriter Oct 18, 2004, 34,39 • Health care fraud – Use statistics to identify indicators of fraud or abuse – Can rapidly sort through large databases • Identify patterns different from norm – Moderately successful • But only effective on schemes already detected • To benefit firm, need to identify fraud before paying claim McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 1-31 Knowledge Discovery • Machine learning? – Usually need intelligent analyst • Directed: explain value of some variable • Undirected: no dependent variable selected – identify patterns • Use undirected to recognize relationships; use directed to explain once found McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 1-32 Directed • Goal-oriented • Examples: If discount applies, impact on products - who is likely to purchase credit insurance? Predicted profitability of new customer - what to bundle with a particular package • • • • Identify sources of preclassified data Prepare data for analysis Built & train computer model Evaluate McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 1-33 Identify Data Sources • Best - existing corporate data warehouse – data clean, verified, consistent, aggregated • Usually need to generate – most data in form most efficient for designed purpose – historical sales data often purged for dormant customers (but you need that information) McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 1-34 Prepare Data • Put in needed format for computer • Make consistent in meaning • Need to recognize what data are missing change in balance = new – old add missing but known-to-be-important data • Divide data into training, test, evaluation • Decide how to treat outliers – statistically biasing, but may be most important McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 1-35 Build & Train Model • Regression - human builds (selects IVs) • Automatic systems train – give it data, let it hammer • OVERFITTING: – fit the data – TEST SET a means to evaluate model against data not used in training • tune weights before using to evaluate McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 1-36 Evaluate Model • ERROR RATE: proportion of classifications in evaluation set that were wrong • too little training: poor fit on training data and poor error rate • optimal training: good fit on both • too much training: great fit on training data and poor error rate McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 1-37 Undirected Discovery • What items sell together? Strawberries & cream – Directed: What items sell with tofu? tabasco • Long distance caller market segmentation – Uniform usage - weekday & weekend, spikes on holidays – After segmentation: high & uniform except for several months of nothing high credit worthiness & profitability college students McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 1-38 UNSUPERVISED Dorn, National Underwriter Oct 18, 2004, 34,39 • Health care fraud – Look at historical claim submissions • Build ad hoc model to compare with current claims – Assign similarity score to fraudulent claims – Predict fraud potential McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 1-39 Undirected Process • • • • • Identify data sources Prepare data Build & train computer model Evaluate model Apply model to new data • Identify potential targets for undirected • Generate new hypotheses to test McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 1-40 Identify potential targets • Why • Who • When McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 1-41 Generate hypotheses • Any commonalities in data? • Are they useful? – Many adults watch children’s movies • chaperones are an important market segment • they probably make final decision • When hypothesis is generated, that determines data needed McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 1-42 Bank Case Study • Directed knowledge discovery to recognize likely prospects for home equity loan • training set - current loan holders • developed model for propensity to borrow • got continuous scores, ranked customers • sent top 11% material • Undirected: segmented market into clusters • in one, 39% had both business & personal accounts • cluster had 27% of the top 11% • Hypothesis: people use home equity to start business McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved