Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Final Project 結束 Data sets Visit web site: http://www.kdnuggets.com/datasets/index.html This is an online repository of large data sets which encompasses a wide variety of data types, analysis tasks, and application areas. The primary role of this repository is to enable researchers in knowledge discovery and data mining to scale existing and future data analysis algorithms to very large and complex data sets. http://kdd.ics.uci.edu/ 10-2 結束 Data sets Data Sets by application area by name by date (reverse chronological) Machine Learning Repository Task Files by task type by application area by name by date (reverse chronological) by data type 10-3 結束 Report & Presentation 書面 (50%) + 簡報 (50%)==> 為期末考成績 4位同學一組 書面報告 (8 pages at least, cover not included) 簡報: 15分鐘+問題提問 (5分鐘) ,簡報同學 不發問,其餘同學皆須回答問題,不用及時 回答,可於下課前回答。 一節課用於討論與提問,並預先訂定所選定 資料庫。(可於一星期內修改之) 。 10-4 Business Data Mining Applications 結束 Business Data Mining Applications Partial representative sample of applications Catalog sales CRM Credit scoring Banking (loans) Investment risk Insurance 10-6 結束 Fingerhut Founded 1948 today sends out 130 different catalogs to over 65 million customers 6 terabyte data warehouse 3000 variables of 12 million most active customers over 300 predictive models Focused marketing 10-7 結束 Fingerhut Purchased by Federated Department Stores for $1.7 billion in 1999 (for database) Fingerhut had $1.6 to $2 billion business per year, targeted at lower income households Can mail 400,000 packages per day Each product line has its own catalog 10-8 結束 Fingerhut Uses segmentation, decision tree, regression, neural network tools from SAS and SPSS Segmentation - combines order & demographic data with product offerings can target mailings to greatest payoff customers who recently had moved tripled their purchasing 12 weeks after the move send furniture, telephone, decoration catalogs 10-9 結束 Data for SEGMENTATION cluster subj 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 age 53 48 32 26 51 59 43 38 35 27 income 80000 120000 90000 40000 90000 150000 120000 160000 70000 50000 indices marital grocery wife 180 husband 120 single 30 wife 80 wife 110 wife 160 husband 140 wife 80 single 40 wife 130 dine out 90 110 160 40 90 120 110 130 170 80 10-10 savings 30000 20000 5000 0 20000 30000 10000 15000 5000 0 結束 Initial Look at Data Want to know features of those who spend a lot dining out INCLUDE AS MANY ACTIONABLE VARIABLES AS POSSIBLE things you can identify Manipulate data sort on most likely indicator (dine out) 10-11 結束 Sorted by Dine Out cluster subject 1004 1010 1001 1005 1002 1007 1006 1008 1003 1009 age 26 27 53 51 48 43 59 38 32 35 income 40000 50000 80000 90000 120000 120000 150000 160000 90000 70000 indices marital grocery wife 80 wife 130 wife 180 wife 110 husband 120 husband 140 wife 160 wife 80 single 30 single 40 dine out 40 80 90 90 110 110 120 130 160 170 10-12 savings 0 0 30000 20000 20000 10000 30000 15000 5000 5000 結束 Analysis Best indicators marital status groceries Available marital status might be easier to get 10-13 結束 Fingerhut Mailstream optimization which customers most likely to respond to existing catalog mailings save near $3 million per year reversed trend of catalog sales industry in 1998 reduced mailings by 20% while increasing net earnings to over $37 million 10-14 結束 LIFT LIFT = probability in class by sample divided by probability in class by population if population probability is 20% and sample probability is 30%, LIFT = 0.3/0.2 = 1.5 Best lift not necessarily best need sufficient sample size as confidence increases, longer list but lower lift 10-15 結束 Lift Example Product to be promoted Sampled over 10 identifiable segments of potential buying population Profit $50 per item sold Mailing cost $1 Sorted by Estimated response rates 10-16 結束 Lift Data Seg R a te R ev C o st P ro fit S e g R a te R ev C o st P ro fit 1 0 .0 4 2 $ 2 .1 0 $1 $ 1 .1 0 6 0 .0 1 3 $ 0 .6 5 $1 -$ 0 .3 5 2 0 .0 3 5 $ 1 .7 5 $1 $ 0 .7 5 7 0 .0 0 9 $ 0 .4 5 $1 -$ 0 .5 5 3 0 .0 2 5 $ 1 .2 5 $1 $ 0 .2 5 8 0 .0 0 5 $ 0 .2 5 $1 -$ 0 .7 5 4 0 .0 1 7 $ 0 .8 5 $1 -$ 0 .1 5 9 0 .0 0 4 $ 0 .2 0 $1 -$ 0 .8 0 5 0 .0 1 5 $ 0 .7 5 $1 -$ 0 .2 5 1 0 0 .0 0 1 $ 0 .0 5 $1 -$ 0 .9 5 10-17 結束 Lift Chart Cumulative Proportion LIFT 1.2 1 0.8 Cum Response 0.6 Random 0.4 0.2 0 0 1 2 3 4 5 6 7 8 9 10 Segment 10-18 結束 Profit Impact PROFIT 12 10 Dollars 8 6 Cum Revenue 4 Cum Cost 2 Cum Profit 0 -2 0 1 2 3 4 5 6 7 8 9 10 -4 Segment 10-19 結束 RFM Recency, Frequency, Monetary Same purpose as lift Identify customers more likely to respond RFM tracks customer transactions by its 3 measures Code each customer Often 5 cells for each measure, or 125 combinations Identify positive response of each of the combinations 10-20 CUSTOMER RELATIONSHIP MANAGEMENT (CRM) understanding value customer provides to firm Kathleen Khirallah - The Tower Group Banks will spend $9 billion on CRM by end of 1999 Deloitte only 31% of senior bank executives confident that their current distribution mix anticipated customer needs 10-21 結束 結束 Customer Value Middle age (41-55), 3-9 years on job, 3-9 years in town, savings account year annual purchases profit discounted net 1.3 rate 1 1000 200 153 153 2 1000 200 118 272 3 1000 200 91 363 4 1000 200 70 433 5 1000 200 53 487 6 1000 200 41 528 7 1000 200 31 560 8 1000 200 24 584 9 1000 200 18 603 10 1000 200 14 618 10-22 結束 Younger Customer Young (21-29), 0-2 years on job, 0-2 years in town, no savings account year annual purchases profit discounted net 1.3 1 300 60 46 46 2 360 72 43 89 3 432 86 39 128 4 518 104 36 164 5 622 124 34 198 6 746 149 31 229 7 896 179 29 257 8 1075 215 26 284 9 1290 258 24 308 10 1548 310 22 331 10-23 結束 Lifetime Value Application Drew et al. (2001), Journal of Service Research 3:3 Cellular telephone division, major US telecommunications firm Data on billing, usage, demographics Neural net model of churn proportion by month of tenure 36 tenure classes Tested model on 21,500 subscribers April 1998 Trained on 15,000, tested on 6,500 10-24 結束 Customer Tenure Segments 1. Least likely to churn • Left alone 2. Slight propensity to churn at end of tenure • Moderate pre-expiration marketing 3. Large spike in churn at expiration • Concentrated marketing efforts before expiration 4. Highest risk • Continued competitive offers 10-25 結束 CREDIT SCORING Data warehouse including demand deposits, savings, loans, credit cards, insurance, annuities, retirement programs, securities underwriting, other Statistical & mathematical models (regression) to predict repayment 10-26 結束 CREDIT SCORING Bank Loan Applications Age 24 20 20 33 30 55 28 20 20 39 Income 55557 17152 85104 40921 76183 80149 26169 34843 52623 59006 Assets Debts Want 27040 48191 1500 11090 20455 400 0 14361 4500 91111 90076 2900 101162 114601 1000 511937 21923 1000 47355 49341 3100 0 21031 2100 0 23054 15900 195759 161750 600 On-time 1 1 1 1 1 1 0 1 0 1 10-27 結束 Credit Card Management Very profitable industry Card surfing - pay old balance with new card Promotions typically generate 1000 responses, about 1% In early 1990s, almost all mass marketing Data mining improves (lift) 10-28 結束 British Credit Card Company Monthly credit data Didn’t want those who paid in full (no profit) Application scoring Continued what had been done manually for over 50 years Behavioral scoring Monitor revolving credit accounts for early warning 90,000 customers State variable: cumulative months of missed repayment Selected sample of 10,000 observations Initial state all 0 in selected data Over 70% of customers never left state 0 10-29 結束 Analysis Clustering Unsupervised partitioning K-median to get more stable results Pattern search Sought patterns from object grouping Unexpectedly large number of similar objects Estimated probability of each case belonging to objects 10-30 結束 Comparison Compared clustering partitions with pattern search groupings Pattern search identified those behaving in anomalous manner 10-31 結束 Banking Among first users of data mining Used to find out what motivates their customers (reduce churn) Loan applications Target marketing Norwest: 3% of customers provided 44% profits Bank of America: program cultivating top 10% of customers 10-32 結束 CHURN Customer turnover Critical to: telecommunications banks human resource management retailers 10-33 結束 Characteristics of Not On-Time Age 28 20 Income 26169 52623 Assets 47355 0 Debts 49341 23054 Want 3100 15900 On-time 0 0 Here, Debts exceed Assets Age Young Income Low BETTER: Base on statistics, large sample supplement data with other relevant variables 10-34 結束 Identify Characteristics of Those Who Leave Age Time-job Time-town min bal checking years months months $ 27 12 12 549 x 41 18 41 3259 x 28 9 15 286 x 55 301 5 2854 x 43 18 18 1112 x 29 6 3 0 x 38 55 20 321 x 63 185 3 2175 x 26 15 15 386 x 46 13 12 1187 x 37 32 25 1865 x savings card x x x x x x x x x x x x 10-35 loan x x x x x 結束 Analysis What are the characteristics of those who leave? Correlation analysis Which customers do you want to keep? Customer value - net present value of customer to the firm 10-36 結束 Correlation Age Age 1.0 Job 1.0 Town Min-Bal Check Saving Card Loan Time Job 0.6 0.9 Time Town 0.4 -0.6 1.0 min-bal check saving card loan -0.4 0.1 -0.5 1.0 0.4 0.9 0.3 0.3 0.5 1.0 0.2 0.3 -0.2 0.5 0.4 0.6 -0.1 0.2 0.2 0.9 0.3 1.0 0.5 1.0 0.0 0.6 -0.1 -0.2 1.0 10-37 結束 Bankruptcy Prediction Sung et al. (1999), Journal of MIS 16:1 Late 20th-century, East Asian corporate bankruptcy critical Models built for normal & crisis conditions Used decision tree models for explanation Discriminant analysis applied to benchmark Korean corporations Data for all bankrupt corporations on Korean Stock Exchange, 2nd quarter 1997 to 1st quarter 1998 75 such cases – full data on 30 of those Normal 2nd Qtr 1991 to 1st Qtr 1995 56 firms, full data on 26 10-38 結束 Korean Bankruptcy Study Matched bankrupt firms with one or two nonbankrupt firms that had similar assets and size 56 financial ratios used Eliminated 16 due to duplication 10-39 結束 Financial Ratios Growth (5) Profitability (13) Leverage (9) Efficiency (6) Productivity (7) DV 0/1 variable of bankruptcy or not 10-40 結束 Multivariate Discriminant Analysis Used stepwise procedure NORMAL PERIOD Normal = 0.58 * cash flow/assets + 0.0623 * productivity of capital - 0.006 * average inventory turnover BANKRUPT PERIOD Bankrupt = 0.053 * cash flow/liabilities + 0.056 * productivity of capital + 0.014 * fixed assets/(equity+LT liab) 10-41 結束 Decision Tree Models Used C4.5 Applied boosting to improve predictive power, improved prediction success NORMAL RULES IF productivity of capital > 19.65 THEN OK IF cash flow/total assets > 5.64 THEN OK IF cash flow/total assets ≤ 55.64 & productivity of capital ≤ 19.65 THEN bankrupt 10-42 結束 CRISIS RULES IF productivity of capital > 20.61 THEN OK IF cash flow/liabilities > 2.64 THEN OK IF fixed assets/(equity+long-term invest) > 87.23 THEN OK IF cash flow/liabilities ≤ 2.64 AND productivity of capital ≤20.61 AND fixed assets/(equity+long-term invest) ≤ 87.23 THEN bankrupt 10-43 結束 Comparison Correct Bankrupt Correct OK Overall Variables DA-normal 0.69 0.90 0.82 3 DA-crisis 0.53 0.85 0.74 3 DT-normal 0.72 0.90 0.83 8 DT-crisis 0.67 0.89 0.81 6 10-44 結束 Mortgage Market Early 1990s - massive refinancing Need to keep customers happy to retain Contact current customers who have rates significantly higher than market a major change in practice data mining & telemarketing increased Crestar Mortgage’s retention rate from 8% to over 20% 10-45 結束 Country Investment Risk Outcome categories: 1. 2. 3. 4. 5. Most safe Developed Mature emerging markets New emerging markets Frontier 10-46 結束 Investment Risk Analysis Becerra-Fernandez et al. (2002) Computers and Industrial Engineering 43 Risk by country Expert assessment available Decision tree (C5), neural network models Data: Economic indicators (4) Depth & liquidity (4) Performance & value (5) Economic & market risk (4) Regulation & efficiency (4) 52 samples, so used bootstrapping 10-47 結束 Models Decision trees Pruning rate 50%: Pruning rate 75% Neural networks Backpropogation Fuzzy (ARTMAP) Learning vector quantization 10-48 結束 Results Decision tree algorithms more accurate Lower pruning rate – lowest error rate Neural networks disadvantaged by small data set Decision tree algorithms consistently optimistic relative to expert ratings 10-49 結束 Banking Fleet Financial Group $30 million data warehouse hired 60 database marketers, statistical/quantitative analysts & DSS specialists expected to add $100 million in profit by 2001 10-50 結束 Banking First Union concentrated on contact point previously had very focused product groups, little coordination Developed offers for customers 10-51 結束 INSURANCE Marketing, as retailing & banking Special: Farmers Insurance Group - underwriting system generating $ millions in higher revenues, lower claims 7 databases, 35 million records better understanding of market niches lower rates on sports cars, increasing business 10-52 結束 Insurance Fraud Specialist criminals - multiple personas InfoGlide specializes in fraud detection products Similarity search engine link names, telephone numbers, streets, birthdays, variations identify 7 times more fraud than exact-match systems 10-53 結束 Insurance Fraud - Link Analysis claim type amount back 50000 neck 80000 arm 40000 neck 80000 leg 30000 multiple 120000 neck 80000 back 60000 arm 30000 internal 180000 physician Welby Frank Barnard Frank Schmidt Heinrich Frank Schwartz Templer Weiss attorney McBeal Jones Fraser Jones Mason Feiffer Jones Nixon White Richards 10-54 結束 Insurance Fraud Analytics’ NetMap for Claims uses industrywide database creates data mart of internal, external data unusual activity for specific chiropractors, attorneys HNC Insurance Solutions workers compensation fraud VeriComp - predictive software (neural nets) saved Utah over $2 million 10-55 結束 Insurance Data Mining Examples Smith et al. (2000) Journal of the Operational Research Society 51:5 Large data warehouse system Recorded every transaction & claim Data mining to predict average claim costs & frequency, impact on profitability Pricing 10-56 結束 Customer Retention Analysis Over 20,000 motor vehicle policies due for renewal in one month About 7% didn’t renew Expected reasons: price, service, value of vehicle 10-57 結束 Customer Retention Results Data Mining Enterprise Miner Used data exploration to select variables (13) Used log transforms for highly skewed data Performed log regression, decision trees, neural networks Neural network fit test set best But low correct rate for termination 10-58 結束 Claims Analysis Recent growth in policies Lower profitability Could improve by lowering frequency, reducing claim amounts Data over a three-year period Sample size well over 100,000 per quarter Descriptive statistics: High growth in young people, insurance over $40,000 10-59 結束 Claims Models Clustering Predict group policy claims behavior Used 50 clusters K-means algorithm Identified several clusters with abnormal cost ratios or frequency size 10-60 結束 TELECOMMUNICATIONS Deregulation - widespread competition churn 1/3 poor call quality, 1/2 poor equipment wireless performance monitor tracking reduced churn about 61%, $580,000/year cellular fraud prevention spot problems when cell phones begin to go bad 10-61 結束 Telecommunications Metapath’s Communications Enterprise Operating System help identify telephone customer problems dropped calls, mobility patterns, demographics to target specific customers reduce subscription fraud $1.1 billion reduce cloning fraud cost $650 million in 1996 10-62 結束 Telecommunications Churn Prophet, ChurnAlert data mining to predict subscribers who cancel Arbor/Mobile set of products, including churn analysis 10-63 結束 TELEMARKETING MCI uses data marts to extract data on prospective customers typically a 2-month program 20% improvement in sales leads multimillion investment in data marts & hardware staff of 45 trend spotting (which approaches specific customers) 10-64 結束 Telemarketing Australian Tourist Commission maintained database since 1992 responses to travel inquiries on tours, hotels, airlines, travel agents, consumers data mine to identify travel agents & consumers responding to various media sales closure rate at 10% and up lead lists faxed weekly to productive travel agents 10-65 結束 Telemarketing Segmentation Which customers respond to new promotions, to discounts, to new product offers Determine whom to offer new service to those most likely to commit fraud 10-66 結束 Human Resource Management Identify individuals liable to leave company without additional compensation or benefits Firm may already know 20% use 80% of offered services don’t know which 20% data mining (business intelligence) can identify Use most talented people in highest priority (or most profitable) business units 10-67 結束 Human Resource Management Downsizing identify right people, treat them well track key performance indicators data on talents, company needs, competitor requirements State of Mississippi’s MERLIN network 30 databases (finance, payroll, personnel, capital projects) Cognos Impromptu system - 230 users 10-68 結束 CASINOS Casino gaming one of richest data sets known Harrah’s - incentive programs about 8 million customers hold Total Gold cards, used whenever the customer spends money in the casino comprehensive data collection Trump’s Taj Card similar 10-69 結束 Casinos Bellagio & Mandelay Bay strategy of luxury visits child entertainment change from old strategy - cheap food Identify high rollers - cultivate identify those to discourage from play estimate lifetime value of players 10-70 結束 ARTS Computerized box offices lead to high volumes of data Identify potential consumers for shows Software to manage shows similar to airline seating chart software 10-71