Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Knowledge Discovery with Data and Text Mining Dursun Delen, Ph.D. Associate Professor of MSIS William S. Spears School of Business Oklahoma State University - Tulsa Introduction 2 Why data mining? What is data mining? Data Mining Process Data Mining Techniques Text Mining Data & Text Mining Examples A Demonstration of Data Mining in Action © D. Delen KPM Symposium – Tulsa, OK – October 3-4, 2007 Data Deluge Why Now? Barcodes RFIDs Magnetic Strips Sensors… 3 © D. Delen KPM Symposium – Tulsa, OK – October 3-4, 2007 1 Motivation: “Necessity is the Mother of Invention” Competitive pressure to make better decisions Data explosion problem (or opportunity) Why do we have more data now, then we had before? Technology driven reasons: Automated data collection t l and tools d techniques… t h i Software driven reasons: Mature database technology… Cost driven reasons (both hardware and software) We are drowning in data, but starving for knowledge! Solution: Data and data mining 4 © D. Delen KPM Symposium – Tulsa, OK – October 3-4, 2007 What Is Data Mining? Data mining : Alternative names… 5 Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information (or patterns) from data © D. Delen Data mining: a misnomer? Knowledge discovery, knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. KPM Symposium – Tulsa, OK – October 3-4, 2007 Standardized DM Processes CRISP-DM Cross-Industry Standard Process for Data Mining www.crisp-dm.org SEMMA 9 Sample the data Explore the data Modify the data Model Assess http://www.sas.com/technologies/analytics/datamining/miner/semma.html 9 9 9 9 6 © D. Delen KPM Symposium – Tulsa, OK – October 3-4, 2007 2 Standardized Data Mining Processes Step 1: Business Understanding Cross-Industry Standard forthe Data Mining Process Determine business objectives CRISP DM CRISP-DM Assess the situation Established in 1996 Determine the data mining By Daimler-Benz, goals Integral Solutions Ltd. (ISL), NCR, Produce a project plan and OHRA Cross-Industry Standard Process for Data Mining CRISP-DM 7 © D. Delen KPM Symposium – Tulsa, OK – October 3-4, 2007 Standardized Data Mining Processes Step 2: Data Understanding Collect the initial data Describe the data Explore the data Verify the data Cross-Industry Standard Process for Data Mining CRISP-DM 8 © D. Delen KPM Symposium – Tulsa, OK – October 3-4, 2007 Standardized Data Mining Processes Step 3: Data Preparation Select data Clean data Construct data Integrate data Format data Cross-Industry Standard Process for Data Mining CRISP-DM 9 © D. Delen KPM Symposium – Tulsa, OK – October 3-4, 2007 3 Standardized Data Mining Processes Step 4: Modeling Select the modeling technique Generate test design Build the model Assess the model Cross-Industry Standard Process for Data Mining CRISP-DM 10 © D. Delen KPM Symposium – Tulsa, OK – October 3-4, 2007 Standardized Data Mining Processes Step 5: Evaluation Evaluate results Review process Determine next step Cross-Industry Standard Process for Data Mining CRISP-DM 11 © D. Delen KPM Symposium – Tulsa, OK – October 3-4, 2007 Standardized Data Mining Processes Step 6: Deployment Plan deployment Plan monitoring and maintenance Produce final report Review the project Cross-Industry Standard Process for Data Mining CRISP-DM 12 © D. Delen KPM Symposium – Tulsa, OK – October 3-4, 2007 4 Data Mining Techniques (1) Association (correlation and causality) age(X, “20..29”) ^ income(X, “20..29K”) Æ buys(X, “PC”) [support = 2%, confidence = 60%] contains(T, “computer”) Æ contains(x, “software”) [[support pp = 1%,, confidence = 75%]] Prediction (Classification & Regression) 13 © D. Delen Classification: developing models (functions) that describe and distinguish classes or concepts for future prediction Predicting whether it will rain tomorrow Classifying loan applicant as “good” or “bad” KPM Symposium – Tulsa, OK – October 3-4, 2007 Data Mining Techniques (2) Prediction (Classification & Regression) Cluster analysis 14 Regression: developing models (functions) that forecast the value of a continuous numerical variable Predicting what the temperature will be tomorrow Forecasting the future value of a stock a year from now © D. Delen Class label is unknown: Group the samples of objects based on their quantifiable characteristics Cluster the customers who share the same characteristics: interest, income level, spending habits, etc. Clustering is done by maximizing the intra-class similarity and minimizing the interclass similarity KPM Symposium – Tulsa, OK – October 3-4, 2007 Text Mining Text mining is a process that employs Statistical Natural Language Processing g Data Mining 15 a set of algorithms for converting unstructured text into structured data objects, and the quantitative methods that analyze these data objects to discover knowledge Text Mining = Statistical NLP + DM © D. Delen KPM Symposium – Tulsa, OK – October 3-4, 2007 5 Text Mining Process Map Unstructured Text Document Structuring - Extract - Transform - Load - Documents (.doc, .pdf, .ps, etc.) - Fragments (sentences, paragraphs, speech encodings, etc.) - Web pages (textual sections, XML files, etc.) - Etc. Problems / Opportunities - Environmental scanning Term Generation - Natural Language Processing - Term Filtering: use of domain knowledge (stemming, stop word list, synonym list, acronym list, etc.) Structured Text Feedback - Tabular representation - Collection of XML files - Etc. Term by Document Matrix T_1 T_2 Feedback Doc 1 1 Doc 2 1 ... ... Doc M 16 © D. Delen T_N 2 3 1 Information / Knowledge - Clustering - Classification - Association - Link analysis 1 KPM Symposium – Tulsa, OK – October 3-4, 2007 Data Mining Applications… Military Health System (1/3) KBSI’s Phase II SBIR research project Funded through SBIR program by the Offices of the Secretary of Defense SBIR: Phase I ⇒ Phase II ⇒ Phase III DM in Healthcare Managerial Clinical Reference 17 Delen, D. and S. Ramachandran (2003). A Hybrid Approach to Knowledge Discovery from Military Health Systems. Journal of Neural, Parallel and Scientific Computing, Volume 11, Number 1&2, pp. 161-183. © D. Delen KPM Symposium – Tulsa, OK – October 3-4, 2007 Data Mining Applications… Military Health System (2/3) One of the largest health systems in the US Purpose/mission is to 18 90+ hospitals 100s of outpatient clinics, treatment facilities 180,000 employees (doctors, nurses, other staff) 8 million beneficiaries $20 Billion/year Billi / budget b d t Provide healthcare to eligible veterans Provide education and training opportunities to health profession (residency, practical training, etc.) Conduct medical research, create innovation Provide public health service at the time of natural disasters Problem: ⇑ demand, ⇓ budget © D. Delen KPM Symposium – Tulsa, OK – October 3-4, 2007 6 Data Mining Applications… Military Health System (3/3) Classification/prediction Demand forecasting Resource allocation Load Forecast DMIS Region FACILITY LEVEL Load Forecast DMIS Facility 1 Load Forecast DMIS Facility 2 GRANULAR REGIONAL LEVEL AGGREGATE SERVICE LEVEL Load Forecast DMIS Facility 1 Diagnosis Type C, D Association rules 19 © D. Delen Patient diagnosis Load Forecast DMIS Facility 1 Diagnosis Type A, B Load Forecast DMIS Facility 2 Diagnosis Type X, Y Load Forecast DMIS Facility 2 Diagnosis Type P, Q, R Primary/Secondary Diagnosis Patient ID D1 D2 D3 ... 10001 10002 10003 . . 0 0 1 . . 0 1 1 . . 1 0 1 . . … … … . . Association Rules Disease 724, 653, 366 => 343, 453 Disease 741. 606, 42.01 => 42.05, 35.0, 23.0 Disease 724, 653, 366 => 343, 453 Disease 724, 653, 366 => 343, 453 ... Support 60%, Conf. 40%, Lift 79% Support 30%, Conf. 50%, Lift 49% Support 50%, Conf. 50%, Lift 45% Support 30%, Conf. 40%, Lift 49% ... KPM Symposium – Tulsa, OK – October 3-4, 2007 Data Mining Applications… Predicting Breast Cancer Survivability Objective: Predicting breast cancer survivability Methods/Materials: Used artificial neural networks (MLP), decision trees (C5) and logistic regression Used 10-fold cross validation and more than 200,000 cases Results: Decision tree (C5) is the best predictor with 93.6% accuracy, artificial neural networks (MLP) came out to be the second with 91.2% accuracy, and the logistic regression models came out to be the worst of the three with 89.2% accuracy Reference: Delen, D., G. Walker and A. Kadam (2004). Predicting Breast Cancer Survivability: A Comparison of Three Data Mining Methods, to appear in Artificial Intelligence in Medicine. 20 © D. Delen KPM Symposium – Tulsa, OK – October 3-4, 2007 A Brief List of Text Mining Applications Exploring trends in research literature Using text mining to detect deception 21 We did this for IS literature Identifying functional relationships between different genes using large collections of interdisciplinary research literature A l i records Analyzing d for f customer t contact t t centers t Both audio as well as text (SAS does this) One of our Ph.D. student working on this “Reading”, summarizing classifying emails Text mining the Web content Anti-terrorism initiatives (CIA is using text mining) © D. Delen KPM Symposium – Tulsa, OK – October 3-4, 2007 7 Forecasting Box Office Success of Hollywood Movies Ramesh Sharda and Dursun Delen Institute for Research in Information Systems Oklahoma State University Forecasting BoxBox-Office Receipts: A Tough Problem! “… No one can tell you how a movie is going to do in the marketplace… not until the film opens in darkened theatre and sparks fly up between the screen and the audience” Mr. Jack Valenti President and CEO of the Motion Picture Association of America 23 © D. Delen KPM Symposium – Tulsa, OK – October 3-4, 2007 Current work on prediction… A lot of research have been done Our approach 24 Behavioral models Analytical models ' Predict after the initial release © D. Delen Use a data mining approach Use as much historical data as possible Make it web-enabled & Predict before the initial release KPM Symposium – Tulsa, OK – October 3-4, 2007 8 Our Approach – Movie Forecast Guru 9 DATA – 849 Movies released between 1998-2006 9 Movie Decision Parameters: 9 Intensity of competition rating MPAA Rating Star power Genre Technical Effects Sequel ? Estimated screens at opening … Output: Box office gross receipts (flop → blockbuster) Class No. Range (in Millions) 25 © D. Delen 1 <1 (Flop) 2 >1 < 10 3 > 10 < 20 4 > 20 < 40 5 > 40 < 65 6 > 65 < 100 7 > 100 < 150 8 > 150 < 200 9 > 200 (Blockbuster) KPM Symposium – Tulsa, OK – October 3-4, 2007 Our Approach – Movie Forecast Guru PREDICTION MODELS 9 Statistical Models: 9 Discriminant Analysis Ordinal Multiple Logistic Regression Machine Learning Models: Artificial Neural Networks Decision Tree Induction 26 © D. Delen CART - Classification & Regression Trees C5 - Decision Tree Support Vector Machines Rough Sets … KPM Symposium – Tulsa, OK – October 3-4, 2007 Prediction Results… Models include plots via texttext-mining Text Mining Models - Tested on 2006 Data 100.00% 90.00% 80 00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% 27 No Category War Life Organization Character Bingo 49.86% 53.03% 52.16% 51.59% 53.60% 1-Away 88.18% 87.32% 88.18% 89.63% 89.63% © D. Delen KPM Symposium – Tulsa, OK – October 3-4, 2007 9 Sample Predictions… Movie Actual Spider-Man 3 9 9 9 9 9 9 300 9 5 7 5 5 5 War Organization Life Character The Simpsons Movie 8 9 9 9 9 9 The Bourne Ultimatum 8 6 7 9 9 7 Knocked Up 7 5 5 5 5 7 Live Free or Die Hard 7 9 7 9 9 9 Evan Almighty 6 6 7 7 7 7 1408 6 5 5 4 5 5 TMNT 28 No Category 5 4 3 4 5 5 Music and Lyrics 5 4 5 5 5 5 Freedom Writers 4 4 3 4 4 5 Reign Over Me 3 4 4 4 4 4 Black Snake Moan 2 3 3 3 3 3 Ta Ra Rum Pum 1 1 1 1 1 1 © D. Delen KPM Symposium – Tulsa, OK – October 3-4, 2007 Web--based DSS Web Prediction Models Remote Models Movie Forecast Guru (MFG) Local Models GUI (Internet Browser) HTML TCP/IP User (Manager) Remote Data Sources Web Services XML / SOAP MFG Engine (Web Server) ETL ODBC & ETL MFG Database XML Knowledge Base (Business Rules) 29 © D. Delen KPM Symposium – Tulsa, OK – October 3-4, 2007 Demonstration of MGF 30 © D. Delen KPM Symposium – Tulsa, OK – October 3-4, 2007 10