Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lluis Belanche + Alfredo Vellido Data Mining II An Introduction to Mining (2) On dates & evaluation: Lectures expected to end on the week 14-18th Dec Likely essay deadline & presentation: 15th, 22nd Jan What’s DATA MINING?: A historicist viewpoint $! %& !" # " DATA MINING as a methodology CRISP: a DM methodology CRoss-Industry Standard Process for Data Mining: neutral methodology from the point of view of industry, tool and application (free & non-proprietary) Pete Chapman, Randy Kerber (NCR); Julian Clinton, Thomas Khabaza, Colin Shearer (SPSS), Thomas Reinartz, Rüdiger Wirth (DaimlerChrysler) CRISP-DM was conceived in 1996 DaimlerChrysler: leaders in industrial application, SPSS: leaders in product development (Clementine, 1994), NCR: owners of large (huge!) databases (Teradata) Financed by the EU. Version 1.0 released officially in 1999 CRISP: Hierarchic structure of the methodology CRISP: The virtuous loop of methodology phases CRISP: Description of phases Problem understanding: study of targets and requirements form the business/problem viewpoint. Defining it as a DM problem. Data understanding: data recolection; getting to know the data, trying to detect both quality problems and interesting features. Data preparation: Preparing the data set to be modelled, starting from raw data. This is an iterative and exploratory process. Selection of files, tables, variables, record samples… plus data cleaning. Modelling: Data analysis using modelling techniques of a sort that are suitable for the problem at hand. Includes fiddling with the models, tuning their parameters, etc. Evaluation: All previous steps must be evaluated as whole (as a unitary process), and we must decide whether deliverables so far meet the DM challenge. Implementation: All the knowledge aquired to this point must be organized and presented to the “client” in a usable form. We must define, together with this client, a protocol to reliably deploy the DM findings. CRISP: The virtuous loop of methodology phases Use of DM methodologies (2004 ! 2007) ! " #$ % $ Enterprise MinerTM: SEMMA The acronym SEMMA -- Sample, Explore, Modify, Model, Assess -- refers to the core process of conducting data mining. Beginning with a statistically representative sample of your data, SEMMA makes it easy to apply exploratory statistical and visualization techniques, select and transform the most significant predictive variables, model the variables to predict outcomes, and confirm a model's accuracy. Use of DM methodologies (2004 2004 2007 2007) CRISP: Phases: Problem understanding PROBLEM UNDERSTANDING DATA DATA UNDERST’ING PREPARATION MODELLING DETERMINE PROBLEM GOAL BACKGROUND ASSESS SITUATION INVENTORY RESOURCES DETERMINE DM GOALS GOALS DM SUCCESS CRITERIA DM PRODUCE PROJECT PLAN PROJECT INITIAL SELECTION OF TOOLS PLAN PROBLEM SUCCESS GOALS CRITERIA REQUERIMS. ASSUMPTIONS LIMITATIONS RISKS CONTINGEN. EVALUATION TERMINOLOG. IMPLEMEN TATION COSTS & BENEFITS DM application areas (’06->’09) & ) *+ $ $, $, -$ .)* + $+ , / $,#.0$ 1 , 3 $4, $ 1 . $ ,# 2 " #$ 2 5$6$, 1 3 $4* $ 1 ,$ ,$ * ,$ $ 6 7$ 1$ . ,$+, 6. # 1 ! *8,* 0 7$ 1$ . 6 $ , 11$ ,$ 5 7$6.9 : 6 2 $,* . $ 1 2 9$ 6#, $.9 2 ;* -$1 6. :1 $ 1$ . * , / - &'( (! ( '( 2(2 &( &( ( (' (' (' ( ( (& ( ( 2(2 (2 (2 (2 (! (! (' CRISP: Phases: Data understanding PROBLEM UNDERSTANDING DATA DATA UNDERST’ING PREPARATION OBTAIN INITIAL DATA DESCRIPTION DATA EXPLORATION DATA VERIFICATION QUALITY DATA INITIAL DATA REPORT DATA DESCRIPTIVE REPORT DATA EXPLORATION REPORT DATA QUALITY REPORT MODELLING EVALUATION IMPLEMEN TATION METROFANG: a real story about data understanding (1) METROFANG: a real story about data understanding (2) caudal entrada 350,00 Missing data 300,00 250,00 Stationality 200,00 150,00 100,00 Outliers 50,00 0,00 1 1768 3535 5302 7069 8836 10603 12370 14137 15904 17671 Par motor Secador A 140,00 120,00 Time Series Weekend? FORUM??? 100,00 80,00 60,00 40,00 20,00 0,00 1 1768 3535 5302 7069 8836 10603 12370 14137 15904 17671 Storing data (’07) Poll What did you use for data storage for significant data mining projects in the past year: [142 voters, 284 votes] Text files (e.g. tab or comma delim) (75) 52.8% Data mining system format (SAS, SPSS, arff) (57) 40.1% Excel (28) 19.7% Oracle (25) SQL Server (15) mySQL (12) other format (10) other commercial DBMS (7) other free DBMS (4) 17.6% 10.6% 8.5% 7.0% 4.9% 2.8% CRISP: Phases: Data preparation PROBLEM UNDERSTANDING DATA DATA UNDERST’ING PREPARATION MODELLING EVALUATION DATA SELECTION ARGUMENTS FOR SELECTION DATA CLEANING DATOA CLEANING REPORT RECONSTRUCT DATA DERIVATED VARIABLES INTEGRATE DATA INTEGRATED DATA DATA FORMATTING DATA WITH NEW FORMAT IMPLEMEN TATION OSERVATIONS GENERATED Is data preparation that important? !"#$ 7$ ! ! " 2 & &' 2 6$ 2 ! Common data types analyzed …(’07) Compared to 2005 KDnuggets Poll on “Types of data you analyzed/mined in last 12 months”, the biggest increase was in anonymized data (perhaps and indicator of increasing importance of privacy issues). Common data types analyzed …(’09) How big is yours? … (’06 -> ‘09) % & / 2 / / / 0/ ( 0/ 2 0/ 0/ 5$ 4 $ 7$ 5$ 4 $ ' 6$ # ( & 2 2 ! 2 Data manipulation tools …(’07) CRISP: Phases: Modelling PROBLEM UNDERSTANDING DATA DATA UNDERST’ING PREPARATION SELECT MODELING TECHNIQUE CREATE TEST DESIGN BUILD MODEL VALIDATE MODEL MODELLING EVALUATION SELECTED TECHNIQUE TEST DESIGN PARAMETER SELECTION MODEL VALIDATION MODEL MODEL DESCRIPTION IMPLEMEN TATION CRISP: Selection of techniques U N I V E R S E OF T E C H N I Q U E S (Definided by tools) TECHNIQUES SUITED TO A PROBLEM POLITICAL REQUIREMENTS (Business, executive) LIMITATIONS Money, time, hh.rr. Data types, knowledge SELECTED TOOL(S) Commonly used models/techniques (‘05)… ( $, 6* $ $ $ " " ) ) * + 5 $$ . *6 $ & ' , ! < *6 & %$* 6%$ , *6$ 2 %$ $ %$ #4 & < *:: 7$, 1 ,# $ & / $ & $=*$ ,$. 5 1$ $ $ 6 / 2 9 4 +1$ # + & / 0$ $ , 6 #1 ' " #$ ! ! & & & & & Commonly used models/techniques (‘07)… CRISP: Phases: Evaluation PROBLEM UNDERSTANDING DATA DATA UNDERST’ING PREPARATION EVALUATE RESULTS REVISE PROCESSES DETERMINE NEXT STEPS MODELLING EVOLUTION OF DM RESULTS EVALUATION APPROVED MODELS REVISION OF THE PROCESS LIST OF POSSIBLE ACTIONS DECISSIONS IMPLEMEN TATION CRISP: Phases: Deployment PROBLEM UNDERSTANDING DATA DATA UNDERST’ING PREPARATION PLAN IMPLEMEN TATION PLAN MONITORIZATION & MAINTENANCE GENERATE FINAL REPORT REVISE PROJECT MODELLING EVALUATION IMPLEMENTATION PLAN MONITORIZATION & MAINTENANCE PLAN FINAL REPORT DOCUMENTATION OF EXPERIENCE FINAL PRESENTATION IMPLEMEN TATION How do you deploy it? (’06 > ’09) , *46 # $ $ ,#: :$ & > $8+ ,# $ 4* $ *6$ ??? $:6 : +*, +( ( ( > $+ 1 68 , 7$ 1 +$6 @A 7$ 1 +$6 #$ 6 * $ 7$ 1 +$6 ;7 7$ 1 +$6 A ??? $:6 4 ,#1 +$ ! $:6 $ 6 1$ 1 +$ Cloud computing : computing in which dynamically scalable and often virtualized resources are provided as a service over the Internet. An example Google Apps #- $* ./ &!( ' ( !( '( 2 ( ( ! ( ! ( 2( 2 ( Software popularity (‘07) Free vs. commercial: debate Software popularity (‘09) ' % $ (")*+ ' %' %, $ %, (") - Why? Many changes have occurred in the business application of data mining since CRISP-DM 1.0 was published. Emerging issues and requirements include: The availability of new types of data—text, Web, and attitudinal data, for example—along with new techniques for pre-processing, analyzing, and combining them with related case data Integration and deployment of results with operational systems such as call centers and Web sites Far more demanding requirements for scalability and for deployment into real-time environments The need to package analytical tasks for non-analytical end users and integrate these tasks in business workflows The need to seamlessly integrate the deployment of results and closed-loop feedback with existing business processes The need to mine large-scale databases in situ, rather than exporting an analytical dataset Organizations’ increasing reliance on teams, making it important to educate greater numbers of people on the processes and best practices associated with data mining and predictive analytics In July 2006 the consortium announced that it was going to start the process of working towards a second version of CRISP-DM. On 26 September 2006, the CRISP-DM SIG met to discuss potential enhancements for CRISP-DM 2.0 and the subsequent roadmap. However, these efforts appear to be stalled. The SIG has not met, updated the CRISP website, or communicated anything to members since early 2007.