Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
1 Knowledge Discovery and Data Mining An Introduction Daniel L. Silver Copyright (c), 2003 All Rights Reserved CogNova Technologies 2 Agenda Introduction to KDD & DM Overview of the KDD Process Benefits, Costs, Status and Trends CogNova Technologies 3 “We are drowning in information, but starving for knowledge.” John Naisbett Megatrends, 1988 Data Analytics or KDD: Data Warehousing, Data Mining, Data Visualization CogNova Technologies 4 Introduction Data Analytics is not a new field ... Since 1990’s referred to as: Data Analysis, Data Mining, Data Warehousing A • • • • • multidisciplinary field: Database and data warehousing Data and model visualization methods On-line Analytical Processing Statistics and machine learning Knowledge management CogNova Technologies 5 Introduction Why has Data Analytics become important? Competitive focus - Knowledge Management Abundance of data !! Inexpensive, powerful computing engines Strong theoretical/mathematical foundations • machine learning & logical inference • statistics and dynamically systems • database management systems CogNova Technologies 6 Introduction What is Data Analytics (KDD)? A Process The selection and processing of data for: • the identification of novel, accurate, and useful patterns, and • the modeling of real-world phenomenon. Data Warehousing, Data mining, and Data Visualization are major components. CogNova Technologies 7 The KDD Process Interpretation and Evaluation Data Mining Knowledge Selection and Preprocessing Data Consolidation p(x)=0.02 Patterns & Models Data Warehouse Prepared Data Consolidated Data Data Sources CogNova Technologies 8 Introduction – KDD In Context 9 T he KD D Pro ce ss Interpretation and E valuation D ata M ining K n o w le d g e Sele ction a nd Preprocessing Problem D ata C onsolidation Knowledge p (x) = 0. 02 P a tt e r n s & M o d e ls W are h ou se P r e p a r e d D a ta C o n s o lid a te d D a ta D a ta S o u r c e s C o g N o va T e c h n o lo g i e s Identify Problem or Opportunity Strategy “The Virtuous Cycle” Berry & Linoff Measure Effect of Action Act on Knowledge Results CogNova Technologies 9 Introduction - CRISP Cross Industry Standard Process for Data Mining Developed by employees at SPSS, NCR, DaimlerCrysler Iterative process with 6 major steps: • • • • • • Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment CogNova Technologies 10 Marketing Embraces KM, DW, DM Why? … Marketing Traditional Marketing MIS Relationship Marketing a.k.a Customer Relationship Management Data WarehousingData Mining CogNova Technologies 11 What is Relationship Marketing? Arbuckle’s Market “ The Corner Store ” Knowing your customers on an individual basis Maximizing life-time value not individual sales Developing and maintaining a mutually beneficial relationship Acquire, retain, win-back desirable customers CogNova Technologies 12 Knowledge Discovery What can KDD do for an organization? Impact on Marketing Target marketing at a credit card company Consumer usage analysis at a telecomm provider Loyalty assessment at a service bureau Quality of service analysis at an appliance chain CogNova Technologies 13 Application Areas Private/Commercial Sector Marketing: segmentation, product targeting, customer value and retention, ... Finance: investment support, portfolio management Banking & Insurance: credit and policy approval Security: fraud detection, access control Science and medicine: hypothesis discovery, prediction, classification, diagnosis Manufacturing: process modeling, quality control, resource allocation Engineering: pattern recognition, signal processing Internet: smart search engines, web marketing CogNova Technologies 14 Application Areas Public/Gov’t Sector Finance: investment management, price forecasting Taxation: adaptive monitoring, fraud detection Health care: medical diagnosis, risk assessment, cost /quality control Education: process and quality modeling, resource forecasting Insurance: worker’s compensation analysis Security: bomb, iceberg detection Transportation: simulation and analysis Statistics: demographic analysis, municipal planning CogNova Technologies 15 The Data Analytics (KDD) Process CogNova Technologies 16 The KDD Process Interpretation and Evaluation Data Mining Knowledge Selection and Preprocessing Data Consolidation p(x)=0.02 Patterns & Models Warehouse Prepared Data Consolidated Data Data Sources CogNova Technologies 17 The KDD Process Possible results for any one effort: Confirmation of the obvious New knowledge - the data mine “nugget” No significant relations found (random data) CogNova Technologies 18 The KDD Process Core Problems & Approaches Problems: Probability • • • identification of relevant data representation of data search for valid pattern or model of sale Age Approaches: Income • top-down deduction by expert OLAP • interactive visualization of data/models Data • * bottom-up induction from data * Mining CogNova Technologies 19 The KDD Process The Architecture of a KDD System Graphical User Interface Data Consolidation Data Sources Selection and Preprocessing Warehouse Data Mining Interpretation and Evaluation Knowledge CogNova Technologies 20 The KDD Process Interpretation and Evaluation Data Mining Knowledge Selection and Preprocessing p(x)=0.02 Data Consolidation Warehouse CogNova Technologies 21 Data Consolidation Garbage in Garbage out The quality of results relates directly to quality of the data 50%-70% of KDD process effort will be spent on data consolidation, cleansing and preprocessing Major justification for a corporate Data Warehouse CogNova Technologies 22 Data Consolidation & Warehousing From data sources to consolidated data repository RDBMS Legacy DBMS Analysis and Info Sharing Inflow Data Consolidation and Cleansing Warehouse or Datamart Flat Files Metaflow External Upflow Downflow Outflow CogNova Technologies 24 Data Warehousing – A Process Definition: The strategic collection, cleansing, and consolidation of organizational data to meet operational, analytical, and communication needs. 75% of early DW projects were not completed Data warehousing is not a project It is an on-going set of organizational activities Must be business benefits driven CogNova Technologies 27 Relationship between DW and DM? Strategic Tactical Rationale for data consolidation Analysis Data Warehousing Query/Reporting OLAP Data Mining Source of consolidated data CogNova Technologies 28 The KDD Process Interpretation and Evaluation Data Mining Knowledge Selection and Preprocessing p(x)=0.02 Data Consolidation Warehouse CogNova Technologies 29 Selection and Preprocessing Generate a set of examples • • • Reduce attribute dimensionality • • remove redundant and/or correlating attributes combine attributes (sum, multiply, difference) Reduce attribute value ranges • • choose sampling method consider sample complexity deal with volume bias issues group symbolic discrete values quantize continuous numeric values OLAP and visualization tools play key role (Han calls this descriptive data mining) CogNova Technologies 30 OLAP: On-Line Analytical Processing OLAP Functionality Profit Values Dimension selection • slice & dice Sales Region OLAP cube Rotation • allows change in perspective Filtration • value range selection Year by Month Product Class by Product Name Hierarchies • • drill-downs to lower levels roll-ups to higher levels CogNova Technologies 31 Selection and Preprocessing Transform data • decorrelate and normalize values • map time-series data to static representation Encode data • representation must be appropriately for the Data Mining tool which will be used • continue to reduce attribute dimensionality where possible without loss of information OLAP and visualization tools as well as transformation and encoding software CogNova Technologies 33 The KDD Process Interpretation and Evaluation Data Mining Knowledge Selection and Preprocessing p(x)=0.02 Data Consolidation Warehouse CogNova Technologies 34 Overview of Data Mining Methods Automated Exploration/Discovery • • Prediction/Classification • • e.g.. discovering new market segments x2 distance and probabilistic clustering algorithms x1 e.g.. forecasting gross sales given current factors regression, neural networks, genetic algorithms f(x) Explanation/Description • • e.g.. characterizing customers by demographics and purchase history inductive decision trees, if age > 35 association rule systems Focus is on induction of a model from specific examples x and income < $35k then ... CogNova Technologies 35 Data Mining Methods Automated Exploration and Discovery Distance-based numerical clustering • • metric grouping of examples (KNN) graphical visualization can be used Income Bayesian clustering • Age search for the number of classes which result in best fit of a probability distribution to the data Unsupervised Learning CogNova Technologies 36 Data Mining Methods Prediction and Classification Function approximation (curve fitting) Classification (concept learning, pattern recognition) A Methods: • • • • x2 Statistical regression Artificial neural networks Genetic algorithms Nearest neighbour algorithms Supervised Learning B f(x) x O1 O2 x1 I1 I2 I3 I4 CogNova Technologies 37 Data Mining Methods Generalization The objective of learning is to achieve good generalization to new cases, otherwise just use a look-up table. Generalization can be defined as a mathematical interpolation or regression over a set of training points: f(x) x CogNova Technologies 41 Data Mining Methods Explanation and Description Learn a generalized hypothesis (model) from selected data Description/Interpretation of model provides new human knowledge Methods: Root • • • Inductive decision tree and rule systems B? Association rule systems Link Analysis D? A? C? Yes Leaf CogNova Technologies 42 Modeling & Data Mining DEMO WEKA – A Data Mining Environment CogNova Technologies 43 The KDD Process Interpretation and Evaluation Data Mining Knowledge Selection and Preprocessing p(x)=0.02 Data Consolidation and Warehousing Warehouse CogNova Technologies 44 Interpretation and Evaluation Evaluation Statistical validation and significance testing Qualitative review by experts in the field Pilot surveys to evaluate model accuracy Interpretation Inductive tree and rule models can be read directly Clustering results can be graphed and tabled Code can be automatically generated by some systems (ANNs, IDTs, Regression models) CogNova Technologies 45 Interpretation and Evaluation Visualization tools can be very helpful: • • • • sensitivity analysis (I/O relationship) histograms of value distributions time-series plots and animation requires training and practice Response Temp Velocity CogNova Technologies 46 Benefits, Costs, Status and Trendss CogNova Technologies 47 Benefits of Data Analytics(KDD) Maximum utility from corporate data • discovery of new knowledge • generation of predictive models Important feedback to data warehousing effort • identification and justification of essential data Reduction of application dev ’t backlog • model development vs. software development Effect on bottom line of organization • cost reduction, increased productivity, risk avoidance … competitive advantage CogNova Technologies 48 Requirements and Costs of KDD Hardware - computationally intensive Software - micro < $20k, integrated suites $100k+ Data - internal collection, surveys, external sources Human resources • DB/DP/DC expertise to consolidate and preprocess data • Machine learning and stats competence • Application knowledge & project mgmt 70% of the effort is expended on the data consolidation and preprocessing activities CogNova Technologies 49 Current Status and Trends Standards and methodologies are maturing Many products: • Open source (WEKA, RapidMiner) • micro DM packages (IBM Cognos) • Macro integrated suites (IBM SPSS Modeler, SAS Enterprise Miner) Software costs have stabalized Major players have been determined Internet - “the” sink and source of data Legal and ethical issues on the horizon CogNova Technologies 50 Current Status and Trends Methods used • http://www.kdnuggets.com/polls/2013/analytic s-big-data-mining-data-science-software.html Appication areas: • http://www.kdnuggets.com/polls/2012/whereapplied-analytics-data-mining.html Other Poles: • http://www.kdnuggets.com/polls/index.html CogNova Technologies 51 The Current Status and Trends What has prevented the use of Data Mining? Products: • General in nature, not tailored for business • Missing standard interfaces to organizational data • Emphasis on sales and not training/consulting Customers: • • • • Frightened by technical skill set required Uncertain of mining results and ROI Convinced warehouse must be completed first Lacking knowledge of external data sources CogNova Technologies 52 Key Technologies for KDD Data warehousing and distributed database Parallel computing AI and expert systems Machine learning and statistical inference Visualization (including Virtual Reality) Internet - future sink and source of data • adaptive filters, knowledge extractors • smart web services CogNova Technologies 53 Current Management Issues Ownership of data and knowledge Security of customer data Responsibility for accuracy of information Ethical practices - fair use of data CogNova Technologies 54 A List of Major Vendors Lots of Players Approaching market from hardware, database, statistical, machine learning, education, financial/marketing, and management consulting: IBM, SAS, SPSS, SGI, Thinking Machines, Cognos, ZDM Scientific, Neuralware, Information Discovery, American Heuristics, Data Distilleries, SuperInduction CogNova Technologies 55 THE END [email protected] CogNova Technologies