Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining and Knowledge Discovery in Databases Outline • • • • • What is Data Mining and KDD? Characteristics Applications Methods Packages & Close Relatives What is Data Mining & KDD? • “The process of identifying hidden patterns and relationships within data” or • “Data mining helps end users extract useful business information from large databases” What’s the Appeal? • Hidden nuggets of valuable information buried deep within a mountain of otherwise unremarkable data • Pervasive data • Seek competitive advantage The Challenge 5102018890521200153945819900000000141988122944882199608162100000010100010000000 1100003111110000000001003130200000000000000202001000000000000000000000000000043 4388888888424243424333012202022200001010010000000441000000001100000000000000000 1000001000000000000000000000000000000000000000000000000019981027510201896060120 0212694096800000015901998090337981199809173100100000100010000000110000320002000 0001000000012399000000000000200222200313100312000000000000000042438888888888424 3424233212121222200000010110000002441000000000100200000000000000000000100000000 0000000000000000000000000000000000000000019981230510201897020320001862692920000 0047091998021356971199802273100000100100010000000001101100000020000100000000021 0110001000000000001000000000000100011000000011100338888222233113233433300000011 0000011101001100102000100000000100000000100000000000000000000000000000000000000 0000000000000000000000000019981221510201899093020052008986730000019410199901127 5981199901263100100010100010000000001111101111122010100000111230010010000001021 0002200000000002000000000000011133438888434242424342423300000011110000010110010 0002441000000000100200000001001010000000100000000000000000000000000000000000000 0000000000019990525510201899122720093540515830000014484199705271797119970610310 0000010110010000000100000311120120000100100101200011110010000110100120000000000 0100000000001010132438888888888224242433100000001002100001110010011230100000010 0000200010000000000110000100000000100000100000000000000001000000000000000001998 1117510201899122720093540515830000014484199705271797219980616310000001011001000 0000110100311111121000100000202210012220220020221222201000000000000000001010011 0032434343213242214242423300210021000011110110000011223100110000010000001000000 0000110000100000000100000100000000000000000000000000000000001998122351020190001 Process: Knowledge Discovery In Databases database database cleaning & integration data warehouse modify data selection modify data selection collect and transform data mining modify methods, parameters data mining engines, models discovered patterns user interface and expert knowledge domain evaluation & presentation Context • Where you stand on Data Mining depends on where you sit: • Business User • Researcher • Computer Scientist Data Mining Might Mean… • • • • • • • • • • • • Statistics Visualization Artificial intelligence Machine learning Database technology Neural networks Pattern recognition Knowledge-based systems Knowledge acquisition Information retrieval High performance computing And so on... What’s needed? • • • • Suitable data Computing power Data mining software Skilled operator who knows both the nature of the data and the software tools • Reason, theory, or hunch Typical Applications of Data Mining & KDD • Marketing • Market Basket Analysis • Customer Relationship Management • New Product Development Typical Applications of Data Mining & KDD • Financial Services • Credit Approval • Fraud Detection • Marketing Typical Applications of Data Mining & KDD • Health Care • Epidemiological Analysis - incidence and prevalence of disease in large populations and detection of the source and cause of epidemics of infectious disease • Knowledge for funding • Policy, programs Two Basic Approaches • Supervised • A dependent or target variable • Unsupervised • “Pure Data Mining” • Fewer assumptions • Typically used for clustering techniques Automation • The ability to aim a tool at some data and push a button • Some methods of KDD/Data mining are more suitable for automation than others Seven Basic Methods: 1. 2. 3. 4. 5. 6. 7. Decision Trees (Artificial) Neural Networks Cluster/Nearest Neighbour Genetic Algorithms/Evolutionary Computing Bayesian Networks Statistics Hybrids Decision Trees • Graphical representations of relationships with data • Excel at Classification & Prediction Models Sample of a Decision Tree gender male female age good health? <65 married? >=65 yes no urban? yes no - + yes pet owner? + yes no - + pet owner? no yes no - - + Decision Trees • Strengths • Easily understood and interpreted • Represent complexity in a compact form • Handle non-linear data well • Relatively well suited to automation. • Weaknesses • Large trees with large numbers of variables become difficult to understand • Missing data must be appropriately managed in construction and use of the models Neural Networks • Derived from Artificial Intelligence Research • Modelled on the Human Neuron Neural Networks Prediction Weights 0.4 0.8 Hidden Layer Weights 0.1 0.6 0.3 Input Variables Age 0.3 0.7 0.5 Gender 0.2 Income Neural Networks • Strengths • Accuracy of prediction • Robust performance with a wide variety of data types • Weaknesses • Prone to overfitting • Poor clarity of model Clustering/Nearest Neighbour • Aim to assign “like” records to a group • Groups assigned according to some target variable or criteria • Nearest neighbour used for prediction Clustering/Nearest Neighbour • Applications: • Text processing: search engines • Image processing: radiology/image processing • Fraud detection: outliers Clustering/ Nearest Neighbour • Strengths • Easily understood and interpreted • Easily implemented in basic situations • Weaknesses • complex data not well suited to automation (much preprocessing required) Genetic Algorithms/ Evolutionary Computing • Grounded in Darwin – applied using mathematics • Require • a way to represent a solution to a problem • a way to test the “fitness” of the solution • Solutions are mathematically “mutated” • Fittest solutions survive • Convergence Genetic Algorithms/ Evolutionary Computing • Strengths • Suited to novel problems that are poorly understood • Suitable where data is dirty or missing • May be useful where other methods cannot be applied • Weaknesses • Not easily automated • Require creativity in their application Bayesian Networks • Based on Bayes’ rule: • P(a|b) = P(b|a) * P(a) / P(b) • Can construct networks of linked events, each with prior probabilities Bayesian Network Example Bobby publicly threatened Suicid e Bobby shot him J. R. Treated for Depressio n J.R. Shot Mistress shot him Big fight between wife, mistress Wife shot him Just a dream sequence Producer s desperat e for ratings Bayesian Networks • Strengths • Clarity of the resulting models • Good precision in predicting • Easily adapt to new probabilities • Weaknesses • Time consuming to construct and maintain • Poor at predicting rare events Statistics • With an outcome or dependent variable: • Correlations • ANOVA • Regression • Used by themselves or to confirm findings of another method Statistics • Strengths • “Gold Standard” – valid and trusted in scientific circles • Weaknesses • Limits findings to those techniques that are applied and their associated limitations (normality, linearity, and so on) Hybrids • Techniques used in combination • Example: use of a genetic algorithm to identify target variables for inclusion in a neural network model Recap • Data Mining is the core activity or method within a process of Knowledge Discovery in Databases • Done in order to find useful information in large amounts of data not possible using “conventional” approaches • Variety of methods • Knowledge of data domain, methods, as well as creativity Data Mining Packages • Major vendors of database/data management products (IBM, SPSS, Oracle PeopleSoft, SAS, and so on) • Added as a component of turnkey packages • May incorporate several methods (SAS Enterprise Miner) • Single method (TreeAge Software Inc.: a dedicated decision tree product) How to implement? • Do it yourself (you know the data domain) • Put a team together (domain and method specialists) • Hire a consultant (who knows both your domain and the tools) • Vertical markets in data mining Close Relatives of Data Mining • On-Line Analytical Processing (OLAP) • Pivot tables in spreadsheets • General statistical packages • Intelligent Data Analysis – comprises the use of data mining methods in the analysis of “small” datasets