Download Data Mining and Knowledge Discovery in Databases

Data Mining and Knowledge Discovery in Databases Outline • • • • • What is Data Mining and KDD? Characteristics Applications Methods Packages & Close Relatives What is Data Mining & KDD? • “The process of identifying hidden patterns and relationships within data” or • “Data mining helps end users extract useful business information from large databases” What’s the Appeal? • Hidden nuggets of valuable information buried deep within a mountain of otherwise unremarkable data • Pervasive data • Seek competitive advantage The Challenge 5102018890521200153945819900000000141988122944882199608162100000010100010000000 1100003111110000000001003130200000000000000202001000000000000000000000000000043 4388888888424243424333012202022200001010010000000441000000001100000000000000000 1000001000000000000000000000000000000000000000000000000019981027510201896060120 0212694096800000015901998090337981199809173100100000100010000000110000320002000 0001000000012399000000000000200222200313100312000000000000000042438888888888424 3424233212121222200000010110000002441000000000100200000000000000000000100000000 0000000000000000000000000000000000000000019981230510201897020320001862692920000 0047091998021356971199802273100000100100010000000001101100000020000100000000021 0110001000000000001000000000000100011000000011100338888222233113233433300000011 0000011101001100102000100000000100000000100000000000000000000000000000000000000 0000000000000000000000000019981221510201899093020052008986730000019410199901127 5981199901263100100010100010000000001111101111122010100000111230010010000001021 0002200000000002000000000000011133438888434242424342423300000011110000010110010 0002441000000000100200000001001010000000100000000000000000000000000000000000000 0000000000019990525510201899122720093540515830000014484199705271797119970610310 0000010110010000000100000311120120000100100101200011110010000110100120000000000 0100000000001010132438888888888224242433100000001002100001110010011230100000010 0000200010000000000110000100000000100000100000000000000001000000000000000001998 1117510201899122720093540515830000014484199705271797219980616310000001011001000 0000110100311111121000100000202210012220220020221222201000000000000000001010011 0032434343213242214242423300210021000011110110000011223100110000010000001000000 0000110000100000000100000100000000000000000000000000000000001998122351020190001 Process: Knowledge Discovery In Databases database database cleaning & integration data warehouse modify data selection modify data selection collect and transform data mining modify methods, parameters data mining engines, models discovered patterns user interface and expert knowledge domain evaluation & presentation Context • Where you stand on Data Mining depends on where you sit: • Business User • Researcher • Computer Scientist Data Mining Might Mean… • • • • • • • • • • • • Statistics Visualization Artificial intelligence Machine learning Database technology Neural networks Pattern recognition Knowledge-based systems Knowledge acquisition Information retrieval High performance computing And so on... What’s needed? • • • • Suitable data Computing power Data mining software Skilled operator who knows both the nature of the data and the software tools • Reason, theory, or hunch Typical Applications of Data Mining & KDD • Marketing • Market Basket Analysis • Customer Relationship Management • New Product Development Typical Applications of Data Mining & KDD • Financial Services • Credit Approval • Fraud Detection • Marketing Typical Applications of Data Mining & KDD • Health Care • Epidemiological Analysis - incidence and prevalence of disease in large populations and detection of the source and cause of epidemics of infectious disease • Knowledge for funding • Policy, programs Two Basic Approaches • Supervised • A dependent or target variable • Unsupervised • “Pure Data Mining” • Fewer assumptions • Typically used for clustering techniques Automation • The ability to aim a tool at some data and push a button • Some methods of KDD/Data mining are more suitable for automation than others Seven Basic Methods: 1. 2. 3. 4. 5. 6. 7. Decision Trees (Artificial) Neural Networks Cluster/Nearest Neighbour Genetic Algorithms/Evolutionary Computing Bayesian Networks Statistics Hybrids Decision Trees • Graphical representations of relationships with data • Excel at Classification & Prediction Models Sample of a Decision Tree gender male female age good health? <65 married? >=65 yes no urban? yes no - + yes pet owner? + yes no - + pet owner? no yes no - - + Decision Trees • Strengths • Easily understood and interpreted • Represent complexity in a compact form • Handle non-linear data well • Relatively well suited to automation. • Weaknesses • Large trees with large numbers of variables become difficult to understand • Missing data must be appropriately managed in construction and use of the models Neural Networks • Derived from Artificial Intelligence Research • Modelled on the Human Neuron Neural Networks Prediction Weights 0.4 0.8 Hidden Layer Weights 0.1 0.6 0.3 Input Variables Age 0.3 0.7 0.5 Gender 0.2 Income Neural Networks • Strengths • Accuracy of prediction • Robust performance with a wide variety of data types • Weaknesses • Prone to overfitting • Poor clarity of model Clustering/Nearest Neighbour • Aim to assign “like” records to a group • Groups assigned according to some target variable or criteria • Nearest neighbour used for prediction Clustering/Nearest Neighbour • Applications: • Text processing: search engines • Image processing: radiology/image processing • Fraud detection: outliers Clustering/ Nearest Neighbour • Strengths • Easily understood and interpreted • Easily implemented in basic situations • Weaknesses • complex data not well suited to automation (much preprocessing required) Genetic Algorithms/ Evolutionary Computing • Grounded in Darwin – applied using mathematics • Require • a way to represent a solution to a problem • a way to test the “fitness” of the solution • Solutions are mathematically “mutated” • Fittest solutions survive • Convergence Genetic Algorithms/ Evolutionary Computing • Strengths • Suited to novel problems that are poorly understood • Suitable where data is dirty or missing • May be useful where other methods cannot be applied • Weaknesses • Not easily automated • Require creativity in their application Bayesian Networks • Based on Bayes’ rule: • P(a|b) = P(b|a) * P(a) / P(b) • Can construct networks of linked events, each with prior probabilities Bayesian Network Example Bobby publicly threatened Suicid e Bobby shot him J. R. Treated for Depressio n J.R. Shot Mistress shot him Big fight between wife, mistress Wife shot him Just a dream sequence Producer s desperat e for ratings Bayesian Networks • Strengths • Clarity of the resulting models • Good precision in predicting • Easily adapt to new probabilities • Weaknesses • Time consuming to construct and maintain • Poor at predicting rare events Statistics • With an outcome or dependent variable: • Correlations • ANOVA • Regression • Used by themselves or to confirm findings of another method Statistics • Strengths • “Gold Standard” – valid and trusted in scientific circles • Weaknesses • Limits findings to those techniques that are applied and their associated limitations (normality, linearity, and so on) Hybrids • Techniques used in combination • Example: use of a genetic algorithm to identify target variables for inclusion in a neural network model Recap • Data Mining is the core activity or method within a process of Knowledge Discovery in Databases • Done in order to find useful information in large amounts of data not possible using “conventional” approaches • Variety of methods • Knowledge of data domain, methods, as well as creativity Data Mining Packages • Major vendors of database/data management products (IBM, SPSS, Oracle PeopleSoft, SAS, and so on) • Added as a component of turnkey packages • May incorporate several methods (SAS Enterprise Miner) • Single method (TreeAge Software Inc.: a dedicated decision tree product) How to implement? • Do it yourself (you know the data domain) • Put a team together (domain and method specialists) • Hire a consultant (who knows both your domain and the tools) • Vertical markets in data mining Close Relatives of Data Mining • On-Line Analytical Processing (OLAP) • Pivot tables in spreadsheets • General statistical packages • Intelligent Data Analysis – comprises the use of data mining methods in the analysis of “small” datasets

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data Mining and Knowledge Discovery in Databases