Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Allan Tucker School of Information Systems Computing and Mathematics Brunel University, London. UB8 3PH. UK The talk • The Data Explosion • Data Mining techniques & Application • Data Mining in the Media • Some of our work on Biomedical Data Mining • Some Caveats Data historically... • Preserve of scientists: Darwin, 1800s Newton, 1600s Galton, 1800s Pearson, 1900s Database Technology Timeline 1960s: Data collection, database creation 1970s: Relational data model Relational DBMS implementation 1980s: Advanced data models (extended-relational, OO, deductive, etc.) Application-oriented DBMS (spatial, scientific, engineering, etc.) 1990s—2000s: Data Warehousing Multimedia and Web databases Distributed DW: The Cloud Data Generation examples • Data collected from online forms • Amazon purchases • Google searches • Loan requests • Massively parallel sequencing of biological data (gene expression) • Telescopes scanning the skies The Data Explosion “We are drowning in information, but starving for knowledge” John Naisbett (Futurologist) Due to the advance of IT and the Internet • Massive increase in ability to: • Record: Electronic records and forms, the Internet • Store: Data Warehouses, the Cloud • Risk of Information Overload The Data Explosion Need to Analyse: Data Mining, Machine Learning, Intelligent Data Analysis, Knowledge Discovery in Databases, Bioinformatics Knowledge Overlap with Statistics “Statistics is the science of the collection, organization, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments”, OED “... statistics, that is, the mathematical treatment of reality ...” Hannah Arendt “He uses statistics as a drunken man uses lampposts - for support rather than for illumination.” Andrew Lang “There are lies, damned lies, and statistics.” Benjamin Disraeli Overlap with Statistics “DM is the process of extracting patterns from large data sets by combining methods from statistics and artificial intelligence with database management.”, Jason Frand, UCLA • More explorative • Not always an hypothesis • Works with Historical Data • Rarely any experimental design! • Makes less assumptions about the data Data Mining Process (or Knowledge Discovery) Knowledge Discovery in Databases (KDD) The Process (from Advances in KDD and Data mining): Knowledge Data Target Data Pre-processed Data Transformed Data Patterns Typical Tasks Descriptive: • Clustering (customer profiling) • Association Rule Mining (basket analysis) Predictive: • Classification (medical diagnosis) • Forecasting (stock forecasts) • Regression (interpolation / extrapolation) Clustering (unsupervised learning) • Looking for data points that are similar • Depends on how you measure difference or similarity! Clustering (unsupervised learning) • Customer Relationship Management Clustering (unsupervised learning) • Patients with similar symptoms Classification (supervised learning) • Separate Classes with: 3 2.5 A simple model: Generalisable but biased or ... 2 1.5 1 0.5 3 2.5 3 3.5 4 4.5 5 2.5 ... a complex model: Good fit but risks overfitting 2 1.5 1 0.5 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 5.5 6 6.5 7 Classification (supervised learning) • For example, • Purchases bought online • Patient data – healthy vs disease • Agreeing loans Decision Trees Credit Rating / Pregnancy Screening examples Feature Selection • How data clusters / classifies depends very much upon selected variables (features) • What if we select customers based upon their purchases only? Or also include demographics ? • We get very different results. • Feature selection involves automatically identifying the important variables Feature Selection • Clusters plotted with different features 8 5 4.5 3 7 2.5 4 6 3.5 2 5 3 Series1 2.5 Series1 4 Series2 2 Series3 Series1 1.5 Series2 Series2 Series3 3 Series3 1 1.5 2 1 0.5 1 0.5 0 0 0 2 4 6 8 10 0 0 1 2 3 4 5 0 2 4 6 8 • Filter methods score each variable independently (e.g. chi squared) • Wrapper approaches model the interactions between variables Association Rules • Based upon Basket Analysis • Supermarkets use this all the time • Given a large amount of basket data, generate rules: <Set of items> <Set of items> <Set of items> If <items> then <items> (confidence / support) Association Rules • Based upon Basket Analysis • Supermarkets use this all the time • Given a large amount of basket data, generate rules: <Set of items> <Set of items> <Set of items> If <items> then <items> (confidence / support) Association Rules • Why do they find this knowledge useful? • Loyalty Cards • Shop layout • Special offers • Think of amazon ... Association Rules Association Rules Time-Series Models • Statistical & “AI” Models (Neural Networks) • Temporal Abstractions • For example, EEG & ECG in ICUs, Stock markets Time-Series Models • Statistical & “AI” Models (Neural Networks) • Temporal Abstractions • For example, EEG & ECG in ICUs, Stock markets Bayesian Networks • A probabilistic method to model data • Easily interpreted by non-statisticians • Can be used to combine existing knowledge with data • Essentially use independence assumptions to model the joint distribution of a domain Bayesian Networks • Simple 2 variable Joint Distribution P(Gene, Disease) Gene ¬ Gene Disease 0.89 0.01 ¬ Disease 0.03 0.07 • Can use it to ask many useful questions • But requires kN probabilities Bayesian Network for Toy Domain P(A) .001 Gene A A T T F F C P(D) T .70 F .01 B T F T F Gene D P(C) .95 .94 .29 .001 Gene B P(B) .002 Gene C Gene E C P(E) T .90 F .05 Bayesian Networks for Classification & Feature Selection & forecasting • Nodes that can represents class labels or variables at “points in time” t-1 t • Also latent variables via EM X1 X1 X1 X2 P(X1) P(X2) X3 X4 P(X4 | X3) P(X3 | X1, X2) X5 C P(X5 | X3) X1 X2 X3 XN X2 X2 X3 X3 X4 X4 XN XN t-1 t H H X2 X2 XN XN Bayesian Networks for Classification & Feature Selection & forecasting • Diagnosing Aircraft Failure Data Mining - Successes Some successful examples of its use: Search Engines – Bayesian networks Pharmaceutical companies – Drug Discovery Credit card companies – Fraud Detection Transportation companies - Routing Large consumer package goods companies (to improve the sales process to retailers) • Hospital Organisation – Decision Analysis • Online businesses – Market Research • • • • • On Business Intelligence & OLAP Application of DM & DW to Business Data On-Line Analytic Processing: Overlap with Data Mining More focussed on interactive ad-hoc analysis Exploits multidimensional modelling Concepts of: Drill-down Consolidation Slicing & Dicing Visualisation & Dashboards On Business Intelligence & OLAP On Social Media / Market Research LinkedIn (professional contacts) Skype (voice / video) Ipods (location) Flickr (images) Facebook (personal contacts) On Social Media / Market Research LinkedIn (professional contacts) Skype (voice / video) Ipods (location) Flickr (images) Facebook (personal contacts) Data Mining – In the media – part 1 Most positive news stories relate to other names for DM Some of our Data Mining work in Biomedical / Eco Informatics • Building Gene Regulatory Networks • Building Trajectories of Disease from Medical Data • Building Dynamic Models of Ecological Data Microarray Data & Bioinformatics • Major source of data for gene expression activity • Technology takes measurements over 1000s of genes simultaneously • Gene Regulatory Networks (GRNs) model how genes interact • Eliciting reliable GRNs from data key to understanding biological mechanisms Yeast The Importance of Independent Test Data • Prediction – Train a network on one dataset • Test it on the others sets (Independent Data) • As opposed to Cross Validation (testing on the same dataset) Models of Increasing Complexity (MIC) • Extending the Consensus across platforms • Select one dataset for training, others become test sets • Score mean and var of SSE using CV and independent test sets • Use these to rank genes (this is feature selection) (2010) Anvar, S.Y., t' Hoen, P.A.C. and Tucker, A., The Identification of Informative Genes from Multiple Datasets with Increasing Complexity, BMC Bioinformatics 11 : 32 Mechanisms Between Species? • Dandelion Algorithm – extension of MIC (submitted) Anvar, Y. Tucker, A. Venema, A. van Ommen, G.J.B. van der Maarel, S.M. Raz, V. „t Hoen, P.A.C. “Interspecies translation of gene disease networks increase robustness and predictive accuracy”, PLOS Computational Biology Inter-species Mechanisms Modelling Clinical Data • Biomedical studies often involve data sampled from a cross-section of a population • Collecting medical information on patients suffering from a particular disease and controls • These studies show a “snapshot” of the disease process but disease is inherently temporal: • Previously healthy people can develop a disease over time going through different stages of severity • If we want to model the development of such processes, usually require longitudinal data (expensive) Models of Disease: Visual Field and Retinal Image Data • Progressive loss of the field of vision is characteristic of many eye diseases • Glaucoma is a leading cause of irreversible blindness in the world. • VF Data: sensitivity of field of vision • HRT Data: anatomical info of retina b) Pseudo Time-Series for CS Data Tucker, A. and Garway-Heath, D., The Pseudo Temporal Bootstrap for Predicting Glaucoma from Cross-Sectional Visual Field Data, IEEE Transactions on IT in Biomedicine 14 (1) : 79-85 , 2010 Fisheries Population Modelling Cod Collapse in G Bank, N Sea & ESS 10 George’s Bank Functional Collapse in late „80s 8 50000.00 Catch 40000.00 6 30000.00 4 20000.00 2 10000.00 0 0.00 1970 North Sea No Functional Collapse 60000.00 Biomass 1975 1980 1985 1990 1995 2000 2005 400 350 300 250 200 150 100 50 0 300000.00 250000.00 200000.00 150000.00 100000.00 50000.00 0.00 1970 1975 1980 1985 1990 1995 2000 2005 12000 35000.00 30000.00 25000.00 20000.00 15000.00 10000.00 5000.00 0.00 10000 East Scotian Shelf Functional Collapse in early „90s 8000 6000 4000 2000 0 1970 1975 1980 1985 1990 1995 2000 2005 Dynamic Functional Models • Predicting ESS event & Cod biomass from G Bank ESS G Bank Th Skate Cod Cusk Cod Catch Summary What is Data Mining • Potential (& Successful) Applications • • Business Intelligence • Medical Informatics • Bio Informatics • Ecological Data • Engineering • What about some of the downsides… Caveats to Data Mining Data Quality ✓ Spurious Correlations ✓ Over-fitting ✓ “Black Box” Modelling ✓ Over-reliance – slave to the data ? “Can’t see the wood for the trees” ? Data Mining – in the media – part 2 • Data mining government /commercial data sets for national security or law enforcement purposes has raised privacy concerns • EU – The “right to be forgotten” e.g. Facebook • Patenting Genetic information Data Mining – in the media – part 2 Data Mining Video 2 http://www.time.com/time/video/player/0,320 68,821500876001_2058396,00.html Data Mining in the Future … • Maybe a rebranding is needed? • Medical Informatics & Business Intelligence • Data to Knowledge • Knowledge Discovery in Databases & KDnuggets • In the cloud: “Cloud Analytics” Data Mining in the Future … • Maybe different names? • Medical Informatics & Business Intelligence • Data to Knowledge • Knowledge Discovery in Databases • In the cloud: “Cloud Analytics” Thanks for listening Emma Steele, Yahya Anvar & PeterBram ‘t Hoen for their work on the microarray research Daniel Duplisea for the work on the fish biomass research Stefano Ceccon, Yuanxi Li & David Garway-Heath for work on Glaucoma research