Download Data Mining

General introduction Marco Saerens (UCL), with Christine Decaestecker (ULB) General introduction Slides references  Many slides and figures have been adapted from the slides associated to the following books: – Alpaydin (2004), Introduction to machine learning. The MIT Press. – Duda, Hart & Stork (2000), Pattern classification, John Wiley & Sons. – Han & Kamber (2006), Data mining: concepts and techniques, 2nd ed. Morgan Kaufmann. – Tan, Steinbach & Kumar (2006), Introduction to data mining. Addison Wesley. As well as from Wikipedia, the free online encyclopedia  I really thank these authors for their contribution to machine learning and data mining teaching  Machine learning and data mining  Interplay of these two related fields: – Data mining – Machine learning Why “Machine learning” ?  Machine learning is programming computers to optimize a performance criterion using example data or past experience – There is no need to “learn” to calculate payroll  Learning is used when: – Human expertise does not exist (navigating on Mars), – Humans are unable to explain their expertise accurately (speech recognition) – Solution changes in time (routing on a computer network) – Solution needs to be adapted to particular cases (user biometrics) What we talk about when we talk about “Learning”  Learning general models from a data of particular examples – Data is cheap and abundant (data warehouses, data marts); knowledge is expensive and scarce  Example in retail: Customer transactions to consumer behavior: People who bought “Da Vinci Code” also bought “The Five People You Meet in Heaven” (www.amazon.com)  Build a model that is a good and useful approximation to the data Data mining is related to machine learning  Related to knowledge extraction from data – Retail: Market basket analysis, Customer relationship management (CRM) – Finance: Credit scoring, fraud detection – Manufacturing: Optimization, troubleshooting – Medicine: Medical diagnosis – Telecommunications: Quality of service optimization – Bioinformatics: Motifs, alignment – Web mining: Search engines  In order to take well-founded decisions What is machine learning?  Optimize a performance criterion using example data or past experience. – Role of Statistics: Inference from a sample – Role of Engineering: • Efficient extraction of features from images, camera, measurement devices • Finding the best strategy – similar to control theory – Role of Computer science: Efficient algorithms to • Solve the optimization problem • Representing and evaluating the model for inference Brief sketch of some applications  Association rules  Supervised Learning – Classification – Regression  Clustering  Reinforcement Learning  Collaborative recommendation  Etc... Learning associations  Basket analysis: P(Y | X) probability that somebody who buys X also buys Y where X and Y are products/services. Example: P(chips | beer) = 0.7 Classification   Example: Credit scoring Differentiating between low-risk and high-risk customers from their income and savings Discriminant: IF income > θ1 AND savings > θ2 THEN low-risk ELSE high-risk Classification: Applications     Aka Pattern recognition Face recognition: Pose, lighting, occlusion (glasses, beard), make-up, hair style Character recognition: Different handwriting styles Speech recognition: Temporal dependency – Use of a dictionary or the syntax of the language – Sensor fusion: Combine multiple modalities; eg, visual (lip image) and acoustic for speech  Medical diagnosis: From symptoms to illnesses Classification: Face recognition Training examples of a person Test images AT&T Laboratories, Cambridge UK http://www.uk.research.att.com/facedatabase.html Regression  Example: Price of a used car y = wx + w0  x: car features y: price y = g(x, w): model w: model parameters Unsupervised learning and clustering  Learning “what normally happens”  No output (= unsupervised)  Clustering: Grouping similar instances  Example applications – Customer segmentation in CRM – Image compression: Color quantization – Bioinformatics: Learning motifs Reinforcement learning  Learning a policy: A sequence of outputs – No supervised output but delayed reward  Credit assignment problem  Game playing  Robot in a maze  Multiple agents, partial observability, ... Introduction to data mining Why data mining?  The Explosive Growth of Data: from terabytes to petabytes – Data collection and data availability • Automated data collection tools, database systems, Web, computerized society – Major sources of abundant data • Business: Web, e-commerce, transactions, stocks, … • Science: Remote sensing, bioinformatics, scientific simulation, … • Society and everyone: news, digital cameras,  We are drowning in data, but starving for knowledge! Evolution of database technology  1960s: – Data collection, database creation, IMS and network DBMS  1970s: – Relational data model, relational DBMS implementation  1980s: – RDBMS, advanced data models (extended-relational, OO, deductive, etc.) – Application-oriented DBMS (spatial, scientific, engineering, etc.)  1990s: – Data mining, data warehousing, multimedia databases, and Web databases  2000s – Stream data management and mining – Data mining and its applications – Web technology (XML, data integration) and global information systems What Is Data Mining?  Data mining (knowledge discovery from data) – Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from large amount of data – Data mining: a misnomer?  Alternative names – Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.  Watch out: Is everything “data mining”? – No: Simple search and query processing – (Deductive) expert systems Why data mining?– potential applications  Data analysis and decision support – Market analysis and management • Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation – Risk analysis and management • Forecasting, customer retention, improved underwriting, quality control, competitive analysis – Fraud detection and detection of unusual patterns (outliers)  Other Applications – Text mining (news group, email, documents) – Web mining – Stream data mining – Bioinformatics and bio-data analysis Knowledge discovery process – Data mining—core of knowledge discovery process Performances Evaluation Data Mining Task-relevant Data Data Warehouse Data Cleaning Data Integration Databases Selection KDD methodology CRISP-DM (1999) Cross-Industry Standard Process Model KDD methodology Business Knowledge Specification Extraction Audit Transformation Modelling Evaluation Knowledge Deployment Presentation -Map issues to data, gap analysis -Model data design (Matched modelling techinques, Taregt definition -Sampling: data history, time period, volume that models can be built and used!) -Extraction/transformation specification -Data quality audit -Data exploration (understanding, first insight for variable selection) -Data Treatment (missing values, outliers, transformation spec…) Data consolidation, aggregations  suitable format for modelling Construct and evaluate as-simple-as-possible models that render the highest amount of explanatory power Evaluate the model and its impact on your business, an ROI analysis Data Preparation can take 60 – 80% of the time! -Model implementation -Real time repeatable applications How should we act to derive value? KDD process How do we evaluate what we find? How do we extract appropriate data? Interpretation/Evaluation Data Mining Knowledge Transformation Patterns Preprocessing How do we specify what we need? How ought we deliver discovered information? Selection Preprocess ed Data Target Data Data Transform ed Data KDD process: several key steps  Learning the application domain – relevant prior knowledge and goals of application  Creating a target data set: data selection  Data cleaning and preprocessing  Data reduction and transformation – Find useful features, dimensionality/variable reduction, invariant representation  Choosing functions of data mining – summarization, classification, regression, association, clustering  Choosing the mining algorithm(s)  Data mining: search for patterns of interest  Performance evaluation and knowledge presentation – visualization, transformation, removing redundant patterns, etc.  Use of discovered knowledge Data mining and business intelligence Increasing potential to support business decisions End User Decision Making Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration Statistical exploration, Querying, and Reporting Business Analyst Data Analyst Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems DBA Confluence of multiple disciplines Database technology Machine learning Pattern recognition Statistics Data Mining Algorithmics Visualization Other disciplines Data mining vs statistics  Roughly speaking, data mining is statistical analysis of data performed by computer science people  In order to be able to support a wide variety of fast-growing, innovative, computer science applications Why not traditional data analysis?  Tremendous amount of data – Algorithms must be highly scalable to handle such as terabytes of data  High-dimensionality of data – Micro-array may have tens of thousands of dimensions  High complexity of data – Data streams and sensor data – Time-series data, temporal data, sequence data – Structure data, graphs, social networks and multi-linked data – Heterogeneous databases and legacy databases – Spatial, spatiotemporal, multimedia, text and web data – Software programs, virtual worlds  New and sophisticated applications – Small gap between development of techniques and embedding in software applications Multi-dimensional view of data mining  Data to be mined – Relational, data warehouse, transactional, stream, objectoriented/relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW  Knowledge to be mined – Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. – Multiple/integrated functions and mining at multiple levels  Techniques utilized – Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, etc  Applications tackled – Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc Data mining: classification schemes  General functionality – Descriptive data mining – Predictive data mining  Different views lead to different classifications – Data view: kinds of data to be mined – Knowledge view: kinds of knowledge to be discovered – Method view: kinds of techniques utilized – Application view: kinds of applications adapted Data mining: what kinds of data?  Database-oriented data sets and applications – Relational database, data warehouse, transactional database  Advanced data sets and advanced applications – Data streams and sensor data – Time-series data, temporal data, sequence data (incl. biosequences) – Structured data, graphs, social networks and multi-linked data – Object-relational databases – Heterogeneous databases and legacy databases – Spatial data and spatiotemporal data – Multimedia database – Text databases – The World-Wide Web – Virtual worlds Data mining functionalities  Multidimensional concept description: Characterization and discrimination – Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions  Frequent patterns, association, correlation vs. causality – Diaper  Beer [0.5%, 75%] (Correlation or causality?)  Classification and prediction – Construct models (functions) that describe and distinguish classes or concepts for future prediction • E.g., classify countries based on (climate), or classify cars based on (gas mileage) – Predict some unknown or missing values Data Mining Functionalities (2)  Cluster analysis – Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns – Maximizing intra-class similarity & minimizing interclass similarity  Outlier analysis – Outlier: Data object that does not comply with the general behavior of the data – Noise or exception? Useful in fraud detection, rare events analysis  Trend and evolution analysis – Trend and deviation: e.g., regression analysis – Sequential pattern mining: e.g., digital camera  large SD memory – Periodicity analysis – Similarity-based analysis  Etc Are the “discovered” patterns interesting?  Data mining may generate thousands of patterns: Not all of them are interesting – Suggested approach: Human-centered, query-based, focused mining  Interestingness measures – A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm  Objective vs. subjective interestingness measures – Objective: based on statistics and structures of patterns, e.g., support, confidence, etc. – Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, actionability, etc. Some other pattern mining issues  Precise patterns vs. approximate patterns – Association and correlation mining: possible find sets of precise patterns • But approximate patterns can be more compact and sufficient • How to find high quality approximate patterns?? – Gene sequence mining: approximate patterns are inherent • How to derive efficient approximate pattern mining algorithms??  Constrained vs. non-constrained patterns – Why constraint-based mining? – What are the possible kinds of constraints? How to push constraints into the mining process? Architecture: typical data mining system Graphical User Interface Pattern Evaluation Knowle dgeBase Data Mining Engine Database or Data Warehouse Server Data cleaning, transformation, integration, and selection Database Data warehouse World-wide web Other info repositories Some issues in data mining  Mining methodology – Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web – Performance: efficiency, effectiveness, and scalability – Pattern evaluation: the interestingness problem – Incorporation of background knowledge – Handling noise and incomplete data – Parallel, distributed and incremental mining methods – Integration of the discovered knowledge with existing one: knowledge fusion  User interaction – Data mining query languages and ad-hoc mining – Expression and visualization of data mining results – Interactive mining of knowledge at multiple levels of abstraction  Applications and social impacts – Domain-specific data mining & invisible data mining – Protection of data security, integrity, and privacy Summary  Data mining: Discovering interesting patterns from large amounts of data  A natural evolution of database technology + statistics, in great demand, with wide applications  A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation  Mining can be performed in a variety of information repositories  Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc.  Data mining systems and architectures  Major issues in data mining Some practical information Conferences International conferences  KDD, PKDD  Knowledge Discovery in Databases  KDD is organised by ACM International conferences  ICDM  IEEE International Conference on Data Mining  IEEE Computer Society International conferences  Closely related to « Machine Learning »  ICML, ECML  International Conference on Machine Learning International conferences  Closely related to « pattern recognition » (from a engineering perspective)  ICPR  International Conference on Pattern Recognition  Organised by IAPR International conferences  Closely related to « neural information processing »  NIPS  Neural Information Processing Systems Conference Some periodicals Periodicals  Researchers of the data mining community typically publish their work in – – – – – – – ACM Transactions on Knowledge Discovery form Data Machine Learning Journal of Machine Learning Research Neural Computation IEEE Transactions on Neural Networks Pattern Recognition IEEE Transactions on Pattern Analysis and Machine Intelligence – IEEE Transactions on Knowledge and Data Engineering – Etc… Some books Books  Hastie, Tibshirani & Friedman (2001) “The elements of statistical learning”.  Mitchell (1997) “Machine learning”.  Webb (2002) “Statistical pattern recognition, 2nd ed”. Books  Bishop (1995) “Neural networks for pattern recognition”.  Theodoridis & Koutroumbas (2005) “Pattern recognition, 3nd ed”  Cornuéjols & Miclet (2002) “Apprentissage artificiel”. Books  Duda, Hart & Stork (2001) “Pattern classification”.  Ripley (1996) “Pattern recognition and neural networks”.  Bishop (2006) “Pattern recognition and machine learning”. Books  Alpaydin (2004) “Introduction to machine learning”.  Han & Kamber (2006) “Data mining. Concepts and techniques, 2nd ed”.  Tan, Steinbach & Kumar (2005) “Introduction to data mining”. More books… Alpaydin (2004), “Introduction to machine learning”. MIT Press. Bardos (2001), "Analyse discriminante. Application au risque et scoring financier. Dunod. Bishop (1995), "Neural networks for pattern recognition". Clarendon Press. Bishop (2006), “Pattern recognition and machine learning”. Springer-Verlag. Bouroche & Saporta (1983), "L'analyse des données". Que Sais-je. Cornuéjols & Miclet (2002), "Apprentissage artificiel. Concepts et algorithmes". Eyrolles. Duda, Hart & Stork (2001), "Pattern classification, 2nd ed". John Wiley & Sons. Dunham (2003), "Data mining. Introductory and advanced topics". Prentice-Hall. Greenacre (1984), "Theory and applications of correspondence analysis". Academic Press. Han & Kamber (2006), "Data mining: Concepts and techniques, 2nd ed". Morgan Kaufmann. Hand (1981), "Discrimination and classification". John Wiley & Sons. Hardle & Simar (2003), "Applied multivariate statistical analysis". Springer-Verlag. Disponible à http://www.quantlet.com/mdstat/scripts/mva/htmlbook/mvahtml.html Hastie, Tibshirani & Friedman (2001), "The elements of statistical learning". Springer-Verlag. Johnson & Wichern (2002), "Applied multivariate statistical analysis, 5th ed". Prentice-Hall. Lebart, Morineau & Piron (1995), "Statistique exploratoire multidimensionnelle". Dunod. Mitchell (1997), "Machine learning". McGraw-Hill. Naim, Wuillemin, Leray, Pourret & Becker (2004), “Réseaux bayesiens”. Editions Eyrolles. Nilsson (1998), “Artificial intelligence: A new synthesis”. Morgan Kaufmann. Ripley (1996), "Pattern recognition and neural networks". Cambridge University Press. Rosner (1995), "Fundamentals of biostatistics, 4th ed".Wadsworth Publishing Company. Saporta (1990), "Probabilités, analyse des données et statistique". Editions Technip. Tan, Steinbach & Kumar (2005), “Introduction to data mining”. Addison Wesley. Theodoridis & Koutroumbas (2006), "Pattern recognition, 3nd ed". Academic Press. Therrien (1989), "Decision, estimation and classification". Wiley & Sons. Venables & Ripley (2002), "Modern applied statistics with S. Springer-Verlag. Wasserman (2004), “All of statistics”. Springer. Webb (2002), "Statistical pattern recognition, 2nd ed". John Wiley and Sons. Some softwares Softwares  SAS/Enterprise Miner  You benefit from the full power of SAS – SAS/Stat, Base, Graph, etc  Highly scalable  Quite complete, but expensive Softwares/SAS EM Softwares   SPSS/Clementine One of the first specific data mining software – Less scalable than SAS – Less expensive  Same look and feel as SAS/EM – Palette with icons – Drag & drop of models – Data flux Softwares/SPSS Softwares  StatSoft/DataMiner  Released recently  Relies on Statistica  Same look and feel Softwares/StatSoft Softwares    Insightful Miner Relies on S-Plus Same look and feel Softwares/Insightful Softwares  IBM/DB2 Intelligent Miner – One of the first on the market – Quite impressive for data visualisation  KXEN – A newcomer – Quite controversial – A bit “Black Box” Softwares   Of course, you do not need a specific data mining product in order to do data mining Statistical softwares integrate data mining techniques as well – – – – – S-Plus R SAS/JMP Stata Etc… Open source softwares  WEKA – www.cs.waikato.ac.nz/~ml/weka/ – Data mining techniques implemented in Java  Pentaho (integrates WEKA) – www.pentaho.com – BI open source suite  R – www.r-project.org – Statistical software including data mining techniques  Orange – http://www.ailab.si/orange – Data mining tool in Python

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data Mining