Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Prof. CE, KMITL 1/2555 ©CS Jiawei Han of UIUC • http://www.cs.uiuc.edu/~hanj/ http://www.cs.uiuc.edu/ hanj/ © CS Motivation: Why data mining? What is data mining? Data Mining: On what kind of data? D t mining Data i i ffunctionality ti lit Classification of data mining systems Top-10 p most p popular p data mining g algorithms g Major issues in data mining © CS Data Mining, CE KMITL, 1/2555 3 Data Mining, CE KMITL, 1/2555 Before 1600, empirical science 1600-1950s, theoretical science • Each discipline has grown a theoretical component. Theoretical models often motivate experiments and generalize our understanding. 1950s-1990s, computational science • Over the last 50 years, most disciplines have grown a third, computational branch (e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.) • Computational Science traditionally meant simulation. It grew out of our inability to find closedform solutions for complex mathematical models models. 1990-now, data science • The flood of data from new scientific instruments and simulations • The Th ability bilit tto economically i ll store t and d manage petabytes t b t off data d t online li • The Internet and computing Grid that makes all these archives universally accessible • Scientific info. management, acquisition, organization, query, and visualization tasks scale almost linearly with data volumes. Data mining is a major new challenge! Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science, Comm. ACM, 45(11): 50-54, Nov. 2002 © CS Data Mining, CE KMITL, 1/2555 2 4 1960s : • Data collection, database creation, IMS and network DBMS • Data collection and data availability Automated data collection tools, database systems, Web, computerized p society y 1970s : • Relational data model, relational DBMS implementation 1980s : • Major sources of abundant data Business: Web, e-commerce, transactions, stocks, … Science: Remote sensing, bioinformatics, scientific simulation, … Society and everyone: news news, digital cameras cameras, YouTube • RDBMS, advanced data models (extended-relational, OO, deductive, etc.) • Application-oriented DBMS (spatial (spatial, scientific scientific, engineering, engineering etc etc.)) 1990s : • Data mining, g, data warehousing, g, multimedia databases,, and Web databases 2000s : • Stream data management and mining • Data mining and its applications • Web technology (XML, data integration) and global information systems © CS The Explosive Growth of Data: from terabytes to petabytes Data Mining, CE KMITL, 1/2555 5 © CS We are drowning in data, but starving for knowledge! Data Mining, CE KMITL, 1/2555 6 Data analysis and decision support • Market analysis y and management g Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation • Risk analysis and management Forecasting, customer retention, improved underwriting, quality control, competitive analysis • Fraud F dd detection t ti and d detection d t ti off unusuall patterns tt (outliers) ( tli ) Other Applications • Text mining (news group, email, documents) and Web mining • Stream data mining • Bioinformatics and bio-data analysis © CS Data Mining, CE KMITL, 1/2555 7 Data Mining, CE KMITL, 1/2555 © CS 8 Where does the data come from?—Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies Target marketing • Find clusters of “model” customers who share the same characteristics: interest, income level, spending p g habits, etc. • Determine customer purchasing patterns over time Cross-market analysis—Find associations/co-relations between product sales, & predict based on such association Customer profiling—What types of customers buy what products (clustering or classification)) Customer requirement analysis • Identify the best products for different groups of customers • Predict what factors will attract new customers Provision of summary information • Multidimensional summary reports • Statistical summary information (data central tendency and variation) © CS Data Mining, CE KMITL, 1/2555 9 Goal: Improved business efficiency • Improve marketing (advertise to the most likely buyers) • Inventory I t reduction d ti (stock ( t k only l needed d d quantities) titi ) © CS • Example: Supermarket sales records © CS Fish N Y Turkey Y N Cranberries Y N Wine N Y ... ... ... • Size ranges from 50K records (research studies) to terabytes (years of data from chains) • Data is already being warehoused Finance planning and asset evaluation Resource planning • summarize i and d compare th the resources and d spending di Competition • monitor competitors and market directions • group customers into classes and a class-based pricing procedure • set pricing strategy in a highly competitive market Sample question – what products are generally purchased together? The answers are in the data, if only we could see them Data Mining, CE KMITL, 1/2555 10 • cash flow analysis y and prediction p • contingent claim analysis to evaluate assets • cross-sectional and time series analysis (financial-ratio, trend analysis analysis, etc etc.)) Information source: Historical business data Date/Time/Register 12/6 13:15 2 12/6 13:16 3 Data Mining, CE KMITL, 1/2555 11 © CS Data Mining, CE KMITL, 1/2555 12 Approaches: Clustering & model construction for frauds, outlier analysis Applications: Health care, retail, credit card service, telecomm. Delivery • Debatable what data mining will do here; best match would be related to “quality analysis”: given lots of data about deliveries, try to find common threads in problem deliveries “problem” • Auto insurance: ring of collisions • Money laundering: suspicious monetary transactions • Medical insurance Predicting Professional patients, ring of doctors, and ring of references Unnecessary or correlated screening tests Looking for cycles, related to similarity search in time series data Look for similar cycles between products, even if not repeated Phone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm • Retail industry Analysts estimate that 38% of retail shrink is due to dishonest employees • Event-related • Anti-terrorism item needs • Seasonal S l • Telecommunications: phone-call fraud © CS delays S Sequential ti l association i ti b between t eventt and d product d t order d (probably weak) Data Mining, CE KMITL, 1/2555 13 © CS Data Mining, CE KMITL, 1/2555 Data mining (knowledge discovery from data) 6 • Extraction of interesting (non-trivial, non-trivial implicit, implicit previously unknown and potentially useful) patterns or knowledge from huge amount of data 7 5 Alternative names 3 • K Knowledge l d di discovery iin d databases t b (KDD) (KDD), kknowledge l d extraction, t ti data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. 4 Data Mining, CE KMITL, 1/2555 Data mining • Core of KDD • Search for k knowledge l d in i th the data 1 Watch out: Is everything “data mining”? 2 • Simple search and query processing • (Deductive) expert systems © CS 14 15 © CS Data Mining, CE KMITL, 1/2555 16 Data mining may generate thousands of patterns: Not g all of them are interesting • Can a data mining g system y find all the interesting gp patterns? Do we need to find all of the interesting patterns? • Heuristic vs. exhaustive search • Association vs. vs classification vs vs. clustering • Suggested approach: Human-centered, query-based, focused mining I t Interestingness ti measures • A pattern is interesting if it is easily understood by humans, valid g of certainty, y, potentially p y on new or test data with some degree useful, novel, or validates some hypothesis that a user seeks to confirm Objective vs vs. subjective interestingness measures Data Mining, CE KMITL, 1/2555 Search for only interesting patterns: An optimization problem • Can a data mining system find only the interesting patterns? • Approaches • Objective: based on statistics and structures of patterns, e.g., support, confidence, etc. • Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, actionability, etc. © CS Find all the interesting patterns: Completeness First general all the patterns and then filter out the uninteresting ones Generate only the interesting patterns—mining query optimization 17 © CS Precise patterns vs. approximate patterns Data Mining, CE KMITL, 1/2555 Graphical User Interface • Association and correlation mining: possible find sets of precise patterns Pattern Evaluation But approximate patterns can be more compact and sufficient How to find high quality approximate patterns?? Data Mining Engine • Gene sequence mining: approximate patterns are inherent How H to d derive i efficient ffi i approximate i pattern mining i i algorithms?? l i h ?? data cleaning, integration, and selection Database Data Mining, CE KMITL, 1/2555 Knowl edgeBase Database or Data Warehouse Server Constrained vs. non-constrained patterns • Why constraint-based mining? • What are the possible kinds of constraints? How to push constraints t i t into i t the th mining i i process? ? © CS 18 19 © CS Data World-Wide Other Info Repositories Warehouse W h W b Web Data Mining, CE KMITL, 1/2555 20 Increasing potential to support business decisions Database Technology End User Decision Making D t Presentation Data P t ti Visualization Techniques Business Analyst Data D t Mi Mining i Information Discovery Machine L Learning i Data D Analyst Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration Preprocessing/Integration, Data Warehouses Wareho ses Feature Selection, Dimensionality Reduction, Normalization Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems © CS Data Mining, CE KMITL, 1/2555 21 Tremendous amount of data © CS © CS Data Mining, CE KMITL, 1/2555 22 Data to be mined Knowledge to be mined • Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. • Multiple/integrated functions and mining at multiple levels High complexity of data Data streams and sensor data Time-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked multi linked data Heterogeneous databases and legacy databases Spatial, spatiotemporal, multimedia, text and Web data Software programs, programs scientific simulations Techniques utilized • Database-oriented Database-oriented, data warehouse (OLAP), (OLAP) machine learning, learning statistics, statistics visualization, etc. Applications pp adapted p • Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc. New and sophisticated applications Data Mining, CE KMITL, 1/2555 Other Disciplines • Relational, data warehouse, transactional, stream, objectoriented/relational active, oriented/relational, active spatial spatial, time time-series, series text text, multi multi-media, media heterogeneous, legacy, WWW High-dimensionality of data • • • • • • Algorithm Visualization DBA • Micro-array may have tens of thousands of dimensions Data Mining Pattern Recognition • Algorithms must be highly scalable to handle such as tera-bytes of data Statistics 23 © CS Data Mining, CE KMITL, 1/2555 24 General functionality • Relational database, data warehouse, transactional database • Descriptive D i ti d data t mining i i • Predictive data mining Different • • • • © CS Data view: Kinds of data to be mined Knowledge view: Kinds of knowledge to be discovered Method view: Kinds of techniques utilized Application view: Kinds of applications adapted 25 Multidimensional concept description: Characterization and discrimination © CS Frequent patterns, association, correlation vs. causality Classification and prediction E E.g., g classify countries based on (climate) (climate), or classify cars based on (gas mileage) • Predict some unknown or missing numerical values Data Mining, CE KMITL, 1/2555 Cluster analysis Outlier analysis Trend and evolution analysis • • • • • Construct models (functions) that describe and distinguish classes or concepts for future prediction © CS 26 • Outlier: Data object that does not comply with the general behavior of the data detection, rare events analysis • Noise or exception? Useful in fraud detection • Diaper Di Beer B [0.5%, [0 5% 75%] (C (Correlation l i or causality?) li ?) Data Mining, CE KMITL, 1/2555 • Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns • Maximizing intra-class similarity & minimizing interclass similarity • Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions Advanced data sets and advanced applications • Data streams and sensor data • Time-series data, temporal data, sequence data (incl. bioq ) sequences) • Structure data, graphs, social networks and multi-linked data • Object-relational databases • Heterogeneous databases and legacy databases • Spatial data and spatiotemporal data • Multimedia database • Text databases • The World-Wide Web views lead to different classifications Data Mining, CE KMITL, 1/2555 Database-oriented data sets and applications 27 © CS Trend and deviation: e.g., g , regression g analysis y Sequential pattern mining: e.g., digital camera large SD memory Periodicity analysis Similarity-based Similarity based analysis Other pattern-directed or statistical analyses Data Mining, CE KMITL, 1/2555 28 Find groups of similar data items Statistical techniques require some definition of “distance” (e.g. between travel profiles) while conceptual techniques use background concepts and logical descriptions Uses: Demographic analysis Clusters g Technologies: Self-Organizing Maps Probability Densities “G l with ith similar i il “Group people travel profiles” Conceptual Clustering Identify dependencies in the data: • X makes Y likely Indicate significance of each dependency B Bayesian i methods th d Uses: Targeted marketing “Find Find groups of items commonly purchased together” © CS Data Mining, CE KMITL, 1/2555 29 © CS Find ways to separate data items into pre-defined groups Requires q “training g data”: Data items where g group p is known Training Data Uses: tool produces Generate decision trees (results are human understandable) Neural Nets classifier classifie “Route documents to most likely interested parties” Data Mining, CE KMITL, 1/2555 Classification Statistical Learning • #5. SVM: Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. SpringerVerlag. • #6. EM: McLachlan, G. and Peel, D. (2000). Finite Mixture Models. J. Wiley, New York. Association Analysis • #7. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules Rules. In VLDB '94 94. • #8. FP-Tree: Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without candidate generation. In SIGMOD '00. G Groups © CS Profiling Technologies: 30 • #1. C4.5: Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann 1993. Kaufmann., 1993 • #2. CART: L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth, 1984. g ((kNN): ) Hastie,, T. and Tibshirani,, R. 1996. Discriminant • #3. K Nearest Neighbours Adaptive Nearest Neighbor Classification. TPAMI. 18(6) • #4. Naive Bayes Hand, D.J., Yu, K., 2001. Idiot's Bayes: Not So Stupid After All? Internat. Statist. Rev. 69, 385-398. • We know X and Y belong together, find other things in same group Data Mining, CE KMITL, 1/2555 31 © CS Data Mining, CE KMITL, 1/2555 32 Link Mining Clustering • #11. K-Means: MacQueen, J. B., Some methods for classification and analysis y of multivariate observations,, in Proc. 5th Berkeleyy Symp. y p Mathematical Statistics and Probability, 1967. • #12. BIRCH: Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: an efficient data clustering method for very large databases. In SIGMOD '96. '96 33 Mining methodology © CS Conferences: SIGIR, WWW, CIKM, etc. Journals: WWW: Internet and Web Information Systems, Statistics Conferences: Joint Stat. Meeting, etc. Journals: Annals of statistics, etc. Visualization • • 35 Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc. Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, IEEE-PAMI, etc. Web and IR • • Applications pp and social impacts p Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc. AI & Machine Learning • • Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc. Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD Database systems (SIGMOD: ACM SIGMOD Anthology—CD ROM) • • 34 Data mining and KDD (SIGKDD: CDROM) • • User interaction Data Mining, CE KMITL, 1/2555 Data Mining, CE KMITL, 1/2555 • • • Domain-specific data mining & invisible data mining • Protection of data security, integrity, and privacy © CS © CS • Data mining query languages and ad-hoc mining • Expression E i and d visualization i li ti off d data t mining i i results lt • Interactive mining of knowledge at multiple levels of abstraction Graph Mining • # #18. gSpan: S Yan, X. and Han, J. 2002. gSpan: S G Graph-Based Substructure S Pattern Mining. In ICDM '02. • Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web • Performance: efficiency, effectiveness, and scalability • Pattern evaluation: the interestingness problem • Incorporation of background knowledge • Handling noise and incomplete data • Parallel, distributed and incremental mining methods teg at o o of tthe ed discovered sco e ed knowledge o edge with t e existing st g o one: e knowledge o edge fusion us o • Integration Rough Sets • #17 #17. Finding reduct: Zdzislaw Pawlak, Pawlak Rough Sets: Theoretical Aspects of Reasoning about Data, Kluwer Academic Publishers, Norwell, MA, 1992 • #13 #13. AdaBoost: Freund Freund, Y. Y and Schapire, Schapire R R. E E. 1997 1997. A decision decision-theoretic theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 1 (Aug. 1997), 119-139. Data Mining, CE KMITL, 1/2555 Integrated Mining • #16. CBA: Liu,, B.,, Hsu,, W. and Ma,, Y. M. Integrating g g classification and association rule mining. KDD-98. Bagging and Boosting © CS Sequential Patterns • #14. GSP: Srikant, R. and Agrawal, R. 1996. Mining Sequential Patterns: Generalizations and Performance Improvements. Improvements In Proceedings of the 5th International Conference on Extending Database Technology, 1996. • #15. PrefixSpan: J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal and M-C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected P tt Pattern Growth. G th IIn ICDE '01. '01 • #9. PageRank: Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine engine. In WWW WWW-7, 7 1998 1998. • #10. HITS: Kleinberg, J. M. 1998. Authoritative sources in a hyperlinked environment. SODA, 1998. Conference proceedings: CHI, ACM-SIGGraph, etc. Journals: IEEE Trans. visualization and computer graphics, etc. Data Mining, CE KMITL, 1/2555 36 S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann, 2002 R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003 U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996 U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001 J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2nd ed., 2006 D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001 T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer-Verlag, Springer Verlag, 2001 B. Liu, Web Data Mining, Springer 2006. T. M. Mitchell, Machine Learning, McGraw Hill, 1997 G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991 P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005 S M. S. M Weiss W i and dN N. Indurkhya, I d kh P di ti D Predictive Data t Mi Mining, i M Morgan K Kaufmann, f 1998 I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, © CS Morgan Kaufmann, 2nd ed. 2005 Data Mining, CE KMITL, 1/2555 37 Data mining: Discovering interesting patterns from large amounts of data A natural evolution of database technology, in great demand, with wide applications pp A KDD process includes data cleaning, data integration, data selection, transformation, data mining, g p pattern evaluation, and knowledge presentation Mining g can be p performed in a variety y of information repositories p Data mining functionalities: characterization, discrimination, association, classification, clustering, g outlier and trend analysis, y etc. Data mining systems and architectures Major issues in data mining © CS Data Mining, CE KMITL, 1/2555 38