Download Data Mining Prof. Jiawei Han of UIUC

Data Mining  Prof. CE, KMITL 1/2555 ©CS Jiawei Han of UIUC • http://www.cs.uiuc.edu/~hanj/ http://www.cs.uiuc.edu/ hanj/ © CS  Motivation: Why data mining?  What is data mining?  Data Mining: On what kind of data?  D t mining Data i i ffunctionality ti lit  Classification of data mining systems  Top-10 p most p popular p data mining g algorithms g  Major issues in data mining © CS Data Mining, CE KMITL, 1/2555 3 Data Mining, CE KMITL, 1/2555  Before 1600, empirical science  1600-1950s, theoretical science • Each discipline has grown a theoretical component. Theoretical models often motivate experiments and generalize our understanding.  1950s-1990s, computational science • Over the last 50 years, most disciplines have grown a third, computational branch (e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.) • Computational Science traditionally meant simulation. It grew out of our inability to find closedform solutions for complex mathematical models models.  1990-now, data science • The flood of data from new scientific instruments and simulations • The Th ability bilit tto economically i ll store t and d manage petabytes t b t off data d t online li • The Internet and computing Grid that makes all these archives universally accessible • Scientific info. management, acquisition, organization, query, and visualization tasks scale almost linearly with data volumes. Data mining is a major new challenge!  Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science, Comm. ACM, 45(11): 50-54, Nov. 2002 © CS Data Mining, CE KMITL, 1/2555 2 4  1960s :  • Data collection, database creation, IMS and network DBMS  • Data collection and data availability  Automated data collection tools, database systems, Web, computerized p society y 1970s : • Relational data model, relational DBMS implementation  1980s : • Major sources of abundant data  Business: Web, e-commerce, transactions, stocks, …  Science: Remote sensing, bioinformatics, scientific simulation, …  Society and everyone: news news, digital cameras cameras, YouTube • RDBMS, advanced data models (extended-relational, OO, deductive, etc.) • Application-oriented DBMS (spatial (spatial, scientific scientific, engineering, engineering etc etc.))  1990s : • Data mining, g, data warehousing, g, multimedia databases,, and Web databases  2000s : • Stream data management and mining • Data mining and its applications • Web technology (XML, data integration) and global information systems © CS  The Explosive Growth of Data: from terabytes to petabytes Data Mining, CE KMITL, 1/2555  5 © CS We are drowning in data, but starving for knowledge! Data Mining, CE KMITL, 1/2555 6 Data analysis and decision support • Market analysis y and management g  Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation • Risk analysis and management  Forecasting, customer retention, improved underwriting, quality control, competitive analysis • Fraud F dd detection t ti and d detection d t ti off unusuall patterns tt (outliers) ( tli )  Other Applications • Text mining (news group, email, documents) and Web mining • Stream data mining • Bioinformatics and bio-data analysis © CS Data Mining, CE KMITL, 1/2555 7 Data Mining, CE KMITL, 1/2555 © CS 8  Where does the data come from?—Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies  Target marketing • Find clusters of “model” customers who share the same characteristics: interest, income level, spending p g habits, etc. • Determine customer purchasing patterns over time  Cross-market analysis—Find associations/co-relations between product sales, & predict based on such association  Customer profiling—What types of customers buy what products (clustering or classification))  Customer requirement analysis • Identify the best products for different groups of customers • Predict what factors will attract new customers  Provision of summary information • Multidimensional summary reports • Statistical summary information (data central tendency and variation) © CS  Data Mining, CE KMITL, 1/2555 9 Goal: Improved business efficiency  • Improve marketing (advertise to the most likely buyers) • Inventory I t reduction d ti (stock ( t k only l needed d d quantities) titi )  © CS • Example: Supermarket sales records   © CS Fish N Y Turkey Y N Cranberries Y N Wine N Y ... ... ...  • Size ranges from 50K records (research studies) to terabytes (years of data from chains) • Data is already being warehoused  Finance planning and asset evaluation Resource planning • summarize i and d compare th the resources and d spending di Competition • monitor competitors and market directions • group customers into classes and a class-based pricing procedure • set pricing strategy in a highly competitive market Sample question – what products are generally purchased together? The answers are in the data, if only we could see them Data Mining, CE KMITL, 1/2555 10 • cash flow analysis y and prediction p • contingent claim analysis to evaluate assets • cross-sectional and time series analysis (financial-ratio, trend analysis analysis, etc etc.)) Information source: Historical business data Date/Time/Register 12/6 13:15 2 12/6 13:16 3 Data Mining, CE KMITL, 1/2555 11 © CS Data Mining, CE KMITL, 1/2555 12  Approaches: Clustering & model construction for frauds, outlier analysis  Applications: Health care, retail, credit card service, telecomm.  Delivery • Debatable what data mining will do here; best match would be related to “quality analysis”: given lots of data about deliveries, try to find common threads in problem deliveries “problem” • Auto insurance: ring of collisions • Money laundering: suspicious monetary transactions • Medical insurance  Predicting  Professional patients, ring of doctors, and ring of references  Unnecessary or correlated screening tests  Looking for cycles, related to similarity search in time series data  Look for similar cycles between products, even if not repeated  Phone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm • Retail industry  Analysts estimate that 38% of retail shrink is due to dishonest employees • Event-related • Anti-terrorism  item needs • Seasonal S l • Telecommunications: phone-call fraud © CS delays S Sequential ti l association i ti b between t eventt and d product d t order d (probably weak) Data Mining, CE KMITL, 1/2555 13 © CS Data Mining, CE KMITL, 1/2555 Data mining (knowledge discovery from data) 6 • Extraction of interesting (non-trivial, non-trivial implicit, implicit previously unknown and potentially useful) patterns or knowledge from huge amount of data  7 5 Alternative names 3 • K Knowledge l d di discovery iin d databases t b (KDD) (KDD), kknowledge l d extraction, t ti data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.  4 Data Mining, CE KMITL, 1/2555  Data mining • Core of KDD • Search for k knowledge l d in i th the data 1 Watch out: Is everything “data mining”? 2 • Simple search and query processing • (Deductive) expert systems © CS 14 15 © CS Data Mining, CE KMITL, 1/2555 16  Data mining may generate thousands of patterns: Not g all of them are interesting  • Can a data mining g system y find all the interesting gp patterns? Do we need to find all of the interesting patterns? • Heuristic vs. exhaustive search • Association vs. vs classification vs vs. clustering • Suggested approach: Human-centered, query-based, focused mining  I t Interestingness ti measures • A pattern is interesting if it is easily understood by humans, valid g of certainty, y, potentially p y on new or test data with some degree useful, novel, or validates some hypothesis that a user seeks to confirm   Objective vs vs. subjective interestingness measures  Data Mining, CE KMITL, 1/2555 Search for only interesting patterns: An optimization problem • Can a data mining system find only the interesting patterns? • Approaches • Objective: based on statistics and structures of patterns, e.g., support, confidence, etc. • Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, actionability, etc. © CS Find all the interesting patterns: Completeness  First general all the patterns and then filter out the uninteresting ones  Generate only the interesting patterns—mining query optimization 17 © CS Precise patterns vs. approximate patterns Data Mining, CE KMITL, 1/2555 Graphical User Interface • Association and correlation mining: possible find sets of precise patterns Pattern Evaluation  But approximate patterns can be more compact and sufficient  How to find high quality approximate patterns?? Data Mining Engine • Gene sequence mining: approximate patterns are inherent  How H to d derive i efficient ffi i approximate i pattern mining i i algorithms?? l i h ??  data cleaning, integration, and selection Database Data Mining, CE KMITL, 1/2555 Knowl edgeBase Database or Data Warehouse Server Constrained vs. non-constrained patterns • Why constraint-based mining? • What are the possible kinds of constraints? How to push constraints t i t into i t the th mining i i process? ? © CS 18 19 © CS Data World-Wide Other Info Repositories Warehouse W h W b Web Data Mining, CE KMITL, 1/2555 20 Increasing potential to support business decisions Database Technology End User Decision Making D t Presentation Data P t ti Visualization Techniques Business Analyst Data D t Mi Mining i Information Discovery Machine L Learning i Data D Analyst Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration Preprocessing/Integration, Data Warehouses Wareho ses Feature Selection, Dimensionality Reduction, Normalization Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems © CS  Data Mining, CE KMITL, 1/2555 21 Tremendous amount of data © CS   © CS  Data Mining, CE KMITL, 1/2555 22 Data to be mined Knowledge to be mined • Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. • Multiple/integrated functions and mining at multiple levels High complexity of data Data streams and sensor data Time-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked multi linked data Heterogeneous databases and legacy databases Spatial, spatiotemporal, multimedia, text and Web data Software programs, programs scientific simulations  Techniques utilized • Database-oriented Database-oriented, data warehouse (OLAP), (OLAP) machine learning, learning statistics, statistics visualization, etc.  Applications pp adapted p • Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc. New and sophisticated applications Data Mining, CE KMITL, 1/2555 Other Disciplines • Relational, data warehouse, transactional, stream, objectoriented/relational active, oriented/relational, active spatial spatial, time time-series, series text text, multi multi-media, media heterogeneous, legacy, WWW High-dimensionality of data • • • • • • Algorithm Visualization DBA • Micro-array may have tens of thousands of dimensions  Data Mining Pattern Recognition • Algorithms must be highly scalable to handle such as tera-bytes of data  Statistics 23 © CS Data Mining, CE KMITL, 1/2555 24  General  functionality • Relational database, data warehouse, transactional database • Descriptive D i ti d data t mining i i • Predictive data mining  Different • • • • © CS   Data view: Kinds of data to be mined Knowledge view: Kinds of knowledge to be discovered Method view: Kinds of techniques utilized Application view: Kinds of applications adapted 25 Multidimensional concept description: Characterization and discrimination © CS   Frequent patterns, association, correlation vs. causality  Classification and prediction E E.g., g classify countries based on (climate) (climate), or classify cars based on (gas mileage)  • Predict some unknown or missing numerical values Data Mining, CE KMITL, 1/2555 Cluster analysis Outlier analysis Trend and evolution analysis • • • • • Construct models (functions) that describe and distinguish classes or concepts for future prediction © CS 26 • Outlier: Data object that does not comply with the general behavior of the data detection, rare events analysis • Noise or exception? Useful in fraud detection • Diaper Di  Beer B [0.5%, [0 5% 75%] (C (Correlation l i or causality?) li ?)  Data Mining, CE KMITL, 1/2555 • Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns • Maximizing intra-class similarity & minimizing interclass similarity • Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions  Advanced data sets and advanced applications • Data streams and sensor data • Time-series data, temporal data, sequence data (incl. bioq ) sequences) • Structure data, graphs, social networks and multi-linked data • Object-relational databases • Heterogeneous databases and legacy databases • Spatial data and spatiotemporal data • Multimedia database • Text databases • The World-Wide Web views lead to different classifications Data Mining, CE KMITL, 1/2555 Database-oriented data sets and applications 27 © CS Trend and deviation: e.g., g , regression g analysis y Sequential pattern mining: e.g., digital camera  large SD memory Periodicity analysis Similarity-based Similarity based analysis Other pattern-directed or statistical analyses Data Mining, CE KMITL, 1/2555 28  Find groups of similar data items  Statistical techniques require some definition of “distance” (e.g. between travel profiles) while conceptual techniques use background concepts and logical descriptions Uses:  Demographic analysis Clusters g Technologies:  Self-Organizing Maps  Probability Densities “G l with ith similar i il “Group people travel profiles”  Conceptual Clustering  Identify dependencies in the data: • X makes Y likely  Indicate significance of each dependency  B Bayesian i methods th d Uses:  Targeted marketing “Find Find groups of items commonly purchased together” © CS  Data Mining, CE KMITL, 1/2555 29 © CS  Find ways to separate data items into pre-defined groups Requires q “training g data”: Data items where g group p is known Training Data Uses:  tool produces Generate decision trees (results are human understandable)  Neural Nets classifier classifie “Route documents to most likely interested parties” Data Mining, CE KMITL, 1/2555 Classification Statistical Learning • #5. SVM: Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. SpringerVerlag. • #6. EM: McLachlan, G. and Peel, D. (2000). Finite Mixture Models. J. Wiley, New York. Association Analysis • #7. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules Rules. In VLDB '94 94. • #8. FP-Tree: Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without candidate generation. In SIGMOD '00. G Groups  © CS  Profiling Technologies: 30 • #1. C4.5: Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann 1993. Kaufmann., 1993 • #2. CART: L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth, 1984. g ((kNN): ) Hastie,, T. and Tibshirani,, R. 1996. Discriminant • #3. K Nearest Neighbours Adaptive Nearest Neighbor Classification. TPAMI. 18(6) • #4. Naive Bayes Hand, D.J., Yu, K., 2001. Idiot's Bayes: Not So Stupid After All? Internat. Statist. Rev. 69, 385-398. • We know X and Y belong together, find other things in same group  Data Mining, CE KMITL, 1/2555 31 © CS Data Mining, CE KMITL, 1/2555 32  Link Mining   Clustering  • #11. K-Means: MacQueen, J. B., Some methods for classification and analysis y of multivariate observations,, in Proc. 5th Berkeleyy Symp. y p Mathematical Statistics and Probability, 1967. • #12. BIRCH: Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: an efficient data clustering method for very large databases. In SIGMOD '96. '96     33 Mining methodology    © CS Conferences: SIGIR, WWW, CIKM, etc. Journals: WWW: Internet and Web Information Systems, Statistics Conferences: Joint Stat. Meeting, etc. Journals: Annals of statistics, etc. Visualization • • 35 Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc. Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, IEEE-PAMI, etc. Web and IR • • Applications pp and social impacts p Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc. AI & Machine Learning • •  Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc. Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD Database systems (SIGMOD: ACM SIGMOD Anthology—CD ROM) • •  34 Data mining and KDD (SIGKDD: CDROM) • • User interaction Data Mining, CE KMITL, 1/2555 Data Mining, CE KMITL, 1/2555 • • • Domain-specific data mining & invisible data mining • Protection of data security, integrity, and privacy © CS © CS  • Data mining query languages and ad-hoc mining • Expression E i and d visualization i li ti off d data t mining i i results lt • Interactive mining of knowledge at multiple levels of abstraction  Graph Mining • # #18. gSpan: S Yan, X. and Han, J. 2002. gSpan: S G Graph-Based Substructure S Pattern Mining. In ICDM '02. • Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web • Performance: efficiency, effectiveness, and scalability • Pattern evaluation: the interestingness problem • Incorporation of background knowledge • Handling noise and incomplete data • Parallel, distributed and incremental mining methods teg at o o of tthe ed discovered sco e ed knowledge o edge with t e existing st g o one: e knowledge o edge fusion us o • Integration  Rough Sets • #17 #17. Finding reduct: Zdzislaw Pawlak, Pawlak Rough Sets: Theoretical Aspects of Reasoning about Data, Kluwer Academic Publishers, Norwell, MA, 1992 • #13 #13. AdaBoost: Freund Freund, Y. Y and Schapire, Schapire R R. E E. 1997 1997. A decision decision-theoretic theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 1 (Aug. 1997), 119-139. Data Mining, CE KMITL, 1/2555 Integrated Mining • #16. CBA: Liu,, B.,, Hsu,, W. and Ma,, Y. M. Integrating g g classification and association rule mining. KDD-98. Bagging and Boosting © CS Sequential Patterns • #14. GSP: Srikant, R. and Agrawal, R. 1996. Mining Sequential Patterns: Generalizations and Performance Improvements. Improvements In Proceedings of the 5th International Conference on Extending Database Technology, 1996. • #15. PrefixSpan: J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal and M-C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected P tt Pattern Growth. G th IIn ICDE '01. '01 • #9. PageRank: Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine engine. In WWW WWW-7, 7 1998 1998. • #10. HITS: Kleinberg, J. M. 1998. Authoritative sources in a hyperlinked environment. SODA, 1998. Conference proceedings: CHI, ACM-SIGGraph, etc. Journals: IEEE Trans. visualization and computer graphics, etc. Data Mining, CE KMITL, 1/2555 36  S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann, 2002  R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000  T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003  U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996  U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001  J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2nd ed., 2006  D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001  T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer-Verlag, Springer Verlag, 2001  B. Liu, Web Data Mining, Springer 2006.  T. M. Mitchell, Machine Learning, McGraw Hill, 1997  G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991  P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005  S M. S. M Weiss W i and dN N. Indurkhya, I d kh P di ti D Predictive Data t Mi Mining, i M Morgan K Kaufmann, f 1998 I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, © CS Morgan Kaufmann, 2nd ed. 2005 Data Mining, CE KMITL, 1/2555  37  Data mining: Discovering interesting patterns from large amounts of data  A natural evolution of database technology, in great demand, with wide applications pp  A KDD process includes data cleaning, data integration, data selection, transformation, data mining, g p pattern evaluation, and knowledge presentation  Mining g can be p performed in a variety y of information repositories p  Data mining functionalities: characterization, discrimination, association, classification, clustering, g outlier and trend analysis, y etc.  Data mining systems and architectures  Major issues in data mining © CS Data Mining, CE KMITL, 1/2555 38

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data Mining Prof. Jiawei Han of UIUC