Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Data Mining 資料探勘 Instructor: Hsiao-Ping Tsai 蔡曉萍 Electrical Engineering Department National Chung Hsing University Taichung Taiwan, ROC Why Mine Data? Commercial Viewpoint Lots of data is being collected and warehoused Web data, e-commerce purchases at department/ grocery stores Bank/Credit Card transactions Computers have become cheaper and more powerful Competitive Pressure is Strong 2010/02/26 Provide better, customized services for an edge (e.g. in Customer Relationship Management) 2 Why Mine Data? Scientific Viewpoint Data collected and stored at enormous speeds (GB/hour) remote sensors on a satellite telescopes scanning the skies microarrays generating gene expression data scientific simulations generating terabytes of data Traditional techniques infeasible for raw data Data mining may help scientists 2010/02/26 in classifying and segmenting data in Hypothesis Formation 3 Motivation We are data rich but information poor 2010/02/26 4 Data Mining We are buried in data, but looking for knowledge Data mining: Knowledge discovery in databases 2010/02/26 Extraction of interesting knowledge (rules, regularities, patterns) from data in large databases 5 Course Staff Instructor: Hsiao-Ping Tsai 蔡曉萍 Time: Fri. 9:00-12:00 Location: EE-102 Grading Midterm exam 25%, final exam 25%, homework 40%, and Paper Studying Presentation 10% . Text Book 2010/02/26 "Introduction to Data Mining," Pang-Ning Tan, Michael Steinbach and Vipin Kumar, Addison-Wesley "Data mining: Concepts and Techniques," by Jiawei Han and Micheline Kamber 6 Course Staff Email: [email protected] Office: EE4-711 Tel: (04) 22851549 ext.711 Course Web Site: 2010/02/26 電機系首頁-> course (課程規章)->課程詳述->資料探勘 7 Outline of Course Introduction Association Rules Sequential Patterns Classification and Prediction Cluster Analysis Mining Stream, Time-Series, and Sequence Data Web Mining Social Network Mining Cloud Mining 2010/02/26 9 Course Requirements Had better have backgrounds on 2010/02/26 Databases Statistics AI Fundamental Web Technology Algorithm Programming in C/C++, Java 10 What Is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Alternative names Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. Watch out: Is everything “data mining”? Simple search and query processing (Deductive) expert systems 2010/02/26 13 What is (not) Data Mining? What is not Data Mining? What is Data Mining? – Look up phone number in phone directory – Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area) – Query a Web search engine for information about “Amazon” – Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,) 2010/02/26 14 KDD Process: Several Key Steps Learning the application domain Creating a target data set Data cleaning and preprocessing (may take 60% of effort!) Data reduction and transformation Choosing functions of data mining Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation Use of discovered knowledge 2010/02/26 17 Techniques to Be Utilized Database-oriented Machine learning Neural network Machine Learning Fuzzy set Pattern Statistics Recognition Visualization Algorithm … 2010/02/26 Database Technology Statistics Visualization Data Mining Other Disciplines Graph Theory Neural Network 32 Data Mining Tasks Prediction Methods Use some variables to predict unknown or future values of other variables. Description Methods Find human-interpretable patterns that describe the data. From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996 2010/02/26 33 Knowledge to Be Mined 2010/02/26 Association rules Classification Clustering Trend and deviation analysis Outlier 34 Association Rules Buy(bread) ^ Buy(milk) => Buy(butter) Age(20~29) ^ Income(20~30k) => Buy(CD player) 2010/02/26 35 Classification Example Tid Refund Marital Status Taxable Income Cheat Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No No Single 75K ? 2 No Married 100K No Yes Married 50K ? 3 No Single 70K No No Married 150K ? 4 Yes Married 120K No Yes Divorced 90K ? 5 No Divorced 95K Yes No Single 40K ? 6 No Married No No Married 80K ? 60K 10 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No 10 2010/02/26 Single 90K Yes Training Set Learn Classifier Test Set Model 36 Classification Supervised classification Organizes data into given classes based on attribute values X<10 No Yes group 1 Y<5 group 2 2010/02/26 group 3 37 What is a natural grouping among these objects? Clustering is subjective Simpson's Family 2010/02/26 School Employees Females Males 40 Clustering Unsupervised classification Organizes data into classes based on attribute values y 2010/02/26 y x x 41 Sequential Patterns Given is a set of objects, with each object associated with its own timeline of events, find rules that predict strong sequential dependencies among different events. (A B) 2010/02/26 (C) (D E) 44 Time Series Analysis •Trends analysis •Regression •Sequential patterns •Similar sequences 2010/02/26 45 The similarity matching problem can come in two flavors I Query Q (template) 1 6 2 7 3 8 4 9 5 10 1: Whole Matching C6 is the best match. Database C Given a Query Q, a reference database C and a distance measure, find the Ci that best matches Q. 2010/02/26 48 Regression Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency. Greatly studied in statistics, neural network fields. Examples: 2010/02/26 Predicting sales amounts of new product based on advetising expenditure. Time series prediction of stock market indices. 50 Deviation/Anomaly Detection Detect significant deviations from normal behavior Applications: Credit Card Fraud Detection Network Intrusion Detection Typical network traffic at University level may reach over 100 million connections per day 2010/02/26 51 Features & Challenges of KDD Handling of different types of data Efficiency & scalability of data mining algorithm Usefulness, certainly & expressiveness of results Interactive mining at multiple abstraction levels Parallel & distributed data mining Protection of privacy & data security 2010/02/26 54 Summary Data mining: Discovering interesting patterns from large amounts of data A natural evolution of database technology, in great demand, with wide applications A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation Mining can be performed in a variety of information repositories Data mining functionalities: association, classification, sequential pattern, clustering, outlier detection, ranking, and trend analysis, etc. 2010/02/26 55 Related Conferences and Journals KDD Conferences ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining (KDD) SIAM Data Mining Conf. (SDM) (IEEE) Int. Conf. on Data Mining (ICDM) Conf. on Principles and practices of Knowledge Discovery and Data Mining (PKDD) Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD) 2010/02/26 Other related conferences ACM SIGMOD VLDB (IEEE) ICDE WWW, SIGIR ICML, CVPR, NIPS Journals Data Mining and Knowledge Discovery (DAMI or DMKD) IEEE Trans. On Knowledge and Data Eng. (TKDE) KDD Explorations ACM Trans. on KDD 56 Where to Find References? DBLP, CiteSeer, Google Data mining and KDD (SIGKDD: CDROM) Database systems (SIGMOD: ACM SIGMOD Anthology—CD ROM) Conferences: SIGIR, WWW, CIKM, etc. Journals: WWW: Internet and Web Information Systems, Statistics Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc. Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, IEEE-PAMI, etc. Web and IR Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc. AI & Machine Learning Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc. Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD Conferences: Joint Stat. Meeting, etc. Journals: Annals of statistics, etc. Visualization 2010/02/26 Conference proceedings: CHI, ACM-SIGGraph, etc. Journals: IEEE Trans. visualization and computer graphics, etc. 57 Recommended Reference Books S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann, 2002 R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003 U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996 U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001 J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2nd ed., 2006 D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001 T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer-Verlag, 2001 T. M. Mitchell, Machine Learning, McGraw Hill, 1997 G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991 P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005 S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998 I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2nd ed. 2005 2010/02/26 58 Examples of Data Mining Systems (1) Mirosoft SQLServer 2005 SAS Enterprise Miner Integrate DB and OLAP with mining Support OLEDB for DM standard A variety of statistical analysis tools Data warehouse tools and multiple data mining algorithms IBM Intelligent Miner 2010/02/26 A wide range of data mining algorithms Scalable mining algorithms Toolkits: neural network algorithms, statistical methods, data preparation, and data visualization tools Tight integration with IBM's DB2 relational database system 59 Examples of Data Mining Systems (2) SGI MineSet Multiple data mining algorithms and advanced statistics Advanced visualization tools SPSS 2010/02/26 An integrated data mining development environment for end-users and developers Multiple data mining algorithms and visualization tools 60