Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Ch. Eick: Introduction Data Mining and Course Information Introduction --- Part2 1. 2. Another Introduction to Data Mining Course Information 1 Ch. Eick: Introduction Data Mining and Course Information Knowledge Discovery in Data [and Data Mining] (KDD) Let us find something interesting! Definition := “KDD is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” (Fayyad) Frequently, the term data mining is used to refer to KDD. Many commercial and experimental tools and tool suites are available (see http://www.kdnuggets.com/siftware.html) Field is more dominated by industry than by research institutions 2 Ch. Eick: Introduction Data Mining and Course Information Motivation: “Necessity is the Mother of Invention” Data explosion problem Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories We are drowning in data, but starving for knowledge! Solution: Data warehousing and data mining Data warehousing and on-line analytical processing (“analyzing and mining the raw data rarely works”)—idea: mine summarized,. aggregated data Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data collections 3 Ch. Eick: Introduction Data Mining and Course Information YAHOO!’s View of Data Mining ACME CORP ULTIMATE DATA MINING BROWSER What’s New? What’s Interesting? Predict for me http://www.sigkdd.org/kdd2008/ 4 Ch. Eick: Introduction Data Mining and Course Information Data Mining: A KDD Process Pattern Evaluation Data mining: the core of knowledge discovery Data Mining process. Task-relevant Data Data Warehouse Selection Data Cleaning Data Integration Databases 5 Ch. Eick: Introduction Data Mining and Course Information Steps of a KDD Process Learning the application domain: Creating a target data set: data selection Data cleaning and preprocessing: Data reduction and transformation (the first 4 steps may take 75% of effort!) : summarization, classification, regression, association, clustering. Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation Find useful features, dimensionality/variable reduction, invariant representation. Choosing functions of data mining relevant prior knowledge and goals of application visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge 6 Ch. Eick: Introduction Data Mining and Course Information Data Mining and Business Intelligence Increasing potential to support business decisions Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery End User Business Analyst Data Analyst Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA Data Sources Paper, Files, Information Providers, Database Systems, OLTP DBA 7 Ch. Eick: Introduction Data Mining and Course Information Are All the “Discovered” Patterns Interesting? A data mining system/query may generate thousands of patterns, not all of them are interesting. Suggested approach: Human-centered, query-based, focused mining Interestingness measures: A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm Objective vs. subjective interestingness measures: Objective: based on statistics and structures of patterns, e.g., support, confidence, etc. Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, actionability, etc. 8 Ch. Eick: Introduction Data Mining and Course Information Data Mining: Confluence of Multiple Disciplines Machine Learning Applications Algorithm Pattern Recognition Data Mining Database Technology Statistics Visualization High-Performance Computing 9 KDD Process: A Typical View from ML and Statistics Input Data Data PreProcessing Data integration Normalization Feature selection Dimension reduction Data Mining Association Analysis Classification Clustering Outlier analysis Summary Generation … PostProcessing Pattern Pattern Pattern Pattern evaluation selection interpretation visualization This is a view from typical machine learning and statistics communities 10 Ch. Eick: Introduction Data Mining and Course Information Data Mining Competitions Netflix Price: http://www.netflixprize.com//index KDD Cup 2009: http://www.kddcuporange.com/ KDD Cup 2011: http://www.kdd.org/kdd2011/kddcup.shtml 11 Ch. Eick: Introduction Data Mining and Course Information Summary Data mining: discovering interesting patterns from large amounts of data A natural evolution of database technology, in great demand, with wide applications A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation Mining can be performed in a variety of information repositories Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc. Classification of data mining systems 12 Ch. Eick: Introduction Data Mining and Course Information COSC 6335 in a Nutshell Preprocessing Data Mining Post Processing Association Analysis Pattern Evaluation Clustering Classification & Prediction Visualization Summarization 13 Ch. Eick: Introduction Data Mining and Course Information Prerequisites The course is basically self contained; however, the following skills are important to be successful in taking this course: Basic knowledge of programming Java/language of your own choice and data mining tools will be used in the programming projects—basic knowledge of Java is sufficient! Basic knowledge of statistics Basic knowledge of data structures 14 Ch. Eick: Introduction Data Mining and Course Information Course Objectives will know what the goals and objectives of data mining are will have a basic understanding on how to conduct a data mining project will obtain practical experience in data analysis and making sense out of data will have sound knowledge of popular classification techniques, such as decision trees, support vector machines and nearest-neighbor approaches. will know the most important association analysis techniques will have detailed knowledge of popular clustering algorithms, such as Kmeans, DBSCAN, grid-based, hierarchical and supervised clustering. will have some knowledge of R, an open source statistics/data mining environment will obtain practical experience in designing data mining algorithms and in applying data mining techniques to real world data sets will have some exposure to more advanced topics, such as sequence mining, spatial data mining, and web page ranking algorithms 15 Ch. Eick: Introduction Data Mining and Course Information Data Mining Course Organization I Introduction to Data Mining and Data Mining Basics (Chapter 1 and 2.1) II Exploratory Data Analysis (Chapter 3) moved! III Introduction to Classification --- Basic Concepts and Decision Trees (Chapter 4 IV Introduction to Similarity Assessment and Clustering (Other material 2.3 and Chapter 8 in part) V Introduction to Data Cubes (Section 3.4) moved! VI Association Analysis (Chapter 6) VII Spatial Data Mining VIII More on Classification: Regression, Instance-based Learning and Support Vector Machines (Chapter 5) IX Data Preprocessing, Data Cubes, and Data Warehouses (Chapter 2 and …l) X More on Clustering (Chapter 8 and Chapter 9 in part) XI Sequence and Graph Mining (Chapter 7 in part) XI PageRank and other Top 10 Data Mining Algorithms (Journal Paper) XII Final Words 16 Ch. Eick: Introduction Data Mining and Course Information Order of Coverage Introduction Exploratory Data Analysis Similarity Assessment Clustering Association Analysis Classification Spatial Data Mining More on Classification OLAP and Data Warehousing Preprocessing More on Clustering Sequence and Graph Mining Top 10 Data Mining Algorithms Summary Also: Some introductory tutorial into R (2-3 classes) 17 Ch. Eick: Introduction Data Mining and Course Information In particular, R will be used for most course projects, except spatial clustering algorithms which are part of Cougar^2 will be used in the third project. The bad news is that it is more challenging to get started with R (compared to Weka---but Weka is a "dead" language), although you should be okay after you used R for some weeks. On the other hand, the good news about R is that it continues to grow quickly in popularity. A recent poll at KDnuggets found that 34% of respondents do at least half of their data mining in R. Although it's a domain specific language, it's versatile. As we have not used R in the course before, we expect some startup problems and ask you for your patience, but, on the positive side knowing R will be a plus when conducting research projects and when looking for jobs after you graduate, due to 18 R's completeness and R's rising popularity. Ch. Eick: Introduction Data Mining and Course Information Where to Find References? Data mining and KDD Database field (SIGMOD member CD ROM): Conference proceedings: ICML, AAAI, IJCAI, ECML, etc. Journals: Machine Learning, Artificial Intelligence, etc. Statistics: Conference proceedings: VLDB, ICDE, ACM-SIGMOD, CIKM Journals: ACM-TODS, J. ACM, IEEE-TKDE, JIIS, etc. AI and Machine Learning: Conference proceedings: ICDM, KDD, PKDD, PAKDD, SDM,ADMA etc. Journal: Data Mining and Knowledge Discovery Conference proceedings: Joint Stat. Meeting, etc. Journals: Annals of statistics, etc. Visualization: Conference proceedings: CHI, etc. Journals: IEEE Trans. visualization and computer graphics, etc. 19 Ch. Eick: Introduction Data Mining and Course Information Textbooks Required Text: P.-N. Tang, M. Steinback, and V. Kumar: Introduction to Data Mining, Addison Wesley, Link to Book HomePage Mildly Recommended Text Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, Morgan Kaufman Publishers, second edition. Link to Data Mining Book Home Page 20 Ch. Eick: Introduction Data Mining and Course Information Tentative Schedule for • Exams: October 25, December 6 • Reviews: Plan First Half of the Fall 2011 Semester: Aug. 23+25: Introduction to DM August 30: Exploratory Data Analysis (Dr. Chen) September 1+22: Lab (Zechun Cao) September 6+8+15+20: Clustering I September 27+29+Oct. 4: Association Analysis October 6+11+13: Classification and Prediction October 18+20: Spatial Data Mining October 27+Nov.1: More on Classification and Prediction 21 October 25: Midterm Exam Ch. Eick: Introduction Data Mining and Course Information 2011 Course Projects Project 1: Exploratory Data Analysis • Project 2: Traditional Clustering with K-means and DBSCAN Project 3: Spatial Clustering with CLEVER Project 4: Group Project (different topics, no programming) Project 5: TBDL (something with SVMS and/or regression) 22 Ch. Eick: Introduction Data Mining and Course Information TA/Students of my Research Group: Duties: 1. 2. 3. 4. Grading of programming projects, home works, and exams (in part) Run 2/3 labs Help students with homework, programming projects and problems with the course material Teach a class (two to three times) Office: Office Hours: E-mail: Meet our TA: Thursday 23 Ch. Eick: Introduction Data Mining and Course Information Web Course Webpage (http://www2.cs.uh.edu/~ceick/DM/DM11.html ) UH-DMML Webpage (http://www2.cs.uh.edu/~UH-DMML/index.html) 24 Ch. Eick: Introduction Data Mining and Course Information Where to Find References? DBLP, CiteSeer, Google Data mining and KDD (SIGKDD: CDROM) Database systems (SIGMOD: ACM SIGMOD Anthology—CD ROM) Conferences: SIGIR, WWW, CIKM, etc. Journals: WWW: Internet and Web Information Systems, Statistics Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc. Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, IEEE-PAMI, etc. Web and IR Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc. AI & Machine Learning Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc. Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD Conferences: Joint Stat. Meeting, etc. Journals: Annals of statistics, etc. Visualization Conference proceedings: CHI, ACM-SIGGraph, etc. Journals: IEEE Trans. visualization and computer graphics, etc. 25 Ch. Eick: Introduction Data Mining and Course Information Teaching Philosophy and Advice The first 8 weeks will give a basic introduction to data mining and follows the textbook somewhat closely. Read the sections of the textbook before you come to the lecture; if you work continuously for the class you will do better and lectures will be more enjoyable. Starting to review the material that is covered in this class 1 week before the next exam is not a good idea. Do not be afraid to ask questions! I really like interactions with students in the lectures… If you do not understand something at all send me an e-mail before the next lecture! If you have a serious problem talk to me, before the problem gets out of hand. 26 Ch. Eick: Introduction Data Mining and Course Information Where to Find References? DBLP, CiteSeer, Google Data mining and KDD (SIGKDD: CDROM) Database systems (SIGMOD: ACM SIGMOD Anthology—CD ROM) Conferences: SIGIR, WWW, CIKM, etc. Journals: WWW: Internet and Web Information Systems, Statistics Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc. Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, IEEE-PAMI, etc. Web and IR Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc. AI & Machine Learning Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc. Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD Conferences: Joint Stat. Meeting, etc. Journals: Annals of statistics, etc. Visualization Conference proceedings: CHI, ACM-SIGGraph, etc. Journals: IEEE Trans. visualization and computer graphics, etc. 27 Ch. Eick: Introduction Data Mining and Course Information Course Planning for Research in Data Mining This course “Data Mining” I also suggest to taking at least 1, preferably two, of the following courses: Pattern Classification (COSC 6343), Artificial Intelligence (COSC 6368), and Machine Learning (COSC 6342). Moreover, having basic knowledge in data structures, software design, and databases is important when conducting data mining projects; therefore, taking COSC 6320, COSC 6318 and COSC 6340 is a good choice. Moreover, taking a course that teaches high performance computing is also a good choice, because data mining algorithms are very time consuming. Because a lot of data mining projects have to deal with images, I suggest to take at least one of the many biomedical image processing courses that are offered in our curriculum. Finally, having knowledge in evolutionary computing, data visualization, statistics, solving optimization problems, GIS (geographical information systems) is a plus! 28