Download Data Mining - SFU computing science

Special Topics in Database Systems Martin Ester Simon Fraser University School of Computing Science CMPT 884 Spring 2009 CMPT 884, SFU, Martin Ester, 1-09 1 Introduction [Fayyad, Piatetsky-Shapiro & Smyth 96] Knowledge discovery in databases (KDD) is the process of (semi-)automatic extraction of knowledge from databases which is • valid • previously unknown • and potentially useful. Remarks • (semi)-automatic: distinction from manual analysis / OLAP. Typically, some user interaction necessary. • valid: in the statistical sense. • previously unknown: not explicit, no „common sense knowledge“. • potentially useful: for some given application. CMPT 884, SFU, Martin Ester, 1-09 2 Introduction Statistics [Hand, Mannila & Smyth 2001] • representation of uncertainty • model-based inferences • focus on numeric data Machine Learning [Mitchell 1997] • knowledge representation • search strategies • focus on symbolic data Database Systems [Han & Kamber 2000] • data management • integration of data mining with DBS • scalability for large databases CMPT 884, SFU, Martin Ester, 1-09 3 Introduction Knowledge KDD Process [Han & Kamber 2000] Pattern Evaluation Data Mining Task-relevant Data Selection Data Warehouse Data Cleaning Data Integration Databases KDD Process [Fayyad, Piatetsky-Shapiro & Smyth 1996] Focussing Preprocessing Transformation Data Mining Evaluation Pattern Database CMPT 884, SFU, Martin Ester, 1-09 Knowledge 4 Data Mining Definition [Fayyad, Piatetsky-Shapiro, Smyth 1996] • Data Mining is the application of efficient algorithms to determine the patterns contained in some database. Data-Mining Tasks • •• • • • • •• • • • • aa a a clustering b b b ab bb a classification A and B  C association rules • • • • • • •• • • • • generalisation other tasks: regression, outlier detection . . . CMPT 884, SFU, Martin Ester, 1-09 5 Trends in KDD Research KDD 2000 Conference • New Data Mining Algorithms • Efficiency and Scalability of Data Mining Algorithms • Interactive Data Exploration • Visualization • Constraints and Evaluation in the KDD Process CMPT 884, SFU, Martin Ester, 1-09 6 Trends in KDD Research KDD 2002 Conference • Statistical Methods • Frequent Patterns • Streams and Time Series • Visualization • Web Search and Navigation • Text and Web Page Classification • Intrusion and Privacy • Applications CMPT 884, SFU, Martin Ester, 1-09 7 Trends in KDD Research KDD 2004 Conference • Frequent Patterns / Association Rules • Clustering • Mining Spatio-Temporal Data • Mining Data Streams • Dimensionality Reduction • Privacy-Preserving Data Mining • Mining Biological Data • Applications (Web, biological data, security, . . .) CMPT 884, SFU, Martin Ester, 1-09 8 Trends in KDD Research KDD 2006 Conference • Clustering • Classification / supervised ML • Privacy • Web / Graph Mining • Web / Text Mining • Frequent Pattern Mining • Structured Data CMPT 884, SFU, Martin Ester, 1-09 9 Trends in KDD Research KDD 2008 Conference • Text Mining • Data Integration • Social Networks • Graph Mining • Distance Functions and Metric Learning • Active and Semi-supervised Learning • Pattern Mining • Collaborative Filtering CMPT 884, SFU, Martin Ester, 1-09 10 Trends in KDD Research Some Hot Topics • Social Networks THE hot topic of KDD 08  topic of the only panel • Graph mining • Text mining and information extraction / integration • Collaborative Filtering more general, recommender systems  $1M NetFlix prize CMPT 884, SFU, Martin Ester, 1-09 11 Overview of this Course Prerequisites Foundations of database systems and statistics Introductory graduate data mining course or equivalent Objectives • Introduction into some hot topics of data mining research • Training in research methodology • Presentation skills start thesis work after this class! CMPT 884, SFU, Martin Ester, 1-09 12 Overview of this Course Topics • Graph mining social network analysis and analysis of biological networks as driving applications • Recommender systems in particular trust-based recommendation • Information extraction and integration integration with existing databases CMPT 884, SFU, Martin Ester, 1-09 13 Overview of this Course Format • Tutorial surveys by instructor • Written research paper reviews by students • Research paper presentations by students discussions in class • Course research projects by students on a topic of their choice CMPT 884, SFU, Martin Ester, 1-09 14 Overview of this Course Tentative Grading Scheme • Paper review (20 %) • Paper presentation (20 %) • Course project report (40%) two steps: project proposal, final project report • Course project presentation (20 %)  marking criteria: originality, technical quality, presentation CMPT 884, SFU, Martin Ester, 1-09 15 Overview of this Course Types of Course Projects • Literature survey summarize the state-of-the-art and identify open research problems • New problem introduce and analyze a new problem • New algorithm for known problem implement and evaluate algorithm • Improvement of existing algorithm implement and compare algorithm • Comparison of existing algorithms on a new, interesting dataset identify criteria for choice of algorithms / open research problems CMPT 884, SFU, Martin Ester, 1-09 16 Graph Mining Motivating Applications • Social network analysis o What communities exist? o How does information about a new product spread? o What customers should be targeted to maximize the profit of a marketing campaign? • Analysis of biological networks o What are the functional modules of an organism? o How do biological networks evolve in the course of time? o What protein should be targeted to inhibit some virulent bacteria? CMPT 884, SFU, Martin Ester, 1-09 17 Graph Mining Methods • Frequent subgraph mining frequent pattern mining approach • Graph clustering e.g., normalized cut, i.e. Minimize number of edges between graph components / clusters • Graph generative models probabilistic models that generate graphs similar to real graphs / networks CMPT 884, SFU, Martin Ester, 1-09 18 Graph Mining Challenges • Complexity of graph algorithms o Many graph mining problems are NP-hard. o Real graphs tend to be extremely large.  need efficient algorithms • Attribute data o Many graphs have attributes associated with the nodes. o Transformation into weighted graph looses a lot of information.  need new models / algorithms considering relationship and attribute data CMPT 884, SFU, Martin Ester, 1-09 19 Recommender Systems Motivating Applications • Motivation o The internet provides a flood of information on all kinds of items. o There is a great need for personalized recommendations. o The internet also provides a wealth of item ratings / reviews. • Typical applications o Movie recommendation o Product recommendation o Keyword recommendation CMPT 884, SFU, Martin Ester, 1-09 20 Recommender Systems Methods • Collaborative filtering o Uses only a database of user – item ratings. o Recommendation based on ratings by users with similar rating patterns. • Content-based recommender systems o Uses information about the content of items and / or the properties of users. o Recommends items that have content similar to items liked by user. • Trust-based recommender systems o Assume a social network / trust network. Trust can be defined explicitly or implicitly. o Recommendation based on ratings by trusted neighbors. CMPT 884, SFU, Martin Ester, 1-09 21 Recommender Systems Challenges • High dimensionality and sparsity of data o The overwhelming majority (> 99%) of user item ratings is unknown. o Recommendation especially hard for cold start users and controversial items.  dimensionality reduction, model based methods, trust-based approach • Fraud o Memory-based collaborative filtering can be easily manipulated by adding fraudulent ratings.  trust-based approach more robust to fraud • Privacy issues with trust network data o only very few trust networks are public domain CMPT 884, SFU, Martin Ester, 1-09 22 Information Extraction and Integration Motivating Applications • Importance of unstructured text data o The overwhelming majority (>= 80%) of human generated information is not in structured form, but in unstructured text. • Biomedical literature o Contains a wealth of valuable information that cannot be processed / searched automatically. o Extraction of entities and relationships such as proteins and their localizations. • Online product reviews o A lot of product „reviews“ available online in community databases or blogs. o Companies want to know what customers think of their products. CMPT 884, SFU, Martin Ester, 1-09 23 Information Extraction and Integration Methods • Basic NLP methods o Part-of-speech tagging o Lexica, ontologies, . . . • Machine learning methods o Typically, supervised classification. o CRFs and similar methods are state-of-the-art. • Bootstrapping approach o Using a small labeled training dataset, find textual extraction patterns. o Using these patterns, extract further entities / relationships and continue. CMPT 884, SFU, Martin Ester, 1-09 24 Information Extraction and Integration Challenges • Text data is hard to understand o Many of the NLP problems are still essentially unsolved.  relatively simple NLP methods often sufficient for information extraction • Portability across domains o Extraction methods need to be portable from one domain to another. o Knowledge engineering approach (domain expert defines rules) is labor-intensive and expensive.  machine learning methods • Entity mentions need to be resolved o Information extraction produces strings referencing an entity of a given type. o Without mapping to known real world entities, extracted information is of limited usefulness.  need to integrate extracted information with existing databases CMPT 884, SFU, Martin Ester, 1-09 25 References Graph mining - X Yan & Karsten Borgwardt, "Graph Mining and Graph Kernels", Tutorial KDD 08 - Jure Leskovec and Christos Faloutsos, “Mining Large Graphs: Models, Diffusion and Case Studies”, Tutorial ECML/PKDD 2007 Recommender systems - Joseph Konstan, “Introduction to Recommender Systems”, Tutorial SIGMOD 2008 Information extraction and integration - Eugene Agichtein & Sunita Sarawagi, “Scalable Information Extraction and Integration”, Tutorial KDD 06 - AnHai Doan & Raghu Ramakrishnan & Shiv Vaithyanathan, “Managing Information Extraction”, Tutorial SIGMOD 2006 CMPT 884, SFU, Martin Ester, 1-09 26

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data Mining - SFU computing science