Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining MTAT.03.183 (4AP = 6EAP) (4AP 6EAP) Introduction Jaak Vilo Jaak Vilo 2009 Fall Lecturer • • • • • 1986‐1991 U Tartu 1991‐1999 1991 1999 U Helsinki U Helsinki (sequence pattern discovery) 1999‐2002 EMBL‐EBI, UK (bioinformatics) 2002‐ EGeen ‐> Quretec (Biobank and Data Mgmnt) U Tartu professor (Bioinformatics) 2007 U Tartu, professor (Bioinformatics) 2007 – EXCS – Center of Excellence – STACC – Software Technologies and Applications Competence Center (Tarkvara TAK) – research projects Jaak Vilo and other authors UT: Data Mining 2009 2 Students • • • • • • >80 registered Estonian vs Foreign Estonian vs MSc 1st y / 2nd y ? BSc , PhD ? Non IT/CS ? IT/CS ? Why this class? Expectations? (ESSCaSS’08,09…) Jaak Vilo and other authors UT: Data Mining 2009 3 Course • • • • • http://courses.cs.ut.ee/2009/dm/ List: [email protected] List: [email protected] Lectures: 10:15, Liivi 2‐403 Seminars: 12:15, Liivi 2‐403 Prof Jaak Vilo vilo@ut ee Prof. Jaak Vilo [email protected] • http://www.quicktopic.com/43/H/eWqhydvFpUN • Other? Skype O h ? Sk ? ? Jaak Vilo and other authors UT: Data Mining 2009 4 Seminars • Three types: 1. Homework: presentations/discussions p / 2. Guest lectures, visitors 3 Practical labs/training (no concrete 3. (no concrete plans yet) • Participation is obligatory (>75%) Jaak Vilo and other authors UT: Data Mining 2009 5 Grading requirements • • • • • • Participation! >75% of Participation! >75% of seminars Homeworks (30%) (min 50% of assignments) Projects/essays (30%) Exam (40%) Total: 100% + thresholds All deadlines are stringent. Jaak Vilo and other authors UT: Data Mining 2009 6 Homework • Tasks/assignments – 5 tasks/week + possibly bonuses – About in every 2 weeks (irregular) • Report/mark p / all completed p tasks – written reports on tasks – ready to present fully present fully to class – there will be some uploading system – and/or paper sheets in class • Deadline always before class start (Thu, 12:15) start (Thu, 12:15) Jaak Vilo and other authors UT: Data Mining 2009 7 4AP = 6EAP 4AP = 6EAP • 4 weeks (4x40h=160h) of intensive work – assumingg basic knowledge g of BSc material • 1/3 in 1/3 in class • 1/3 reading, homeworks • 1/3 projects, writing, … Jaak Vilo and other authors UT: Data Mining 2009 8 What is Data Mining? • Data ‐> Information, Knowledge, Insight – new, interesting, nontrivial, useful , g, , … • Data size ‐> Algorithmic challenge • Predictive, useful Predictive, useful ‐> theoretical theoretical and and economical challenge • Why? By y yp practical demand and need… Jaak Vilo and other authors UT: Data Mining 2009 9 Textbooks • • • • • Han, Kamber: Data Mining: Concepts and Techniques, Second Edition (The b d h d d ( h Morgan Kaufmann Series in Data Management Systems) Google Books web Chakrabarti et al. Data Mining: know it all. Morgan Kaufmann 2008 @ELsevier @AMazon @Google Bramer: Principles of Data Mining (Springer 2007) @Amazon @Springer Bramer: Principles of Data Mining (Springer, 2007) @Amazon @Google David J. Hand, Heikki Mannila and Padhraic Smyth: Principles of Data Mining (MIT Press, 2001) @MIT Press @Google Trevor Hastie, Robert Tibshirani, Jerome Friedman: The Elements of Statistical Learning: Data Mining Inference and Prediction (Springer Statistical Learning: Data Mining, Inference, and Prediction. (Springer 2009) @Tibshirani @Amazon Jaak Vilo and other authors UT: Data Mining 2009 10 • Han, Kamber: Data Mining: Concepts and T h i Techniques, Second S d Edition Edi i (The (Th Morgan M Kaufmann Series in Data Management Systems) • TOC: TOC: http://www.cs.uiuc.edu/homes/hanj/bk2/toc.pdf http://www cs uiuc edu/homes/hanj/bk2/toc pdf Jaak Vilo and other authors UT: Data Mining 2009 11 Jaak Vilo and other authors UT: Data Mining 2009 12 What’ss it all about? What it all about? Data DB Jaak Vilo and other authors UT: Data Mining 2009 13 • • • • • • • • Statistics Patterns in data Learning Classification Knowledge / Information / Information / / Algorithms Prediction … Jaak Vilo and other authors UT: Data Mining 2009 14 Sources of data (growth) Sources of data (growth) • • • • • • • • • devices net/web logs transactional db consumer multimedia(!) science cheaper storage, compute power … Jaak Vilo and other authors UT: Data Mining 2009 15 Why y Data Mining? g The Explosive Growth of Data: from terabytes to petabytes Data collection and data availability Automated data collection tools,, database systems, y , Web,, computerized society Major j sources of abundant data Business: Web, e-commerce, transactions, stocks, … Science: Remote sensing, bioinformatics, scientific simulation, … Society and everyone: news, digital cameras, YouTube We are drowning in data, data but starving for knowledge! “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets Jiawei Han, Micheline Kamber, and Jian Pei Data Mining: Concepts and Techniques 16 Evolution of Sciences Before 1600, empirical science 1600-1950s theoretical science 1600-1950s, 1950s 1990s computational 1950s-1990s, comp tational science Over the last 50 years, most disciplines have grown a third, computational branch (e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.) Computational Science traditionally meant simulation. It grew out of our inability to find closed-form solutions for complex mathematical models. 1990-now, data science The flood of data from new scientific instruments and simulations The ability to economically store and manage petabytes of data online The Internet and computing p g Grid that makes all these archives universallyy accessible Each discipline has grown a theoretical component. Theoretical models often motivate experiments and generalize our understanding. Scientific info. management, acquisition, organization, query, and visualization tasks scale almost linearly with data volumes. Data mining is a major new challenge! Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science, Comm. ACM, 45(11): 50-54, Nov. 2002 Jiawei Han, Micheline Kamber, and Jian Pei Data Mining: Concepts and Techniques 17 Evolution of Database Technology gy 1960s: 1970s: Relational data model, model relational DBMS implementation 1980s: RDBMS, advanced data models ((extended-relational, OO, deductive, etc.)) Application-oriented DBMS (spatial, scientific, engineering, etc.) 1990s: Data collection, ll i database d b creation, i IMS S and d networkk DBMS S Data mining, data warehousing, multimedia databases, and Web databases 2000s Stream data management and mining Data mining and its applications Web technology (XML, data integration) and global information systems Jiawei Han, Micheline Kamber, and Jian Pei Data Mining: Concepts and Techniques 18 examples from Machine Learning • 1950’ies – checkers (Arthur Samuels 1959) • 1960 1960’ies ies – NN NN – perceptron and it and it’ss limitations • 1970’ies – expert systems, decision trees (ID3), … (ID3) • 1980’ies – Neural Networks, PAC learning, … , g, • 1990’ies – Data mining, ILP, Ensembles • 2000’ – SVM, Kernels, Graphical Models, … Jaak Vilo and other authors UT: Data Mining 2009 19 Chapter 1. Introduction Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining Data Mining Functionalities: What Kinds of Patterns Can Be Mined? Data Mining: On What Kind of data? Time and Ti d Ordering: Od i Sequential S ti l Pattern, P tt Trend T d and d Evolution E l ti Analysis Structure and Network Analysis Evaluation of Knowledge Applications pp of Data Mining g Major Challenges in Data Mining A Brief History of Data Mining and Data Mining Society Summary Jiawei Han, Micheline Kamber, and Jian Pei Data Mining: Concepts and Techniques 20 What Is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit, previously u unknown o and a d po potentially a y useful) u u ) pa patterns or o knowledge o dg from o huge amount of data Alternative names Data mining: a misnomer? Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. W t h out: Watch t IIs everything thi “d “data t mining”? i i ”? Simple search and query processing (D d ti ) expertt systems (Deductive) t Jiawei Han, Micheline Kamber, and Jian Pei Data Mining: Concepts and Techniques 21 Knowledge g Discovery y ((KDD)) Process This is a view from typical database systems and data Pattern Evaluation warehousing communities Data mining plays an essential l in i the th knowledge k l d discovery di role Data Mining process T k l Task-relevant t Data D t Selection D t W h Data Warehouse Data Cleaning Data Integration Databases Jiawei Han, Micheline Kamber, and Jian Pei Data Mining: Concepts and Techniques 22 Example: A Web Mining Framework Web mining usually involves Data cleaning Data integration from multiple sources Warehousing the data Data cube construction Data selection for data mining g Data mining Presentation of the mining results Patterns and knowledge to be used or stored into knowledge-base knowledge base Jiawei Han, Micheline Kamber, and Jian Pei Data Mining: Concepts and Techniques 23 Data Mining g in Business Intelligence g Increasing potential to support business decisions Decision M ki Making Data Presentation Visualization Techniques End User Business Analyst Data Mining Information Discovery Data Analyst y Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems Jiawei Han, Micheline Kamber, and Jian Pei Data Mining: Concepts and Techniques DBA 24 Collaborative filtering Collaborative filtering – Amazon, Netflicks • Collaborative filtering systems usually take two steps: – Look for users who share the same rating patterns with the active user (the user whom the prediction is for). – Use the ratings from those like‐minded users found in step 1 to calculate a prediction for the active user Jaak Vilo and other authors UT: Data Mining 2009 25 Jaak Vilo and other authors UT: Data Mining 2009 26 Netflix prize http://www.netflixprize.com/ • http://en.wikipedia.org/wiki/Netflix_Prize 18K movies i 480K customers ~ 100M ratings ???? Jaak Vilo and other authors UT: Data Mining 2009 Test on 2.8M witheld ratings 27 Social network Social network • Graph of connections • Social network mining Jaak Vilo and other authors UT: Data Mining 2009 28 Web • Interlinked web sites and pages • Directed Graph of links • Information Retrieval, PageRank Retrieval PageRank • Web mining Jaak Vilo and other authors UT: Data Mining 2009 29 Web usage mining Web usage mining • Software and web usage logs • Typical use patterns • User groups, their groups their preferences, behavior preferences behavior • Can you predict their goals and help to achieve them? – distributed online transactions, queries, … (Google, etc) Jaak Vilo and other authors UT: Data Mining 2009 30 Biomedical data mining Biomedical data mining • Analyse: – DNA, , – Genotype information – disease histories – find associated genes – predict and classify diseases and outcomes – discover “how biology gy works” –… Jaak Vilo and other authors UT: Data Mining 2009 31 Combinatorial Data Mining Algorithms Combinatorial Data Mining Algorithms (research seminar, Sven Laur, PhD) Basics ideas and techniques – How to find frequent sets in databases H t fi d f t t i d t b – How to find frequent motifs in sequences Algorithmic problems – Depth‐first vs breath first search – How to avoid combinatorial explosion p Interpretation of results – Which patterns are important enough? Combinatorial Data Mining Algorithms (research seminar, Sven Laur, PhD.) Other important aspects – How to handle noisy data – Random sampling vs linear scan Applications and extensions Applications and extensions – Association rules in practice – Log analysis. Episode rules and usability Log analysis Episode rules and usability – Graph mining and biochemistry Combinatorial Data Mining Algorithms (research seminar, Sven Laur, PhD) Administrative details • • • • • • • Combinatorial Data Mining Algorithms C bi t i l D t Mi i Al ith Gives 3 EAP (2 old AP) Takes place on Wednesdays in L122 First seminar is on 16th of September Each participant has to give a presentation Project work is combined with DM course http://courses.cs.ut.ee/2009/fast‐counting/ Research at U Tartu Research at U Tartu • BIIT – http://biit.cs.ut.ee/ • STACC – Software Technologies and A li i Applications Competence Center C C – companies and universities – Skype, Regio, Delfi, Quretec, … – Research problems, topics, scholarships Jaak Vilo and other authors UT: Data Mining 2009 35 Research topics Research topics • Publications =>> Projects, funding Projects, funding • Relevant to STACC, companies • Can lead to job offers Jaak Vilo and other authors UT: Data Mining 2009 36 UT CS department UT CS department • Job offers: • courses.cs.ut.ee ‐ web site development – UT CS department courses web development – Other sysdamin y and Department p development p tasks –… Jaak Vilo and other authors UT: Data Mining 2009 37