Download What Is Data Mining?

Data Mining MTAT.03.183 (4AP = 6EAP) (4AP 6EAP) Introduction Jaak Vilo Jaak Vilo 2009 Fall Lecturer • • • • • 1986‐1991 U Tartu 1991‐1999 1991 1999 U Helsinki U Helsinki (sequence pattern discovery) 1999‐2002 EMBL‐EBI, UK (bioinformatics) 2002‐ EGeen ‐> Quretec (Biobank and Data Mgmnt) U Tartu professor (Bioinformatics) 2007 U Tartu, professor (Bioinformatics) 2007 – EXCS – Center of Excellence – STACC – Software Technologies and Applications Competence Center (Tarkvara TAK) – research projects Jaak Vilo and other authors UT: Data Mining 2009 2 Students • • • • • • >80 registered Estonian vs Foreign Estonian vs MSc 1st y / 2nd y ? BSc , PhD ? Non IT/CS ? IT/CS ? Why this class? Expectations? (ESSCaSS’08,09…) Jaak Vilo and other authors UT: Data Mining 2009 3 Course • • • • • http://courses.cs.ut.ee/2009/dm/ List: [email protected] List: [email protected] Lectures: 10:15, Liivi 2‐403 Seminars: 12:15, Liivi 2‐403 Prof Jaak Vilo vilo@ut ee Prof. Jaak Vilo [email protected] • http://www.quicktopic.com/43/H/eWqhydvFpUN • Other? Skype O h ? Sk ? ? Jaak Vilo and other authors UT: Data Mining 2009 4 Seminars • Three types: 1. Homework: presentations/discussions p / 2. Guest lectures, visitors 3 Practical labs/training (no concrete 3. (no concrete plans yet) • Participation is obligatory (>75%) Jaak Vilo and other authors UT: Data Mining 2009 5 Grading requirements • • • • • • Participation! >75% of Participation! >75% of seminars Homeworks (30%) (min 50% of assignments) Projects/essays (30%) Exam (40%) Total: 100% + thresholds All deadlines are stringent. Jaak Vilo and other authors UT: Data Mining 2009 6 Homework • Tasks/assignments – 5 tasks/week + possibly bonuses – About in every 2 weeks (irregular) • Report/mark p / all completed p tasks – written reports on tasks – ready to present fully present fully to class – there will be some uploading system – and/or paper sheets in class • Deadline always before class start (Thu, 12:15) start (Thu, 12:15) Jaak Vilo and other authors UT: Data Mining 2009 7 4AP = 6EAP 4AP = 6EAP • 4 weeks (4x40h=160h) of intensive work – assumingg basic knowledge g of BSc material • 1/3 in 1/3 in class • 1/3 reading, homeworks • 1/3 projects, writing, … Jaak Vilo and other authors UT: Data Mining 2009 8 What is Data Mining? • Data ‐> Information, Knowledge, Insight – new, interesting, nontrivial, useful , g, , … • Data size ‐> Algorithmic challenge • Predictive, useful Predictive, useful ‐> theoretical theoretical and and economical challenge • Why? By y yp practical demand and need… Jaak Vilo and other authors UT: Data Mining 2009 9 Textbooks • • • • • Han, Kamber: Data Mining: Concepts and Techniques, Second Edition (The b d h d d ( h Morgan Kaufmann Series in Data Management Systems) Google Books web Chakrabarti et al. Data Mining: know it all. Morgan Kaufmann 2008 @ELsevier @AMazon @Google Bramer: Principles of Data Mining (Springer 2007) @Amazon @Springer Bramer: Principles of Data Mining (Springer, 2007) @Amazon @Google David J. Hand, Heikki Mannila and Padhraic Smyth: Principles of Data Mining (MIT Press, 2001) @MIT Press @Google Trevor Hastie, Robert Tibshirani, Jerome Friedman: The Elements of Statistical Learning: Data Mining Inference and Prediction (Springer Statistical Learning: Data Mining, Inference, and Prediction. (Springer 2009) @Tibshirani @Amazon Jaak Vilo and other authors UT: Data Mining 2009 10 • Han, Kamber: Data Mining: Concepts and T h i Techniques, Second S d Edition Edi i (The (Th Morgan M Kaufmann Series in Data Management Systems) • TOC: TOC: http://www.cs.uiuc.edu/homes/hanj/bk2/toc.pdf http://www cs uiuc edu/homes/hanj/bk2/toc pdf Jaak Vilo and other authors UT: Data Mining 2009 11 Jaak Vilo and other authors UT: Data Mining 2009 12 What’ss it all about? What it all about? Data DB Jaak Vilo and other authors UT: Data Mining 2009 13 • • • • • • • • Statistics Patterns in data Learning Classification Knowledge / Information / Information / / Algorithms Prediction … Jaak Vilo and other authors UT: Data Mining 2009 14 Sources of data (growth) Sources of data (growth) • • • • • • • • • devices net/web logs transactional db consumer multimedia(!) science cheaper storage, compute power … Jaak Vilo and other authors UT: Data Mining 2009 15 Why y Data Mining? g  The Explosive Growth of Data: from terabytes to petabytes  Data collection and data availability  Automated data collection tools,, database systems, y , Web,, computerized society  Major j sources of abundant data  Business: Web, e-commerce, transactions, stocks, …  Science: Remote sensing, bioinformatics, scientific simulation, …  Society and everyone: news, digital cameras, YouTube  We are drowning in data, data but starving for knowledge!  “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets Jiawei Han, Micheline Kamber, and Jian Pei Data Mining: Concepts and Techniques 16 Evolution of Sciences  Before 1600, empirical science  1600-1950s theoretical science 1600-1950s,   1950s 1990s computational 1950s-1990s, comp tational science    Over the last 50 years, most disciplines have grown a third, computational branch (e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.) Computational Science traditionally meant simulation. It grew out of our inability to find closed-form solutions for complex mathematical models. 1990-now, data science  The flood of data from new scientific instruments and simulations  The ability to economically store and manage petabytes of data online  The Internet and computing p g Grid that makes all these archives universallyy accessible   Each discipline has grown a theoretical component. Theoretical models often motivate experiments and generalize our understanding. Scientific info. management, acquisition, organization, query, and visualization tasks scale almost linearly with data volumes. Data mining is a major new challenge! Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science, Comm. ACM, 45(11): 50-54, Nov. 2002 Jiawei Han, Micheline Kamber, and Jian Pei Data Mining: Concepts and Techniques 17 Evolution of Database Technology gy  1960s:   1970s:    Relational data model, model relational DBMS implementation 1980s:  RDBMS, advanced data models ((extended-relational, OO, deductive, etc.))  Application-oriented DBMS (spatial, scientific, engineering, etc.) 1990s:   Data collection, ll i database d b creation, i IMS S and d networkk DBMS S Data mining, data warehousing, multimedia databases, and Web databases 2000s  Stream data management and mining  Data mining and its applications  Web technology (XML, data integration) and global information systems Jiawei Han, Micheline Kamber, and Jian Pei Data Mining: Concepts and Techniques 18 examples from Machine Learning • 1950’ies – checkers (Arthur Samuels 1959) • 1960 1960’ies ies – NN NN – perceptron and it and it’ss limitations • 1970’ies – expert systems, decision trees (ID3), … (ID3) • 1980’ies – Neural Networks, PAC learning, … , g, • 1990’ies – Data mining, ILP, Ensembles • 2000’ – SVM, Kernels, Graphical Models, … Jaak Vilo and other authors UT: Data Mining 2009 19 Chapter 1. Introduction  Why Data Mining?  What Is Data Mining?  A Multi-Dimensional View of Data Mining  Data Mining Functionalities: What Kinds of Patterns Can Be Mined?  Data Mining: On What Kind of data?  Time and Ti d Ordering: Od i Sequential S ti l Pattern, P tt Trend T d and d Evolution E l ti Analysis  Structure and Network Analysis  Evaluation of Knowledge  Applications pp of Data Mining g  Major Challenges in Data Mining  A Brief History of Data Mining and Data Mining Society  Summary Jiawei Han, Micheline Kamber, and Jian Pei Data Mining: Concepts and Techniques 20 What Is Data Mining?  Data mining (knowledge discovery from data)  Extraction of interesting (non-trivial, implicit, previously u unknown o and a d po potentially a y useful) u u ) pa patterns or o knowledge o dg from o huge amount of data   Alternative names   Data mining: a misnomer? Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. W t h out: Watch t IIs everything thi “d “data t mining”? i i ”?  Simple search and query processing  (D d ti ) expertt systems (Deductive) t Jiawei Han, Micheline Kamber, and Jian Pei Data Mining: Concepts and Techniques 21 Knowledge g Discovery y ((KDD)) Process   This is a view from typical database systems and data Pattern Evaluation warehousing communities Data mining plays an essential l in i the th knowledge k l d discovery di role Data Mining process T k l Task-relevant t Data D t Selection D t W h Data Warehouse Data Cleaning Data Integration Databases Jiawei Han, Micheline Kamber, and Jian Pei Data Mining: Concepts and Techniques 22 Example: A Web Mining Framework  Web mining usually involves  Data cleaning  Data integration from multiple sources  Warehousing the data  Data cube construction  Data selection for data mining g  Data mining  Presentation of the mining results  Patterns and knowledge to be used or stored into knowledge-base knowledge base Jiawei Han, Micheline Kamber, and Jian Pei Data Mining: Concepts and Techniques 23 Data Mining g in Business Intelligence g Increasing potential to support business decisions Decision M ki Making Data Presentation Visualization Techniques End User Business Analyst Data Mining Information Discovery Data Analyst y Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems Jiawei Han, Micheline Kamber, and Jian Pei Data Mining: Concepts and Techniques DBA 24 Collaborative filtering Collaborative filtering – Amazon, Netflicks • Collaborative filtering systems usually take two steps: – Look for users who share the same rating patterns with the active user (the user whom the prediction is for). – Use the ratings from those like‐minded users found in step 1 to calculate a prediction for the active user Jaak Vilo and other authors UT: Data Mining 2009 25 Jaak Vilo and other authors UT: Data Mining 2009 26 Netflix prize http://www.netflixprize.com/ • http://en.wikipedia.org/wiki/Netflix_Prize 18K movies i 480K customers ~ 100M ratings ???? Jaak Vilo and other authors UT: Data Mining 2009 Test on 2.8M witheld ratings 27 Social network Social network • Graph of connections • Social network mining Jaak Vilo and other authors UT: Data Mining 2009 28 Web • Interlinked web sites and pages • Directed Graph of links • Information Retrieval, PageRank Retrieval PageRank • Web mining Jaak Vilo and other authors UT: Data Mining 2009 29 Web usage mining Web usage mining • Software and web usage logs • Typical use patterns • User groups, their groups their preferences, behavior preferences behavior • Can you predict their goals and help to achieve them? – distributed online transactions, queries, … (Google, etc) Jaak Vilo and other authors UT: Data Mining 2009 30 Biomedical data mining Biomedical data mining • Analyse: – DNA, , – Genotype information – disease histories – find associated genes – predict and classify diseases and outcomes – discover “how biology gy works” –… Jaak Vilo and other authors UT: Data Mining 2009 31 Combinatorial Data Mining Algorithms Combinatorial Data Mining Algorithms (research seminar, Sven Laur, PhD) Basics ideas and techniques – How to find frequent sets in databases H t fi d f t t i d t b – How to find frequent motifs in sequences Algorithmic problems – Depth‐first vs breath first search – How to avoid combinatorial explosion p Interpretation of results – Which patterns are important enough? Combinatorial Data Mining Algorithms (research seminar, Sven Laur, PhD.) Other important aspects – How to handle noisy data – Random sampling vs linear scan Applications and extensions Applications and extensions – Association rules in practice – Log analysis. Episode rules and usability Log analysis Episode rules and usability – Graph mining and biochemistry Combinatorial Data Mining Algorithms (research seminar, Sven Laur, PhD) Administrative details • • • • • • • Combinatorial Data Mining Algorithms C bi t i l D t Mi i Al ith Gives 3 EAP (2 old AP) Takes place on Wednesdays in L122 First seminar is on 16th of September Each participant has to give a presentation Project work is combined with DM course http://courses.cs.ut.ee/2009/fast‐counting/ Research at U Tartu Research at U Tartu • BIIT – http://biit.cs.ut.ee/ • STACC – Software Technologies and A li i Applications Competence Center C C – companies and universities – Skype, Regio, Delfi, Quretec, … – Research problems, topics, scholarships Jaak Vilo and other authors UT: Data Mining 2009 35 Research topics Research topics • Publications =>> Projects, funding Projects, funding • Relevant to STACC, companies • Can lead to job offers  Jaak Vilo and other authors UT: Data Mining 2009 36 UT CS department UT CS department • Job offers: • courses.cs.ut.ee ‐ web site development – UT CS department courses web development – Other sysdamin y and Department p development p tasks –… Jaak Vilo and other authors UT: Data Mining 2009 37

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download What Is Data Mining?