Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DATA MINING II - 1DL460 Spring 2015 A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt15 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala University, Uppsala, Sweden Kjell Orsborn - UDBL - IT - UU 20/04/15 1 Personell • Kjell Orsborn, lecturer, examiner: – email: [email protected], phone: 471 5154, room: 116, ITC building 19 • Tore Risch, lecturer – email: [email protected], phone 471 6342, room: 137, ITC building 19 • Emil Jansson, course assistant, – email: [email protected], room: 138, ITC building 19 • Michelle Brundin, course assistant, – email: [email protected], room: 138, ITC building 19 Kjell Orsborn - UDBL - IT - UU 20/04/15 2 Preliminary course contents • Lecture topics: – Course intro - overview of topics in data mining 2 – Web mining – Search engines – Sequential association analysis – Alt. association analysis – Visual data exploration – Cluster validation – Advanced clustering methods: • • • • • – – – – Stream data mining Privacy preserving data mining Outlier detection Additional topics if time: • Spatial data mining • More on large scale data mining • Invited Guest lectures Chamelon, Cure Birch (SNN, Rock, Jarvis-Patrick) large scale clustering methods Kjell Orsborn - UDBL - IT - UU 20/04/15 3 Course contents continued … • Assignments: – Assignment 1 – Web mining – HITS / PageRank – Assignment 2 – Implementation of Association Rule Mining – Assignment 3 – Implementation of scalable K-means Kjell Orsborn - UDBL - IT - UU 20/04/15 4 Examination • Written examination – grade 3, 4 and 5 • Assignments – all 3 assignments should be passed with a passing grade Kjell Orsborn - UDBL - IT - UU 20/04/15 5 Introduction to Data Mining II (Tan, Steinbach, Kumar ch. 1) Kjell Orsborn Department of Information Technology Uppsala University, Uppsala, Sweden Kjell Orsborn - UDBL - IT - UU 20/04/15 6 Data Mining • The process of extracting valid, previously unknown, comprehensible, and actionable information from large databases and using it to make crucial business decisions, (Simoudis, 1996). – Involves the analysis of data and the use of software techniques for finding hidden and unexpected patterns and relationships in sets of data; in contrast to information and knowledge that are already intuitive. – Patterns and relationships are identified by examining the underlying rules and features in the data. – Tends to work from the data up and most accurate results normally require large volumes of data to deliver reliable conclusions. – Data mining can provide huge paybacks for companies who have made a significant investment in data warehousing. – Relatively new technology, however already used in a number of industries. Kjell Orsborn - UDBL - IT - UU 20/04/15 7 Historic view of data mining Han et al, 2006. Kjell Orsborn - UDBL - IT - UU 20/04/15 8 The data mining process Knowledge • • • Data cleaning (to remove noise and inconsistent data) Data integration (where multiple data sources may be combined) Data selection (where data relevant to the analysis task are retrieved from the database) • Data transformation (where data are transformed or consolidated Evaluation & Presentation into forms appropriate for mining by performing summary or aggregation operations) • Data mining (an essential process where intelligent • 1 5 2 Patterns Pattern evaluation (to identify the truly interesting patterns representing knowledge based on some interestingness measures) • 3 Data Mining methods are applied in order to extract data patterns) 6 Selection & Transformation Knowledge presentation (where visualization and knowledge representation techniques are used to present the Cleaning & mined knowledge Integration to the user) Data Warehouse Database Database Database Kjell Orsborn - UDBL - IT - UU File File 20/04/15 File 9 Why data mining? • "There was 5 exabytes of information created between the dawn of civilization through 2003," Schmidt said, "but that much information is now created every 2 days, and the pace is increasing... People aren't ready for the technology revolution that's going to happen to them....” (Eric Schmidt, Google) Kjell Orsborn - UDBL - IT - UU 20/04/15 10 The information explosion • The world’s information is doubling every two years. • In 2011 the world will create a staggering 1.8 Zettabytes. • By 2020 the world will generate 50 times the amount of information and 75 times the number of "information containers" (files) while IT staff to manage it will grow less than 1.5 times. [ref. IDC/EMC 2011] Kjell Orsborn - UDBL - IT - UU 20/04/15 11 Why data mining? • The explosive growth of data: from terabytes, through petabytes, to exabytes – Data collection from automated data collection tools, database systems, web, e-commerce, transactions, stocks, remote sensing, bioinformatics, scientific simulation, computerized society, news, digital cameras, … – Human analysts may take weeks to discover useful information 4,000,000 – Much of the data is never analyzed at all Total new disk (TB) since 1995 3,500,000 3,000,000 The Data Gap 2,500,000 From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for 2,000,000 Scientific and Engineering 1,500,000 Applications” 1,000,000 Number of analysts 500,000 0 1995 Kjell Orsborn - UDBL - IT - UU 1996 1997 20/04/15 1998 1999 12 Why mine data (commercial viewpoint)? • Lots of data is being collected and warehoused – Web data, e-commerce – purchases at department/ grocery stores – Bank & credit card transactions • Computers have become cheaper and more powerful • Competitive pressure is strong – Provide better, customized services for an edge (e.g. in Customer Relationship Management) Kjell Orsborn - UDBL - IT - UU 20/04/15 13 Why mine data (scientific viewpoint)? • Data collected and stored at enormous speeds (GB/hour) – remote sensors on a satellite – telescopes scanning the skies – microarrays generating gene expression data – scientific simulations generating terabytes of data • Traditional techniques infeasible for raw data • Data mining may help scientists – in classifying and segmenting data – in hypothesis formation Kjell Orsborn - UDBL - IT - UU 20/04/15 14 Why not traditional data analysis? • Tremendous amount of data – Algorithms must be highly scalable to handle such as tera-bytes of data • High-dimensionality of data – Micro-array may have tens of thousands of dimensions • High complexity of data – – – – – – • Data streams and sensor data Time-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked data Heterogeneous databases and legacy databases Spatial, spatiotemporal, multimedia, text and Web data Software programs, scientific simulations New and sophisticated applications Kjell Orsborn - UDBL - IT - UU 20/04/15 15 Data mining tasks • Prediction methods – Use some variables to predict unknown or future values of other variables. • Description methods – Find human-interpretable patterns that describe the data. From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996 Kjell Orsborn - UDBL - IT - UU 20/04/15 16 Classification - definition • Given a collection of records (training set) – Each record contains a set of attributes, one of the attributes is the class. • Find a model for class attribute as a function of the values of other attributes. • Goal: previously unseen records should be assigned a class as accurately as possible. – A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. Kjell Orsborn - UDBL - IT - UU 20/04/15 17 Clustering - definition • Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that – Data points in one cluster are more similar to one another. – Data points in separate clusters are less similar to one another. • Similarity Measures: – Euclidean distance if attributes are continuous. – Other problem-specific measures. Kjell Orsborn - UDBL - IT - UU 20/04/15 18 Association rule discovery - definition • Given a set of records each of which contain some number of items from a given collection; – Produce dependency rules which will predict occurrence of an item based on occurrences of other items. TID Items 1 2 3 4 5 Bread, Coke, Milk Beer, Bread Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Diaper, Milk Kjell Orsborn - UDBL - IT - UU Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer} 20/04/15 19 Sequential pattern discovery definition • Given is a set of objects, with each object associated with its own timeline of events, find rules that predict strong sequential dependencies among different events. (A B) (C) (D E) • Rules are formed by first disovering patterns. Event occurrences in the patterns are governed by timing constraints. (A B) (C) (D E) <= xg >ng <= ws <= ms Kjell Orsborn - UDBL - IT - UU 20/04/15 20 Deviation or anomaly detection • • Detect significant deviations from normal behavior Applications: – Credit Card Fraud Detection – Network Intrusion Detection Typical network traffic at University level may reach over 100 million connections per day Kjell Orsborn - UDBL - IT - UU 20/04/15 21 Challenges of data mining • • • • • • • Scalability Dimensionality Complex and heterogeneous data Data quality Data ownership and distribution Privacy preservation Streaming data Kjell Orsborn - UDBL - IT - UU 20/04/15 22