Download Data Mining - Computer Science Unplugged

Data Mining LECTURE # 01 Introduction to Data Mining Motivation: “Necessity is the Mother of Invention” • Data Explosion Problem 1. Automated data collection tools (e.g. web, sensor networks) and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories. 2. Currently enterprises are facing data explosion problem. • Electronic Information an Important Asset for Business Decisions 1. With the growth of electronic information, enterprises began to realizing that the accumulated information can be an important asset in their business decisions. 2. There is a potential business intelligence hidden in the large volume of data. 3. This intelligence can be the secret weapon on which the success of a business may depend. Extracting Business Intelligence (Solution) 1. It is not a Simple Matter to discover Business Intelligence from Mountain of Accumulated Data. 2. What is required are Techniques that allow the enterprise to Extract the Most Valuable Information. 3. The Field of Data Mining provides such Techniques. 4. These techniques can Find Novel Patterns (unknown) that may Assist an Enterprise in Understanding the business better and in forecasting. Data Mining vs SQL, EIS, and OLAP • SQL. SQL is a query language, difficult for business people to use • EIS = Executive Information Systems. EIS systems provide graphical interfaces that give executives a preprogrammed (and therefore limited) selection of reports, automatically generating the necessary SQL for each. • OLAP allows views along multiple dimensions, and drilldrown, therefore giving access to a vast array of analyses. However, it requires manual navigation through scores of reports, requiring the user to notice interesting patterns themselves. • Data Mining picks out interesting patterns. The user can then use visualization tools to investigate further. 4 An Example of OLAP Analysis and its Limits Walking Sticks Sales by City • What is driving sales of walking sticks ? Step 1 50 • Step 1: View some OLAP graphs: e.g. walking stick sales by city. 10 Karachi Lahore Islamabad • Step 2: Noticing that Islamabad has high sales you decide to investigate further. • (Before OLAP, you would have to have written a very complex SQL query instead of just simply clicking to drill-down). • It seems that old people are responsible for most walking stick sales. You confirm this by viewing a chart of age distributions by state. • But imagine if you had to do this manual investigation for all of the 10,000 products in your range ! Here, OLAP gives way to Data Mining. 400 Walking Sticks Sales in Islamabad by Age Step 2 10 30 Less than 20 20 to 60 360 Older than 60 Age Distribution by City 80 60 Younger than 20 40 20 to 60 20 Older than 60 0 Karachi Lahore Islamabad 5 Data Mining vs Expert Systems • Expert Systems = Rule-Driven Deduction Top-down: From known rules (expertise) and data to decisions. Rules Data Expert System Decisions • Data Mining = Data-Driven Induction Bottom-up: From data about past decisions to discovered rules (general rules induced from the data). Data (including past decisions) Data Mining Rules 6 Difference b/w Machine Learning and Data Mining • Machine Learning techniques are designed to deal with a limited amount of artificial intelligence data. Where the Data Mining Techniques deal with large amount of databases data. • Data Mining (Knowledge Discovery in Databases) – Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases. • What is not Data Mining? – (Deductive) query processing. – Expert systems or small ML/statistical programs Data Mining (Example) • Random Guessing vs. Potential Knowledge – Suppose we have to Forecast the Probability of Rain in Islamabad city for any particular day. – Without any Prior Knowledge the probability of rain would be 50% (pure random guess). – If we had a lot of weather data, then we can extract potential rules using Data Mining which can then forecast the chance of rain better than random guessing. • Example: The Rule if [Temperature = ‘hot’ and Humidity = ‘high’] then there is 66.6% chance of rain. Temperature hot hot hot mild cool cool Humidity high high high high normal normal Windy false true false false false true Rain No Yes Yes No No Yes The Data Mining Process • Step 0: Determine Business Objective - e.g. Forecasting the probability of rain - Must have relevant prior knowledge and goals of application. • Step 1: Prepare Data - Noisy and Missing values handling (Data Cleaning). - Data Transformation (Normalization/Discretization). - Attribute/Feature Selection. • Step 2: Choosing the Function of Data Mining - Classification, Clustering, Association Rules • Step 3: Choosing The Mining Algorithm - Selection of correct algorithm depending upon the quality of data. - Selection of correct algorithm depending upon the density of data. • Step 4: Data Mining - Search patterns of interest:- A typical data mining algorithm can mine millions of patterns. • Step 5: Visualization/Knowledge Representation - Visualization/Representation of interesting patterns, etc 9 Data Mining: A KDD Process – Data mining: the core of knowledge discovery process. Pattern Evaluation Data Mining Task-relevant Data Data Warehouse Data Cleaning Data Integration Databases Data Mining: On What Kind of Data? 1. 2. 3. 4. Relational databases Data warehouses Transactional databases Advanced DB and information repositories – – – – – Time-series data and temporal data Text databases Multimedia databases Data Stream (Sensor Networks Data) WWW Data Mining Functionalities (1) • Data Preprocessing – Handling Missing and Noisy Data (Data Cleaning). – Techniques we will cover. • Missing values Imputation using Mean, Median and Mod. • Missing values Imputation using K-Nearest Neighbor. • Missing values Imputation using Association Rules Mining. • Data Binning for Noisy Data. TID Refund Country Taxable Income Cheat 1 Yes 2 3 No USA 125K No UK 100K No Australia 70K No 120K No 95K Yes 4 5 No NZL Data Mining Functionalities (1) • Data Preprocessing – Data Transformation (Discretization and Normalization). – With the help of data transformation rules become more General and Compact. – General and Compact rules increase the Accuracy of Classification. Age Age 15 Child 18 40 Child = (0 to 20) 33 Young = (21 to 47) 55 Old = (48 to 120) Child Young Young Old 48 Old 12 Child 23 Young 1. If attribute 1 = value1 & attribute 2 = value2 and Age = 08 then Buy_Computer = No. 2. If attribute 1 = value1 & attribute 2 = value2 and Age = 09 then Buy_Computer = No. 3. If attribute 1 = value1 & attribute 2 = value2 and Age = 10 then Buy_Computer = No. 1. If attribute 1 = value1 & attribute 2 = value2 and Age = Child then Buy_Computer = No. Data Mining Functionalities (1) • Data Preprocessing – – Attribute Selection/Feature Selection • Selection of those attributes which are more relevant to data mining task. • Advantage1: Decrease the processing time of mining task. • Advantage2: Generalize the rules. Example • If our mining goal is to find that countries which has more Cheat on which Taxable Income. • Then obviously the date attribute will not be an important factor in our mining task. Date Refund Country Taxable Income Cheat 11/02/2002 Yes USA 125K No 13/02/2002 Yes UK 100K No 16/02/2002 No Australia 120K Yes 21/03/2002 No Australia 120K Yes 26/02/2002 No NZL 95K Yes Data Mining Functionalities (1) • Data Preprocessing • Principle Component Analysis • Wrapper Based • Filter Based Data Mining Functionalities (2) • Association Rule Mining • In Association Rule Mining Framework we have to find all the rules in a transactional/relational dataset which contain a support (frequency) Greater than some minimum support (min_sup) threshold (provided by the user). • For example with min_sup = 50%. Transaction ID 2000 1000 4000 5000 Items Bought Bread,Butter,Egg Bread,Butter, Egg Bread,Butter, Tea Butter, Ice cream, Cake Itemset {Butter} {Bread} {Egg} {Bread,Butter} {Bread, Butter, Egg} Support 4 3 2 3 2 Data Mining Functionalities (2) • Association Rule Mining • Topic we will cover – – – – – – Frequent Itemset Mining Algorithms (Apriori, FP-Growth, Bitvector ). Fault-Tolerant/Approximate Frequent Itemset Mining. N-Most Interesting Frequent Itemset Mining. Closed and Maximal Frequent Itemset Mining. Incremental Frequent Itemset Mining Sequential Patterns. Data Mining Functionalities (2) • Classification and Prediction – Finding models (functions) that describe and distinguish classes or concepts for future prediction – Example: Classify rainy/un-rainy cities based on Temperature, Humidify and Windy Attributes. – Must have known the previous business decisions (Supervised Learning). City Lahore Islamabad Islamabad Multan Karachi Rawalpindi Temperature hot hot hot mild cool hot Prediction of unknown record Humidity low high high low normal high Windy false true false false false true City Muree Sibi Rain No Yes Yes No No Yes Rule • If Temperature = Hot & Humidity = High then Rain = Yes. Temperature hot mild Humidity Windy high false low true Rain ? ? Data Mining Functionalities (2) • Cluster Analysis – Group data to form new classes based on un-labels class data. – Business decisions are unknown (Also called unsupervised Learning). – Example: Classify rainy/un-rainy cities based on Temperature, Humidify and Windy Attributes. City Lahore Islamabad Islamabad Multan Karachi Rawalpindi Temperature hot hot hot mild cool hot Humidity low high high low normal high Windy false true false false false true Rain ? ? ? ? ? ? 3 clusters Data Mining Functionalities (3) • Outlier Analysis – Outlier: A data object that does not comply with the general behavior of the data. – It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis City Lahore Islamabad Islamabad Multan Karachi Rawalpindi Temperature hot hot hot mild cool hot Humidity low high high low normal high Windy false true false false false true Rain ? ? ? ? ? ? 2 outliers Are All the “Discovered” Patterns Interesting? • A data mining system/query may generate thousands of patterns, not all of them are interesting. – Suggested approach: Query-based, Constraint mining • Interestingness Measures: A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm Can We Find All and Only Interesting Patterns? • Find all the interesting patterns: Completeness – Can a data mining system find all the interesting patterns? – Remember most of the problems in Data Mining are NP-Complete. – There is no global best solution for any single problem. • Search for only interesting patterns: Optimization – Can a data mining system find only the interesting patterns? – Approaches • First general all the patterns and then filter out the uninteresting ones. • Generate only the interesting patterns—Constraint based mining (Give threshold factors in mining) Reading Assignment • Book Chapter – Chapter 1 of “Jiawei Han and Micheline Kamber” book “Data Mining: Concepts and Techniques”. Data Mining ------- Where? • Some Nice Resources – ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD) http://www.acm.org/sigs/sigkdd/. – Knowledge Discovery Nuggets www.kdnuggests.com. – IEEE Transactions on Knowledge and Data Engineering – http://www.computer.org/tkde/. – IEEE Transactions on Pattern Analysis and Machine Intelligence – http://www.computer.org/tpami/. – Data Mining and Knowledge Discovery - Publisher: Springer Science+Business Media B.V., Formerly Kluwer Academic Publishers B.V. http://www.kluweronline.com/issn/13845810/. current and previous offerings of Data Mining course at Stanford, CMU, MIT and Helsinki. Text and Reference Material • The course will be mainly based on research literature, following text may however be consulted: – Jiawei Han and Micheline Kamber. “Data Mining: Concepts and Techniques”. 1. David Hand, Heikki Mannila and Padhraic Smyth. “Principles of Data Mining”. Pub. Prentice Hall of India, 2004. 2. Sushmita Mitra and Tinku Acharya. “Data Mining: Multimedia, Soft Computing and Bioinformatics”. Pub. Wiley an Sons Inc. 2003. 3. Usama M. Fayyad et al. “Advances in Knowledge Discovery and Data Mining”, The MIT Press, 1996.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data Mining - Computer Science Unplugged