Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Modul 1: Introduction Topics Definitions Business intelligence DW & OLAP Data mining Data Warehousing and Data Mining Motivation Data mining tasks Classification, clustering, association, etc. Definitions What is business intelligence? The new technology for understanding the past and predicting the futture A broad category of technologies that allows for Gathering, storing, accessing and analyzing the data business users make better decisions Analyzing business performance through data-driven insight A broad category of applications, which includes the activities of Decision support systems Query and reporting OLAP Statistical, forecasting and data mining What is data warehouse? Barry Devlin, IBM Consultant What is data warehouse? W. H. Inmon, Building the Data Warehouse Data in OLTP and OLAP What is data mining? Many Definitions Search for valuable information (knowledge) from large volumes of data Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns & rules Alternative terms: Data analysis, pattern analysis, data dredging, data exploration, data understanding, data summarization Data mining: a misnomer? Knowledge Discovery Process KDD process Data cleaning: remove noise and inconsistent data Data integration: from multiple sources -> data warehouse Data selection and transformation: transform data into forms appropriate for data mining, select relevant data Data mining: extract patterns Pattern evaluation/interpretation: using interestingness measures Knowledge presentation: visualization and knowledge representation are used to present mined knowledge to the user What is (not) Data Mining? What is not Data Mining? – Look up phone number in phone directory – Query a Web search engine for information about “Amazon” What is Data Mining? – Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area) – Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com) Origins of Data Mining Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems Traditional Techniques may be unsuitable due to Enormity of data Statistics/ Machine Learning/ AI Pattern High dimensionality Recognition of data Data Mining Heterogeneous, distributed nature of data Database systems Data mining in the BI context The complete DSS from BI perspective Data Warehousing and Data Mining Motivations Motivation: Data explosion problem: Automated data collection tools and mature database technology lead to large amounts of data stored in databases and data warehouses We are drowning in data, but starving for knowledge! Do not believe it? See the following for proof! Why Mine Data? Commercial Viewpoint Lots of data is being collected and warehoused Web data, e-commerce purchases at department/ grocery stores Bank/Credit Card transactions Computers have become cheaper and more powerful Competitive pressure is strong Provide better, customized services for an edge (e.g. in Customer Relationship Management) Why Mine Data? Scientific Viewpoint Data collected and stored at enormous speeds (GB/hour) remote sensors on a satellite telescopes scanning the skies microarrays generating gene expression data scientific simulations generating terabytes of data Big Data Examples Largest Databases in 2003 What tools do we have? Query processing Reporting tool Spreadsheet Statistics OLAP (On Line Analytical Processing) Are there enough data analysts? Much of the data is never analyzed at all 4,000,000 3,500,000 3,000,000 2,500,000 The Data Gap 2,000,000 1,500,000 1,000,000 500,000 Total new disk (TB) since 1995 Number of analysts 0 1995 1996 1997 1998 1999 From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications” What we need is New technology that can intellectually and automatically assist humans in analyzing and transforming rapidly growing volume of digital data into useful information Data mining Largest Database Data Mined (Jun’06) Data Mining Tasks Data Mining Tasks Prediction Methods Use some variables to predict unknown or future values of other variables. Description Methods Find human-interpretable patterns that describe the data. From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996 Data Mining Tasks... Classification [Predictive] Clustering [Descriptive] Association Rule Discovery [Descriptive] Sequential Pattern Discovery [Descriptive] Regression [Predictive] Deviation Detection [Predictive] Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Illustrating Classification Task Tid Refund Marital Status Taxable Income Cheat Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No No Single 75K ? 2 No Married 100K No Yes Married 50K ? 3 No Single 70K No No Married 150K ? 4 Yes Married 120K No Yes Divorced 90K ? 5 No Divorced 95K Yes No Single 40K ? 6 No Married No No Married 80K ? 60K 10 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 10 No Single 90K Yes Training Set Learn Classifier Test Set Model Example of a Decision Tree Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K Splitting Attributes Refund Yes No NO MarSt Single, Divorced TaxInc < 80K NO NO > 80K YES 10 Training Data Married Model: Decision Tree Apply Model to Test Data Test Data Start from the root of tree. Refund Yes 10 No NO MarSt Single, Divorced TaxInc < 80K NO Married NO > 80K YES Refund Marital Status Taxable Income Cheat No 80K Married ? Application: Credit card application Institution: a credit card company typically receives thousands of applications for new cards. The application contains information: annual salary, any outstanding debts, age etc. The problem: A decision has to be taken whether to accept or reject the applications. Data mining task: To categorize applications into those who have good credit, bad credit, or fall into a gray area (thus requiring further human analysis). Application: Satellite image classification Application: General image Application: Biological image Protein classes: nucleus, cytoplasm, and mitochondria. RBC classes: discocyte, stomatocyte, and echinocyte Clustering Groups data into meaningful classes/clusters Unsupervised learning Motivation: We do not know what to look for The first step in identifying useful patterns is to group data by their similarity Once data are grouped (clustered), properties of each cluster can be analyzed High quality clusters: the intra-class similarity is high the inter-class similarity is low Clustering: Basic concept Given points in some spaces, group the points into a small number of clusters What is a natural grouping among these objects? What is a natural grouping among these objects? Clustering is subjective Simpson's Family School Employees Females Males Application: web clustering Association Rule Discovery: Definition Given a set of records each of which contain some number of items from a given collection; Produce dependency rules which will predict occurrence of an item based on occurrences of other items. TID Items 1 2 3 4 5 Bread, Coke, Milk Beer, Bread Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Diaper, Milk Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer} Association Rule (Plane Form) Sequential Pattern Discovery: Definition Given is a set of objects, with each object associated with its own timeline of events, find rules that predict strong sequential dependencies among different events. Sequence Data Timeline 10 Sequence Database: Object A A A B B B B C Timestamp 10 20 23 11 17 21 28 14 Events 2, 3, 5 6, 1 1 4, 5, 6 2 7, 8, 1, 2 1, 6 1, 8, 7 15 20 25 30 Object A: 2 3 5 6 1 1 Object B: 4 5 6 2 Object C: 1 7 8 7 8 1 2 1 6 35 Examples of Sequence Data Sequence Database Sequence Element (Transaction) Event (Item) Customer Purchase history of a given customer A set of items bought by a customer at time t Books, diary products, CDs, etc Web Data Browsing activity of a particular Web visitor A collection of files viewed by a Web visitor after a single mouse click Home page, index page, contact info, etc Genome sequences DNA sequence of a particular species An element of the DNA sequence Bases A,T,G,C Sequential Pattern Discovery: Examples Stock market (IBM_UP SUN_UP) --> (Microsoft_UP) In point-of-sale transaction sequences, Computer Bookstore: (Intro_To_Visual_C) (C++_Primer) --> (Perl_for_dummies,Tcl_Tk) Athletic Apparel Store: (Shoes) (Racket, Racketball) --> (Sports_Jacket) Medical field If a patient underwent cardiac bypass surgery for blocked arteries (blood vessel) and later developed high blood urea within a year of surgery, he or she is likely to suffer from kidney failure within the next 18 months. Deviation/Anomaly Detection Detect significant deviations from normal behavior Applications: Credit Card Fraud Detection Network Intrusion Detection