Download 25SpCs157L23AssociationRules

CS157A Spring 05 Data Mining Professor Sin-Min Lee Today's Presentation covers: 1.What is Data Mining? 2.Data Mining Objectives 3.Data Mining Operations 4.Knowledge Discovery 5.Application of Data Mining 6.Summary 7.References Statistics Data Mining Databases Visualization Artificial Intelligence Overview of Data Mining 1. What is Data Mining? ➔ We usually use Data Mining to: – – – Discovering useful, previously unknown knowledge by analyzing large and complex databases. Knowledge discovery, exploratory data analysis, applied statistics, machine learning Search for valuable Information in Large Databases 2. Data Mining Objectives ➔ ➔ Find rules and patterns in large volumn databases Discovery – ➔ Finding human understandable patterns describing the data Prediction – Using some variables or fields in database to predict unknown or future values or other variables of interest Data Mining Objectives ➔ Knowledge Discovery – – Stage somewhat prior to prediction where information is insufficient It's close to decision support 3. Data Mining Operations Associations ➔ Sequential Patterns ➔ Time-Series Clustering ➔ Classification ➔ Segmentation ➔ And many more! ➔ Association ● ● ● Used to find all rules in a basket data Basket data also called transaction data Analyze how items purchased by customers in a shop Association... ● ● A formal definition: Let I = {i1, i2, …im} be a total set of items D a set of transactions d is one transaction consists of a set of items dI ● ● ● ● ● ● Association rule:X  Y where X  I ,Y  I and X  Y =  Support = (#of transactions contain X  Y ) / D Support: number of instances predicted correctly Confidence: number of correct predictions, as proportion of all instances Confidence = (#of transactions contain X  Y) / #of transactions contain X Association... ● ● ● ● ● ● Example of transaction data: – Transaction 1: CD player, music's CD, music's book – Transaction 2: CD player, music's CD – Transaction 3: Music's CD, music's book – Transaction 4: CD player I = {CD player, music's CD, music's book} D=4 # of transactions contain both CD player, music's CD = 2 # of transactions contain CD player = 3 Support = 2 /4, Confidence: 2 /3 Applying Association Rule... ● ● Example: Books that tend to be bought together. If a customer buys a book, an online bookstore may suggest other associated books. (ie. Amazon.com) Example: If a person buys a laptop, the salesperson may suggest accessories that tend to be bought along with laptop. Time Series Clustering ● ● ● Given: – A database of time series Find: – Groups of similar time series Sample Applications: – Determine products with similar selling patterns – Identify companies with similar pattern of grown – Find stocks with similar price movements Classification ● Classification – Problem: Given that items belong to one of several classes, and given past instances (aka training instances) of items along with the classes to which they belong, the problem is to PREDICT the class to which a new item belongs – The class of the new instance is not known, so other attributes of the instance must be used to predict the class. – It can be done by finding rules that partition the given data into disjoint groups Classification... ● Dataset is usually in the form of a relation table. ● Data has a set of distinct attributes. ● Each data record is also labeled with a class. ● ● Goal : To build a model or learn rules that can be used to predict the classes of new cases. Training Data are used to build this model. Classification... ● ● ● ● ● For example – Suppose that a credit card company wants to decide whether or not to give a credit card to an applicant The company has a variety of information about the person, such as their age, education background, income, etc.. Then they will rank the applicants (catogorized them into classes) Forall person P, P.degree=masters AND P.income > 75,000 ==> P.credit = excellent Forall person P, P.degree=bachelors OR (P.income >= 25,000 AND P.income <= 75,000) ==> P.credit = good Classification... ● ● Table: Age Smoke Risk -----------------------------------------20 No Low 25 Yes High 44 Yes High 18 No Low 55 No High 35 No Low To identify the risk (we have two groups): – Risk = Low and Risk = High ---- Classification... ● The following techniques could be used to analyze the classification: – – – – – Decision Tree Predictive Modeling Using association rule Neural networks etc... Decision Trees ● ● ● ● ● ● “Divide-and-conquer” approach produce tree Nodes involve testing a particular attribute Usually, attribute value is compared to constant Other possibilities: – Comparing values of two attributes – Using a function of one or more attributes Leaves assign classification, set of classifications, or probability distrbution to instances Unknown instance is routed down the tree Decision Tree ● In short, Decision tree is just a series of nested if/then rules. Smoke Our previous example No Yes Age High 0-35 Low 36-100 High Predictive Modeling ● ● Predict values based on similar groups of data Pattern Recognition – – ● Association of an observation to past experience or knowledge Interchangeable with classification Estimation – Assign infinite number of numeric labels to an observation 4. Knowledge Discovery ● Find Patterns in database – ● Interesting + Certain = Knowledge – ● For example, if someone buys one thing, what else will he buy next Usually the output called “Discovered Knowledge” KDD – Knowledge Discovery in Database ● A non-trivial process of identifying valid, potentially useful, and understandable patterns in data KDD – Knowledge Discovery in Database... ● ● Advances in traditional tasks in data analysis – Classification, Clustering – New Data Mining operations ● Association rules ● Sequential patterns ● Deviation /Exceptions New Application areas – Spatial, Text, Web, Image, .... KDD – Knowledge Discovery in Database ● Applications – Most large companies have data warehouses: platforms for Data Mining Projects – Trend towards integrated vertical solutions such as financial and telecom areas ● Back-end: integration with databases ● Front-end: Campaign Management or CRM (Customer Relationship Management) KDD – Knowledge Discovery in Database ● Next Generation Knowledge Discovery Systems: – – – – Have integrated front-end access to knowledge delivery tools Have integrated back-end access to enterprise and external databases Have knowledge discovery engine as embedded part of the overall solution Be oriented to solving a business problem, not a data analysis problem 5. Application of Data Mining ● ● ● ● ● ● ● ● Medical Control Theory Engineering Marketing and Finance Data Mining on the web Scientific Data Base Fraud Dectection And many more! 6. Summary ● ● Data Mining IS.... – Decision Trees, Nearest Neighbor Classification, Neural networks, Rule Induction, K-means Clustering – Decision support process in which we search patterns of information in data Data Mining is NOT... – Retrieving data (ie. Google) ● “Information retrieval” or “Database querying” ● Data Mining infers “the right query” from data – Merging many small databases into a large one Summary ● Data Mining is not... – Data warehousing – SQL / Ad Hoc Queries / Reporting – Software Agents – Online Analytical Processing (OLAP) – Data Visualization Referneces ● ● Dr. Lee's Presentation – http://www.cs.sjsu.edu/~lee/cs157b/cs157b.html ● Data Mining Section Dr. Kurt Thearling's website – http://www.thearling.com/dmintro/dmintro_frame.htm ● An Introduction to Data Mining

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 25SpCs157L23AssociationRules