Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining: Confluence of Multiple Disciplines Database Systems Machine Learning Algorithm Statistics Data Mining Visualization Other Disciplines Data Mining Outline Introduction Classification Clustering Association Rules Data Mining Outline Introduction Classification Clustering Association Rules Introduction Data is growing at a phenomenal rate Users expect more sophisticated information How? UNCOVER HIDDEN INFORMATION DATA MINING Data Mining Definition Finding hidden information in a database Fit data to a model: descriptive or predictive Similar terms – Exploratory data analysis – Data driven discovery – Deductive learning But it isn’t Magic You must know what you are looking for You must know how to look for it Suppose you knew that a specific cave had gold: • What would you look for? • How would you look for it? • Might need an expert miner “If it looks like a terrorist, duck, walks like a terrorist, duck, andand quacks quackslike likea aduck, terrorist, then then it’s it’sa aduck.” terrorist.” Description Behavior Classification Clustering Associations Link Analysis Query Examples Database – Find all credit applicants with last name of Smith. – Identify customers who have purchase more than $10,000 in last month. – Find all customers who have purchased milk Data Mining – Find all credit applicants who are poor credit risks. (classification) – Identify customers with similar buying habits. (Clustering) – Find all items which are frequently purchased with milk. (association rules) KDD Process © Prentice Hall Selection: Obtain data from various sources. Preprocessing: Cleanse data. Transformation: Convert to common format. Transform to new format. Data Mining: Obtain desired results. Interpretation/Evaluation: Present results to user in meaningful manner. Data Mining Outline Introduction Classification – Assign data to a predefined class – Decision Trees – Neural Networks – Distance Based Clustering Association Rules The classification problem can now be expressed as: Given a training database predict the class label of a previously unseen instance Insect Abdomen Antennae Insect Class ID Length Length Grasshopper 1 2.7 5.5 2 3 4 5 6 7 8 9 10 previously unseen instance = 11 8.0 0.9 1.1 5.4 2.9 6.1 0.5 8.3 8.1 5.1 9.1 4.7 3.1 8.5 1.9 6.6 1.0 6.6 4.7 7.0 Katydid Grasshopper Grasshopper Katydid Grasshopper Katydid Grasshopper Katydid Katydid ??????? Classification Process (1): Model Construction Training Data NAME M ike M ary B ill Jim D ave A nne RANK YEARS TENURED A ssistant P rof 3 no A ssistant P rof 7 yes P rofessor 2 yes A ssociate P rof 7 yes A ssistant P rof 6 no A ssociate P rof 3 no Classification Algorithms Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Classification Process (2): Use the Model in Prediction Classifier Testing Data Unseen Data (Jeff, Professor, 4) NAME T om M erlisa G eorge Joseph RANK YEARS TENURED A ssistant P rof 2 no A ssociate P rof 7 no P rofessor 5 yes A ssistant P rof 7 yes Tenured? Training Dataset This follows an example from Quinlan’s ID3 age <=30 <=30 31…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair low yes excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent buys_computer no no yes yes yes no yes no yes yes yes yes yes no Output: A Decision Tree for “buys_computer” age? <=30 student? overcast 30..40 yes >40 credit rating? no yes excellent fair no yes no yes Neural Network Example Tuple Input Output Data Mining Outline Introduction Classification Clustering – Place data into groups – Hierarchical – K-Means – Partitional Association Rules Clustering Examples Segment customer database based on similar buying patterns. Group houses in a town into neighborhoods based on similar features. Identify new plant species Identify similar Web usage patterns Clustering vs. Classification No prior knowledge – Number of clusters – Meaning of clusters Unsupervised learning Data Mining Outline Introduction Classification Clustering Association Rules – Find relationships between data –Apriori Association Rules Example I = { Beer, Bread, Jelly, Milk, PeanutButter} Support of {Bread,PeanutButter} is 60% Association Rules Ex (cont’d) AR & Market Baskets Determine items often purchased together (Marketbasket Data) Determine optimal placement of data on store floor Determine items for sales and/or specials Increase sales of items www.amazon.com Summary Data Mining is a fast growing area with many applications. Data Mining algorithms are usually computationally expensive. Data Mining tools may be difficult to use effectively.