Download Brandon_Leonardo_Data_mining

Data Mining Brandon Leonardo CS157B (Spring 2006) What Is Data Mining? • A way to discover knowledge • “Semiautomatically analyzing large databases to find useful patterns” • Notable Characteristics • Large amounts of data • Data Stored on Disk What Are We Looking For? • Rules • Use sets of rules to predict/classify objects • Ex. “Students with annual income less than $20,000 year are most likely to get a student loan” • Patterns • Different kinds of patterns • Multiple patterns in one data set What Can Data Mining Do? • Applications • Prediction • What class the data will belong in or what the value will be based on attributes • What kind of animal will this be, considering that it has stripes, 4 legs, and talks? • What customers are likely to switch to a competitor? What Can Data Mining Do? • Applications • Association • Data that goes together in a class • Amazon – books that are bought together • Causality • Whether riding a motorcycle increases your chances of dying in an accident • Descriptive patterns • Clusters Classification • Taking a new item (training instance) and, given past instances, figure out which class the new item belongs in • How? • Rules • Decision Trees • Bayesian Classifiers Rule Classifiers • Break down what classes some data belongs in based on rules • Ex. • If a new customer signs up for a credit card, and makes less than $30,000 a year, then place them in a high risk category Decision Tree Classifiers • Traverse the tree based on attributes, making a decision at each node until a leaf is reached • Ex. Being Hired At Google Degree Bachelors School Not Stanford Not Hired PhD School Stanford Hired Not Stanford Not Hired Stanford Hired Bayesian Classifiers • Bayesian • Predict the probability of an item being in a class for every class • The class with the largest probability “wins” • P(cj|d) = p(d|cj)p(cj) / p(d) • P(d|cj) – probability of generating instance d given class cj • P(cj) – probability of getting class cj • P(d) – probability of d occurring • If a variable isn’t present, it isn’t included in probability Regression • Linear regression/Curve fitting • Y = a0 + a1*X1 + a2*X2 + … + an * Xn • You create the co-efficients a0, a1, a2, …, an • Find the best fit • Not always exact • noise in data • relationship isn’t polynomial Association Rules • Rules denoted by ‘=>’ • Support • What fraction of population has both the antecedent and consequent of the rule • Confidence • How often the consequent is true when the antecedent is true • Ex. Owning car => Buying Gas • Support – 99.9% • Confidence – 99.9% • Probably True Association Rules • Shortcomings • Sometimes there are correlations that aren’t really caused by each other • Ex. Haircuts and Grocery Shopping • 99% of population gets haircuts • 100% of population goes grocery shopping • Everybody who gets a haircut goes grocery shopping, but does that mean that one correlates with the other • Deviation from existing patterns • Correlation (positive and negative) Clustering • Clusters of points in a data set • Break the set down into subsets • Types • Hierarchical clustering • Based on different levels, break things down as you go deeper • Agglomerative clustering • Start small, then create higher levels • Divisive clustering • Start big, then create lower levels Other Types of Mining • Text mining • Mining text documents • Data visualization • Maps, charts, other graphical things • Don’t analyze the data, just present it for users (humans are good at seeing patterns) References • Database System Concepts

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Brandon_Leonardo_Data_mining