Download Data Mining

Chase Repp  knowledge discovery  searching, analyzing, and sifting through large data sets to find new patterns, trends, and relationships contained within  Data mining differs from database querying in the following manner: database querying asks “what company purchased $100,000 worth of widgets last year?” while this asks “what company is likely to purchase over $100,000 of widgets next year and why?”  coined in the 1960s  Data mining was used to find basic information from the collections of data such as total revenue over the last three years.  classic statistics  artificial intelligence  machine learning  Predictive Data Mining • Target value • Future trends  Descriptive Data Mining • No target value • Focuses on relations focuses on discovering a relationship between independent variables and a relationship between dependent and independent variables used to forecast specific things  describes a data set in a brief but comprehensive way and gives interesting characteristics of the data without having any predefined target  Focus on relations  patterns are discovered based on a relationship of a specific item with other items in the same transaction  Descriptive  Example: groceries  to classify each item in a set of data into one of the predefined sets of classes or groups  Often used with machine learning  Predictive  Example: cat or dog person?  Different from classification, the clustering technique also defines the classes and put objects in them  Descriptive  Example: a library  used to predict numbers from data sets that have known target values  Predictive  Example: sales, distance, temperature, value, etc  discovers frequent sequences or subsequences as patterns in a sequence database  Descriptive  Derived from association mining  There are three categories that the main sequential pattern mining techniques fall into.  Apriori-based  Pattern-growth  Early-pruning  follow the apriori property - all nonempty subsets of a frequent itemset must also be frequent  if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset  AprioriAll, GSP, PSP, and SPAM  Transaction data  Assume: t1: Beef, Chicken, Milk t2: Beef, Cheese t3: Cheese, Boots t4: Beef, Chicken, Cheese t5: Beef, Chicken, Clothes, Cheese, Milk t6: Chicken, Clothes, Milk t7: Chicken, Milk, Clothes minsup = 30% minconf = 80%  An example frequent itemset: {Chicken, Clothes, Milk}  Association [sup = 3/7] about 43% rules from the itemset: Clothes  Milk, Chicken [sup = 3/7, conf = 3/3] … … Clothes, Chicken  Milk, [sup = 3/7, conf = 3/3]  Two steps: • Find all itemsets that have minimum support (frequent itemsets). • Use frequent itemsets to generate rules.  E.g., a frequent itemset {Chicken, Clothes, Milk} [sup = 3/7] and one rule from the frequent itemset Clothes  Milk, Chicken conf = 3/3] [sup = 3/7, Dataset T minsup=50% itemset:count 1. scan T  C1: {1}:2, {2}:3, {3}:3, {4}:1, {5}:3  F1:  C2: {1}:2, {2}:3, {3}:3, {5}:3 TID Items T100 1, 3, 4 T200 2, 3, 5 T300 1, 2, 3, 5 T400 2, 5 {1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5} 2. scan T  C2: {1,2}:1, {1,3}:2, {1,5}:1, {2,3}:2, {2,5}:3, {3,5}:2  F2:  C3: {1,3}:2, {2,3}:2, {2,5}:3, {3,5}:2 {2, 3,5} 3. scan T  C3: {2, 3, 5}:2  F3: {2, 3, 5}  divide-and-conquer strategy  to focus the search on a restricted portion of the initial database and generate as few candidate sequences as possible  FreeSpan, PrefixSpan, WAP-mine, and Miner FS-  utilize a sort of position induction to prune candidate sequences very early in the mining process and to avoid support counting as much as possible  LAPIN, HVSM, and DISC-all  searching  content for patterns in data through mining • Search engines  structure mining • Hyper links (hits / page rank)  usage mining • User’s browser data and forms submitted  One use is for finding user navigational patterns on the World Wide Web by extracting knowledge from web logs  An example of applying sequential pattern mining S = {a, b, c, d, e, f}  [P1,<abdac>] [P2,<eaebcac>] [P3,<babfaec>] [P4,<abfac>]  Frequent pattern of abac  combines traditional mining methods and information visualization techniques • user is directly involved  VDMS - simplicity, reliability, reusability, availability, and security  http://www.youtube.com/user/quiterian  http://www.youtube.com/watch?v=MtJ4X a4-J8g  http://www.youtube.com/watch?v=_8Hz wQCFFfw

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data Mining