Download Chapter26 - members.iinet.com.au

Chapter # 26 – Data Mining Data mining  finding interesting trends or patterns in large datasets to guide decisions about future activities  expectation that data mining tools can identify these patterns in the data with minimal user input  patterns identified give data analyst useful and unexpected insight that can be more carefully investigated subsequently  related to subarea of statistics called exploratory data analysis  distinguishing characteristic of data mining is that the volume of data is very large Knowledge discovery process 1. Data selection : identify subset of data and attributes of interest by examining raw data set 2. Data cleaning : noise and outliers are removed, field values are transformed into common units, some new fields are created by combining existing fields, data put into relational format 3. Data mining : apply data mining algorithms to extract interesting patterns 4. Evaluation : patterns are presented to end-users in an understandable form (through visualisation) Mining For Rules  Algorithms proposed for discovering various forms of rules that describe the data o Association rules  Occurrence of A implies occurrence of B o Classification and Clustering  Finding the cluster of similar objects and how to partition these objects into classes o Decision Trees  Graphical representation of a collection of classification rules  Given an object, the decision tree give a quick way of directing the most general class in the root to the most special class in the leaf. o Sequence of Events  There may be causality effects such that an event is the result of another prior event o Pattern in time series  Similarity may be detected within position of time series Association rules  Eg. {pen} => {ink} o If a pen is purchased, it is likely that ink is also purchased in that transaction  2 important measures for an association rule o Support :  Support of a set of items is the % of transactions that contain all these items o Confidence :  Confidence for a rule LHS => RHS is the % of such transactions that also contain all items in RHS  Confidence = sup(LHS union RHS) / sup(LHS) Algorithm to find association rules  User can ask for all association rules that have a specified minimum support (minsup) and minimum confidence (minconf) 1. All frequent itemsets with the user-specified minimum support are computed 2. rules are generated using the frequent itemsets as input a. consider a frequent itemset X with support sX identified in 1st step b. To generate a rule from X we devide X into 2 itemsets LHS and RHS c. The confidence of the rule LHS=>RHS is sX/sLHS d. We know that the support of LHS is larger than minsup and thus we have computed the support of LHS during the first step of the algorithm e. We can compute the confidence values for the rule by calculating the ratio support(X)/support(LHS) i. Check how the ratio compares to minconf

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Chapter26 - members.iinet.com.au