Download Chapter26 - members.iinet.com.au

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Chapter # 26 – Data Mining
Data mining
 finding interesting trends or patterns in large datasets to guide decisions
about future activities
 expectation that data mining tools can identify these patterns in the data
with minimal user input
 patterns identified give data analyst useful and unexpected insight that
can be more carefully investigated subsequently
 related to subarea of statistics called exploratory data analysis
 distinguishing characteristic of data mining is that the volume of data is
very large
Knowledge discovery process
1. Data selection : identify subset of data and attributes of interest by
examining raw data set
2. Data cleaning : noise and outliers are removed, field values are
transformed into common units, some new fields are created by combining
existing fields, data put into relational format
3. Data mining : apply data mining algorithms to extract interesting patterns
4. Evaluation : patterns are presented to end-users in an understandable
form (through visualisation)
Mining For Rules
 Algorithms proposed for discovering various forms of rules that describe
the data
o Association rules
 Occurrence of A implies occurrence of B
o Classification and Clustering
 Finding the cluster of similar objects and how to partition
these objects into classes
o Decision Trees
 Graphical representation of a collection of classification rules
 Given an object, the decision tree give a quick way of
directing the most general class in the root to the most
special class in the leaf.
o Sequence of Events
 There may be causality effects such that an event is the
result of another prior event
o Pattern in time series
 Similarity may be detected within position of time series
Association rules
 Eg. {pen} => {ink}
o If a pen is purchased, it is likely that ink is also purchased in that
transaction

2 important measures for an association rule
o Support :
 Support of a set of items is the % of transactions that contain
all these items
o Confidence :
 Confidence for a rule LHS => RHS is the % of such
transactions that also contain all items in RHS
 Confidence = sup(LHS union RHS) / sup(LHS)
Algorithm to find association rules

User can ask for all association rules that have a specified minimum
support (minsup) and minimum confidence (minconf)
1. All frequent itemsets with the user-specified minimum support are
computed
2. rules are generated using the frequent itemsets as input
a. consider a frequent itemset X with support sX identified in 1st step
b. To generate a rule from X we devide X into 2 itemsets LHS and
RHS
c. The confidence of the rule LHS=>RHS is sX/sLHS
d. We know that the support of LHS is larger than minsup and thus we
have computed the support of LHS during the first step of the
algorithm
e. We can compute the confidence values for the rule by calculating
the ratio support(X)/support(LHS)
i. Check how the ratio compares to minconf