Download data mining unit 2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
CONCEPT DESCRIPTION
tp D
s:/ EE
/w P
w IK
w A
.d K
ee A
pi M
ka B
ka O
m J
bo
j. i
n
Data mining can be classified into two categories:
 Descriptive data mining: describes the data set in a concise and summative manner and
presents interesting general properties of the data
 Predictive data mining: analyzes the data in order to construct one or a set of models, and
attempts to predict the behavior of new data sets.
What is concept description?




ht

The simplest kind of descriptive data mining is concept description.
As a data mining task, concept description is not a simple enumeration of the data.
Instead, concept description generates descriptions for
 characterization and
 comparison of the data.
It is sometimes called class description.
Characterization provides a concise and succinct summarization of the given collection of
the data.
Concept or class comparison (also known as discrimination) provides discriminations
comparing two or more collections of data.
DATA GENERALIZATION
tp D
s:/ EE
/w P
w IK
w A
.d K
ee A
pi M
ka B
ka O
m J
bo
j. i
n
Data Generalization is the process of creating successive layers of summary data in an
evaluational database. It is a process of zooming out to get a broader view of a problem, trend or
situation. It is also known as rolling-up data.
Or
Data generalization is a form of descriptive data mining that abstracts a large set of task relevant
data in database from a low level concept to high level concept. For example: mobile number
and landline number can be replaced by telephone number.
 It helps in providing easiness and flexibility to user in examining the general behavior of data.
 For example: if manager wants to view the data generalized to higher level rather than
examining individual employee, he/she can easily do it if the concept of generalization is
used. This leads to the notion of concept description.
 Data generalization can provide a great help in Online Analytical Processing (OLAP)
technology. OLAP is used for providing quick answers to analytical queries which are by
nature multidimensional. They are commonly used as part of a broader category of business
intelligence.
 Since OLAP is used mostly for business reporting such as those for sales, marketing,
management reporting, business process management and other related areas, having a better
view of trends and patterns greatly speeds up these reports.
 Data generalization is also especially beneficial in the implementation of an Online
transaction processing (OLTP). OLAP was created later than OLTP and had slight
modifications from OLTP.
Figure 1 Data generalization
What are the differences between concept description in large databases and on-line
analytical processing?
The fundamental differences between the two include the following:
ht
1. Many current OLAP systems confine dimensions to nonnumeric data. Similarly,
measures (such as count (), sum (), average ()) in current OLAP systems apply only to
numeric data. In contrast, for concept formation, the database attributes can be of various
data types, including numeric, nonnumeric, spatial, text or image.
2. Concept description in databases can handle complex data types of the attributes and their
aggregations, as necessary whereas OLAP cannot.
ANALYTICAL CHARACTERIZATION: ANALYSIS OF
ATTRIBUTE RELEVANCE
tp D
s:/ EE
/w P
w IK
w A
.d K
ee A
pi M
ka B
ka O
m J
bo
j. i
n
 Measures of attribute relevance analysis can be used to help identify irrelevant or weakly
relevant attributes that can be excluded from the concept description process.
 The incorporation of this preprocessing step into class characterization or comparison is
referred to as analytical characterization or analytical comparison, respectively.
 An attribute or dimension is considered highly relevant with respect to a given class if it is
likely that the values of the attribute or dimension may be used to distinguish the class from
others.
 For example, it is unlikely that the color of an automobile can be used to distinguish
expensive from cheap cars, but the model, make, style, and number of cylinders are likely to
be more relevant attributes.
Methods of Attribute Relevance Analysis
The general idea behind attribute relevance analysis is to compute some measure which is used
to quantify the relevance of an attribute with respect to a given class or concept. Such measures
include the information gain, Gini index, uncertainty, and correlation coefficients.
We first examine the information-theoretic approach applied to the analysis of attribute
relevance. Let's take ID3 as an example.
 ID3 constructs a decision tree based on a given set of data tuples, or training objects, where
the class label of each tuple is known.
 The decision tree can then be used to classify objects for which the class label is not known.
 To build the tree, ID3 uses a measure known as information gain to rank each attribute.
 How does the information gain calculation work?
 Let S be a set of training objects where the class label of each object is known.
 Suppose that there are m classes. Let S contain si objects of class Ci, for i = 1 …m.
 An arbitrary object belongs to class Ci with probability si/s, where s is the total number of
objects in set S.
 When a decision tree is used to classify an object, it returns a class.
 Thus, information gain is obtained in following way:
I (s1, s2 … sm) = ∑
ht
Attribute relevance analysis for concept description is performed as follows:
1. Data collection.
 Collect data for both the target class and the contrasting class by query processing.
 For class comparison, both the target class and the contrasting class are provided by the
user in the data mining query.
 For class characterization, the target class is the class to be characterized, whereas the
contrasting class is the set of comparable data which are not in the target class.
Preliminary relevance analysis using conservative AOI.
 This step identifies a set of dimensions and attributes on which the selected relevance
measure is to be applied.
 Since different levels of a dimension may have dramatically different relevance with
respect to a given class, each attribute defining the conceptual levels of the dimension
should be included in the relevance analysis in principle.
 Attribute-oriented induction (AOI) can be used to perform some preliminary relevance
analysis on the data by removing or generalizing attributes having a very large number of
distinct values (such as name and phone#).
tp D
s:/ EE
/w P
w IK
w A
.d K
ee A
pi M
ka B
ka O
m J
bo
j. i
n
2.
Evaluate each remaining attribute using the selected relevance analysis measure:
 Evaluate each attribute in the candidate relation using the selected relevance analysis
measure.
 The relevance measure used in this step may be built into the data mining system, or
provided by the user (depending on whether the system is flexible enough to allow users
to define their own relevance measurements).
 For example, the information gain measure described above may be used.
4.
Remove irrelevant and weakly relevant attributes.
 Remove from the candidate relation the attributes which are not relevant or are weakly
relevant to the concept description task, based on the relevance analysis measure used
above.
 A threshold may be set to define “weakly relevant".
 This step results in an initial target class working relation and an initial contrasting class
working relation.
5.
Generate the concept description using AOI.
 Perform AOI using a less conservative set of attribute generalization thresholds.
 If the descriptive mining task is class characterization, only the initial target class
working relation is included here.
 If the descriptive mining task is class comparison, both the initial target class working
relation and the initial contrasting class working relation are included.
ht
3.
MINING CLASS COMPARISONS: DISCRIMINATING BETWEEN
DIFFERENT CLASSES


tp D
s:/ EE
/w P
w IK
w A
.d K
ee A
pi M
ka B
ka O
m J
bo
j. i
n

Class discrimination or comparison (hereafter referred to as class comparison) mines
descriptions that distinguish a target class from its contrasting classes.
Note: the target and contrasting classes must be comparable in the sense that they share
similar dimensions and attributes.
For example: the three classes, person, address, and item, are not comparable. However,
the sales in the last three years are comparable classes.
“How is class comparison performed?” In general, the procedure is as follows:
1. Data collection:
 The set of relevant data in the database is collected by query processing and is
partitioned respectively into a target class and one or a set of contrasting class.
2. Dimension relevance analysis:
 If there are many dimensions, then dimension relevance analysis should be
performed on these classes to select only the highly relevant dimensions for
further analysis.
 Correlation or entropy-based measures can be used for this step.
3. Synchronous generalization:
 Generalization is performed on the target class to the level controlled by a user- or
expert-specified dimension threshold, which results in a prime target class
relation.
 The concepts in the contrasting class are generalized to the same level as those in
the prime target class relation, forming the prime contrasting class relation.
4. Presentation of the derived comparison:
 The resulting class comparison description can be visualized in the form of tables,
graphs, and rules.
 This presentation usually includes a “contrasting” measure such as count%
(percentage count) that reflects the comparison between the target and contrasting
classes.
 The user can adjust the comparison description by applying drill-down, roll-up,
and other OLAP operations to the target and contrasting classes, as desired.
ht
EXAMPLE 4.27 (Han n kamber third edition)
MINING DESCRIPTIVE STATISTICAL MEASURES IN LARGE
DATABASES
tp D
s:/ EE
/w P
w IK
w A
.d K
ee A
pi M
ka B
ka O
m J
bo
j. i
n
Descriptive data summarization techniques can be used to identify the typical properties of your
data and highlight which data values should be treated as noise or outliers. Thus, we first
introduce the basic concepts of descriptive data summarization before getting into the concrete
workings of data preprocessing techniques. For many data preprocessing tasks, users would like
to learn about data characteristics regarding both central tendency and dispersion of the data.
 Measures of central tendency include mean, median, mode, and midrange, while measures of
data dispersion include quartiles, interquartile range (IQR), and variance.
 These descriptive statistics are of great help in understanding the distribution of the data.
From the data mining point of view, we need to examine how they can be computed
efficiently in large databases. Knowing what kind of measure we are dealing with can help us
choose an efficient implementation for it.
MEASURING THE CENTRAL TENDENCY
1. Mean: The most common and most effective numerical measure of the “center” of a set of
data is the (arithmetic) mean. Let x1, x2……xN be a set of N values or observations, such as
for some attribute, like salary. The mean of this set of values is
X’=
∑
=

A distributive measure is a measure (i.e., function) that can be computed for a given data
set by partitioning the data into smaller subsets, computing the measure for each subset,
and then merging the results in order to arrive at the measure’s value for the original
(entire) data set. (partitioning the data into smaller subsets computing the measure for
each subset then merging the results)

Both sum() and count() are distributive measures because they can be computed in this
manner.

An algebraic measure is a measure that can be computed by applying an algebraic
function to one or more distributive measures.

Hence, average (or mean()) is an algebraic measure because it can be computed by
sum()/count().
ht
2. Weighted arithmetic mean or the weighted average: Sometimes, each value xi in a set
may be associated with a weight wi, for i = 1………N. The weights reflect the significance,
importance, or occurrence frequency attached to their respective values. In this case, we can
compute
X’ =
∑
∑
=
tp D
s:/ EE
/w P
w IK
w A
.d K
ee A
pi M
ka B
ka O
m J
bo
j. i
n
3. Median: For skewed (asymmetric) data, a better measure of the center of data is the median.
Suppose that a given data set of N distinct values is sorted in numerical order.
 If N is odd, then the median is the middle value of the ordered set.
 If N is even, the median is the average of the middle two values.
 A holistic measure is a measure that must be computed on the entire data set as a whole. It
cannot be computed by partitioning the given data into subsets and merging the values
obtained for the measure in each subset. The median is an example of a holistic measure.
 Assume that data are grouped in intervals according to their xi data values and that the
frequency (i.e., number of data values) of each interval is known.
 For example, people may be grouped according to their annual salary in intervals such as
10–20K, 20–30K, and so on. Let the interval that contains the median frequency be the
median interval.
Median = L1+
(
(∑
))
*width
Figure 2 Mean, median, and mode of symmetric versus positively and negatively skewed data.
Where
L1: lower boundary of the median interval
N: number of values in the entire data set
(freq)l : sum of the frequencies of all of the intervals that are lower than the median interval,
freq median: frequency of the median interval, and width is the width of the median interval.
ht
4. Mode: The mode for a set of data is the value that occurs most frequently in the set.
 It is possible for the greatest frequency to correspond to several different values, which
results in more than one mode.
 Data sets with one, two, or three modes are respectively called unimodal, bimodal, and
trimodal.
 In general, a data set with two or more modes is multimodal.
 At the other extreme, if each data value occurs only once, then there is no mode.
Mean-mode = 3*(mean-median).
 The midrange can also be used to assess the central tendency of a data set. It is the average of
the largest and smallest values in the set.
tp D
s:/ EE
/w P
w IK
w A
.d K
ee A
pi M
ka B
ka O
m J
bo
j. i
n
MEASURING THE DISPERSION OF DATA
The degree to which numerical data tend to spread is called the dispersion, or variance of the
data. The most common measures of data dispersion are
 Range
 the five-number summary (based on quartiles)
 interquartile range
 Standard deviation.
Boxplots can be plotted based on the five-number summary and are a useful tool for identifying
outliers.
Range, Quartiles, Outliers, and Boxplots
Range: The range of the set is the difference between the largest (max()) and smallest (min())
values.
Kth Percentile: Let x1, x2………..xN be a set of observations for some attribute. Then kth
percentile is the value xi having the property that k percent of the data entries lie at or below xi.
The median (discussed in the previous subsection) is the 50th percentile. The most commonly
used percentiles other than the median are quartiles.
First quartile: denoted by Q1, is the 25th percentile
Third quartile: denoted by Q3, is the 75th percentile.
Interquartile range: The distance between the first and third quartiles is a simple measure of
spread that gives the range covered by the middle half of the data. This distance is called the
interquartile range (IQR) and is defined as
IQR = Q3-Q1.
Outliers: It is defined as to single out values falling at least 1:5_IQR above the third quartile or
below the first quartile.
ht
Five-number summary: Because Q1, the median, andQ3 together contain no information about
the endpoints (e.g., tails) of the data, a fuller summary of the shape of a distribution can be
obtained by providing the lowest and highest data values as well. This is known as the fivenumber summary. It consists of median, the quartilesQ1 andQ3, and the smallest and largest
individual observations, written in the order
Minimum; Q1; Median; Q3; Maximum:
tp D
s:/ EE
/w P
w IK
w A
.d K
ee A
pi M
ka B
ka O
m J
bo
j. i
n
A boxplot incorporates the five-number summary as follows:
 Typically, the ends of the box are at the quartiles, so that the box length is the
interquartile range, IQR.
 The median is marked by a line within the box.
 Two lines (called whiskers) outside the box extend to the smallest (Minimum) and largest
(Maximum) observations.
Figure 3 Boxplot for the unit price data for items sold at four branches of AllElectronics during a given time
period.
Here we see, for branch 1, we see that the median price of items sold is $80, Q1 is $60, Q3 is
$100.
Variance and Standard Deviation
The variance of N observations, x1, x2………xN, is
∑(
)
[∑
(∑ ) ]
ht
Where
x’: mean value of the observations
The standard deviation,  is the square root of the variance, 2.
The basic properties of the standard deviation, , as a measure of spread are



 Measures spread about the mean and should be used only when the mean is chosen as the
measure of center.
 =0 only when there is no spread, that is, when all observations have the same value.
Otherwise  > 0.
tp D
s:/ EE
/w P
w IK
w A
.d K
ee A
pi M
ka B
ka O
m J
bo
j. i
n
GRAPH DISPLAYS OF BASIC STATISTICAL CLASS DESCRIPTION
Aside from the bar charts, pie charts, and line graphs used in most statistical or graphical data
presentation software packages, there are other popular types of graphs for the display of data
summaries and distributions. These include
 Histograms
 quantile plots
 q-q plots
 scatter plots, and
 Loess curves.
Histogram:
 A histogram for an attribute A partitions the data distribution of A into disjoint subsets, or
bucket.
 Width of each bucket is uniform.
 Each bucket is represented by a rectangle whose height is equal to the count or relative
frequency of the values at the bucket.
 If A is categoric, such as automobile model or item type, then one rectangle is drawn for
each known value of A, and the resulting graph is more commonly referred to as a bar
chart.
 If A is numeric, the term histogram is preferred.
ht
Above figure shows a histogram, where buckets are defined by equal-width ranges representing
$20 increments and the frequency is the count of items sold.
tp D
s:/ EE
/w P
w IK
w A
.d K
ee A
pi M
ka B
ka O
m J
bo
j. i
n
Quantile plot:
 A quantile plot is a simple and effective way to have a first look at a univariate data
distribution.
 First, it displays all of the data for the given attribute.
 Second, it plots quantile information.
 Let xi, for i = 1 to N, be the data sorted in increasing order so that x1 is the smallest
observation and xN is the largest. Each observation, xi, is paired with a percentage, fi,
which indicates that approximately 100 fi% of the data are below or equal to the value, xi.
We say “approximately” because there may not be a value with exactly a fraction, fi, of
the data below or equal to xi. Note that the 0.25 quantile corresponds to quartile Q1, the
0.50 quantile is the median, and the 0.75 quantile is Q3.
Let
fi=
These numbers increase in equal steps of 1/N, ranging from 1/2N (which is slightly above zero)
to 1-1/2N (which is slightly below one).On a quantile plot, xi is graphed against fi. This allows us
to compare different distributions based on their quantiles. For example, given the quantile plots
of sales data for two different time periods, we can compare theirQ1, median,Q3, and other fi
values at a glance
Figure 4 A quantile plot for the unit price data
Quantile- quantile plot (q-q plot)
ht
Graphs the quantiles of one univariate distribution against the corresponding quantiles of
another.
 Suppose that we have two sets of observations for the variable unit price, taken from two
different branch locations.
 Let x1 ….xN be the data from the first branch, and y1……yM be the data from the
second, where each data set is sorted in increasing order.
 If M = N (i.e., the number of points in each set is the same), then we simply plot yi
against xi, where yi and xi are both (i-0.5)/N quantiles of their respective data sets.
If M < N (i.e., the second branch has fewer observations than the first), there can be only
M points on the q-q plot. Here, yi is the (i-0.5)/M quantile of the y data, which is plotted
against the (i-0.5)/M quantile of the x data.
tp D
s:/ EE
/w P
w IK
w A
.d K
ee A
pi M
ka B
ka O
m J
bo
j. i
n

Figure 5 A quantile-quantile plot for unit price data from two different branches.
Scatter plot


A scatter plot is one of the most effective graphical methods for determining if there
appears to be a relationship, pattern, or trend between two numerical attributes.
To construct a scatter plot, each pair of values is treated as a pair of coordinates in an
algebraic sense and plotted as points in the plane.
ht
Figure 6 Scatter plots can be used to find (a) positive or (b) negative correlations between attributes.
MINING VARIOUS KINDS OF ASSOCIATION RULES
tp D
s:/ EE
/w P
w IK
w A
.d K
ee A
pi M
ka B
ka O
m J
bo
j. i
n
Association rule mining finds interesting association or correlation relationships among a large
set of data items. With massive amounts of data continuously being collected and stored, many
industries are becoming interested in mining association huge amounts of business transaction
records can help in many business decision making processes, such as catalog design, crossmarketing, and loss-leader analysis.
A typical example of association rule mining is market basket analysis.



This process analyzes customer buying habits by finding associations between the
different items that customers place in their “shopping baskets”.
The discovery of such associations can help retailers develop marketing strategies by
gaining insight into which items are frequently purchased together by customers.
For instance, if customers are buying milk, how likely are they to also buy bread (and
what kind of bread) on the same trip to the supermarket? Such information can lead to
increased sales by helping retailers do selective marketing and plan their shelf space.
How can we find association rules from large amounts of data, where the data are either
transactional or relational? Which association rules are the most interesting? How can we help or
guide the mining procedure to discover interesting associations? What language constructs are
useful in defining a data mining query language for association rule mining.
ASSOCIATION RULE MINING
Association rule mining searches for interesting relationships among items in a given data set.
This section provides an introduction to association rule mining. We begin by presenting an
example of market basket analysis, the earliest form of association rule mining. The basic
concepts of mining associations are given and we present a road map to the different kinds of
association rules that can mined.
MARKET BASKET ANALYSIS: A MOTIVATING EXAMPLE FOR ASSOCIATION
RULE MINING
Suppose, as manager of an ABCompany branch, you would like to learn more about the buying
habits of your customers. Specifically, you wonder, “Which groups or Sets of items are
customers likely to purchase on a given trip to the store?”
ht
To answer this question, market basket analysis may be performed on the retail data of customer
Transactions at your store. The results may be used to plan marketing or advertising Strategies,
as well as catalog design. For instance, market basket analysis may help Managers design
different store layouts. In one strategy, items that are frequently purchased together can be placed
in close proximity in order to further encourage the Sale of such items together. If customers
tp D
s:/ EE
/w P
w IK
w A
.d K
ee A
pi M
ka B
ka O
m J
bo
j. i
n
who purchase computers also tend to buy Financial management software at the same time, then
placing the hardware display Close to the software display may help to increase the sales of both
these items. In an Alternative strategy, placing hardware and software at opposite ends of the
store may entice customers who purchase such items to pick up other items along the way.
Market
Figure 7 Market basket analysis
basket analysis can also help retailers to plan which items to put on sale at reduced prices. If
customers tend to purchase computers and printers together, then having a sale on printers may
encourage the sale of printers as well as computers.
Associated Rule below:
Computer=> financial_management_software
[support = 2%,confidence == 60%]
ht
Rule support and confidence are two measures of rule interesting that were described earlier.
They respectively reflect the usefulness and certainty of discovered rules. A support of 2% for
Association Rule means that 2% of all the transaction under analysis show that computer and
financial management software are purchased. Typically ,association rules are considered
interesting if they satisfy both a minimum support thresholds and a minimum confidence
threshold. Users or domain experts can set such thresholds.
"How are association rules mined from large databases?" Association rule mining is a two-step
process:
tp D
s:/ EE
/w P
w IK
w A
.d K
ee A
pi M
ka B
ka O
m J
bo
j. i
n
1. Find all frequent item sets: By definition, each of these item sets will occur at least as
frequently as a pre-determined minimum support count.
2. Generate strong association rules from the frequent item sets: By definition, these rules
must satisfy minimum support and minimum support and minimum confidence. Additional
interestingness measures can be applied, if desired. The second step is the easiest, of the two.
The overall performance of mining association rules is determined by the first step.
ASSOCIATION RULE MINING: A ROAD MAP
Market basket analysis is just one form of association rule mining; in fact, there are many kinds
of association rules. Association rules can be classified in various ways, based on the following
criteria:
Based on the types of values handled in the rule;
(i)
(ii)
(iii)
ht
(iv)
If a rule concerns associations between the presence or absence of items, it is a
Boolean association rule.
If a rule describes associations between quantitative items or attributes, then it is a
quantitative association rule. In these rules, quantitative values for items or attributes
are partitioned into intervals. The following rule is an example of a quantitative
association rule, where X is a variable representing a customer:
age(X,"30........39")^income(X,"42K.....48")implies buys(X, high resolutionTV)
Based on dimensions in the data: If the items or attributes in an association rule
reference only one dimension, then it is a single-dimensional association rule, Note
that
the
rule
written
as
buys(X,"computer")
implies
buys(X,"financial_management_software") is a single dimensional association rule
since it refers to only one dimension, buys. If a rule references two or more
dimensions,such
as
the
dimensions
buys,time_of_transaction,and
customer_category,then it is a multidimensional association rule.
Based on the levels of abstraction in the rule set: Some method for association
rule mining can find rules at differing levels of abstraction. For example, suppose that
a set of association rule mined includes the following rules:
age(x,"30...39") buys(x,"laptop computer")
age(x,"30...39") buys{x, "computer")
tp D
s:/ EE
/w P
w IK
w A
.d K
ee A
pi M
ka B
ka O
m J
bo
j. i
n
In above rules the items bought are referenced at different levels of abstraction. (e.g., "computer"
is a higer-level abstraction of "laptop computer"). We refer to the rule set mined as consisting of
multilevel association rules. If, instead,the rules within a given set do not reference items or
attribute set different levels of abstraction, then then the set contains single-level association
rules.
Based on various extensions to association mining: Association mining can be
extended to correlation analysis, where the absence or presence of correlated items
can be identified. It can also be extended to mining max patters (i.e., maximal
frequent patterns) and frequent closed item sets. A max pattern is a frequent pattern,
p, such that any proper sub pattern of p is not frequent. A frequent closed item set is a
frequent closed item set where an item set c is closed if there exists no proper superset
of c, c ' such that every transaction containing c also contains c'.
ht
(v)
MINING SINGLE-DIMENSIONAL ASSOCIATION RULES FROM TRANSACTIONAL
DATABASES:
The Apriori Algorithm: Finding Frequent Item sets Using Candidate Generation
tp D
s:/ EE
/w P
w IK
w A
.d K
ee A
pi M
ka B
ka O
m J
bo
j. i
n
Apriori algorithm





Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant. The name of the
algorithm is based on the fact that the algorithm uses prior knowledge of frequent itemset
properties.
Apriori employs an iterative approach known as a level-wise search, where k-itemsets are
used to explore (k + 1)-itemsets.
First, the set of frequent 1-itemsets is found by scanning the database to accumulate the
count for each item, and collecting those items that satisfy minimum support.
The resulting set is denoted by L1. Next, L1 is used to find L2, the set of frequent 2itemsets, which is used to find L3, and so on, until no more frequent k-itemsets can be
found. The finding of each Lk requires one full scan of the database.
To improve the efficiency of the level-wise generation of frequent item sets, an important
property called the Apriori property is used to reduce the search space.
Apriori property: All nonempty subsets of a frequent item set must also be frequent. If an
item set I does not satisfy the minimum support threshold, min sup, then I is not frequent If
an item A is added to the item set I, then the resulting item set (i.e., I ∪A) cannot occur
more frequently than I.
“How is the Apriori property used in the algorithm?”
To understand this, let us look at how Lk−1 is used to find Lk for k ≥ 2. A two-step process is
followed, consisting of join and prune actions.
ht
The join step: To find Lk, a set of candidate k-item sets is generated by joining Lk−1 with itself.
This set of candidates is denoted Ck . Let l1 and l2 be item sets in Lk−1. The notation li[j] refers to
the jth item in li (e.g., l1[k − 2] refers to the second to the last item in l1). For efficient
implementation, Apriori assumes that items within a transaction or item set are sorted in
lexicographic order. For the (k − 1)-item set, li, this means that the items are sorted such that li[1]
< li[2] < ··· < li[k − 1]. The join, Lk−1 ✶ Lk−1, is performed, where members of Lk−1 are joinable
if their first (k − 2) items are in common. That is, members l1 and l2 of Lk−1 are joined if (l1[1] =
l2[1]) ∧ (l1[2] = l2[2]) ∧ ··· ∧ (l1[k − 2] = l2[k − 2]) ∧(l1[k − 1] < l2[k − 1]). The condition l1[k −
1] < l2[k − 1] simply ensures that no duplicates are generated. The resulting item set formed by
joining l1 and l2 is {l1[1], l1[2],..., l1[k − 2], l1[k − 1], l2[k − 1]}.
tp D
s:/ EE
/w P
w IK
w A
.d K
ee A
pi M
ka B
ka O
m J
bo
j. i
n
The prune step: Ck is a superset of Lk , that is, its members may or may not be frequent, but all
of the frequent k-itemsets are included in Ck . A database scan to determine the count of each
candidate in Ck would result in the determination of Lk (i.e., all candidates having a count no less
than the minimum support count are frequent by definition, and therefore belong to Lk). Ck ,
however, can be huge, and so this could involve heavy computation. To reduce the size of C k ,
the Apriori property is used as follows. Any (k − 1)-item set that is not frequent cannot be a
subset of a frequent k-item set. Hence, if any (k − 1)-subset of a candidate k-item set is not in
Lk−1, then the candidate cannot be frequent either and so can be removed from Ck. This subset
testing can be done quickly by maintaining a hash tree of all frequent item sets.
Example 6.3
Apriori: Let’s look at a concrete example, based on the AllElectronics transaction database, D, of
Table 6.1. There are nine transactions in this database, that is, |D| = 9. We use figure 1 to
illustrate the Apriori algorithm for finding frequent item sets in D.
1. In the first iteration of the algorithm, each item is a member of the set of candidate 1-itemsets,
C1. The algorithm simply scans all of the transactions to count the number of occurrences of
each item.
2. Suppose that the minimum support count required is 2, that is, min sup = 2. (Here, we are
referring to absolute support because we are using a support count. The corresponding relative
support is 2/9 = 22 %.) The set of frequent 1-itemsets, L1, can then be determined. It consists of
the candidate 1-itemsets satisfying minimum support. In our example, all of the candidates in C1
satisfy minimum support.
3. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join L1 ✶ L1 to generate
a candidate set of 2-itemsets, C2. Note that no candidates are removed from C2 during the prune
step because each subset of the candidates is also frequent.
Transactional Data for an AllElectronics Branch
ht
TID
T100
T200
T300
T400
T500
Table 1
List of item_IDs
I1, I2, I5
I2, I4
I2, I3
I1, I2, I4
I1, I3
T600
T700
T800
T900
I2, I3
I1, I3
I1, I2, I3, I5
I1, I2, I3
tp D
s:/ EE
/w P
w IK
w A
.d K
ee A
pi M
ka B
ka O
m J
bo
j. i
n
Solution:
Figure 1
4. Next, the transactions in D are scanned and the support count of each candidate item set in C2
is accumulated, as shown in the middle table of the second row in Figure1.
5. The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate 2-itemsets
in C2 having minimum support.
ht
6. The generation of the set of the candidate 3-itemsets, C3. From the join step, we first get C3 =
L2 ✶ L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}. Based on the
Apriori property that all subsets of a frequent item set must also be frequent, we can determine
that the four latter candidates cannot possibly be frequent. We therefore remove them from C 3,
thereby saving the effort of unnecessarily obtaining their counts during the subsequent scan of D
to determine L3. Note that when given a candidate k-itemset, we only need to check if its (k −
1)-subsets are frequent since the Apriori algorithm uses a level-wise search strategy. The
resulting pruned version of C3 is shown in the first table of the bottom row of Figure 1.
tp D
s:/ EE
/w P
w IK
w A
.d K
ee A
pi M
ka B
ka O
m J
bo
j. i
n
7. The transactions in D are scanned to determine L3, consisting of those candidate 3-itemsets in
C3 having minimum support.
ht
8. The algorithm uses L3 ✶ L3 to generate a candidate set of 4-itemsets, C4. Although the join
results in {{I1, I2, I3, I5}}, item set {I1, I2, I3, I5} is pruned because its subset {I2, I3, I5} is not
frequent. Thus, C4 = φ, and the algorithm terminates, having found all of the frequent item sets.
MINING MULTILEVEL ASSOCIATION RULES FROM TRANSACTION
DATABASES
tp D
s:/ EE
/w P
w IK
w A
.d K
ee A
pi M
ka B
ka O
m J
bo
j. i
n
Multilevel Association Rules
For many applications, it is difficult to find strong associations among data items at low or
primitive levels of abstraction due to the sparsity of data in multidimensional space. Strong
associations discovered at high concept levels that might represent common sense knowledge.
However, what may represent common sense to one user may seem novel to another. Therefore,
data mining systems should provide capabilities to mine association rules at multiple levels of
abstraction and traverse easily among different abstraction spaces.
ht
Approaches to mining multilevel Association Rules
"How can we mine multilevel association rules efficiently using concept hierarchies?" Let's look
at some approaches based on a support-confidence framework.
In general, a top-down strategy is employed, where counts are accumulated for the calculation of
frequent itemsets at each concept level, starting at the concept level1 and working towards the
lower, more specific concept levels, until no more frequent itemsets can be found. That is, once
all frequent itemsets at concept level 1 are found, then the frequent itemsets at level2 are found,
and so on, For each level, any algorithm for discovering frequent itemsets may be used, such as
Apriori or its variations.
 Using uniform minimum support for all levels (referred to as uniform support). The same
minimum support threshold is used when mining at each level of abstraction. The uniform
support approach, however, has some difficulties. It is unlikely that items at lower levels of
abstraction will occur as frequently as those at higher levels of abstraction. If the minimum
support threshold is set too high, it could miss several meaningful associations occurring at
low abstraction levels. If the threshold is set too low, it may generate many uninteresting
associations occurring at high abstraction levels. This provides the motivation for the
following approach.
tp D
s:/ EE
/w P
w IK
w A
.d K
ee A
pi M
ka B
ka O
m J
bo
j. i
n
 Using reduced minimum support at lower levels (referred to as reduced sup-port); each
level of abstraction has its own minimum support threshold. The lower the abstraction level,
the smaller the corresponding threshold.
ht
For mining multiple-level associations with reduced support, there are a number of alternative
search strategies:
Level-by-Level independent: This is a full-breadth search, where no background knowledge of
frequent itemsets is used for pruning. Each node is examined, regardless of whether or not its
parent node is found to be frequent.
Level -cross-filtering by single item: An item at the ith level is examined if and only if its
parent node at the (i-1)th level is frequent .In other words ,we investigate a more specific
association from a more general one. If a node is frequent, its children will be examined;
otherwise, its descendants are pruned from the search.
MINING MULTIDIMENSIONAL ASSOCIATION RULES FROM
RELATIONAL DATABASES AND DATA WAREHOUSES
tp D
s:/ EE
/w P
w IK
w A
.d K
ee A
pi M
ka B
ka O
m J
bo
j. i
n
In this section, you will learn methods for mining multidimensional association rules, that is ,
rules involving more than one dimension or predicate(e.g., rules relating what a customer buys as
well as the customer's age). These methods can be organized according to their treatment of
quantitative attributes.
Multidimensional Association Rules
So far we have studied association rules that imply a single predicate, that is ,the predicate buys,
For instance, in mining our ABC company database, we may discover the Boolean association
rule "IBM desktop computer" implies "Sony b/w printer" Which can also be written as:
Buys (X,"IBM desktop computer:") implies buys(X, "sony b/w printer").
Where X is a variable representing customers who purchased items in ABCompany transactions.
Following the terminology used in multidimensional database, we refer to each distinct predicate
in a rule as a dimensional. Hence, we can refer to the above rule as a single-dimensional or intra
dimension association rule since it contains a single distinct predicate(e.g.. buys) with multiple
occurrences.(i.e.. predicate occurs more than once within the rule). Such rules are commonly
mined from transactions data.
Suppose, however, that rather than using a transactional database, sales and related information
are stared in a relational database or data warehouse. Such data stores are multidimensional, by
definition. For instance, in addition to keeping track of the items purchased in sales transactions,
a relational database may record other attributes associated with the items, such as the quantity
purchased or the price, or the branch location of the sale.
Considering each database attribute or warehouse dimension as a predicate, it can therefore be
interesting to mine association rules containing multiple predicates, such as;
Age (x,"20......29")^occupation(X, "student") implies buys(X, "laptop":)
Association rules that involve two or more dimensions or predicates can be referred to as
multidimensional association rules. The above rule contains three predicates (age, occupation,
and buys), each of which occurs only once in the rule.
ht
Note that database attributes can be categorical or quantitative.
 Categorical attributes have a finite number of possible values, with no ordering among the
values (e.g., occupation, brand, color). Categorical attributes are also called nominal
attributes, since their values are \names of things".
 Quantitative attributes are numeric and have an implicit ordering among values (e.g., age,
income, and price). Techniques for mining multidimensional association rules can be
categorized according to three basic approaches regarding the treatment of quantitative
(continuous-valued) attributes.
tp D
s:/ EE
/w P
w IK
w A
.d K
ee A
pi M
ka B
ka O
m J
bo
j. i
n
1. in the first approach,
 Quantitative attributes are discretized using predefined concept hierarchies.
 This discretization occurs prior to mining.
 For instance, a concept hierarchy for income may be used to replace the original numeric
values of this attribute by ranges, such as \0-20K", \21-30K", \31-40K", and so on. Here,
 Discretization is static and predetermined.
 The discretized numeric attributes, with their range values, can then be treated as
categorical attributes (where each range is considered a category).
 We refer to this as mining multidimensional association rules using static discretization
of quantitative attributes.
2. in the second approach,
 Quantitative attributes are discretized into \bins" based on the distribution of the data.
 These bins may be further combined during the mining process.
 The discretization process is dynamic and established so as to satisfy some mining
criteria, such as maximizing the confidence of the rules mined.
 Because this strategy treats the numeric attribute values as quantities rather than as
predefined ranges or categories, association rules mined from this approach are also
referred to as quantitative association rules.
3. in the third approach,
 Quantitative attributes are discretized so as to capture the semantic meaning of such
interval data.
 This dynamic discretization procedure considers the distance between data points. Hence,
such quantitative association rules are also referred to as distance-based association rules.
Mining quantitative association rules
Quantitative association rules are multidimensional association rules in which the numeric
attributes are dynamically discretized during the mining process so as to satisfy some mining
criteria, such as maximizing the confidence or compactness of the rules mined. In this section,
we will focus specifically on how to mine quantitative association rules having two quantitative
attributes on the left-hand side of the rule, and one categorical attribute on the right-hand side of
the rule, e.g.,
ht
Aquan1 ^ Aquan2 Acat,
tp D
s:/ EE
/w P
w IK
w A
.d K
ee A
pi M
ka B
ka O
m J
bo
j. i
n
Where Aquan1 and Aquan2 are tests on quantitative attribute ranges (where the ranges are
dynamically determined), and Acat tests a categorical attribute from the task-relevant data. Such
rules have been referred to as two- dimensional quantitative association rules, since they contain
two quantitative dimensions. For instance, suppose you are curious about the association
relationship between pairs of quantitative attributes, like customer age and income, and the type
of television that customers like to buy. An example of such a 2-D quantitative association rule is
Age (X; \30-34") ^ income (X; \42K-48K")  buys (X; \high resolution TV")
Binning:
Quantitative attributes can have a very wide range of values defining their domain. Just think
about how big a 2-D grid would be if we plotted age and income as axes, where each possible
value of age was assigned a unique position on one axis, and similarly, each possible value of
income was assigned a unique position on the other axis! To keep grids down to a manageable
size, we instead partition the ranges of quantitative attributes into intervals. These intervals are
dynamic in that they may later be further combined during the mining process. The Partitioning
process is referred to as binning, i.e., where the intervals are considered \bins". Three common
binning strategies are:
1. equi-width binning: where the interval size of each bin is the same,
2. equi-depth binning: where each bin has approximately the same number of tuples assigned to
it, and
Figure 1 A 2-D grid for tuples representing customers who purchase high resolution TVs
ht
3. homogeneity-based binning: where bin size is determined so that the tuples in each bin are
uniformly distributed.