Download 2.1 UNIT-2 material

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Association Rules
Association rules are used to show relationship
between data items
Frequent pattern: a pattern (a set of items,
subsequences, substructures, etc.) that occurs
frequently in a data set
5/8/2017
Data Mining -By Dr. S. C. Shirwaikar
1
Mining Frequent patterns
Motivation: Finding inherent regularities in data
What products were often purchased together?
-Milk and Bread
What products are purchased one after other?
-PC followed by digital camera
-TV set followed by VCD player
Is there a structure defining relationships in the
items purchased?
-tree structure defining dependencies
-Lattices defining some order in the items bought
5/8/2017
Data Mining -By Dr. S. C. Shirwaikar
2
Applications
- Market Basket analysis
-Cross-Market Analysis
-Catalog design,
-Sale campaign analysis,
-Web log (click stream) analysis
- DNA sequence analysis.
Forms the foundation for many essential data mining tasks
-Association, correlation, and causality analysis
-Classification: associative classification
-Pattern analysis in spatiotemporal, multimedia, timeseries, and stream data
-Cluster analysis: frequent pattern-based clustering
5/8/2017
Data Mining -By Dr. S. C. Shirwaikar
3
Market Basket Analysis
-It analyzes customer-buying habits by finding
associations between the different item that customer
place in their shopping baskets
-It helps retailers in
-developing market strategies
-Advertising strategies
-Planning their shelf space
-Preparing store layouts-proximity
-Plan sales of non-moving items
-Plan discounts, offers etc
5/8/2017
Data Mining -By Dr. S. C. Shirwaikar
4
-Each basket can be represented by a boolean vector
-These vectors can be analyzed to get buying patterns
-Buying patterns can be represented by an association
rules
-Support and confidence are two measures of rule’s
interestingness, They reflect the usefulness and
certainity of discovered rules
The support of an item ( or set of items) is the
percentage of transactions in which that item occurs.
Support of all set of items is problematic as the number
of subsets increase exponentially for a given set of
values
5/8/2017
Data Mining -By Dr. S. C. Shirwaikar
5
Association Rule – It is an implication of the form x  Y
where X, Y are set of items called itemsets and X  Y is
empty
The Support (s) for an association rule X Y is the
percentage of transactions in the database that contains
XY
Confidence or strength() for an association rule X  Y is
the ratio of the number of transactions that contain X  Y
to the number of transaction that contain X
The support count is absolute support while probability of
support count is relative support
5/8/2017
Data Mining -By Dr. S. C. Shirwaikar
6
Transaction-id
Items bought
10
A, B, D
20
A, C, D
30
A, D, E
40
B, E, F
50
B, C, D, E, F
Customer
buys both
Customer
buys Y
5/8/2017
Customer
buys X
 Itemset X = {x1, …, xk}
 Association rule X  Y
 support, s, probability that a
transaction contains X  Y
 Support(X=>Y)=P(XUY)
 confidence, c, conditional
probability that a transaction
having X also contains Y
Confidence(X=>Y)=P(Y/X)
=P(XUY)/P(X)
Association rules:
A  D (60%, 100%)
D  A (60%, 75%)
Data Mining -By Dr. S. C. Shirwaikar
7
An itemset that contains k items is a k-itemset
The occurrence frequency of itemset or support count is
the number of transactions that contain the itemset
A minimum support threshold is specified
A large(frequent) item set is one whose support count is
above a threshold
The subset of any large itemset is also large
Generating Association Rules is a Two-step process
-Find all large ( frequent) itemsets
-Generate strong association rules from the frequent
itemsets satisfying minimum support and minimum
confidence
Confidence(A=>B)=P(B/A)=support(AUB)/support(A)
The support counts of A, B and AUB are required to
determine association rules A=>B and B=>A
5/8/2017
Data Mining -By Dr. S. C. Shirwaikar
8
For each frequent itemset l,
generate all nonempty subsets of l
For every nonempty subset s of l,
output the association rule s=>l-s
if support_count(l)/support_count(s) ≥ min_conf
( min_conf=minimum confidence threshold)
Consider l={A,B.E}. min_sup=2
Tid
Items
Its nonempty subsets are {A,B},{A,E},
10
A, B, E
{B,E}, {A}, {B},{E}
20
B, E
{A,B}=>E
confidence=2/4= 50%
30
B, C
{A,E}=>B
confidence=2/2=100%
40
A, B, D
{B,E}=>A
confidence=2/3=66%
50
A, C
{A}=>{B,E} confidence=2/6=33%
60
B, C
{B}=>{A,E} confidence=2/7=28%
70
A, C
{E}=>{A,B} confidence=2/3=66%
80
A, B, C, E
If minimum confidence threshold is 70%
90
A, B, C
then {A,E}=> B is the only association rule
5/8/2017
Data Mining -By Dr. S. C. Shirwaikar
9
Basic Algorithms
Apriori Algorithm – It is based on large itemset or Aproiori
property
Apriori property- all nonempty subsets of a frequent itemset
must also be frequent- Large itemsets are downward closed
If we know that an itemset is small , we need not consider
supersets of it as candidates because they also will be small
Apriori employs an iterative approach known as level-wise
search, where k-itemsets are used to explore k+1-itemsets
•Initially, scan DB once to get frequent 1-itemset
•Generate length (k+1) candidate itemsets from length k
frequent itemsets
•Test the candidates against DB
•Terminate when no frequent or candidate set can be
generated
5/8/2017
Data Mining -By Dr. S. C. Shirwaikar
10
Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested
Method:
Lk denotes the set of frequent k-itemsets- Large itemset
Ck is the superset of Lk – Candidate for Large itemset
Tid
Items
10
A, B, E
20
30
B, E
B, C
40
A, B, D
50
A, C
60
B, C
70
A, C
80
A, B, C, E
90
A, B, C
5/8/2017
C1
1st scan
Itemset
{A}
{B}
{C}
{D}
{E}
sup
6
7
6
1
3
L1
Itemset
{A}
{B}
{C}
{E}
sup
6
7
6
3
Supmin = 2
Data Mining -By Dr. S. C. Shirwaikar
11
Two-step process is followed consisting of join and prune
actions to generate Lk from Lk-1
Join Step- Apriori assumes that items within a transaction or
itemset are sorted in lexicographic order.
The Candidate set Ck is generated by taking the join Lk-1xLk-1,
where members of Lk-1 are joinable if their first k-2 items are
in common. This ensures that no duplicates are generated
Prune step- To reduce the size of Ck, Apriori poperty is used
as follows
Any (k-1)-itemset that is not frequent cannot be a subset of a
frequent k-itemset. Hence if any (k-1)-subset of a candidate
k-itemset is not in Lk-1, the candidate cannot be frequent and
can be removed from Ck
The count of each candidate in Ck is used to determine Lk
(minimum support count)
5/8/2017
Data Mining -By Dr. S. C. Shirwaikar
12
Supmin = 2
C2
L1
Itemset
{A}
{B}
{C}
{E}
5/8/2017
sup
6
7
6
3
L1xL1
L2
Itemset
{A, B}
sup
4
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
4
2
4
3
1
2nd scan
Data Mining -By Dr. S. C. Shirwaikar
Itemset
{A, B}
{A, C}
{A, E}
sup
4
4
2
{B, C}
{B, E}
4
3
13
Supmin = 2
L2
C3= {{A,B,C},{A,B,E},{A,C,E},{B,C,E}}
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
The 2 item subsets of {A,B,C} are {A,B,},{B,C},
{A,C} which are all in L2
L2xL2
The 2 item subsets of {A,B,E} are {A,B,},{B,E},
{A,E} which are all in L2
The 2 item subsets of {A,C,E} are {A,C,},{C,E}
and {A,E} . {C,E} is not in L2
Remove {A,C,E}
Itemset
{A, B, C}
{A, B, E}
{A, C, E}
{B, C, E}
C3
5/8/2017
The 2 item subsets of {B,C,E} are {B,C,}, {C,E}
and {B,E} . {C,E} is not in L2
Remove {B,C,E}
3rd scan
L3
Itemset
{A, B, C}
sup
2
{A, B, E}
2
Data Mining -By Dr. S. C. Shirwaikar
14
L3
Itemset
{A, B, C}
{A, B, E}
Supmin = 2
L3xL3 C4= {{A,B,C,E}}
The 3 item subsets of {A,B,C,E} are {A,B,C},
{B,C,E}, {A,C,E} and {A,B,E} ,
{B, C, E} and {A, C, E} are not in L3
Remove {A,B, C,E}
Thus C4 is empty and algorithm terminates
having found all the frequent itemsets
5/8/2017
Data Mining -By Dr. S. C. Shirwaikar
15
The Apriori Algorithm
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
Algorithm Apriori
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = Apriori_generate(Lk)
// candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
5/8/2017
Data Mining -By Dr. S. C. Shirwaikar
16
Algorithm Apriori_generate(Lk)
For each itemset l1 in Lk
For each itemset l2 in Lk
If k-1 elemts in l1 and l2 are equal
//If l1[1]=l2[1] and l1[2]=l2[2]and….l1[k-1]=l2[k-1]
and //l1[k]<l2[k]
C=l1xl2
add C to Ck+1
for each k subset s of c
if s does not belong to Lk then
delete c break
The Apriori algorithm assumes that the dataset is memory
resident. The max number of DB scans is one more than
the cardinality of largest itemset.
Large number of data scans is a weakness of apriori
5/8/2017
Data Mining -By Dr. S. C. Shirwaikar
17
Sampling Algorithm
It is an improvement that reduces number of database
scans to one in the best case and two in the worst case
A database sample is drawn such that it can be memory
resident. An algorithm such as Apriori used to find large
itemsets for the sample.
These are viewed as potentially large (PL) itemsets
Additional candidates are determined by applying negative
border function BD~ against the large itemsets from the
sample
Negative border function is defined as the minimal set of
items that are not in PL but whose subsets are all in PL
5/8/2017
Data Mining -By Dr. S. C. Shirwaikar
18
Tid
Items
10
A, B, E
20
30
B, E
B, C
40
A, B, D
50
A, C
60
B, C
70
A, C
80
A, B, C, E
90
A, B, C
C3
Itemset
{A, B , C}
Sample
Tid
Items
10
A, B, E
50
A, C
70
A, C
90
A, B, C
L1
C2
Itemset
{A}
{B}
{C}
Itemset
{A, B,C}
Itemset
{A, B}
{A, C}
{B C}
C1
Itemset
{A}
{B}
{C}
{E}
sup
4
2
3
1
L2
Itemset
{A, B}
{A, C}
{B C}
sup
2
2
1
sup
1
PL ={A,B,C,{A,B}.{A,C}}
BD~(PL)={{B,C}, E,D} , {B,C} is added because both its subsets {B} and {C}
are in PL, { E} , {D} are added as all their subsets (empty) are vacuously in PL
5/8/2017
Data Mining -By Dr. S. C. Shirwaikar
19
Total candidates considered as C= PL U BD~(PL)
= {A,B,C,{A,B}.{A,C},{B,C}, E,D}
During the first scan of the database , support count is
computed for all.
If all candidates that are large are in PL, then all large
itemsets are found
A second scan is required if any are in the negative border
area
The negative border is the buffer area between those
known to be large and others.
It is the smallest possible set of itemsets that could
potentially be large
During the second scan, additional candidates are
generated and counted to ensure that all large itemsets
are found
ML- the missing large itemsets are those in L but are not
in PL
5/8/2017
Data Mining -By Dr. S. C. Shirwaikar
20
To find all the remaining large itemsets in the second scan ,
the sampling algorithm repeatedly applies the negative
border function until the set of possible candidates does not
grow further
ML={{B,C}, E,}
PL ={A,B,C,{A,B}.{A,C}}
BD~(ML)= { {A, E}, {B,E}, {C,E } ,{A,B,C}, E,{,B,C}}
BD~(ML) ={ {A,B,E},{A,C,E},{B,C,E}}
BD~(ML) ={{A,B,C,E}}
This creates potentially large set of candidates with many
not large, it does guarantee that only one more database
scan is required
Apriori algorithm is performed using a support called
small(s), which is a min support value less than s. It is
reduced because the sample size is smaller
5/8/2017
Data Mining -By Dr. S. C. Shirwaikar
21
Sampling algorithm
I=Itemsets s = support count
Ds = Sample drawn from D
PL=Apriori(I, Ds,Small(s)),
C= PLU BD~(PL)
Scan the database and compute support counts of each Li in C and
test if each of the itemset is large
L= Itemsets that are tested to be large
ML= { X / X ε BD~(PL) Λ X ε L}
If ML ≠ Φ then
C=L
Repeat
C= C U BD~(C)
Until no new itemsets are added to C
Scan the database second time and compute support counts of each Li
in C and test if each of the itemset is large
5/8/2017
Data Mining -By Dr. S. C. Shirwaikar
22
Partitioning
Dataset D is divided into p partitions D1, D2, …. Dp
All frequent itemsets within the partition called local frequent
itemsets are computed(min_sup appropriately changed) These
form global candidate itemsets which are used to get frequent
itemsets for the entire database
Partitioning may improve the performance in many ways
•By large itemset property, a large itemset must be large in at
least one of the partitions. Each partition can be created such
that it fits in main memory. The number of itemsets to be
counted per partition would be smaller
•Parallel or distributed algorithms can be used
•Incremental generation of association rules is possible, by
treating newly added data as a new partition
5/8/2017
Data Mining -By Dr. S. C. Shirwaikar
23
•The number of database scans is reduced to two.
In first scan partitions are braught in memory and the large
itemsets of the partition are found
During the second scan , only those itemsets that are large in
atleast one partition are used as candidates and counted to
determine if they are large across the database
Parallel and distributed algorithms
Data parallelism – data can be partitioned but it requires that
memory at each processor is large enough to store all
candidates at each scan
Task parallelism - candidate sets can be partitioned and
counted separately at each processor. Candidate set allotted
to a processor should fit in its memory
5/8/2017
Data Mining -By Dr. S. C. Shirwaikar
24
CDA (Count distribution algorithm)
It uses data parallelism.
The database is divided into p partitions, one for each
processor.
Each processor counts the candidates for its data and then
broadcasts its count to all other processors.
Each processor then determines the global counts.
These are used to generate large itemsets and candidate
sets for the next scan
5/8/2017
Data Mining -By Dr. S. C. Shirwaikar
25
DDA( Data Distribution algorithm)
It uses task parallelism.
The candidates as well as data are partitioned among
processors.
Each processor counts the candidates given to it using local
database partition.
Then each processor broadcasts its database partition to all
other processors.
This data is then used by each processor to compute the global
count for its data and broadcasts this count to all
Each processors determines globally large itemsets and the
candidate sets
These candidate sets are divided among processors for next
scan
This algorithm suffers from high message traffic
5/8/2017
Data Mining -By Dr. S. C. Shirwaikar
26
Comparing the Algorithms
Algorithms can be classified along the following dimensions
Target- The algorithms generate rules that support a given
support and confidence
Type- can generate regular or advanced association rules
Data Type – data can be categorical or textual
Data source – Market basket data- item present in a
transaction
Technique – large or frequent itemsets
Itemset strategy – usually bottom up approach is used
reducing using apriori property – A top-down technique can
also be used
5/8/2017
Data Mining -By Dr. S. C. Shirwaikar
27
Transaction strategy – all database iis scanned or sample or
partition is used
Itemset data structure – hash tree data structure can be usedeffective technique to store access and count itemsets
Transaction data structure- Usually we have table of
transactions with the itemset present in the transactions in
horizontal data format.Alternatively data can be represented in
a table with itemname and set of transactions containing the
item called vertical data format
Optimization – techniques used to improve the performance of
the algorithm for a given data distribution
Architecture – sequential, parallel and distributed algorithms
are used
Parallelism strategy –Data parallelism and or task parallelism
is used
5/8/2017
Data Mining -By Dr. S. C. Shirwaikar
28
Incremental Rules
In case of dynamic databases, database state keeps on
changing. Generating association rules for a new database state
requires a complete rerun of the algorithm
Incremental approaches address the issue of modifying
associations rules as new transactions are inserted into the
database
If D is the database state with large itemsets L and db are the
updates, incremental approach finds itemsets for D U db using L
FUP ( fast update ) is based on Apriori algorithm. For each
iteration when db is canned , the candidate sets generated are
pruned using L.
This is because the itemset need to be large in at least one
partition D or db in order to be large in D Udb .
5/8/2017
Data Mining -By Dr. S. C. Shirwaikar
29
Association rules
There are various kinds of association rules
•Multilevel association
•Multidimensional association
•Quantitative association
•Correlation rules
Multi level association rules
When a concept hierarchy exists between the items,
association rules can be generated at various levels of
concept hierarchy
Items at the lower level are expected to have lower support
Strong associations discovered at higher levels may
represent common sense Knowledge
Strong association rules at lower levels are difficult to find
due to unavailability of data at that level.
5/8/2017
Data Mining -By Dr. S. C. Shirwaikar
30
Food
vegetables
Grain
Fruit
………
Meat
………
Wheat
Rice
yoghurt
Whole
5/8/2017
Dairy
Data Mining -By Dr. S. C. Shirwaikar
Milk
Cheese
Skim
31
Multi level association rules can be mined using concept
hierarchies and support-confidence framework
A top-down strategy can be applied
There are several variations
Using uniform minimum support at all levels
-search procedure is simplified as it avoids examining
itemsets whose ancestors do not have minimum support
- Users are requirted to specify only one value min-sup
Using reduced minimum support at lower levels
-Each level has its own min-sup
-Deeper the level , smaller is the threshold value
Using item or group based minimum support
- minimum support threshold can be set by grouping items
based on price or other attributes
-low support threshold can be set for an item of interest
Redundant rules are generated due to ancestor relationship
5/8/2017
Data Mining -By Dr. S. C. Shirwaikar
32
Multidimensional association rules-Association rules that
involve two or more predicates
Single-dimensional rules:
buys(X, “milk”)  buys(X, “bread”)
Multi-dimensional rules:
Inter-dimension assoc. rules (no repeated predicates)
age(X,”19-25”)  occupation(X,“student”)  buys(X,
“coke”)
hybrid-dimension assoc. rules (repeated predicates)
age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”)
Quantitative association rules
Categorical Attributes: finite number of possible values, no
ordering among values
Quantitative Attributes: numeric, implicit ordering among values
Discretization of quantitative attributes can be predefined –
converted to categorical
Discretization can be dynamic-retaining quantitative nature-to
maximize confidence
5/8/2017
Data Mining -By Dr. S. C. Shirwaikar
33
Correlation rules
Strong association rules may not necessarily be interesting
Correlation analysis can be additionally used to augment
support and confidence measure
There are several correlation measures -lift. Chi-square etc
Measuring Quality of rules
Several measures can be used
Support- s(A=>B) = P(AUB)
Confidence - (A=>B) = P (B/A)
Lift or Interest – relationship between items –symmetric –
interest (A=>B) = P(AUB) / P(A) P(B)
Conviction - measure of independence – negation – inverts
ratio
Conviction(A=>B) = P(A) P~B) /P(A U ~B)
Chi-square χ2 5/8/2017
Data Mining -By Dr. S. C. Shirwaikar
34