Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Transcript

The Goal • Find “patterns”: local regularities that occur more often than you would expect. Examples: If a person buys wine at a supermarket, they also buy cheese. (confidence: 20%) If a person likes Lord of the Rings and Star Wars, they like Star Trek (confidence: 90%) Look like they could be used for classification, but There is not a single class label in mind. They can predict any attribute or a set of attributes. They are unsupervised Not intended to be used together as a set Often mined from very large data sets • • Association Rules Charles Sutton Data Mining and Exploration Spring 2012 • • • • Based on slides by Chris Williams and Amos Storkey Thursday, 8 March 12 Thursday, 8 March 12 Other Examples Example Data Market basket analysis, e.g., supermarket Item Chicken Onion 1 Transactions trip to market 1 1 Rocket Caviar Haggis 1 1 1 1 1 1 1 1 1 1 Collaborative-filtering type data: e.g., Films a person has watched • • Rows: patients, columns: medical tests (Cabena et al, 1998) Survey data (Impact Resources, Inc., Columbus OH, 1987) 494 14. Unsupervised Learning TABLE 14.1. Inputs for the demographic data. Feature 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 .... These are databases that companies have already. Thursday, 8 March 12 • Thursday, 8 March 12 Demographic # Values Sex Marital status Age Education Occupation Income Years in Bay Area Dual incomes Number in household Number of children Householder status Type of home Ethnic classification Language in home ! number in household number of children ⇓ 2 5 7 6 9 9 5 3 9 9 3 5 8 3 Type Categorical Categorical Ordinal Ordinal Categorical Ordinal Ordinal Categorical Ordinal Ordinal Categorical Categorical Categorical Categorical = 1 = 0 language in home = English " Itemsets, Coverage, etc Toy Example Itemsets, Frequency, Accuracy Itemsets, Frequency, Itemsets, Frequency, Accuracy PlayAccuracy Tennis Example A1 , A2 , . . . Am andefined attribute itemsetcolumn is a pattern by • CallAneach Play Tennis Example Itemsets, Frequency, Accuracy Play Tenn Play Play TennisTen Ex Day • An itemset is a pattern defined D1 Day Outlook Temperature Humidity Wind PlayTennis (A aby ) of(Aattribute . . value . (Aik = a AnaItemsets, item set is pairs i1a=set j1by i2 = aj2 ) jk ) An itemset is pattern defined Frequency, Accuracy PlayOutloo Tenn Day itemset is aHot pattern defined D2 D1 An Sunny High by False No Day Day (A Outlook Temperature Wind PlayTennis Itemsets, Frequency, Accuracy P D1 Sunny (Afrequency = aj2 ) of. .an . (A ) simplyHumidity i1 = aj1 ) (or isupport) ik = ajkX D3 2 itemset P(X ) D2 Sunny Hot High True No D1 (AThe aj2 ) . . . (A ajk is ) i1 = aj1 ) D1(Ai2 =Sunny ik = D2 No D4 Sunny Hot High False (A = a ) (A = a ) . . . (A = a ) i j i j i j An itemset is a pattern defined by 1 1 2 2 False k Yes k D3 Overcast Hot High D2 Example: the “Play data D3 NoOverca Day D5 The frequency (or support) anTennis” itemset Xdata is simply Example: IninD2 the of Play Tennis Sunny Hot P(X ) High True D4 Rain Mild High False Yes An of itemset is defined D1 D4 Rain (Ai1itemset =a aj1pattern ) (A = asimply . by . . (A D6 i2 is j2 ) ik = a jk ) The frequency (or support) an X P(X ) Overcast Hot = False) =High False Yes D3 P(Humidity = Normal 4/14 frequencyCool (or support) of an itemset simply P(X ) D5 The Rain Normal False X is Yes Example: in the “PlayD3 Tennis” dataPlay = Yes Windy D2 D5 Rain D7 D4 Itemsets, Frequency, Accuracy PlayExample: Tennis Example D4 Rain High Yes D8 (Ai1 = aof (A . . . (A = )ajk ) False D6 Rain Cool Normal True No D3 j1 )an Mild i2 = aX j2 )is simply iP(X The frequency (or support) itemset k in the “Play Tennis” data D6 Rain D5 Example: in the “Play Tennis” data True P(Humidity = NormalD5Play = Rain Yes Windy = False) = 4/14 Normal D4 Cool Falseset D7 YesOverca D7 Overcast Cool Normal Yes D9 The support of an item set is its frequency in the data Example: in the “Play Tennis” data The accuracy (orfrequency confidence) an association rule if Y=y D6 D5 The (orofsupport) of an itemset X is simply P(X ) D8 An itemset Sunnyis a patternMild High False No D6 Rain Windy Normal True Play = Yes = Cool False) = 4/14 D8 NoD10 Sunny defined by Play = Yes Windy = False) = 4/14 P(Humidity = Normal then Z=z is D6 P(Humidity = Normal D7 P(Humidity = Normal Play = Yes Windy = False) = 4/14 Day OutlookExample: Temperature Humidity Wind PlayTennis D9 Sunny Cool Normal False Yes D9 YesD11 Sunny D7 Overcast True Example: the “Play D7 P(Z = z|YTennis” =Cool yrule ) data The accuracy (or confidence) of in an association if Y=y Normal D8 D1 Sunny Hot High False No ) (Ai2 = aj2 ) Normal . . . (Aik = aFalse D10 Rain (Ai1 = aj1Mild Yes D10 NoD12 Rain jk ) D8 D8 Sunny Mild High False then Z=z is D2 Sunny Example Hot P(Humidity High = Normal True No= Yes Windy = False) = Play 4/14 support ( ) = 4 D11 D13 Sunny D9 D11 Sunny Mild Normal True Yes D9 P(Z = z|Y = y) TheD9 accuracy (or an association rule if Y=y False Normal YesD14 ofSunny anconfidence) association rule if Y=y D3accuracy Overcast(or confidence) Hot High False of Cool Yes frequency (or support) of an itemset Xofisan simply P(X ) D10 The accuracy (or association rule if Y=y The D12 Overca D12TheOvercast Mildconfidence) High True Yes D10 then Z=z is P(Windy = False Play =association Yes|Humidity =rule Normal) = 4/7then Z=z D4 Z=z Rain Mild High False Yes D10 Rain Mild Normal Falseis D13YesOverca then is The confidence of an if Y=y D11 thenin Z=z is Tennis” Example D13Example: Overcast Hot data Normal False Yes P(Z = z|Y = y ) the “Play D5 Rain CoolD11 Normal False Yes of an association D12 The accuracy (or confidence) rule if Y=y D14YesD11 Sunny Mild Normal True P(Z = z|Y = y ) Rain D14 Rain Mild High No P(Z = z|Y True = y) D6 Rain Example: Cool Play No = 4/7 D12 D13 P(Windy = False =Normal Yes|Humidity then Z=z is True= Normal) • • • • • P(Humidity = Normal Play = Yes Windy = False) = 4/14 Example 5/1 The accuracy (or confidence) of an association rule if Y=y = False Play = Yes|Humidity = Normal) then P(Windy Z=z is P(Z = z|Y = y ) Thursday, 8 March 12 Finding Frequent Itemsets Item sets to rules Generating rules from itemsets Example P(Windy = False Play = Yes|Humidity = Normal) = 4/7 • Example Overcast 5/1 Mild True Yes CoolD12 Normal True Yes= z|Y = y ) High D14 P(Z False Play = Yes|Humidity = Normal) = 4/7 D13 =Overcast Hot Normal False YesD13 MildP(Windy High False No D9 Generating Sunny Normal False Yes Finding rules itemsets D14 Rain Mild True No D14Fr Example = False Cool Play = from Yes|Humidity = Normal) = 4/7 High5 / 1 Thursday, 8 March 12 D10 Rain Mild Normal False Yes 4/76 / 1 P(Windy 5/1 P(Windy Yes|Humidity = Normal) = 4/7 D11 Sunny Mild Normal= False True Play =Yes Finding Freque Generating rules from itemsets D12 Overcast Mild High True Finding F Generating rules fromFalse itemsetsYes 5/1 D13 5Overcast Hot Normal Yes /1 5/1 D14 Rain An itemset Mild of size kHigh can giveTrue rise to 2k No1 rules D7 Overcast Example D8 Sunny = Finding Frequent Itemsets Itemset Generating rulesExample. from itemsets Generating rules from itemsets Finding Frequent Itemsets k An itemset of sizeAn k can giveofrise 2 give 1 rules itemset sizeto k can rise to 2k 1 rules 6/1 Windy=False, Play=Yes, Humidity=Normal Example. Itemset Example. Itemset 5/1 Task Finding FF Key ifTas alla Task: find prop Key observ Key Finding Frequentgives Itemsets Generating rules from itemsets rise to 7 rules including So if alfi Key observation: a set X of variables can be frequent only if all subse Windy=False, Play=Yes, Humidity=Normal k Windy=False,An Play=Yes, Humidity=Normal itemset of size k can give rise to 2 1 rules Then: We convert them to rules on pro. k IF Windy=False and Humidity=Normal THEN Play=Yes (4/4) if all subsets of variables are frequent (monotonicity property), Task: Find all rise item sets with support An itemsetIFofPlay=Yes size kTHENcan give toWindy=False 2 1 rules Humidity=Normal and (4/9) An itemset of size k can give rise to 2k k 1 rules gives rise to 7 rules including Example. Itemset IF True THEN Windy=False and Play=Yes and Humidity=Normal (4/14) So gives rise to 7 rules including e Ta property), i.e. of P(A, B) k can P(A)give andrise P(A,to B)2 -1 P(B) An itemset size rules So findAn freq Task: find allTHEN itemsets with(4/4) frequency s Example. Itemset on . IF Windy=False and Humidity=Normal Play=Yes k item Insight: A large set can be no more frequent than Example. Itemset An itemset of size k can give rise to 2 1 rules on ... IF Play=Yes THEN Humidity=Normal and Windy=False (4/9) IF Windy=False and Humidity=Normal THEN Play=Yes (4/4) So find frequent singleton sets, then sets of size 2, and so Windy=False, Humidity=Normal Ke True THEN Windy=False and Play=Yes Humidity=Normal (4/14) itemsets with frequency s and(4/9) Key observation: aPlay=Yes, set X of variables can be frequent only IF Task: Play=Yesfind THENall Humidity=Normal and Windy=False An Example: itemset (199 itsIF subsets, e.g., Select association rules that have (4/14) accuracy greater than some Example. Itemset IF True THEN Windy=False and Play=Yes and Humidity=Normal on ... An efficien if a item Humidity=Normal KeyWindy=False, observation: a setaXPlay=Yes, variables be frequent onlyare frequent (monotonicity ifof all subsets ofincluding variables threshold gives rise to 7can rules Windy=False, Play=Yes, Humidity=Normal itemsets(19 is An efficient algorithmPlay=Yes, using thisHumidity=Normal idea for finding frequent support(Wind = False) support(Wind = False, Outlook = Sunny) pro if all subsets of variables frequent (monotonicity Selectare association rules that have accuracy greater than some Windy=False, property), i.e. P(A, B) P(A) and P(A, B) P(B) IF Windy=False and Humidity=Normal THEN Play=Yes (4/4) (1994), Ma threshold aand property), i.e. P(A, B) P(A) P(A, B) P(B) Select rules that have accuracy greater than some itemsets is the APRIORI algorithm (Agrawal and Srikant gives rise toassociation 7 rules including • • • • Task: itemsets with frequency s First:find Weallwill find frequent item sets • • Results 7 to rules including: gives in rise 7 rules including • (1994), Mannila et al (1994)) gives rise to 7 rules including IF IF Windy=False and Humidity=Normal THEN Play=YesTHEN Play=Yes (4/4) Windy=False and Humidity=Normal IF IF Play=Yes THEN Humidity=Normal and Windy=False (4/9) Play=Yes THEN Humidity=Normal and Windy=False IF True THEN Windy=False and Play=Yes and Humidity=Normal (4/14) IF True THEN Windy=False and Play=Yes and Humidity=Normal • 8/1 IF IF IF (4/4) (4/9) (4/14) • IF Play=Yes THEN Humidity=Normal and Windy=False • Selectkeep association that whose have accuracy greater thanissome We rulesrules only confidence greater andrules Srikant, 1994; Mannilaetgreater et 1994) (1994), Mannila al al, (1994)) Select association that have accuracy than some threshold a association rules that have accuracy greater than some Select 9/1 than a threshold threshold a threshold a Thursday, 8 March 12 Thursday, 8 March 12 8/1 So (4/9) IF True THEN find Windy=False and Play=Yes and Humidity=Normal (4/14) sets of size 2, and so singleton then aSosingleton Sothreshold find frequent sets, thenfrequent sets of size 2,in and so sets, searchSo through itemsets order of number of on Windy=False and Humidity=Normal THEN Play=Yes (4/4) 8/1 on ... on ... Play=Yes THEN Humidity=Normal and Windy=False (4/9) items True THEN and Play=Yes and Humidity=Normal (4/14) Select association rules that have accuracy greater than some AnWindy=False efficient algorithm using idea for finding frequent 8/1 Anthis efficient algorithm using this idea for finding frequentAn threshold a (Agrawal and Srikant itemsets is theAn APRIORI algorithm ite efficient algorithm this is APRIORI itemsets is thefor APRIORI algorithm(Agarwal and Srikant 8 / (Agrawal 1 (1994), Mannila et al (1994)) 9/1 (19 8/1 lgorithm APRIORI Algorithm APRIORI algorithm ariables) Single database pass is linear in |Ci |n, make a pass for each i Single database pass is linear in |Ci |n, make a pass for each i until Ci is empty until Ci is empty Candidate formation Candidate formation (for binary variables) Find all pairs of all sets {U, V } {U, from thatthat U U VVhas Find pairs of sets V }Lfrom Li such has i such size i + 1 and test if this union is really a potential size i + 1 and test if this union is really a potential candidate. O(|Li |3 ) 3 candidate. O(|L i| ) i =1 A is a variable} Ci = {{A}|A is a variable} while Ci is not empty not empty database pass: se pass: for each set in Ci test if it is frequent or each set in Ci test if it isletfrequent Li be collection of frequent sets from Ci candidate et Li be collection of frequent formation: sets from Ci let Ci+1 be those sets of size i + 1 ate formation: all of whose subsets are frequent et Ci+1 be those sets of size i + 1 end while Example: 5 three-item sets Example: 5 three-item (ABC), (ABD),sets (ACD), (ACE), (BCD) Candidate (ABC), (ABD), (ACD),four-item (ACE),sets (BCD) (ABCD) ok Candidate four-item sets (ACDE) not ok because (CDE) is not present above (ABCD) ok Data structure techniques can be used forabove speedups (ACDE) not ok because (CDE) is not present Other algorithms possible for finding frequent itemsets, e.g. Data structure techniques Han’s FP-growth can be used for speedups ll of whose subsets are frequent Other algorithms possible for finding frequent itemsets, e.g. Han’s FP-growth Thursday, 8 March 12 Thursday, 8 March 12 10 / 1 APRIORI and Algorithm Components Comments Comments on Association Rules 10 / 1 association rules will be trivial, some nd Algorithm Components • Some interesting. Need to sort through them • • 11 / 1 Finding Association Rules is just the beginning in a datamining effort. Some will be trivial, others interesting. Challenge is to select potentially interesting rules Comments on Association Rules Finding Association rules as Exploratory Data Analysis Trivial ruleRules example: Finding Association is just the beginning in a datamining effort. Some will be trivial, others interesting. Challenge is to pregnant female select potentially interesting rules with accuracy 1! Task: Rule Pattern Discovery Example: pregnant => female (confidence: 1) Structure: Association Rules AlsoScore can miss “interesting but rare” rules Function: Support • Search: Breadth Firstcaviar with Pruning Example: vodka --> (low support) Finding Association Exploratory Data Analysis For rule Arules B, itas can be useful to compare P(B|A) to P(B) Data Management Technique: Linear Scans • re: Association• Rules For rule A -->B, can be useful to compare P(B|A) APRIORI algorithm can be generalized to frequent structure Trivial rule example: Rule Pattern Discovery Really this is a type of exploratory data analysis mining, e.g. finding episodes from sequences or frequently-occurring trees pregnant • : Breadth First with Pruning APRIORI can be generalised to structures like subsequences subtrees anagement Technique: Linearand Scans female Example application: Health Insurance Commission (HIC) in with accuracy 1! Australia detected patterns of ordering of medical tests that suggested that some of the tests ordered were unnecessary For rule A (Cabe B, itña can be1998) useful to compare P(B|A) to P(B) et al, Function: Supportto P(B) Thursday, 8 March 12 11 / 1 12 / 1 APRIORI algorithm can be generalized to frequent structure mining, e.g. finding episodes from sequences or frequently-occurring trees 13 / 1