Download Data Mining - Computer Science Intranet

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
COMP527:
Data Mining
COMP527: Data Mining
M. Sulaiman Khan
([email protected])
Dept. of Computer Science
University of Liverpool
2009
Association Rule Mining
March 5, 2009
Slide 1
COMP527:
Data Mining
COMP527: Data Mining
Introduction to the Course
Introduction to Data Mining
Introduction to Text Mining
General Data Mining Issues
Data Warehousing
Classification: Challenges, Basics
Classification: Rules
Classification: Trees
Classification: Trees 2
Classification: Bayes
Classification: Neural Networks
Classification: SVM
Classification: Evaluation
Classification: Evaluation 2
Regression, Prediction
Association Rule Mining
Input Preprocessing
Attribute Selection
Association Rule Mining
ARM: A Priori and Data Structures
ARM: Improvements
ARM: Advanced Techniques
Clustering: Challenges, Basics
Clustering: Improvements
Clustering: Advanced Algorithms
Hybrid Approaches
Graph Mining, Web Mining
Text Mining: Challenges, Basics
Text Mining: Text-as-Data
Text Mining: Text-as-Language
Revision for Exam
March 5, 2009
Slide 2
COMP527:
Data Mining
Today's Topics
Introduction to Association Rule Mining (ARM)
General Issues
Support
Confidence
Lift
Conviction
Complexity!
Frequent Itemsets
Association Rule Mining
March 5, 2009
Slide 3
COMP527:
Data Mining
Introduction
We've spent a long time looking at various classification methods,
but there's more to data mining than classification.
Given a data set with no classes, just attributes, what might we
want to do with it?
Association Rule Mining: Find patterns in the attribute values
between instances.
Instead of predicting an unknown value, we want to find interesting
facts about the relationships between the known values.
Association Rule Mining
March 5, 2009
Slide 4
COMP527:
Data Mining
Introduction
In ARM, these patterns take the form of rules about the cooccurrence of attributes. The easiest example to use is market
basket analysis -- finding patterns of things that are bought
together in a supermarket.
Shopping at a supermarket, you typically buy many things together
(as opposed to shopping for a television, say). Perhaps 30
different items. Under 10 items is pretty rare.
By comparing your shopping habits over time, the supermarket can
learn about you and how best to make you spend more money,
increasing their profits. They can also compare all shoppers'
habits to find general rules, hopefully for how to increase profits.
Association Rule Mining
March 5, 2009
Slide 5
COMP527:
Data Mining
Introduction
Basket1: bread, butter, jam
Basket2: bread, butter
Basket3: bread, butter, milk
Basket4: beer, bread
Basket5: beer, milk
What can we find from this?
Some simple statistics: bread occurs 80% of the time. butter
appears 60% of the time.
Less simple: 100% of baskets containing butter also contain bread.
100% of baskets containing butter and jam also contain bread.
Association Rule Mining
March 5, 2009
Slide 6
COMP527:
Data Mining
Finding Rules
Basket1: bread, butter, jam
Basket2: bread, butter
Basket3: bread, butter, milk
Basket4: beer, bread
Basket5: beer, milk
if (butter jam) then bread
if butter then bread
if bread then butter
To find rules we find sets of items which occur together. The more
frequently they occur, the better our rule is. There are some
particular factors involved in determining the 'goodness' of a
rule...
Association Rule Mining
March 5, 2009
Slide 7
COMP527:
Data Mining
Support
Basket1: bread, butter, jam
Basket2: bread, butter
Basket3: bread, butter, milk
Basket4: beer, bread
Basket5: beer, milk
Support: Percentage of baskets in which the item(s) occur.
bread: 80%, butter 60%, (bread butter) 60% ...
So the support for a rule X => Y, is the percentage of instances
which contain both X and Y.
Association Rule Mining
March 5, 2009
Slide 8
COMP527:
Data Mining
Confidence
Basket1: bread, butter, jam
Basket2: bread, butter
Basket3: bread, butter, milk
Basket4: beer, bread
Basket5: beer, milk
We also need a confidence for each rule -- how strongly we believe
that rule to be true.
Here, butter => bread is true 100% of the time, but bread => butter
is only true for 3/4 baskets that contain bread so true 75% of the
time.
Confidence for X => Y is number of instances that contain X and Y
divided by the number of instances that contain X.
Association Rule Mining
March 5, 2009
Slide 9
COMP527:
Data Mining
Rule Mining
Basket1: bread, butter, jam
Basket2: bread, butter
Basket3: bread, butter, milk
Basket4: beer, bread
Basket5: beer, milk
ARM algorithms have a minimum threshold for both support and
confidence and discard any rules below those thresholds.
For example jam => (butter bread) has 100% confidence, but only
20% support, because jam butter and bread only occur once.
On the other hand butter => bread has 60% support and 100%
confidence, a much more interesting rule to us.
Association Rule Mining
March 5, 2009
Slide 10
COMP527:
Data Mining
Lift
Confidence and Support are necessary but not sufficient to find
interesting rules.
Suppose that X => Y has a confidence of 60%. (X+Y)/X = 0.6
Sure, that looks interesting... there's a correlation between buying X
and buying Y.
But what if the probability of Y was 70% overall? Then if you buy X,
you're less likely than normal to buy Y... certainly not what the
rule is implying!
Association Rule Mining
March 5, 2009
Slide 11
COMP527:
Data Mining
Lift
Lift is measured in terms of support: s(X+Y) / s(X) * s(Y)
This would then take into account the likelihood of Y.
This penalises 'obvious' rules where both X and Y are common.
For example bread => milk ... if 90% of baskets contain bread
and 85% of baskets contain milk, then the worst that
bread=>milk could be is 75%.
(10% of baskets don't contain bread but do contain milk, 15% don't
contain milk but do contain bread, therefore at least 75% must
contain both. The maximum is 85%, where all baskets with milk
have bread, 5% have just bread and 10% have neither)
Association Rule Mining
March 5, 2009
Slide 12
COMP527:
Data Mining
Lift
Lift: s(X+Y) / s(X) * s(Y)
if the support for X is 0.25, Y is 0.7, and X+Y is 0.15 then we have:
0.15 / (0.25 * 0.7) = 0.857
Because this is less than 1, there is a negative correlation.
0.75 / (0.85 * 0.90) = 0.98 --> Negative lift
0.85 / (0.85 * 0.90) = 1.111 --> Positive lift
Break even point is 0.765
Association Rule Mining
March 5, 2009
Slide 13
COMP527:
Data Mining
Conviction
We can express this in just terms of baskets that contain A but not
B.
“if A then B” implies “not (A and not B)”
So the formula for conviction is:
s(A) s(not B) / s(A and not B)
If A and B always co-occur, the denominator will be 0. Splat. (treat
as infinite)
Association Rule Mining
March 5, 2009
Slide 14
COMP527:
Data Mining
Other Evaluation Metrics
Association Rule Mining
March 5, 2009
Slide 15
COMP527:
Data Mining
Back to Rule Mining
The most common approach to finding rules is:
1. Find sets of 2 or more attributes that occur together in more
instances than a minimum support threshold.
2. Generate rules from those sets.
The most important thing to note is that any subset of a frequent
item set is also frequent.
If (bread, milk, butter, beer) is frequent, then (bread, butter, beer) is
also frequent because it must occur as least as often as the full
set.
Association Rule Mining
March 5, 2009
Slide 16
COMP527:
Data Mining
Naïve Approach
No problem. Algorithm is obvious:
Count all possible itemsets that appear in all transactions.
If our transactions are: BC, BD, AC, BCD, ABD, ABCD
We count: AB, AC, AD, BC, BD, CD, ABC, ABD, ACD, BCD, ABCD
Uhh... And when you have the number of different items as a
supermarket?? Say 100,000 different products? Ignoring empty
set and the single item sets, that's 2100000-100000 -1...
You want to know how many that is?
Association Rule Mining
March 5, 2009
Slide 17
COMP527:
Data Mining
Naïve Approach: BAD!!!
99900209301438450794403276433003359098042913905418169177152927386314583246425734832748733133244965040316439444555585493001879966076561765629084713542474928751988896298736710932463504273731124792658002785312410887
370856052872283901645686910268506759235179146970528576446968015248323454755432502927865208069577709717411022320429763512053307779968979251166198707717857759555217200813202952046179492292592956239209657978
735581586675254957973131448062492602618379413050805826860315351341787396228349908863577580621046066363721305877953223449720108084863695414018358513598580356035740218729081555665806071864612689728397946218
422675793496388933572475887619591376567624111250207087048704651793963987101092003639347456180906016133778985602968635985580247614489330470522228601313770959583573194858984964045723838751707022423326334368
944232973818777331532869442179361253019078689036036632831615027261399341528040711719149239033418749353944558963012921972564177172335435447515523793108922681824024527557520947046421859438628656327442313320
847422215514933150027177500642288262118225493496005574573349646784832691809518959557691745096732244177404328404558821091379053756467721399766217852650571698548345624875183223832503186455054721143699341679
816781702551228129780651948062954053391546574799412974991903485075443364145056316573960066933824273164340395801212802609842122475142078347122248314103040686037196401618557416564394722534649452497003145098
900931622689527444287054764254722531675145211822314553883743082326422006330251375331293651643417252062561553117947386191429047614456549271284181751835313270529754953705614382395732279396730301060774568484
774278321953492279838364361637647429695459066723691241363259321233356431358944652191018821238297409079163860232354509593887667364032295779939011521544480036372150691155911119960015305891077294210322304242
620356934932160529275696258584458223545946452769231081973058062803265167364493437617324097533423332897302829591735692730132864233117596052304951716770331637095222569524604021433876551976440165281480223483
318810975594219604764793885201985410173489859485110054692466172341431353099384059232689535865388869744270086070286355020855620295493524800507965215649196832651067441009678229519541616177175429975200098873
073778762106858907709694116104380286239504453237895918707602892603934898261007748876728529181064684891438936490647845912116121933007079005370590421880128565594036990708880329668716116559612323319983109232
250828661803218804394475729867620969358197843859279692501233269351946932077243355273655662482237878338880749992768316334403186044636187037897843130328438234704109443065914719283411909751852392123276743849
905615636884329390394420026175309768506051329371014490863961416205560535473355699267009413752718291424072342679375650697655674759341013102253428300804090795873295442135513073020501715984242307604692097329
072901416063539608805592023573768856478522400927771114891344924169956071717862984365339781808694741067511113535237115404365993108896974856588008878619749343579292462040517672460122506184040119662898726738
030704983612179744846791007478463561946648292247361341151355671792917819680560537264841411283478582412591219546011844124093497829633170420025304186616949623187358606524854102222118695442237882891897120805
145751413619648053697231645705649984795376571745481285974060773391587753323552156094359192751993510142222469630170137174193375049192953632951011152929518362828191918216516764559465158280489842561167481503
678052678786627169996492969493770457948761466281109299820207370133303244510053853785511888034741481986651145793226849009930002367361685552941734420599253719652449979254831593437063439703718096114703230741
869850350547222890271748503333683283002811329108416931504573899331839345932929949427960153097561187089189295284490742432847670062431711716227317666067961019678022045645890158995247047410011581109636337313
293883568689494087593341769093878063985846473005889281759988444774861300631530687600700848372675277897773568300427789027721056838330214702797285953363321105640642639097245799496861629080196041417539357688
765879924285499121517379242703432486484142474568388895418932414509875057594030132496975416969553302968802193048741635010979200362102387682751763699809776149796360967043481401241306835768799049974365962964
957054595247353820003637703248949821033313329135623151698544104153170541939282347233988484535521732036880883121009439414349382822035496502815307510870986046812248029738256312449893319652962023726085865090
503079933086520012316719151827657420956895131361840954121214737863110428977178614481583169658487669495548262525049612270447147122296202746823629098038774693769873589421254417923552983874798304502539097887
334697326030975441564748054737327327672486527590349953363541269539004588549886835749278646152520408004901147858922890854433539969947808674716135197858385714564215831711930041179894407902683463575503398880
867251278835772976264992138274365739929273022387925769242327854872012972553860719683037824830637258998084846385038283562584039173118726943814645536516900625300232175913430847552159014752991492152969443623
669108332336937679931382092758700242462383312182367152367720984171877038601723085224480431763336027597331612012622483230853292889861545592214273785074109788222447295126635722255671697794097673415430172892
683326350774512101678691213344656807397973727114619192999381181788275414217929268837902854309099424412605119458492379099663295502638657011148841422661629698100736527109285045794708615080940545777978643015
048999586341647005282205627860088640257094324442540440342431402038120748575379990160664655209869807905893473202430506359073638215212806000418275293254852479279042357275985742095546323638309324282507115188
017756337398115237619946862632705506350998512543338755946015409008620142936256737383316930823288543270014874766351188308851737752688195263601653459005561607677134536176554509744249790760639060933000284169
648475940270466694684865936364254286252416448366521739225865284742449523633023053114134493323398223365516114314691319001704882268365259163997239126266161402057079967273835295974791254889614192872612597575
617015926458235411519221772539196510343447936803690570038130565578663110114763131895715563365187277579919088628907654949520194749221488514170792523523942938017011494852390058443583297487692799415863846408
772659017491049332388534654299792539005613115622882411471921581372101202673996486228316104302872687398403351421202995166108461931646880759445269652485700705544521525474934504348529179875121859736471904615
154135825821390401721182957023275370273897877935069040449385535876505035571558728732015968850613311454771015756993754410974933741159911991149627268017180389509078030411844000755854685609769656695843256272
833274164180445907278446800513607741542884127124563533836254690689364309020682167504598193217445133629138539831545606104596926045087877003041845791534782917257628106327221080358260609045724606192042375803
631472001587490753616337852434622987699178878086714539288465724172235048877668038694534745888319075973552928007092414713706966470295307005070830914124927714047761934590073152062336342261281370745041625204
734495974156788820038454467743889503791923445941712455102317389950303484219370880833297091081765610107086931580206950600964283520466473333611634766641063112470651738025105994092669089840466632986136488548
712306599035657723276676960571870572768143949325593713680293759746041160756415999194022667942306814857233613635929036768414803583280931275068011115716150627615566071582366122685442683302747258492948758520
897908509628352355279784914755637443184839934746333003309724970128084159009694551903758499457503794650191660098615027946061307947268985078496103038848460354233921754495058761571303447004158230802257866933
005121268318460095102035431743237832921768659760762754124218928081388728801758131092960201507463319795614881463334126748962568837843511784775926605772127342693283823847117460837822099396466123083439521695
765810654237719818995738404303159309732150599013712183997625850554354595163400551490805656273304753625289269450202261631309024207950062589313678130052221407429647561940537821824528330970215542109296386930
054600119271783027615635057157354056726525241759254363718634718362920121624566620936420746055008424493472898306195060775705287548452776806612183580661302914632889322407010438875350078519971591983900846545
969961971383997234958637498065824393846150491840484858191935606671259680185748778195611043352384208734177433851857356631012927574092805868400118048549941494787368829493687866372026826071987076562864367537
757095603497183974055655052694252183543013489107852345177955197575164847115459284660037545584854709947374937966158410404142398757633352017955186448566322015985563419342866689125221534463487912181596227445
253723142191847387705966599421812754036136604385388292018102048509177177914852560262425298024923092295621770627700276592881584739948042550677309034200434916329135886446274153184685174625801809013144773586
374865282212744506618836678735450371395355632603497782099924165591116020974374914323607878793310150524170474378235535062056170175721753870617511929197156603630283023438195849465943284604829319605151248671
236046256539035651733228567582109375412226742238470466647336202928248340651378144753677476718822200983896820197842167240154912533604364378474797706336579054181335230108045599585473796858647089377916593402
237955370452738494354411059838879697411430510694012710656285075370398233088678198682981714151852182714936131109639840219124483234239013925538117259541532094350029548076402919827657415140429566669531773040
033587015037034974248978981089394530269768782315579381589289968687663676035790553227948227576591048128352197457240223475699146502406367304928332861518750491298734579308749994880486812508029046064462235695
627679648989148699242019464585213551657098871183782904371743756252826061405346119873953346775009366257467656384596295218722627774734804912339651942813537250686607820766838625654872790380204867780999917543
808157898208252555662349839332174914938649662841168898746650054147482645999727520033700845425925443011903990412317527719937677998475512794480129138420343231548881379325248871720993811957221631481016702748
773791618309689373487201689449032996589325119965041096536746189148615994816320408919305772386303963118582133413371100963891138365968959147153709250739984616820464264472907889765255935051365469783646031838
206195605785175615049726618176490303049821385347386962122346261140430356009670425470123173604497246232874525751511987718015857428293890256508259882754951108654247042183372640230780456816514205178074181960
964015134617607943627696122281261186109127668148805009509638890328777108376510519000761280584739692587687379373066647513879422173546940211576755575689701687341043424465255225689743297161527425581105034950
457189317524470704103077608303655367141803887236029488728055907527111155907947569269039785196019397903117680703568019449361068506405685192906450486855356282567872257345441465655411878167177298506128740446
208907185021085180250529245903598141175227203205526425977519844107424921792420390800146062259994221097171761187468458026737248013656038669099710713472558597232170275540550850820904189875348292220041789984
750305195371790620015093330230238818065191824055508186721647117023075299226522280338204041133866253358150429341151439809399864163656339236206738742593427134447012427027222271975732031944894078563555116396
19115985907995399083680129468810771595938084908111251938016414866250141095286680914828503123938960997659175977315432797173945762560365023587931559926170852315074247849814256564
Association Rule Mining
March 5, 2009
Slide 18
COMP527:
Data Mining
Frequent Itemsets
Let's not try to work out the support for all possible combinations.
Subsets of frequent itemsets are frequent. All subsets of a set that
meets the minimum support will also necessarily meet the
minimum support.
So if we know a subset is small, any superset must also be small.
So, instead of trying all combinations, we'll generate itemsets for a
particular size and scan the database to see if any of them meet
the support threshold. We know that any subsets of frequent
sets are also frequent and supersets of infrequent are also
infrequent, so don't need to check them.
Association Rule Mining
March 5, 2009
Slide 19
COMP527:
Data Mining
Itemset Lattice
null
Infrequent
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
Pruned
supersets
ABDE
ACDE
BCDE
(Lattice borrowed from
CSE980 @ MSU)
ABCDE
Association Rule Mining
March 5, 2009
Slide 20
COMP527:
Data Mining
A Priori
The algorithm that does this is called A Priori and most other ARM
techniques are based on it.
Will look at it in more detail next week.
Association Rule Mining
March 5, 2009
Slide 21
COMP527:
Data Mining
Issues with WEKA and ARM
ARFF is a horrible horrible format for ARM.
Most datasets are very sparse with the attributes being present or
not present. Bread 0/1, Milk 0/1, etc. We want to record this as
{bread, milk ,cheese} not a huge table of 1s and 0s
Weka doesn't include many ARM algorithms...
In fact it has three, thankfully one is A Priori. The book doesn't
include much information, but Dunham has good coverage.
We'll also look at some other ARM applications built by Frans
Coenen and Paul Leng here at Liverpool.
Association Rule Mining
March 5, 2009
Slide 22
COMP527:
Data Mining
Further Reading

Witten 4.5

Dunham 6.1, 6.2

Han 5.1

Berry and Browne 14.1-14.3

Berry and Linoff Chapter 9

Zhang, Association Rule Mining, Chapter 1, 2.1, 2.2

Pal and Mitra, 8.3
Association Rule Mining
March 5, 2009
Slide 23
Related documents