Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Discovery of Indirect Association from Web Usage Data Vipin Kumar Army High Performance Computing Research Center Department of Computer Science & Engineering University of Minnesota [email protected] Homepage: http://www.cs.umn.edu/~kumar © Vipin Kumar, PAKDD (May 2002) University of Minnesota / Army High Performance Research Center ‹#› Overview Application/ Web Server Web User Internet Knowledge © Vipin Kumar, PAKDD (May 2002) Web Usage Mining University of Minnesota / Army High Performance Research Center Web usage data ‹#› Overview… Web Usage Mining: – the automatic extraction of non-trivial patterns from Web usage data Main techniques: – Clustering: to find groups of users who share similar browsing behavior – Classification: to categorize Web users according to their past access history – Association: to determine what are the set of page views often accessed together in the same server session © Vipin Kumar, PAKDD (May 2002) University of Minnesota / Army High Performance Research Center ‹#› Application of Web Usage Mining Web Usage Mining Personalization Site H elper Letizia Web Watcher M obasher Analog Krishnapuram System Improvement R exford Schecter Ag g arwal Site M odification Adaptive Sites WebSIFT WU M SpeedTracer WebLog M iner Shahabi Business Intellig ence SurfAid Buchner Tuzhilin U sag e C haracterization Pitkow Arlitt M anley Almeida Source: J. Srivastava, R. Cooley, M. Deshpande, PN Tan, Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data, SIGKDD Explorations (2000) © Vipin Kumar, PAKDD (May 2002) University of Minnesota / Army High Performance Research Center ‹#› Taxonomy of Web Usage Mining Pro ject Ap p licatio n Data Fo cu s Serv er W eb SIFT (C TS9 9 ) Gen eral x Sp eed Tracer (W YB 9 8 ,C PY9 6 ) Gen eral x W UM (SF9 8 ) Gen eral x Sh ah ab i (SZAS9 7 ,ZASS9 7 ) Gen eral Site Help er (NW 9 7 ) Perso n alizatio n x Letizia (Lie9 5 ) Perso n alizatio n W eb W atch er (JFM9 7 ) Perso n alizatio n Krish n ap u ram(NKJ9 9 ) Perso n alizatio n x An alo g (YJGD9 6 ) Perso n alizatio n x Mo b ash er (MC S9 9 ) Perso n alizatio n x Tu zh ilin (PT9 8 ) B u sin ess x Su rfAid B u sin ess x B u ch n er(B M9 8 ) B u sin ess x W eb Tren d s,Hitlist,Accru e,etc. B u sin ess x W eb Lo g Min er (ZXH9 8 ) B u sin ess x Pag eGath er,SC ML (PE9 8 ,PE9 9 )Site Mo d ificatio n x Man ley (Man 9 7 ) C h aracterizatio n x Arlitt(AW 9 6 ) C h aracterizatio n x Pitk o w(PIT9 7 ,PIT9 8 ) C h aracterizatio n x Almeid a(AB C 9 6 ) C h aracterizatio n x R ex fo rd (C KR 9 8 ) Sy stem Imp ro v e. x Sch ech ter(SKS9 8 ) Sy stem Imp ro v e. x Ag g arwal(AY9 7 ) Sy stem Imp ro v e. So u rce Pro x y C lien t x x x x x x Data Ty p e User Site Stru ctu re C o n ten t Usag e Pro file Sin g le Mu lti Sin g le Mu lti x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Source: J. Srivastava, R. Cooley, M. Deshpande, PN Tan, Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data, SIGKDD Explorations (2000) © Vipin Kumar, PAKDD (May 2002) University of Minnesota / Army High Performance Research Center ‹#› Mining Association Patterns in Web Data Input Output /home /product /product/electronics Frequent Itemsets /home /shipping /shipping/status Association Rules /home /product /product/offer /product/electronics/tv /home /product /product/electronics /product/electronics/tv Click-streams © Vipin Kumar, PAKDD (May 2002) Association Mining University of Minnesota / Army High Performance Research Center Sequential Patterns Association Patterns ‹#› Mining Association Patterns in Web Data… Provide the following information: – What are the set of pages frequently accessed together by Web users? (frequent itemsets) – What page will be fetched next? (association rules) – What are the paths frequently traversed by Web users? (sequential patterns) Web association patterns are useful: – – – – To improve Web site design To develop prefetching and Web caching policies. To recommend related pages To collect business intelligence about behavior of Web users © Vipin Kumar, PAKDD (May 2002) University of Minnesota / Army High Performance Research Center ‹#› Association Mining is Support-based Frequent itemset: – combination of pages that have support greater than a user-specified minimum threshold Association Rule: s,c Support, s A B Number of sessions that contain A and B Total number of sessions Confidence , c Number of sessions that contain A and B Number of sessions that contain A Higher support implies greater statistical significance Support-based pruning constraints the exponential complexity and makes the association rule computation tractable © Vipin Kumar, PAKDD (May 2002) University of Minnesota / Army High Performance Research Center ‹#› Can Infrequent Patterns be Interesting? If support between A and B is too small, then there may be a negative correlation between A and B – P(A,B) < P(A) P(B) – If (A B) has a high support, then support(A,B) will tend to be low because: P(A B) = P(B) – P(A,B) – Example: Coffee Tea © Vipin Kumar, PAKDD (May 2002) University of Minnesota / Army High Performance Research Center ‹#› Approach 1: Using Negative Items Session Id A A B B C C D D 1 1 0 0 1 1 0 0 1 2 1 0 0 1 0 1 0 1 3 1 0 0 1 1 0 0 1 4 1 0 1 0 0 1 0 1 5 1 0 0 1 0 1 1 0 Computationally expensive Tends to produce too many negative associations © Vipin Kumar, PAKDD (May 2002) University of Minnesota / Army High Performance Research Center ‹#› Approach 2: Negative Itemsets Savasere et al [1998]: – A negative itemset is a set of items whose actual support is significantly lower than its expected support – Expected support can be computed using item taxonomy Suppose C and G are frequent: A F C B D G E © Vipin Kumar, PAKDD (May 2002) J H K sup( CG) sup( E ) sup( J ) sup( C ) sup( G ) sup( CG) sup( J ) Exp(sup( CJ )) sup( G ) sup( CG) sup( H ) Exp(sup( CH )) sup( G ) Exp(sup( EJ )) University of Minnesota / Army High Performance Research Center ‹#› Indirect Association a IF: THEN: M b a and M are frequent and highly dependent on each other b and M are frequent and highly dependent on each other a and b are expected to be frequent If a and b are infrequent, it has an interesting negative association a and b are indirectly associated via mediator M © Vipin Kumar, PAKDD (May 2002) University of Minnesota / Army High Performance Research Center ‹#› Non-Sequential Indirect Association: Formulation M a b A pair of Web pages, a and b, are indirectly associated via mediator set M, if : – Sup({a,b}) < ts – Sup({a} M) tf, Sup({b} M) tf – Dep(a,M) td, Dep(b,M) td where Dep(x,M) can be any “reasonable” objective measures © Vipin Kumar, PAKDD (May 2002) University of Minnesota / Army High Performance Research Center ‹#› Finding Interesting Negative Associations For all pairs of items: With Mediator Frequent FM No Mediator FN Minimum itempair support Infrequent IM IN If IM/FM = IN/FN mediator thresholds © Vipin Kumar, PAKDD (May 2002) then Indirect Association is not surprising University of Minnesota / Army High Performance Research Center ‹#› Finding Interesting Negative Associations With Mediator No Mediator FM FN Frequent • IM/FM is small • IM/IN is small Indirect Association Infrequent IN is interesting IM © Vipin Kumar, PAKDD (May 2002) University of Minnesota / Army High Performance Research Center ‹#› Finding Interesting Negative Association 1 10 Indirect Association is interesting when minimum itempair support threshold is small. IM/FM IM/IN 0 10 But, if threshold is too low, very few indirect associations are obtained. -1 10 -2 10 1 3 5 7 9 11 13 15 17 19 21 Minimum itempair support (ts ) © Vipin Kumar, PAKDD (May 2002) 23 25 27 29 University of Minnesota / Army High Performance Research Center ‹#› Application: Market-Basket Analysis Substitute items: up-sell – Pavilion PC 17” Monitor Pavilion multimedia PC Competing items: competitive analysis – Coke Ruffles Pepsi Complementary items: cross-sell – Tekken 3 Playstation Memory Card Tomb Raider 2 © Vipin Kumar, PAKDD (May 2002) University of Minnesota / Army High Performance Research Center ‹#› Other Applications: Information Retrieval Identify synonyms and antonyms Identify the different contexts of a queried word – Useful to group together query results Data Mining Web Trade Worker © Vipin Kumar, PAKDD (May 2002) Union Gold Soviet University of Minnesota / Army High Performance Research Center ‹#› Other Applications: Stock Market Dow Jones (DJI) % Change 5 0 Let v(t): value of time series on day t % change (t) = (v(t) - v(t-1))/v(t-1)*100 -5 -10 11/8/99 Microsoft Corp (MSFT) 3/28/02 Event definition: Up: % change > 2% Down: % change < -2% % Change 20 0 -20 11/8/99 RedHat Inc (RHAT) 3/28/02 % Change 20 Support counts: {DJI-Down,MSFT-Down} : 26 {RHAT-Up,MSFT-Down} : 24 {DJI-Down,RHAT-Up} :4 0 -20 11/8/99 3/28/02 Indirect Association can partition events that are associated with the movement of a stock price. MSFTDown DJI-Down © Vipin Kumar, PAKDD (May 2002) University of Minnesota / Army High Performance Research Center RHAT-Up ‹#› Sequential Indirect Association: Formulation A pair of Web pages, a and b, forms a sequential indirect association via a mediator sequence, w, if the following conditions are satisfied: – Support({a,b}) < ts – Support(s1) tf, Support(s2) tf – Dependence(a,w) td, Dependence(b,w) td – w does not contain a and b (s1=aw or wa, s1=bw or wb) discover groups of users who have different interests but share a common traversal path © Vipin Kumar, PAKDD (May 2002) University of Minnesota / Army High Performance Research Center ‹#› Types of Sequential Indirect Association Different entry points a Different entry and user interests w b w a a b w Different user interests Type C (convergence) © Vipin Kumar, PAKDD (May 2002) Type D (divergence) b Type T (transitive) University of Minnesota / Army High Performance Research Center ‹#› Clustering vs Indirect Association Clustering is another way to discover different groups of users – A. Banerjee and J.Ghosh, Clickstream Clustering using Weighted Longest Common Subsequences, Workshop on Web Mining (2001) – TW Yan, M Jacobsen, H Garcia-Molina and U Dayal, From User Access Patterns to Dynamic Hypertext Linking, Proc of the 5th International World Wide Web Conference (1996). – YJ Fu, K Sandhu, MY Shih, A Generalization-Based Approach to Clustering of Web Usage Sessions, Web Usage Analysis and User Profiling, LNAI (2001) Clustering cannot find distinct groups of users who share a similar traversal path since the support of the mediator is large (mediator often contains navigation pages) It is more likely that several indirect associations are contained within a single cluster, rather than each indirect association connects between two separate clusters © Vipin Kumar, PAKDD (May 2002) University of Minnesota / Army High Performance Research Center ‹#› Indirect Association for Web Data Site Structure A C Large 3-itemset (minsup = 40%) Pattern {A,B,D} {A,B,E} {B,C,D} B Support 2 2 2 {A,B} D E D E Web sessions Session Id 1 2 3 4 5 Non-sequential Indirect Association Sequence <A,B,C,B,D> <A,B,E> <B,C> <A,B,E,B,D> <B,D,B,C> © Vipin Kumar, PAKDD (May 2002) Sequential Indirect Association Large 3-sequence (minsup = 40%) Pattern A->B->D A->B->E A Support 2 2 B D University of Minnesota / Army High Performance Research Center E ‹#› Impact of Site Structure on Associations Navigation pages (e.g. home pages and hub pages) tend to have higher support than content pages 5 5 10 10 4 4 10 10 3 3 10 Support Support 10 2 10 1 1 10 10 0 10 2 10 0 1 2 3 4 5 6 7 8 Number of hops away from the home page 9 10 10 1 2 3 4 5 6 7 Number of outgoing links 8 9 10 Source: U of Minnesota Computer Science department Web logs (Jan 1-31, 2001) If threshold is too high, most of the patterns contain only navigation pages If threshold is too low, too many patterns! © Vipin Kumar, PAKDD (May 2002) University of Minnesota / Army High Performance Research Center ‹#› How Indirect Association helps Indirect association groups together patterns that have similar substructures – The common substructures (mediators) often contain the navigation pages Mediator A Navigation pages B D © Vipin Kumar, PAKDD (May 2002) E University of Minnesota / Army High Performance Research Center ‹#› Grouping Indirect Associations Indirect associations can also be grouped together into more compact structures if they share a common mediator A,B A,B D E D F E A,B Check degree of association D F © Vipin Kumar, PAKDD (May 2002) University of Minnesota / Army High Performance Research Center ‹#› Viewer for Non-Sequential Indirect Association Indirect Pairs (dashed line) List of Mediators Frequent Pairs (solid line) Currently Selected Mediator © Vipin Kumar, PAKDD (May 2002) University of Minnesota / Army High Performance Research Center ‹#› Experimental Setup Data sources: – UMN: University of Minnesota Web log contains 91,443 page views and 34,526 sessions – EC: E-commerce Web log Contains 6664 page views and 143,604 sessions Steps: 1. Preprocessing – identify sessions, convert sessions into “transactions” 2. Extract frequent itemsets or sequences: – apply Apriori algorithm to find frequent itemsets – Apply GSP algorithm to find frequent sequences 3. Apply indirect association algorithm 4. Merge indirect associations that have common mediator © Vipin Kumar, PAKDD (May 2002) University of Minnesota / Army High Performance Research Center ‹#› Experimental Results (UMN) Prospective graduate students Contact information Students taking CS as minor subject CS Graduate student association Students planning to take PhD exam Experimental Results (UMN) Regular Seminar/ Colloquium attendees Contact information Students taking CS as minor subject Experimental Results (EC) Experimental Results (EC) Experimental Results (EC) Experimental Results (EC) Experimental Results (EC) Experimental Results (EC) # a Mediator, w b 1 Stereo HomeElectronicsAuto Radar Detector 2 Hunting HomeSporting Goods Aerobics 3 Women’s gown set HomeApparel Men’s boots 4 Camcorders HomeElectronicsVideo 13” TV and VCR combo 5 Scanner HomeElectronicsComputer Multimedia Computer 6 Shower curtains DomesticsBath Shop Towel set 7 Bedroom furniture frame Home & Accessories Furniture Oak nightstand 8 Speakers ElectronicsStereo CD boombox 9 Men’s HomeJewelry Ladies 10 Cordless phone HomeTelephones Answering device © Vipin Kumar, PAKDD (May 2002) University of Minnesota / Army High Performance Research Center ‹#› Conclusion Indirect Association provides an alternative approach to capture interesting infrequent patterns – For Web data, indirect association represents the distinct interests of Web users who share similar traversal path – Such patterns cannot be easily found using standard association and clustering techniques Indirect Association can be used to group together patterns into more compact structures – Navigation pages form the mediators © Vipin Kumar, PAKDD (May 2002) University of Minnesota / Army High Performance Research Center ‹#› References 1. PN Tan, V Kumar, J Srivastava, Indirect Association : Mining Higher Order Dependencies in Data, In Proc of the 4th European Conference on Principles and Practice of Knowledge Discovery in Databases, Lyon, France, Sept 13-16, 632-637 (2000) 2. PN Tan, V Kumar, H Kuno, Using SAS for Mining Indirect Associations in Data, In Proc of the Western Users of SAS Software Conference (2001) 3. PN Tan, V Kumar, Mining Indirect Associations in Web Data, In Proc of WebKDD 2001: Mining Log Data Across All Customer TouchPoints, August (2001) 4. KDD Web Mining workshops: WebKDD 1999, WebKDD 2000, WebKDD 2001 5. SIAM Workshop on Web Mining 2001 and 2002 © Vipin Kumar, PAKDD (May 2002) University of Minnesota / Army High Performance Research Center ‹#›