Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Web Mining: Introduction Alípio Jorge, DCC-FC, Universidade do Porto [email protected] 1 Data Mining 2 Alípio Jorge Data Mining Problems Techniques Domains Data formats Challenges 2 Data Mining 2 Alípio Jorge Overview Web Mining Usage Mining Structure Mining Links Social networks Content Mining 3 Recommender systems Data collection and processing Deployment Text mining Information extraction Information retrieval Opinion mining / sentiment analysis Data Mining 2 Alípio Jorge Knowledge (sort of) tree data mining web mining web usage mining text mining web content mining web structure mining recommender systems authority identification information extraction 4 Opinion mining information retrieval ... mining scientific data Data Mining 2 Alípio Jorge community identification Web Mining Web Usage Mining discovery from user access patterns from logs or alike applications: user segmentation, recommendation, personalization, adaptation, usability improvement Web Structure Mining discovery of useful knowledge from hyperlinks applications: discover important pages (information retrieval) discover communities Web Content Mining extracts information from Web pages applications 5 information extraction, summarization, topic extraction, opinion mining, sentiment analysis, information retrieval Data Mining 2 Alípio Jorge Web Usage Mining: Problems User Segmentation Content Bundling Item Recommendation Menu Customization User Action Prediction ... 6 Data Mining 2 Alípio Jorge Web Usage Mining: Techniques Clustering methods segmentation content bundling Association rule discovery methods recommendation and personalization Collaborative filtering Markov chains Classification predicting if a user is leaving the site or what is doing next ... 7 Data Mining 2 Alípio Jorge Web Usage Mining: Problem User segmentation we want to find user segments according to their activity what is a good user segment? Examples Web site targeting Newsletter targeting Study the evolution of usage styles Data 8 What is necessary? Data Mining 2 Alípio Jorge Web Usage Mining: Tec: Clustering Example task we want to have different entry pages for different user groups Strategy 9 users who tend to visit the same pages are regarded as a group In DM, this is automatically done by Clustering Algorithms Data Mining 2 Alípio Jorge Web Usage Mining: Clustering USER PAGE 1 A 1 B 1 C 2 A 2 C 3 B 3 G 3 F 3 I 4 B 4 C 5 G 5 F 5 I 5 J 6 A 6 C 10 Activity look for two groups of users Two users are similar if they tend to view the same pages similarity of two users X and Y can be: #pages seen by both / # pages seen by any of them We can also use Euclidean distance each user is described as a point in the space <A,B,C,D,E,F,G,I,J> then we calculate the distance between the two points ( A1 − A2)2 + (B1 − B2)2 + ...+ ( J1 − J 2)2 Data Mining 2 Alípio Jorge Web Usage Mining: Clustering USER PAGE 1 A 1 B 1 C 2 A 2 C 3 B 3 G 3 F 3 I 4 B 4 C 5 G 5 F 5 I 5 J 6 A 6 C 11 Bottom up Hierarchical Clustering look for the two most similar users join them in one cluster look for the two most similar users/clusters etc. continue until all users are in one cluster Distance between two clusters U and V minimum: the minimum distance between one element of U and one element of V maximum: adapt from above a.k.a. single link a.k.a. complete link average: adapt from above Data Mining 2 Alípio Jorge Web Usage Mining: Clustering USER PAGE 1 A 1 B 1 C 2 A 2 C 3 B 3 G 3 F 3 I 4 B 4 C 5 G 5 F 5 I 5 J 6 A 6 C 12 Bottom up Hierarchical Clustering aggregation tree obtained with RapidMiner we can obtain a predefined number of clusters operator AgglomerativeClustering Euclidean distance complete linkage operator AglomerativeFlatClustering Kmeans clustering another popular clustering method Data Mining 2 Alípio Jorge Web Usage Mining: Clustering USER PAGE 1 A 1 B 1 C 2 A 2 C 3 B 3 G 3 F 3 I 4 B 4 C 5 G 5 F 5 I 5 J 6 A 6 C 13 Applications entry page customization with known users with unknown users Newsletter customization Segmented usability study Dynamics Other variables to consider Aggregate number of page views size of average session number of sessions pageview duration Data Mining 2 Alípio Jorge Activity(R) USER PAGE 1 A 1 B 1 C 2 A 2 C 3 B 3 G 3 F 3 I 4 B 4 C 5 G 5 F 5 I 5 J 6 A 6 C 14 # read data (copy and…) #d<-read.table(file("clipboard"),header=TRUE) # macs: read.table(pipe(“pbpaste”),…) # or paste into a file and read.table/ read.csv d<-read.csv("toy-session-data.csv") # transform data into a matrix dm<-table(d$USER,d$PAGE) # cluster and view dendrogram plot(hclust(dist(dm))) # check parameters of dist and hclust Data Mining 2 Alípio Jorge Activity (Rapid Miner) Reproduce the clustering results with RapidMiner first we have to convert the 2 column table to a table with one user per line and one page per column excel: pivot tables + function IF open Rapid Miner and nuild a new process with two operators ExcelExample Source Agglomerative Cluster 15 define the right parameters define the right parameters press run (play) User A B C F G I J 1 1 1 1 0 0 00 2 1 0 1 0 0 00 3 0 1 0 1 1 10 4 0 1 1 0 0 00 5 0 0 0 1 1 11 6 1 0 1 0 0 00 Data Mining 2 Alípio Jorge Case summary We want to differentiate users We use access data Then clustering A known user can be assigned to an appropriate group And be shown a specific version of the site We could also 16 Cluster pages Data Mining 2 Alípio Jorge Web content mining: tag bundles Problem Social sites have tagged items Users provide tags (social tagging) We can bundle tags for helping users to provide tags How? Process [Kammergruber et al. 2010] 17 Get tags provided by users A transaction is the set of tags given by a single user Run Association Rule discovery Tag bundle_C = {t | AC is a rule, t is a tag in A} Data Mining 2 Alípio Jorge Resources Book Web Data Mining, Bing Liu Papers 18 Walter Kammergruber, M.Viermetz, K. Ehms “Using association rules for discovering tag bundles in social tagging data”, CISIM 2010. Data Mining 2 Alípio Jorge