Download Web Mining: Introduction

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Transcript
Web Mining: Introduction
Alípio Jorge, DCC-FC, Universidade do Porto
[email protected]
1
Data Mining 2
Alípio Jorge
Data Mining
Problems
Techniques
Domains
Data formats
Challenges
2
Data Mining 2
Alípio Jorge
Overview
Web Mining
Usage Mining
Structure Mining
Links
Social networks
Content Mining
3
Recommender systems
Data collection and processing
Deployment
Text mining
Information extraction
Information retrieval
Opinion mining / sentiment analysis
Data Mining 2
Alípio Jorge
Knowledge (sort of) tree
data mining
web mining
web usage mining
text mining
web content mining
web structure mining
recommender
systems
authority
identification
information
extraction
4
Opinion
mining
information
retrieval
...
mining scientific data
Data Mining 2
Alípio Jorge
community
identification
Web Mining
Web Usage Mining
discovery from user access patterns from logs or alike
applications:
user segmentation, recommendation, personalization, adaptation, usability
improvement
Web Structure Mining
discovery of useful knowledge from hyperlinks
applications:
discover important pages (information retrieval)
discover communities
Web Content Mining
extracts information from Web pages
applications
5
information extraction, summarization, topic extraction, opinion mining, sentiment
analysis, information retrieval
Data Mining 2
Alípio Jorge
Web Usage Mining: Problems
User Segmentation
Content Bundling
Item Recommendation
Menu Customization
User Action Prediction
...
6
Data Mining 2
Alípio Jorge
Web Usage Mining: Techniques
Clustering methods
segmentation
content bundling
Association rule discovery methods
recommendation and personalization
Collaborative filtering
Markov chains
Classification
predicting if a user is leaving the site or what is doing next
...
7
Data Mining 2
Alípio Jorge
Web Usage Mining: Problem
User segmentation
we want to find user segments according to their activity
what is a good user segment?
Examples
Web site targeting
Newsletter targeting
Study the evolution of usage styles
Data
8
What is necessary?
Data Mining 2
Alípio Jorge
Web Usage Mining: Tec: Clustering
Example task
we want to have different entry pages for different user groups
Strategy
9
users who tend to visit the same pages are regarded as a group
In DM, this is automatically done by Clustering Algorithms
Data Mining 2
Alípio Jorge
Web Usage Mining: Clustering
USER PAGE
1
A
1
B
1
C
2
A
2
C
3
B
3
G
3
F
3
I
4
B
4
C
5
G
5
F
5
I
5
J
6
A
6
C
10
Activity
look for two groups of users
Two users are similar if they tend to view the
same pages
similarity of two users X and Y can be:
#pages seen by both / # pages seen by any of them
We can also use Euclidean distance
each user is described as a point in the space
<A,B,C,D,E,F,G,I,J>
then we calculate the distance between the two points
( A1 − A2)2 + (B1 − B2)2 + ...+ ( J1 − J 2)2
Data Mining 2
Alípio Jorge
Web Usage Mining: Clustering
USER PAGE
1
A
1
B
1
C
2
A
2
C
3
B
3
G
3
F
3
I
4
B
4
C
5
G
5
F
5
I
5
J
6
A
6
C
11
Bottom up Hierarchical Clustering
look for the two most similar users
join them in one cluster
look for the two most similar users/clusters
etc.
continue until all users are in one cluster
Distance between two clusters U and V
minimum: the minimum distance between one
element of U and one element of V
maximum: adapt from above
a.k.a. single link
a.k.a. complete link
average: adapt from above
Data Mining 2
Alípio Jorge
Web Usage Mining: Clustering
USER PAGE
1
A
1
B
1
C
2
A
2
C
3
B
3
G
3
F
3
I
4
B
4
C
5
G
5
F
5
I
5
J
6
A
6
C
12
Bottom up Hierarchical
Clustering
aggregation tree obtained with
RapidMiner
we can obtain a predefined
number of clusters
operator AgglomerativeClustering
Euclidean distance
complete linkage
operator
AglomerativeFlatClustering
Kmeans clustering
another popular clustering
method
Data Mining 2
Alípio Jorge
Web Usage Mining: Clustering
USER PAGE
1
A
1
B
1
C
2
A
2
C
3
B
3
G
3
F
3
I
4
B
4
C
5
G
5
F
5
I
5
J
6
A
6
C
13
Applications
entry page customization
with known users
with unknown users
Newsletter customization
Segmented usability study
Dynamics
Other variables to consider
Aggregate
number of page views
size of average session
number of sessions
pageview duration
Data Mining 2
Alípio Jorge
Activity(R)
USER PAGE
1
A
1
B
1
C
2
A
2
C
3
B
3
G
3
F
3
I
4
B
4
C
5
G
5
F
5
I
5
J
6
A
6
C
14
# read data (copy and…)
#d<-read.table(file("clipboard"),header=TRUE)
# macs: read.table(pipe(“pbpaste”),…)
# or paste into a file and read.table/ read.csv
d<-read.csv("toy-session-data.csv")
# transform data into a matrix
dm<-table(d$USER,d$PAGE)
# cluster and view dendrogram
plot(hclust(dist(dm)))
# check parameters of dist and hclust
Data Mining 2
Alípio Jorge
Activity (Rapid Miner)
Reproduce the clustering results with RapidMiner
first we have to convert the 2 column table to a table with one
user per line and one page per column
excel: pivot tables + function IF
open Rapid Miner and nuild a new process with two operators
ExcelExample Source
Agglomerative Cluster
15
define the right parameters
define the right parameters
press run (play)
User A B C F G I J
1
1 1 1 0 0 00
2
1 0 1 0 0 00
3
0 1 0 1 1 10
4
0 1 1 0 0 00
5
0 0 0 1 1 11
6
1 0 1 0 0 00
Data Mining 2
Alípio Jorge
Case summary
We want to differentiate users
We use access data
Then clustering
A known user can be assigned to an appropriate group
And be shown a specific version of the site
We could also
16
Cluster pages
Data Mining 2
Alípio Jorge
Web content mining: tag bundles
Problem
Social sites have tagged items
Users provide tags (social tagging)
We can bundle tags for helping users to provide tags
How?
Process [Kammergruber et al. 2010]
17
Get tags provided by users
A transaction is the set of tags given by a single user
Run Association Rule discovery
Tag bundle_C = {t | AC is a rule, t is a tag in A}
Data Mining 2
Alípio Jorge
Resources
Book
Web Data Mining, Bing Liu
Papers
18
Walter Kammergruber, M.Viermetz, K. Ehms “Using
association rules for discovering tag bundles in social tagging
data”, CISIM 2010.
Data Mining 2
Alípio Jorge