Download Intelligent Information Retrieval

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Web Mining: An Overview
CSC 575
Intelligent Information Retrieval
Web Mining
 Today
 Overview of Web Data Mining
 Web Content Mining / Text Mining
 Web Usage Mining
 Web Personalization
Intelligent Information Retrieval
2
What Is Data Mining
 Data Mining: A Definition
The non-trivial extraction of implicit, previously unknown and
potentially useful knowledge from data in large data repositories
 Non-trivial: obvious knowledge is not useful
 implicit: hidden difficult to observe knowledge
 previously unknown
 potentially useful: actionable; easy to understand
Intelligent Information Retrieval
3
The Knowledge Discovery in Data (KDD)
Viewed as a Process
Intelligent Information Retrieval
4
What Can Data Mining Do
 Many Data Mining Tasks
 often inter-related
 often need to try different techniques for each task
 each tasks may require different types of knowledge discovery
 Typical data mining tasks
 Classification
 Prediction
 Clustering
 Association Discovery
 Sequence Analysis
 Characterization
 Discrimination
Intelligent Information Retrieval
5
What is Web Mining
 From its very beginning, the potential of extracting valuable
knowledge from the Web has been quite evident
 Web mining is the collection of technologies to fulfill this potential.
Web Mining Definition
application of data mining and machine learning
techniques to extract useful knowledge from the content,
structure, and usage of Web resources.
Intelligent Information Retrieval
6
Types of Web Mining
Web Mining
Web Content
Mining
Intelligent Information Retrieval
Web Usage
Mining
Web Structure
Mining
7
Types of Web Mining
Web Mining
Web Content
Mining
Web Usage
Mining
Web Structure
Mining
Extracting useful
knowledge from the
contents of Web
documents or other
semantic information
about Web resources
Intelligent Information Retrieval
8
Types of Web Mining
Web Mining
Web Content
Mining
Web Usage
Mining
Web Structure
Mining
Content data may
consist of text, images,
audio, video, structured
records from lists and
tables, or item
attributes from backend
databases.
Intelligent Information Retrieval
9
Types of Web Mining
Web Mining
Web Content
Mining
Web Usage
Mining
Web Structure
Mining
Applications:
• document clustering or
categorization
• topic identification / tracking
• concept discovery
• focused crawling
• content-based personalization
• intelligent search tools
Intelligent Information Retrieval
10
Types of Web Mining
Web Mining
Web Content
Mining
Web Usage
Mining
Web Structure
Mining
Extracting interesting
patterns from user
interactions with
resources on one or
more Web sites
Intelligent Information Retrieval
11
Types of Web Mining
Web Mining
Web Content
Mining
Web Usage
Mining
Web Structure
Mining
Applications:
• user and customer behavior modeling
• Web site optimization
• e-customer relationship management
• Web marketing
• targeted advertising
• recommender systems
Intelligent Information Retrieval
12
Types of Web Mining
Web Mining
Web Content
Mining
Web Usage
Mining
Web Structure
Mining
Discovering useful
patterns from the
hyperlink structure
connecting Web sites
or Web resources
Intelligent Information Retrieval
13
Types of Web Mining
Web Mining
Web Content
Mining
Web Usage
Mining
Web Structure
Mining
Data sources include the
explicit hyperlink between
documents, or implicit
links among objects (e.g.,
two objects being
“tagged” using the same
keyword).
Intelligent Information Retrieval
14
Types of Web Mining
Web Mining
Web Content
Mining
Web Usage
Mining
Web Structure
Mining
Applications:
• document retrieval and
ranking (e.g., Google)
• discovery of “hubs” and
“authorities”
• discovery of Web
communities
• social network analysis
Intelligent Information Retrieval
15
Web Content Mining
:: data preparation
 Typical steps in content preprocessing
 Extract text and meta data from Web documents (generally performed
automatically using a Web crawler)
 Recognize special entities (e.g., dates) and pre-defined keywords
 Remove stop words and non-relevant terms
 Perform stemming and morphological analysis
 Compute statistics based on term occurrences
 document frequency (DF): number of documents with the term occurrence
 term frequency (TF): frequency of occurrence within a specific document
 Additional considerations
 For entities such as products, movies, songs, etc., may need to extract or obtain
structured information such as item attributes from databases or available domain
ontologies
 It may be desirable to identify or discover phrases, facets, collocations, etc. (in order to
treat commonly occurring groups of features as a single term).
Intelligent Information Retrieval
16
Web Content Mining
:: data representation
 Vector Representation
 Typically, each document is represented as multi-dimensional vector over all terms
extracted in the preprocessing step
 dimension values represent weights associated with terms in the document
Document / objects
Terms / attributes
A
web
B
3
data
4
mining
2
business
C
D
E
2
1
1
1
4
1
3
intelligence
5
3
1
1
marketing
2
1
1
information
1
5
2
retrieval
6
1
3
4
 Term weights may be binary or may be derived as a function of term frequency (TF) and
document frequency (DF)
 In some applications, they may be only a limited number of terms (a controlled vocabulary) is
maintained and the weights may be assigned manually or based on external criteria
Intelligent Information Retrieval
17
Web Content Mining
:: common approaches and applications
 Basic notion: document similarity
 Most Web content mining and information retrieval applications involve
measuring similarity among two or more documents
 Vector representation facilitates similarity computations using vector-space
operations (such as Cosine of the angle between two vectors)
 Examples
 Search engines: measure the similarity between a query (represented as a
vector) and the indexed document vectors to return a ranked list of relevant
documents
 Document clustering: group documents based on similarity or dissimilarity
(distance) among them
 Document categorization: measure the similarity of a new document to be
classified with representations of existing categories (such as the mean vector
representing a group of document vectors)
 Personalization: recommend documents or items based their similarity to a
representation of the user’s profile (may be a term vector representing concepts
or terms of interest to the user)
Intelligent Information Retrieval
18
Web Content Mining
:: example – clustered search results
Can drill down
within clusters to
view sub-topics
or to view the
relevant subset
of results
Intelligent Information Retrieval
19
Web Content Mining
:: example – personalized content delivery
Google's personalized
news is an example of a
content-based
recommender system
which recommends
items (in part) based on
the similarity of their
content to a user’s
profile (gathered from
search and click history)
Intelligent Information Retrieval
20
Web Structure Mining
:: graph structures on the Web
 The structure of a typical Web graph
 Web pages as nodes
 hyperlinks as edges connecting two related pages
 Hyperlink Analysis
 Hyperlinks can serve as a tool for pure navigation
 But, often they are used to point to pages with authority on the same topic as the
source page (similar to a citation in a publication)
 Some interesting Web structures *
Intelligent Information Retrieval
21
Web Structure Mining
:: example – Google’s PageRank algorithm
 Basic idea:
Illustration of PageRank propagation
Intelligent Information Retrieval
 Rank of a page depends on the ranks of pages
pointing to it
 Out Degree of page is the number of edges
pointing away from it – used to compute the
contribution of the page to those to which it
points
 The final PageRank value represents the
probability that a random surfer will reach the
page
 d is the prob. that a random surfer chooses the
page directly rather than getting there via
navigation
22
Web Structure Mining
:: example – Hubs and Authorities
 Basic idea
 Authority comes from in-edges
 Being a hub comes from out-edges
 Mutually re-enforcing relationship
 A good authority is a page that is pointed to by
many good hubs.
 A good hub is a page that points to many good
authorities.
 Together they tend to form a bipartite graph
 This idea can be used to discover
authoritative pages related to a topic
Hubs
Authorities
 HITS algorithm – Hypertext Induced Topic Search
Intelligent Information Retrieval
23
Web Structure Mining
:: example – Web communities
 Basic idea
Community 2
 Web communities are collections of
Web pages such that each member
node has more hyperlinks (in either
direction) within the community than
outside the community.
Community 1
 Typical approach: Maximalflow model *
Source
node
sink
 Ex: separate the two subgraphs with
any choice of source node (left
subgraph) and sink node (right
subgraph), removing the three dashed
links
* Source: G. Flake, et al. “Self-Organization and Identification of Web Communities”, IEEE Computer,
Vol. 35, No. 3, pp. 66-71, March 2002 .
Intelligent Information Retrieval
24
Web Usage Mining
The Problem: analyze Web navigational data to
 Find how the Web site is used by Web users
 Understand the behavior of different user segments
 Predict how users will behave in the future
 Target relevant or interesting information to individual or groups of users
 Increase sales, profit, loyalty, etc.
Challenge
 Quantitatively capture Web users’ common interests and characterize
their underlying tasks
Intelligent Information Retrieval
25
Applications of Web Usage Mining
 Electronic Commerce
 design cross marketing strategies across products
 evaluate promotional campaigns
 target electronic ads and coupons at user groups based on their access patterns
 predict user behavior based on previously learned rules and users’ profiles
 present dynamic information to users based on their interests and profiles:
“Web personalization”
 Effective and Efficient Web Presence
 determine the best way to structure the Web site
 identify “weak links” for elimination or enhancement
 prefetch files that are most likely to be accessed
 enhance workgroup management & communication
 Search Engines
 Behavior-based ranking
Intelligent Information Retrieval
26
Behavior-based ranking
 For each query Q, keep track of which docs in the results are
clicked on
 On subsequent requests for Q, re-order docs in results based on
click-throughs.
 Relevance assessment based on
 Behavior/usage
 vs. content
Intelligent Information Retrieval
27
Query-doc popularity matrix B
j
Docs
q
Queries
Bqj = number of times doc j
clicked-through on query q
When query q issued again, order docs by Bqj values.
Intelligent Information Retrieval
28
Vector space implementation
 Maintain a term-doc popularity matrix C
 as opposed to query-doc popularity
 initialized to all zeros
 Each column represents a doc j
 If doc j clicked on for query q, update Cj Cj + q (here q is viewed as a
vector).
 On a query q’, compute its cosine proximity to Cj for all j.
 Combine this with the regular text score.
Intelligent Information Retrieval
29
Issues
 Normalization of Cj after updating
 Assumption of query compositionality
 “white house” document popularity derived from “white” and “house”
 Updating - live or batch?
 Basic assumption:
 Relevance can be directly measured by number of click throughs
 Valid?
Intelligent Information Retrieval
30
Web Usage Mining
:: data sources
 Typical Sources of Data:
 automatically generated Web/application server access logs
 e-commerce and product-oriented user events (e.g., shopping cart changes,
product clickthroughs, etc.)
 user profiles and/or user ratings
 meta-data, page content, site structure
 User Transactions
 sets or sequences of pageviews possibly with associated weights
 a pageview is a set of page files and associated objects that contribute to a
single display in a Web Browser
Intelligent Information Retrieval
31
What’s in a Typical Server Log?
1 2006-02-01 00:08:43 1.2.3.4 - GET /classes/cs589/papers.html - 200 9221 HTTP/1.1
maya.cs.depaul.edu
Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+2.0.50727)
http://dataminingresources.blogspot.com/
2 2006-02-01 00:08:46 1.2.3.4 - GET /classes/cs589/papers/cms-tai.pdf - 200 4096 HTTP/1.1
maya.cs.depaul.edu
Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+2.0.50727)
http://maya.cs.depaul.edu/~classes/cs589/papers.html
3 2006-02-01 08:01:28 2.3.4.5 - GET /classes/ds575/papers/hyperlink.pdf - 200 318814
HTTP/1.1 maya.cs.depaul.edu Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1)
http://www.google.com/search?hl=en&lr=&q=hyperlink+analysis+for+the+web+survey
4 2006-02-02 19:34:45 3.4.5.6 - GET /classes/cs480/announce.html - 200 3794 HTTP/1.1
maya.cs.depaul.edu Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1)
http://maya.cs.depaul.edu/~classes/cs480/
5 2006-02-02 19:34:45 3.4.5.6 - GET /classes/cs480/styles2.css - 200 1636 HTTP/1.1
maya.cs.depaul.edu Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1)
http://maya.cs.depaul.edu/~classes/cs480/announce.html
6 2006-02-02 19:34:45 3.4.5.6 - GET /classes/cs480/header.gif - 200 6027 HTTP/1.1
maya.cs.depaul.edu Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1)
http://maya.cs.depaul.edu/~classes/cs480/announce.html
Intelligent Information Retrieval
32
Typical Fields in a Log File Entry
client IP address
base url
date/time
http method
file accessed
protocol version
status code
bytes transferred
referrer page
user agent
1.2.3.4
maya.cs.depaul.edu
2006-02-01 00:08:43
GET
/classes/cs589/papers.html
HTTP/1.1
200 (successful access)
9221
http://dataminingresources.blogspot.com/
Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;
+SV1;+.NET+CLR+2.0.50727)
In addition, there may be fields corresponding to
• login information
• client-side cookies (unique keys, issued to clients in order to identify
a repeat visitor)
• session ids issued by the Web or application servers
Intelligent Information Retrieval
33
Usage Data Preparation Tasks
 Data cleaning
 remove irrelevant references and fields in server logs
 remove references due to spider navigation
 add missing references due to caching
 Data integration
 synchronize data from multiple server logs
 integrate e-commerce and application server data
 integrate meta-data
 Data Transformation
 pageview identification
 identification of unique users
 sessionization – partitioning each user’s record into multiple sessions or
transactions (usually representing different visits)
 mapping between user sessions and topics or categories
Intelligent Information Retrieval
34
Conceptual Representation of User
Transactions or Sessions
Pageview/objects
Sessions/user
transactions
user0
user1
user2
user3
user4
user5
user6
user7
user8
user9
A
15
0
12
9
0
17
24
0
7
0
B
5
0
0
47
0
0
89
0
0
38
C
0
32
0
0
23
0
0
78
45
57
D
0
4
56
0
15
157
0
27
20
0
E
0
0
236
0
0
69
0
0
127
0
F
185
0
0
134
0
0
354
0
0
15
Raw weights may be binary or based on time spent on a page; in
practice, need to normalize or standardize this data.
Intelligent Information Retrieval
35
Web Usage Mining as a Process
Intelligent Information Retrieval
36
Common Web Usage Mining Tasks
Clustering (unsupervised):
 Automatically group together users with similar purchasing or navigational patterns
 User / customer segments
 Automatically group together items based on co-occurrence in user sessions
 Automatic creation of concept or functional hierarchies for the site
Classification / Prediction (supervised)
 categorize pages or items into topics in a concept hierarchy
 classify users into behavioral groups based on their navigation or purchase histories
(e.g., browser, likely to purchase, loyal customer, etc.)
 predict a user’s interest in an item based on that user’s profiles and those of other
similar users
 Predict the life-time-value for a customer based on transaction history and
navigation behavior
Intelligent Information Retrieval
37
Common Web Usage Mining Tasks
Association Rules
 Associating presence of a set of items with other sets of items
 X  Y, where X and Y are sets of items
 Support of the itemset X  Y: Pr(X,Y); Confidence of rule: Pr(Y|X)
 Examples:
 30% of users who accessed the special-offers page, also placed an online order in
/products/software/
 Customers who bought The Da Vinci Code and Holy Blood, Holy Grail where 65%
likely to also purchase the Harry Potter and the Goblet of Fire DVD
Sequential Patterns / Path Analysis
 Finding common sequences of events/items appearing frequently in transactions
 General form: “x% of the time, when A and B appear in a transaction together, C appears
within z transactions (alternatively, within t time units)”
 15% of visitors had the following common pattern in their navigation path during a
session: home  *  software  *  shopping cart  checkout
Intelligent Information Retrieval
38
Example: Association Analysis for
Ecommerce
Product
Fully
Reversible
Mats
Association
Egyptian
Cotton
Towels
Lift
Confidence
456
41%
 Confidence:
 41% who purchased Fully Reversible Mats also purchased Egyptian Cotton Towels
 Lift:
 People who purchased Fully Reversible Mats were 456 times more likely to
purchase the Egyptian Cotton Towels compared to the general population
Intelligent Information Retrieval
39
Example: Association Rules for
Personalized Recommendations
Intelligent Information Retrieval
40
Profile Aggregation Based on
Clustering Transactions (PACT)
 Input
 set of relevant pageviews in preprocessed log P  { p1 , p2 ,
 set of user transactions T  {t1 , t2 ,
 each transaction is a pageview vector
, pn }
, tm }
t  w( p1 , t ), w( p2 , t ),..., w( pn , t )
 Clusters Transaction (e.g., using k-means)
 each cluster contains a set of transaction vectors
 for each cluster compute centroid as cluster representative
c  u1c , u2c , , unc
 Aggregate Usage Profiles
 a set of pageview-weight pairs: for transaction cluster C, select each pageview pi
such that u c (in the cluster centroid) is greater than a pre-specified threshold
i
Intelligent Information Retrieval
41
Characterizing User Segments via Clustering
Original
Session/user
data
Result of
Clustering
user0
user1
user2
user3
user4
user5
user6
user7
user8
user9
Cluster 0 user 1
user 4
user 7
Cluster 1 user 0
user 3
user 6
user 9
Cluster 2 user 2
user 5
user 8
Intelligent Information Retrieval
A.html B.html C.html D.html E.html F.html
1
1
0
0
0
1
0
0
1
1
0
0
1
0
0
1
1
0
1
1
0
0
0
1
0
0
1
1
0
0
1
0
0
1
1
0
1
1
0
0
0
1
0
0
1
1
0
0
1
0
1
1
1
0
0
1
1
0
0
1
A.html B.html C.html D.html E.html F.html
0
0
1
1
0
0
0
0
1
1
0
0
0
0
1
1
0
0
1
1
0
0
0
1
1
1
0
0
0
1
1
1
0
0
0
1
0
1
1
0
0
1
1
0
0
1
1
0
1
0
0
1
1
0
1
0
1
1
1
0
Given an active session A  B,
the best matching cluster is
Profile 1. This may result in a
recommendation for page F.html,
since it appears with high weight
in that cluster.
Cluster 0 (Cluster Size = 3)
-------------------------------------1.00
C.html
1.00
D.html
Cluster 1 (Cluster Size = 4)
-------------------------------------1.00
B.html
1.00
F.html
0.75
A.html
0.25
C.html
Cluster 2 (Cluster Size = 3)
-------------------------------------1.00
A.html
1.00
D.html
1.00
E.html
0.33
C.html
42
Example: Clustering User Transactions
 Transaction Clusters:
 Clustering similar user transactions and using centroid of each cluster as a
usage profile (representative for a user segment)
Sample cluster centroid from the CS dept. Web site (cluster size =330)
Support
URL
Pageview Description
1.00
/courses/syllabus.asp?course=45096-303&q=3&y=2002&id=290
SE 450 Object-Oriented Development
class syllabus
0.97
/people/facultyinfo.asp?id=290
Web page of a lecturer who thought the
above course
0.88
/programs/
Current Degree Descriptions 2002
0.85
/programs/courses.asp?depcode=96
&deptmne=se&courseid=450
SE 450 course description in SE program
0.82
/programs/2002/gradds2002.asp
M.S. in Distributed Systems program
description
Intelligent Information Retrieval
43
Example: Collaborative Filtering
 Popular Recommendation Technology
 Recommend items to users by finding other users with similar tastes or
interests
 Compare a target user’s profile (typically ratings on various items) to the
profiles of other users in the database with ratings for some common items
 Use these “nearest neighbors” to predict a rating by the target user on an
unseen item
 Collaborative recommendation is one example of using data mining for
automatic personalization
Source: J. Riedl, “Why Does KDD Care About Personalization?”
Intelligent Information Retrieval
44
Example: Collaborative Filtering
Intelligent Information Retrieval
45
Web Mining Approach to
Personalization
 Basic Idea
 generate aggregate user models (usage profiles) by discovering user access patterns
through Web usage mining (offline process)
 Clustering user transactions
 Clustering items / pageviews
 Association rule mining
 Sequential pattern discovery
 match a user’s active session against the discovered models to provide dynamic
content (online process)
 Advantages
 no explicit user ratings or interaction with users
 helps preserve user privacy, by making effective use of anonymous data
 enhance the effectiveness and scalability of collaborative filtering
 more accurate and broader recommendations than content-only approaches
Intelligent Information Retrieval
46
Web Personalization
 The General Problem
 dynamically serve customized content (pages, products, etc.) to users based on
their profiles, preferences, or expected interests
 as we have seen many of the data mining approaches that allow us to learn
aggregate user models can be used for personalization or recommendation
Intelligent Information Retrieval
47
Real-Time Recommendation Engine
 Keep track of users’ navigational history through the site
 a fixed-size sliding window over the active session to capture the current
user’s “short-term” history depth
 Match current user’s activity against the discovered profiles
 profiles either can be based on aggregate usage profiles, or are obtained
directly from association rules or sequential patterns
 Dynamically generated recommendations are added to the
returned page
 each pageview can be assigned a recommendation score based on
 matching score to user profiles (e.g., aggregate usage profiles)
 “information value” of the pageview based on domain knowledge (e.g., link
distance of the candidate recommendation to the active session)
Intelligent Information Retrieval
48
Problems with Web Usage Mining
 New item problem
 Patterns will not capture new items recently added
 Bad for dynamic Web sites
 Poor machine interpretability
 Hard to generalize and reason about patterns
 No domain knowledge used to enhance results
 E.g., Knowing a user is interested in a program, we could recommend the
prerequisites, core or popular courses in this program to the user
 Poor insight into the patterns themselves
 The nature of the relationships among items or users in a pattern is not directly
available
Intelligent Information Retrieval
49
Solution: Integrate Semantic Knowledge
with Web Usage Mining
 Information Retrieval/Extraction Approach
 Represent semantic knowledge in pageviews as keyword vectors
 Keywords extracted from text or meta-data
 Text mining can be used to capture higher-level concepts or associations
among concepts
 Cannot capture deeper relationships among objects based on their inherent
properties or attributes
 Ontology-based approach
 Represent domain knowledge using relational model or ontology
representation languages
 Process Web usage data with the structured domain knowledge
 Requires the extraction of ontology instances from Web pages
 Challenge: performing underlying mining operations on structured objects
(e.g., computing similarities or performing aggregations)
Intelligent Information Retrieval
50