Download Web Mining

Document related concepts
no text concepts found
Transcript
Web Mining : A Bird’s Eye View
Sanjay Kumar Madria
Department of Computer Science
University of Missouri-Rolla, MO 65401
[email protected]
May 22, 2017
Web Mining
1
Web Mining
• Web mining - data mining techniques to
automatically discover and extract information
from Web documents/services (Etzioni, 1996).
• Web mining research – integrate research from
several research communities (Kosala and
Blockeel, July 2000) such as:
• Database (DB)
• Information retrieval (IR)
• The sub-areas of machine learning (ML)
• Natural language processing (NLP)
May 22, 2017
Web Mining
2
Mining the World-Wide Web
• WWW is huge, widely distributed, global
information source for
– Information services: news, advertisements,
consumer information, financial management,
education, government, e-commerce, etc.
– Hyper-link information
– Access and usage information
– Web Site contents and Organization
May 22, 2017
Web Mining
3
Mining the World-Wide Web
•
Growing and changing very rapidly
– Broad diversity of user communities
• Only a small portion of the information on the
Web is truly relevant or useful to Web users
– How to find high-quality Web pages on a
specified topic?
• WWW provides rich sources for data mining
May 22, 2017
Web Mining
4
Challenges on WWW Interactions
• Finding Relevant Information
• Creating knowledge from Information
available
• Personalization of the information
• Learning about customers / individual users
Web Mining can play an important
Role!
May 22, 2017
Web Mining
5
Web Mining: more challenging
• Searches for
– Web access patterns
– Web structures
– Regularity and dynamics of Web contents
• Problems
– The “abundance” problem
– Limited coverage of the Web: hidden Web sources,
majority of data in DBMS
– Limited query interface based on keyword-oriented
search
– Limited customization to individual users
– 2017
Dynamic and semistructured
May 22,
Web Mining
6
Web Mining : Subtasks
• Resource Finding
– Task of retrieving intended web-documents
• Information Selection & Pre-processing
– Automatic selection and pre-processing specific
information from retrieved web resources
• Generalization
– Automatic Discovery of patterns in web sites
• Analysis
– Validation and / or interpretation of mined
patterns
May 22, 2017
Web Mining
7
Web Mining Taxonomy
Web Mining
Web Content
Mining
May 22, 2017
Web Structure
Mining
Web Mining
Web Usage
Mining
8
Web Content Mining
• Discovery of useful information from web
contents / data / documents
– Web data contents: text, image, audio, video,
metadata and hyperlinks.
• Information Retrieval View ( Structured +
Semi-Structured)
– Assist / Improve information finding
– Filtering Information to users on user profiles
• Database View
– Model Data on the web
– Integrate them for more sophisticated queries
May 22, 2017
Web Mining
9
Issues in Web Content Mining
• Developing intelligent tools for IR
- Finding keywords and key phrases
- Discovering grammatical rules and
collocations
- Hypertext classification/categorization
- Extracting key phrases from text documents
- Learning extraction models/rules
- Hierarchical clustering
- Predicting (words) relationship
May 22, 2017
Web Mining
10
Cont….
• Developing Web query systems
– WebOQL, XML-QL
• Mining multimedia data
- Mining image from satellite (Fayyad, et al.
1996)
- Mining image to identify small volcanoes on
Venus (Smyth, et al 1996) .
May 22, 2017
Web Mining
11
Web Structure Mining
• To discover the link structure of the hyperlinks
at the inter-document level to generate
structural summary about the Website and Web
page.
– Direction 1: based on the hyperlinks,
categorizing the Web pages and generated
information.
– Direction 2: discovering the structure of Web
document itself.
– Direction 3: discovering the nature of the
hierarchy or network of hyperlinks in the
Website of a particular domain.
May 22, 2017
Web Mining
12
Web Structure Mining
• Finding authoritative Web pages
– Retrieving pages that are not only relevant,
but also of high quality, or authoritative on
the topic
• Hyperlinks can infer the notion of authority
– The Web consists not only of pages, but also
of hyperlinks pointing from one page to
another
– These hyperlinks contain an enormous
amount of latent human annotation
– A hyperlink pointing to another Web page,
this can be considered as the author's
endorsement of the
page
May 22, 2017
Web other
Mining
13
Web Structure Mining
•
Web pages categorization (Chakrabarti, et al.,
1998)
•
Discovering micro communities on the web
- Example: Clever system (Chakrabarti, et al.,
1999), Google (Brin and Page, 1998)
•
Schema Discovery in Semistructured
Environment
May 22, 2017
Web Mining
14
Web Usage Mining
• Web usage mining also known as
Web log mining
– mining techniques to discover interesting
usage patterns from the secondary data
derived from the interactions of the users
while surfing the web
May 22, 2017
Web Mining
15
Web Usage Mining
• Applications
– Target potential customers for electronic
commerce
– Enhance the quality and delivery of Internet
information services to the end user
– Improve Web server system performance
– Identify potential prime advertisement
locations
– Facilitates personalization/adaptive sites
– Improve site design
– Fraud/intrusion detection
– Predict user’s actions (allows prefetching)
May 22, 2017
Web Mining
16
May 22, 2017
Web Mining
17
Problems with Web Logs
• Identifying users
– Clients may have multiple streams
– Clients may access web from multiple hosts
– Proxy servers: many clients/one address
– Proxy servers: one client/many addresses
• Data not in log
– POST data (i.e., CGI request) not recorded
– Cookie data stored elsewhere
May 22, 2017
Web Mining
18
Cont…
•
Missing data
– Pages may be cached
– Referring page requires client cooperation
– When does a session end?
– Use of forward and backward pointers
•
•
•
Typically a 30 minute timeout is used
Web content may be dynamic
– May not be able to reconstruct what the
user saw
Use of spiders and automated agents –
automatic request we pages
May 22, 2017
Web Mining
19
Cont…
• Like most data mining tasks, web log
mining requires preprocessing
– To identify users
– To match sessions to other data
– To fill in missing data
– Essentially, to reconstruct the click stream
May 22, 2017
Web Mining
20
Log Data - Simple Analysis
• Statistical analysis of users
– Length of path
– Viewing time
– Number of page views
• Statistical analysis of site
– Most common pages viewed
– Most common invalid URL
May 22, 2017
Web Mining
21
Web Log – Data Mining Applications
• Association rules
– Find pages that are often viewed together
• Clustering
– Cluster users based on browsing patterns
– Cluster pages based on content
• Classification
– Relate user attributes to patterns
May 22, 2017
Web Mining
22
Web Logs
• Web servers have the ability to log all
requests
• Web server log formats:
– Most use the Common Log Format (CLF)
– New, Extended Log Format allows
configuration of log file
• Generate vast amounts of data
May 22, 2017
Web Mining
23
•
•
•
•
•
•
•
Common Log Format
Remotehost: browser hostname or IP #
Remote log name of user (almost
always "-" meaning "unknown")
Authuser: authenticated username
Date: Date and time of the request
"request”: exact request lines from client
Status: The HTTP status code returned
Bytes: The content-length of response
May 22, 2017
Web Mining
24
Server Logs
May 22, 2017
Web Mining
25
Fields
•
•
•
•
•
•
•
•
Client IP: 128.101.228.20
Authenticated User ID: - Time/Date: [10/Nov/1999:10:16:39 -0600]
Request: "GET / HTTP/1.0"
Status: 200
Bytes: Referrer: “-”
Agent: "Mozilla/4.61 [en] (WinNT; I)"
May 22, 2017
Web Mining
26
Web Usage Mining
• Commonly used approaches (Borges and
Levene, 1999)
- Maps the log data into relational tables before
an adapted data mining technique is performed.
- Uses the log data directly by utilizing special
pre-processing techniques.
• Typical problems
- Distinguishing among unique users, server
sessions, episodes, etc. in the presence of
caching and proxy servers (McCallum, et al.,
2000; Srivastava, et al., 2000).
May 22, 2017
Web Mining
27
Request
• Method: GET
– Other common methods are POST and HEAD
• URI: /
• – This is the file that is being accessed. When a
directory is specified, it is up to the Server to
decide what to return. Usually, it will be the file
named “index.html” or “home.html”
• Protocol: HTTP/1.0
May 22, 2017
Web Mining
28
Status
• Status codes are defined by the HTTP
protocol.
• Common codes include:
– 200: OK
– 3xx: Some sort of Redirection
– 4xx: Some sort of Client Error
– 5xx: Some sort of Server Error
May 22, 2017
Web Mining
29
May 22, 2017
Web Mining
30
Web Mining Taxonomy
Web Mining
Web Content
Mining
Web Page
Content Mining
May 22, 2017
Web Structure
Mining
Search Result
Mining
Web Usage
Mining
General Access
Pattern Tracking
Web Mining
Customized
Usage Tracking
31
Mining the World Wide Web
Web Mining
Web Content
Mining
Web Page Content Mining
Web Page Summarization
WebOQL(Mendelzon et.al. 1998) …:
Web Structuring query languages;
Can identify information within given
web pages
•(Etzioni et.al. 1997):Uses heuristics to
distinguish personal home pages from
other web pages
•ShopBot (Etzioni et.al. 1997): Looks for
product prices within web pages
May 22, 2017
Web Structure
Mining
Search Result
Mining
Web Mining
Web Usage
Mining
General Access
Pattern Tracking
Customized
Usage Tracking
32
Mining the World Wide Web
Web Mining
Web Content
Mining
Web Page
Content Mining
Web Structure
Mining
Search Result Mining
Search Engine Result
Summarization
•Clustering Search Result
(Leouski and Croft, 1996, Zamir and
Etzioni, 1997):
Categorizes documents using
phrases in titles and snippets
May 22, 2017
Web Usage
Mining
Web Mining
General Access
Pattern Tracking
Customized
Usage Tracking
33
Mining the World Wide Web
Web Mining
Web Content
Mining
Search Result
Mining
Web Page
Content Mining
May 22, 2017
Web Structure Mining
Using Links
•PageRank (Brin et al., 1998)
•CLEVER (Chakrabarti et al., 1998)
Use interconnections between web pages to
give weight to pages.
Using Generalization
•MLDB (1994)
Uses a multi-level database representation of
the Web. Counters (popularity) and link lists
are used for capturing structure.
Web Mining
Web Usage
Mining
General Access
Pattern Tracking
Customized
Usage Tracking
34
Mining the World Wide Web
Web Mining
Web Content
Mining
Web Page
Content Mining
Search Result
Mining
May 22, 2017
Web Structure
Mining
Web Usage
Mining
General Access Pattern Tracking
Customized
Usage Tracking
•Web Log Mining (Zaïane, Xin and Han, 1998)
Uses KDD techniques to understand
general access patterns and trends.
Can shed light on better structure and
grouping of resource providers.
Web Mining
35
Mining the World Wide Web
Web Mining
Web Content
Mining
Web Page
Content Mining
Search Result
Mining
May 22, 2017
Web Structure
Mining
Web Usage
Mining
Customized Usage Tracking
General Access
Pattern Tracking
•Adaptive Sites (Perkowitz and Etzioni, 1997)
Analyzes access patterns of each user at a time.
Web site restructures itself automatically by
learning from user access patterns.
Web Mining
36
Web Content Mining
• Agent-based Approaches:
– Intelligent Search Agents
– Information Filtering/Categorization
– Personalized Web Agents
• Database Approaches:
– Multilevel Databases
– Web Query Systems
May 22, 2017
Web Mining
37
Intelligent Search Agents
• Locating documents and services on the Web:
– WebCrawler, Alta Vista
(http://www.altavista.com): scan millions of Web
documents and create index of words (too many
irrelevant, outdated responses)
– MetaCrawler: mines robot-created indices
• Retrieve product information from a variety of
vendor sites using only general information about
the product domain:
– ShopBot
May 22, 2017
Web Mining
38
Intelligent Search Agents (Cont’d)
• Rely either on pre-specified domain information
about particular types of documents, or on hard
coded models of the information sources to
retrieve and interpret documents:
–
–
–
–
–
Harvest
FAQ-Finder
Information Manifold
OCCAM
Parasite
• Learn models of various information sources and
translates these into its own concept hierarchy:
– ILA (Internet Learning Agent)
May 22, 2017
Web Mining
39
Information Filtering/Categorization
• Using various information retrieval techniques
and characteristics of open hypertext Web
documents to automatically retrieve, filter,
and categorize them.
– HyPursuit: uses semantic information embedded
in link structures and document content to create
cluster hierarchies of hypertext documents, and
structure an information space
– BO (Bookmark Organizer): combines
hierarchical clustering techniques and user
interaction to organize a collection of Web
documents based on conceptual information
May 22, 2017
Web Mining
40
Personalized Web Agents
• This category of Web agents learn user
preferences and discover Web information
sources based on these preferences, and
those of other individuals with similar
interests (using collaborative filtering)
–
–
–
–
–
–
WebWatcher
PAINT
Syskill&Webert
GroupLens
Firefly
others
May 22, 2017
Web Mining
41
Multiple Layered Web Architecture
Layern
More Generalized Descriptions
...
Layer1
Generalized Descriptions
Layer0
May 22, 2017
Web Mining
42
Multilevel Databases
• At the higher levels, meta data or
generalizations are
– extracted from lower levels
– organized in structured collections, i.e. relational
or object-oriented database.
• At the lowest level, semi-structured
information are
– stored in various Web repositories, such as
hypertext documents
May 22, 2017
Web Mining
43
Multilevel Databases (Cont’d)
• (Han, et. al.):
– use a multi-layered database where each layer is
obtained via generalization and transformation
operations performed on the lower layers
• (Kholsa, et. al.):
– propose the creation and maintenance of metadatabases at each information providing domain
and the use of a global schema for the metadatabase
May 22, 2017
Web Mining
44
Multilevel Databases (Cont’d)
• (King, et. al.):
– propose the incremental integration of a portion of the
schema from each information source, rather than
relying on a global heterogeneous database schema
• The ARANEUS system:
– extracts relevant information from hypertext documents
and integrates these into higher-level derived Web
Hypertexts which are generalizations of the notion of
database views
May 22, 2017
Web Mining
45
Multi-Layered Database (MLDB)
• A multiple layered database model
– based on semi-structured data hypothesis
– queried by NetQL using a syntax similar to the relational language
SQL
• Layer-0:
– An unstructured, massive, primitive, diverse global informationbase.
• Layer-1:
– A relatively structured, descriptor-like, massive, distributed
database by data analysis, transformation and generalization
techniques.
– Tools to be developed for descriptor extraction.
• Higher-layers:
– Further generalization to form progressively smaller, better
structured, and less remote databases for efficient browsing,
retrieval, and information discovery.
May 22, 2017
Web Mining
46
Three major components in MLDB
• S (a database schema):
– outlines the overall database structure of the global MLDB
– presents a route map for data and meta-data (i.e., schema)
browsing
– describes how the generalization is performed
• H (a set of concept hierarchies):
– provides a set of concept hierarchies which assist the
system to generalize lower layer information to high
layeres and map queries to appropriate concept layers for
processing
• D (a set of database relations):
– the whole global information base at the primitive
information level (i.e., layer-0)
– the generalized database relations at the nonprimitive
May 22,layers
2017
Web Mining
47
The General architecture of WebLogMiner
(a Global MLDB)
Generalized Data
Higher layers
Site 1
Site 2
Concept
Hierarchies
Resource Discovery
(MLDB)
Knowledge Discovery (WLM)
Site 3
May 22, 2017
Characteristic Rules
Discriminant Rules
Association Rules
Web Mining
48
Techniques for Web usage mining
• Construct multidimensional view on the Weblog
database
– Perform multidimensional OLAP analysis to find the top
N users, top N accessed Web pages, most frequently
accessed time periods, etc.
• Perform data mining on Weblog records
– Find association patterns, sequential patterns, and
trends of Web accessing
– May need additional information,e.g., user browsing
sequences of the Web pages in the Web server buffer
• Conduct studies to
– Analyze system performance, improve system design by
Web caching, Web page prefetching, and Web page
May 22, 2017
Web Mining
49
swapping
Web Usage Mining - Phases
• Three distinctive phases: preprocessing, pattern
discovery, and pattern analysis
• Preprocessing - process to convert the raw data
into the data abstraction necessary for the
further applying the data mining algorithm
• Resources: server-side, client-side, proxy
servers, or database.
• Raw data: Web usage logs, Web page
descriptions, Web site topology, user registries,
and questionnaire.
• Conversion: Content converting, Structure
converting, Usage converting
May 22, 2017
Web Mining
50
• User: The principal using a client to
interactively retrieve and render resources or
resource manifestations.
• Page view: Visual rendering of a Web page in
a specific client environment at a specific point
of time
• Click stream: a sequential series of page view
request
• User session: a delimited set of user clicks
(click stream) across one or more Web
servers.
• Server session (visit): a collection of user clicks
to a single Web server during a user session.
• Episode: a subset of related user clicks that
May 22, 2017
Web Mining
occur within a user session.
51
• Content Preprocessing - the process of
converting text, image, scripts and other files
into the forms that can be used by the usage
mining.
• Structure Preprocessing - The structure of a
Website is formed by the hyperlinks between
page views, the structure preprocessing can be
done by parsing and reformatting the
information.
• Usage Preprocessing - the most difficult task in
the usage mining processes, the data cleaning
techniques to eliminate the impact of the
irrelevant items to the analysis result.
May 22, 2017
Web Mining
52
Pattern Discovery
• Pattern Discovery is the key component of the
Web mining, which converges the algorithms
and techniques from data mining, machine
learning, statistics and pattern recognition etc
research categories.
• Separate subsections: statistical analysis,
association rules, clustering, classification,
sequential pattern, dependency Modeling.
May 22, 2017
Web Mining
53
• Statistical Analysis - the analysts may
perform different kinds of descriptive
statistical analyses based on different
variables when analyzing the session
file ; powerful tools in extracting
knowledge about visitors to a Web site.
May 22, 2017
Web Mining
54
• Association Rules - refers to sets of pages
that are accessed together with a support
value exceeding some specified threshold.
• Clustering: a technique to group together
users or data items (pages) with the similar
characteristics.
– It can facilitate the development and
execution of future marketing strategies.
• Classification: the technique to map a data
item into one of several predefined classes,
which help to establish a profile of users
belonging to a particular class or category.
May 22, 2017
Web Mining
55
Pattern Analysis
• Pattern Analysis - final stage of the Web
usage mining.
• To eliminate the irrelative rules or
patterns and to extract the interesting
rules or patterns from the output of the
pattern discovery process.
• Analysis methodologies and tools: query
mechanism like SQL, OLAP, visualization
etc.
May 22, 2017
Web Mining
56
May 22, 2017
Web Mining
57
WUM – Pre-Processing
– Data Cleaning
Removes log entries that are not needed for the mining process
Data Integration
Synchronize data from multiple server logs, metadata
User Identification
Associates page references with different users
Session/Episode Identification
Groups user’s page references into user sessions
Page View Identification
Path Completion
Fills in page references missing due to browser and proxy
caching
May 22, 2017
Web Mining
58
WUM – Issues in User Session Identification
A single IP address is used by many users
different users
Proxy
server
Web server
Different IP addresses in a single session
ISP server
Single user
Web server
Missing cache hits in the
server logs
Web Mining
May 22, 2017
59
User and Session Identification Issues
• Distinguish among different users to a site
• Reconstruct the activities of the users within
the site
• Proxy servers and anonymizers
• Rotating IP addresses connections through
ISPs
• Missing references due to caching
• Inability of servers to distinguish among
different visits
May 22, 2017
Web Mining
60
WUM – Solutions
Remote Agent
A remote agent is implemented in Java Applet
It is loaded into the client only once when the first page is accessed
The subsequent requests are captured and send back to the server
Modified Browser
The source code of the existing browser can be modified to gain user
specific data at the client side
Dynamic page rewriting
When the user first submit the request, the server returns the
requested page rewritten to include a session specific ID
Each subsequent request will supply this ID to the server
Heuristics
Use a set of assumptions to identify user sessions and find the missing
May 22,cache
2017 hits in the server log Web Mining
61
May 22, 2017
Web Mining
62
WUM – Heuristics
The session identification heuristics
Timeout: if the time between pages requests exceeds a
certain limit, it is assumed that the user is starting a new
session
IP/Agent: Each different agent type for an IP address
represents a different sessions
Referring page: If the referring page file for a request is
not part of an open session, it is assumed that the request
is coming from a different session
Same IP-Agent/different sessions (Closest): Assigns the
request to the session that is closest to the referring page
at the time of the request
Same IP-Agent/different sessions (Recent): In the case
where multiple sessions are same distance from a page
request, assigns the request to the session with the most
May 22, 2017
Web Mining
63
recent referrer access in terms of time
Cont.
The path completion heuristics
If the referring page file of a session is not part of
the previous page file of that session, the user
must have accessed a cached page
The “back” button method is used to refer a
cached page
Assigns a constant view time for each of the
cached page file
May 22, 2017
Web Mining
64
May 22, 2017
Web Mining
65
May 22, 2017
Web Mining
66
May 22, 2017
Web Mining
67
May 22, 2017
Web Mining
68
May 22, 2017
Web Mining
69
WUM – Association Rule Generation
Discovers the correlations between pages that are
most often referenced together in a single server
session
• Provide the information
What are the set of pages frequently accessed together by Web users?
What page will be fetched next?
What are paths frequently accessed by Web users?
Association rule
A
B [ Support = 60%, Confidence = 80% ]
Example
“50% of visitors who accessed URLs /infor-f.html and
labo/infos.html also visited situation.html”
May 22, 2017
Web Mining
70
Associations & Correlations
• Page associations from usage data
– User sessions
– User transactions
• Page associations from content data
– similarity based on content analysis
• Page associations based on structure
– link connectivity between pages
• ==> Obtain frequent itemsets
May 22, 2017
Web Mining
71
Examples:
60% of clients who accessed /products/, also
accessed
/products/software/webminer.htm.
30% of clients who accessed /specialoffer.html, placed an online order in
/products/software/.
(Example from IBM official Olympics Site)
• {Badminton, Diving} ===> {Table Tennis} (a =
69.7%, s = 0.35%)
May 22, 2017
Web Mining
72
WUM – Clustering
• Groups together a set of items having similar
characteristics
• User Clusters
Discover groups of users exhibiting similar
browsing patterns
Page recommendation
User’s partial session is classified into a single
cluster
The links contained in this cluster are
recommended
May 22, 2017
Web Mining
73
Cont..
Page clusters
Discover groups of pages having
related content
Usage based frequent pages
Page recommendation
The links are presented based on
how often URL references occur
together across user sessions
May 22, 2017
Web Mining
74
Website Usage Analysis
•
Why developing a Website usage /
utilization analyzation tool?
•
Knowledge about how visitors use Website could
- Prevent disorientation and help designers place
important information/functions exactly where
the visitors look for and in the way users need it
- Build up adaptive Website server
May 22, 2017
Web Mining
75
Clustering and Classification
clients who often access
• /products/software/webminer.html
tend to be from educational institutions.
clients who placed an online order for software
tend to be students in the 20-25 age group and live
in the United States.
75% of clients who download software from
/products/software/demos/ visit between 7:00 and
11:00 pm on weekends.
May 22, 2017
Web Mining
76
Website Usage Analysis
• Discover user navigation patterns in using
Website
- Establish a aggregated log structure as a
preprocessor to reduce the search space before
the actual log mining phase
- Introduce a model for Website usage pattern
discovery by extending the classical mining
model, and establish the processing framework
of this model
May 22, 2017
Web Mining
77
Sequential Patterns & Clusters
30% of clients who visited
/products/software/, had done a
search in Yahoo using the keyword
“software” before their visit
60% of clients who placed an online order
for WEBMINER, placed another online
order for software within 15 days
May 22, 2017
Web Mining
78
Website Usage Analysis
• Website client-server architecture facilitates
recording user behaviors in every steps by
- submit client-side log files to server when users
use clear functions or exit window/modules
• The special design for local and universal
back/forward/clear functions makes user’s
navigation pattern more clear for designer by
- analyzing local back/forward history and
incorporate it with universal back/forward history
May 22, 2017
Web Mining
79
Website Usage Analysis
• What will be included in SUA
1. Identify and collect log data
2. Transfer the data to server-side and save them
in a structure desired for analysis
3. Prepare mined data by establishing a customized
aggregated log tree/frame
4. Use modifications of the typical data mining
methods, particularly an extension of a traditional
sequence discovery algorithm, to mine user
navigation patterns
May 22, 2017
Web Mining
80
Website Usage Analysis
• Problem need to be considered:
- How to identify the log data when a user go through
uninteresting function/module
- What marks the end of a user session?
- Client connect Website through proxy servers
• Differences in Website usage analysis with
common Web usage mining
- Client-side log files available
- Log file’s format (Web log files follow Common Log Format
specified as a part of HTTP protocol)
- Not necessary for log file cleaning/filtering (which usually
performed in preprocess of Web
log mining)
Web Mining
May 22, 2017
81
Web Usage Mining - Patterns
Discovery Algorithms
• (Chen et. al.) Design algorithms for Path
Traversal Patterns, finding maximal forward
references and large reference sequences.
May 22, 2017
Web Mining
82
Path Traversal Patterns
• Procedure for mining traversal patterns:
– (Step 1) Determine maximal forward references
from the original log data (Algorithm MF)
– (Step 2) Determine large reference sequences (i.e.,
Lk, k1) from the set of maximal forward
references (Algorithm FS and SS)
– (Step 3) Determine maximal reference sequences
from large reference sequences
• Focus on Step 1 and 2, and devise algorithms
for the efficient determination of large
reference sequences
May 22, 2017
Web Mining
83
Determine large reference sequeces
• Algorithm FS:
– Utilizes the key ideas of algorithm DHP:
– employs hashing and pruning techniques
– DHP is very efficient for the generation of candidate
itemsets, in particular for the large two-itemsets, thus
greatly improving the performance bottleneck of the
whole process
• Algorithm SS:
– employs hashing and pruning techniques to reduce both
CPU and I/O costs
– by properly utilizing the information in candidate
references in prior passes, is able to avoid database
scans in some passes, thus further reducing the disk I/O
cost
May 22, 2017
Web Mining
84
Patterns Analysis Tools
• WebViz [pitkwa94] --- provides appropriate
tools and techniques to understand, visualize,
and interpret access patterns.
• Proposes OLAP techniques such as data cubes
for the purpose of simplifying the analysis of
usage statistics from server access logs.
[dyreua et al]
May 22, 2017
Web Mining
85
Patterns Discovery and Analysis Tools
• The emerging tools for user pattern discovery use
sophisticated techniques from AI, data mining,
psychology, and information theory, to mine for
knowledge from collected data:
– (Pirolli et. al.) use information foraging theory to
combine path traversal patterns, Web page typing,
and site topology information to categorize pages
for easier access by users.
May 22, 2017
Web Mining
86
(Cont’d)
• WEBMINER :
– introduces a general architecture for Web usage
mining, automatically discovering association rules
and sequential patterns from server access logs.
– proposes an SQL-like query mechanism for
querying the discovered knowledge in the form of
association rules and sequential patterns.
• WebLogMiner
– Web log is filtered to generate a relational
database
– Data mining on web log data cube and web log
database
May 22, 2017
Web Mining
87
WEBMINER
• SQL-like Query
• A framework for Web mining, the applications
of data mining and knowledge discovery
techniques, association rules and sequential
patterns, to Web data:
– Association rules: using apriori algorithm
• 40% of clients who accessed the Web page with URL
/company/products/product1.html, also accessed
/company/products/product2.html
– Sequential patterns: using modified apriori
algorithm
• 60% of clients who placed an online order in
/company/products/product1.html, also placed an
online order in /company/products/product4.html
within 15 days
May 22, 2017
Web Mining
88
WebLogMiner
• Database construction from server log
file:
– data cleaning
– data transformation
• Multi-dimensional web log data cube
construction and manipulation
• Data mining on web log data cube and
web log database
May 22, 2017
Web Mining
89
Mining the World-Wide Web
• Design of a Web Log Miner
–
–
–
–
Web log is filtered to generate a relational database
A data cube is generated form database
OLAP is used to drill-down and roll-up in the cube
OLAM is used for mining interesting knowledge
Web log
Database
Data Cube
R(q)
(q,p)G out degre (q)
Knowledge
Sliced and diced
cube
R(p) = /n (1 ) 
1
Data Cleaning
May 22, 2017
2
3
Data Cube
OLAP
Creation
Web Mining
4
Data Mining
90
Construction of Data Cubes
(http://db.cs.sfu.ca/sections/publication/slides/slides.html)
Amount
B.C.
Province Prairies
Ontario
sum
0-20K20-40K 40-60K60K- sum
All Amount
Comp_Method, B.C.
Comp_Metho
d
Database
… ...
Discipline
sum
Each dimension contains a hierarchy of values for one attribute
A cube cell stores aggregate values, e.g., count, sum, max, etc.
A “sum” cell stores dimension summation values.
Sparse-cube technology and MOLAP/ROLAP integration.
“Chunk”-based multi-way aggregation and single-pass computation.
May 22, 2017
Web Mining
91
WebLogMiner Architecture
• Web log is filtered to generate a relational
database
• A data cube is generated from database
• OLAP is used to drill-down and roll-up in
the cube
• OLAM is used for mining interesting
knowledge
Web log
Database
Data Cube
R(q)
(q,p)G out degre (q)
Knowledge
Sliced and diced
cube
R(p) = /n (1 ) 
May 22, 2017
1
Data Cleaning
2
Data Cube
Web Mining
Creation
3
OLAP
4
Data Mining
92
WEBSIFT
May 22, 2017
Web Mining
93
What is WebSIFT?
• a Web Usage Mining framework that
– performs preprocessing
– performs knowledge discovery
– uses the structure and content information
about a Web site to automatically define a
belief set.
May 22, 2017
Web Mining
94
Overview of WebSIFT
• Based on WEBMINER prototype
• Divides the Web Usage Mining process
into three main parts
May 22, 2017
Web Mining
95
Overview of WebSIFT
• Input:
– Access
– Referrer and agent
– HTML files
– Optional data (e.g., registration data or
remote agent logs)
May 22, 2017
Web Mining
96
Overview of WebSIFT
• Preprocessing:
– uses input data to construct a user session
file
– site files are used to classify pages of a site
• Knowledge discovery phase
– uses existing data mining techniques to
generate rules and patterns.
– generation of general usage stats
May 22, 2017
Web Mining
97
Information Filtering
• Links between pages provide evidence for
supporting the belief that those pages are
related.
• Strength of evidence for a set pages being
related is proportional to the strength of the
topological connection between the set of
pages.
• Based on site content, can also look at
content similarity and by calculating
“distance” between pages.
May 22, 2017
Web Mining
98
Information Filtering
May 22, 2017
Web Mining
99
Information Filtering
• Uses two different methods to identify
interesting results from a list of
discovered frequent itemsets
May 22, 2017
Web Mining
100
Information Filtering
• Method 1:
– declare itemsets that contain pages not
directly connected to be interesting
– corresponds to a situation where a belief
that a set of pages are related has no
domain or existing evidence but there is
mined evidence.  called Beliefs with
Mined Evidence algo (BME)
May 22, 2017
Web Mining
101
Information Filtering
• Method 2:
– Absence of itemsets  evidence against a belief
that pages are related.
– Pages that have individual support above a
threshold but are not present together in larger
frequent itemsets  evidence against the pages
being related.
– domain evidence suggests that pages are
related the absence of the frequent itemset can
be considered interesting. This is handled by the
Beliefs with Contradicting Evidence algo (BCE )
May 22, 2017
Web Mining
102
Experimental Evaluation
• Performed on web server of U of MN Dept of Comp
Sci & Eng’g web site
• Log spanned eight days in Feb 1999
• Physical size of log: 19.3 MB
• 102,838 entries
• After preprocessing: 43,158 page views (divided
among 10,609 user sessions)
• Threshold of 0.1% for support used to generate 693
frequent itemsets with maximum set size of six
pages.
• 178 unique pages represented in all the rules.
• BCE and BME algos run on frequent itemsets.
May 22, 2017
Web Mining
103
Experimental Evaluation
May 22, 2017
Web Mining
104
Experimental Evaluation
May 22, 2017
Web Mining
105
Future work
• Filtering frequent itemsets, sequential patterns and
clusters
• Incorporate probabilities and fuzzy logic into
information filter
• Future works include path completion verification,
page usage determination, application of the pattern
analysis results, etc.
May 22, 2017
Web Mining
106
Link Analysis
May 22, 2017
Web Mining
107
Link Analysis
• Finding patterns in graphs
– Bibliometrics – finding patterns in citation
graphs
– Sociometry – finding patterns in social
networks
– Collaborative Filtering – finding patterns in
rank(person, item) graph
– Webometrics – finding patterns in web
page links
May 22, 2017
Web Mining
108
Web Link Analysis
• Used for
– ordering documents matching a user
query: ranking
– deciding what pages to add to a collection:
crawling
– page categorization
– finding related pages
– finding duplicated web sites
May 22, 2017
Web Mining
109
Web as Graph
• Link graph:
– node for each page
– directed edge (u,v) if page u contains a hyperlink
to page v
• Co-citation graph
– node for each page
– undirected edge (u,v) iff exists a third page w
linking to both u and v
• Assumption:
– link from page A to page B is a recommendation
of page B by A
– If A and B are connected by a link, there is a
higher probability that they are on the same topic
May 22, 2017
Web Mining
110
Web structure mining
• HITS (Topic distillation)
• PageRank (Ranking web pages used
by Google)
• Algorithm in Cyber-community
May 22, 2017
Web Mining
111
HITS Algorithm
--Topic Distillation on WWW
May 22, 2017
Web Mining
112
HITS Method
• Hyperlink Induced Topic Search
• Kleinberg, 1998
• A simple approach by finding hubs and
authorities
• View web as a directed graph
• Assumption: if document A has hyperlink to
document B, then the author of document A
thinks that document B contains valuable
information
May 22, 2017
Web Mining
113
Main Ideas
• Concerned with the identification of the
most authoritative, or definitive, Web
pages on a broad-topic
• Focused on only one topic
• Viewing the Web as a graph
• A purely link structure-based
computation, ignoring the textual
content
May 22, 2017
Web Mining
114
HITS: Hubs and Authority
• Hub: web page links to a collection of
prominent sites on a common topic
• Authority: Pages that link to a collection
of authoritative pages on a broad topic;
web page pointed to by hubs
• Mutual Reinforcing Relationship: a good
authority is a page that is pointed to by many
good hubs, while a good hub is a page that
points to many good authorities
May 22, 2017
Web Mining
115
Hub-Authority Relations
Hubs
May 22, 2017
Authorities
Web Mining
Unrelated page of
large in-degree
116
HITS: Two Main Steps
• A sampling component, which constructs a
focused collection of several thousand web
pages likely to be rich in relevant authorities
• A weight-propagation component, which
determines numerical estimates of hub and
authority weights by an iterative procedure
• As the result, pages with highest weights are
returned as hubs and authorities for the
research topic
May 22, 2017
Web Mining
117
HITS: Root Set and Base Set
• Using query term to collect a root set (S) of
pages from index-based search engine
(AltaVista)
• Expand root set to base set (T) by including
all pages linked to by pages in root set and all
pages that link to a page in root set (up to a
designated size cut-off)
• Typical base set contains roughly 1000-5000
pages
May 22, 2017
Web Mining
118
Step 1: Constructing Subgraph
1.1 Creating a root set (S)
- Given a query string on a broad topic
- Collect the t highest-ranked pages for the query
from a text-based search engine
1.2 Expanding to a base set (T)
- Add the page pointing to a page in root set
- Add the page pointed to by a page in root set
May 22, 2017
Web Mining
119
Root Set and Base Set (Cont’d)
T
May 22, 2017
S
S
Web Mining
120
Step 2:
Computing Hubs and Authorities
2.1 Associating weights
- Authority weight xp
- Hub weight yp
- Set all values to a uniform constant initially
2.2 Updating weights
May 22, 2017
Web Mining
121
Updating Authority Weight
xp =q suchthat qp
yq
q1
q2
Example
xp=yq1+yq2+yq3
May 22, 2017
Web Mining
P
q3
122
Updating Hub Weight
yp =

xq
q such that pq
q1
Example
P
yp=xq1+xq2+xq3
May 22, 2017
q2
Web Mining
q3
123
Flowchart
Initialization
Update
all xvalues
Set all values to c,
e.g. c =1
Update
all yvalues
Update
all xvalues
2nd time
1st time
May 22, 2017
Update
all yvalues
Web Mining
124
Results
• All x- and y-values converge rapidly so
that termination of the iteration is
guaranteed
• It can be proved in mathematical
approach
• Pages with the highest x-values are
viewed as the best authorities, while
pages with the highest y-values are
regarded as the best hubs
May 22, 2017
Web Mining
125
Implementation
•
•
•
•
Search engine:
Root set:
Base set:
Converging speed:
• Running time:
May 22, 2017
AltaVista
200 pages
1000-5000 pages
Very rapid,
less than 20 times
About 30 minutes
Web Mining
126
HITS: Advantages
• Weight computation is an intrinsic feature
from collection of linked pages
• Provides a densely linked community of
related authorities and hubs
• Pure link-based computation once the root
set has been assembled, with no further
regard to query terms
• Provides surprisingly good search result for a
wide range of queries
May 22, 2017
Web Mining
127
Drawbacks
• Limit On Narrow Topics
– Not enough authoritative pages
– Frequently returns resources for a
more general topic
– adding a few edges can potentially
change scores considerably
• Topic Drifting
- Appear when hubs discuss multiple
topics
May 22, 2017
Web Mining
128
Improved Work
• To improve precision:
- Combining content with link information
- Breaking large hub pages into smaller units
- Computing relevance weights for pages
• To improve speed:
- Building a
Connectivity Server that provides
linkage information for all pages
May 22, 2017
Web Mining
129
Web Structure Mining
– Page-Rank Method
– CLEVER Method
– Connectivity-Server Method
May 22, 2017
Web Mining
130
1. Page-Rank Method
• Introduced by Brin and Page (1998)
• Mine hyperlink structure of web to produce ‘global’
importance ranking of every web page
• Used in Google Search Engine
• Web search result is returned in the rank order
• Treats link as like academic citation
• Assumption: Highly linked pages are more ‘important’
than pages with a few links
• A page has a high rank if the sum of the ranks of its
back-links is high
May 22, 2017
Web Mining
131
Page Rank: Computation
• Assume:
–
–
–
–
–
–
R(u)
Fu
Bu
Nu
C
E(u)
:
:
:
:
:
:
Rank of a web page u
Set of pages which u points to
Set of pages that points to u
Number of links from u
Normalization factor
Vector of web pages as source of rank
• Page Rank Computation:
R (v )
R(u ) = c 
 cE (u )
vBu N v
May 22, 2017
Web Mining
132
Page Rank: Implementation
• Stanford WebBase project  Complete
crawling and indexing system of with current
repository 24 million web pages (old data)
• Store each URL as unique integer and each
hyperlink as integer IDs
• Remove dangling links by iterative procedures
• Make initial assignment of the ranks
• Propagate page ranks in iterative manner
• Upon convergence, add the dangling links
back and recompute the rankings
May 22, 2017
Web Mining
133
Page Rank: Results
• Google utilizes a number of factors to rank
the search results:
– proximity, anchor text, page rank
• The benefits of Page Rank are the greatest
for underspecified queries, example: ‘Stanford
University’ query using Page Rank lists the
university home page the first
May 22, 2017
Web Mining
134
Page Rank: Advantages
• Global ranking of all web pages – regardless
of their content, based solely on their location
in web graph structure
• Higher quality search results – central,
important, and authoritative web pages are
given preference
• Help find representative pages to display for
a cluster center
• Other applications: traffic estimation, backlink predictor, user navigation, personalized
page rank
• Mining structure of web graph is very useful
for various information
retrieval
May 22, 2017
Web Mining
135
CLEVER Method
• CLient–side EigenVector-Enhanced Retrieval
• Developed by a team of IBM researchers at
IBM Almaden Research Centre
• Continued refinements of HITS
• Ranks pages primarily by measuring links
between them
• Basic Principles – Authorities, Hubs
– Good hubs points to good authorities
– Good authorities are referenced by good hubs
May 22, 2017
Web Mining
136
Problems Prior to CLEVER
• Textual content that is ignored leads to
problems caused by some features of web:
– HITS returns good resources for more general
topic when query topics are narrowly-focused
– HITS occasionally drifts when hubs discuss
multiple topics
– Usually pages from single Web site take over a
topic and often use same html template therefore
pointing to a single popular site irrelevant to query
topic
May 22, 2017
Web Mining
137
CLEVER: Solution
• Replacing the sums of Equation (1) and (2) of
HITS with weighted sums
• Assign to each link a non-negative weight
• Weight depends on the query term and end
point
• Extension 1: Anchor Text
– using text that surrounds hyperlink
definitions (href’s) in Web pages, often
referred as ‘anchor text’
– boost weight enhancements of links that
occur near instances of query terms
May 22, 2017
Web Mining
138
CLEVER: Solution (Cont’d)
• Extension 2: Mini Hub Pagelets
– breaking large hub into smaller units
– treat contiguous subsets of links as minihubs or ‘pagelets’
– contiguous sets of links on a hub page are
more focused on single topic than the
entire page
May 22, 2017
Web Mining
139
CLEVER: The Process
Starts by collecting a set of pages
Gathers all pages of initial link, plus any
pages linking to them
Ranks result by counting links
Links have noise, not clear which pages are
best
Recalculate scores
Pages with most links are established as most
important, links transmit more weigh
Repeat calculation no. of times till scores are
refined
May 22, 2017
Web Mining
140
CLEVER: Advantages
Used to populate categories of different
subjects with minimal human assistance
Able to leverage links to fill category with
best pages on web
Can be used to compile large taxonomies of
topics automatically
Emerging new directions: Hypertext
classification, focused crawling, mining
communities
May 22, 2017
Web Mining
141
Connectivity Server Method
Server that provides linkage information for
all pages indexed by a search engine
In its base operation, server accepts a query
consisting of a set of one or more URLs and
return a list of all pages that point to pages in
(parents) and list of all pages that are pointed
to from pages in (children)
In its base operation, it also provides
neighbourhood graph for query set
Acts as underlying infrastructure, supports
search engine applications
May 22, 2017
Web Mining
142
What’s Connectivity Server (Cont’d)
Neighborhood Graph
May 22, 2017
Web Mining
143
CONSERV: Web Structure Mining
Finding Authoritative Pages (Search by
topic)
(pages that is high in quality and relevant to
the topic)
Finding Related Pages (Search by URL)
(pages that address same topic as the
original page, not necessarily semantically
identical)
Algorithms include Companion, Cocitation
May 22, 2017
Web Mining
144
CONSERV: Finding Related Page
May 22, 2017
Web Mining
145
CONSERV: Companion Algorithm
An extension to HITS algorithm
Features:
Exploit not only links but also their order on a
page
Use link weights to reduce the influence of
pages that all reside on one host
Merge nodes that have a large number of
duplicate links
The base graph is structured to exclude
grandparent nodes but include nodes that
share child
May 22, 2017
Web Mining
146
Companion Algorithm (Cont’d)
Four steps
1. Build a vicinity graph for u
2. Remove duplicates and near-duplicates
in graph.
3. Compute link weights based on host to
host connection
4. Compute a hub score and a authority
score for each node in the graph, return
the top ranked authority nodes.
May 22, 2017
Web Mining
147
Companion Algorithm (Cont’d)
Building the Vicinity Graph
Set up parameters: B : no of parents of u, BF : no
of children per parent, F : no of children of u, FB :
no of parents per child
Stoplist (pages that are unrelated to most queries
and have a very high in-degree)
Procedure
Go Back (B) : choose parents (randomly)
Back-Forward(BF) : choose siblings (nearest)
Go Forward (F) : choose children (first)
Forward-Back(FB) : choose siblings (highest indegree)
May 22, 2017
Web Mining
148
Companion Algorithm (Cont’d)
Remove duplicate
Near-duplicate, if two nodes, each has more
than 10 links and they have at least 95% of
their links in common
Replace two nodes with a node whose links are
the union of the links of the two nodes
(mirror sites, aliases)
May 22, 2017
Web Mining
149
Companion Algorithm (Cont’d)
Assign edge (link) weights
Link on the same host has weight 0
If there are K links from documents on a host
to a single document on diff host, each link has
an authority weight of 1/k
If there are k links from a single document on
a host to a set of documents on diff host, give
each link a hub weight of 1/k
(prevent a single host from having too much
influence on the computation)
May 22, 2017
Web Mining
150
Companion Algorithm (Cont’d)
Compute hub and authority scores
Extension of the HITS algorithm with edge weights
Initialize all elements of the hub vector H to 1
Initialze all elements of the authority vector A to 1
While the vectors H and A have not converged:
For all nodes n in the vicinity graph N,
A[n] :=  (n',n)edges(N) H[n'] x
authority_weight(n',n)
For all n in N,
H[n] :=  (n',n)edges(N) A[n'] x
hub_weight(n',n)
Normalize the H and A vectors.
May 22, 2017
Web Mining
151
CONSERV: Cocitation Algorithm
Two nodes are co-cited if they have a
common parent
The number of common parents of two nodes
is their degree of co-citation
Determine the related pages by looking for
sibling nodes with the highest degree of cocitation
In some cases there is an insufficient level of
cocitation to provide meaningful results, chop
off elements of URL, restart algorithm.
e.g. A.com/X/Y/Z

A.com/X/Y
May 22, 2017
Web Mining
152
Comparative Study
• Page Rank
(Google)
• Hub/Authority
(CLEVER, C-Server)
– Assigns initial
ranking and retains
them independently
from queries (fast)
– In the forward
direction from link to
link
– Qualitative result
May 22, 2017
Web Mining
– Assembles different
root set and
prioritizes pages in
the context of query
– Looks forward and
backward direction
– Qualitative result
153
Connectivity-Based Ranking
• Query-independent: gives an intrinsic quality
score to a page
• Approach #1: larger number of hyperlinks
pointing to a page, the better the page
– drawback?
– each link is equally important
• Approach #2: weight each hyperlink
proportionally to the quality of the page
containing the hyperlink
May 22, 2017
Web Mining
154
Query-dependent Connectivity-Based
Ranking
• Carrier and Kazman
• For each query, build a subgraph of the link
graph G limited to pages on query topic
• Build the neighborhood graph
1. A start set S of documents matching query given by
search engine (~200)
2. Set augmented by its neighborhood, the set of
documents that either point to or are pointed to by
documents in S (limit to ~50)
3. Then rank based on indegree
May 22, 2017
Web Mining
155
Idea
• We desire pages that are relevant (in the
neighborhood graph) and authoritative
• As in page rank, not only the in-degree of a
page p, but the quality of the pages that
point to p. If more important pages point to
p, that means p is more authoritative
• Key idea: Good hub pages have links to good
authority pages
• given user query, compute a hub score and
an authority score for each document
• high authority score  relevant content
• high hub score  links to documents with
content Web Mining
May 22, relevant
2017
156
Improvements to Basic Algorithm
• Put weights on edges to reflect
importance of links, e.g., put higher
weight if anchor text associated with
the link is relevant to query
• Normalize weights outgoing from a
single source or coming into a single
sink. This alleviates spamming of query
results
• Eliminate edges between same domain
May 22, 2017
Web Mining
157
Discovering Web communities
on the web
May 22, 2017
Web Mining
158
Introduction
• Introduction of the cyber-community
• Methods to measure the similarity of
web pages on the web graph
• Methods to extract the meaningful
communities through the link
structure
May 22, 2017
Web Mining
159
What is cyber-community
• A community on the web is a group of web pages sharing
a common interest
– Eg. A group of web pages talking about POP Music
– Eg. A group of web pages interested in data-mining
• Main properties:
– Pages in the same community should be similar to
each other in contents
– The pages in one community should differ from the
pages in another community
– Similar to cluster
May 22, 2017
Web Mining
160
Two different types of communities
• Explicitly-defined
communities
– They are well known ones,
such as the resource listed
by Yahoo!
eg.
Arts
Music
Classic
• Implicitly-defined
communities
– They are communities
unexpected or invisible to
most users
May 22, 2017
Web Mining
Painting
Pop
eg. The group of web
pages interested in a
particular singer
161
Two different types of communities
• The explicit communities are easy to identify
– Eg. Yahoo!, InfoSeek, Clever System
• In order to extract the implicit communities,
we need analyze the web-graph objectively
• In research, people are more interested in
the implicit communities
May 22, 2017
Web Mining
162
Similarity of web pages
• Discovering web communities is similar to
clustering. For clustering, we must define the
similarity of two nodes
• A Method I:
– For page and page B, A is related to B if there
is a hyper-link from A to B, or from B to A
Page A
Page B
– Not so good. Consider the home page of IBM
and Microsoft.
May 22, 2017
Web Mining
163
Similarity of web pages
• Method II (from Bibliometrics)
– Co-citation: the similarity of A and B is measured by
the number of pages cite both A and B
Page A
Page B
– Bibliographic coupling: the similarity of A and B is
measured by the number of pages cited by both A and
B.
Page A
May 22, 2017
Page B
Web Mining
164
Methods of clustering
•
Clustering methods based on co-citation
analysis:
•
Methods derived from HITS (Kleinberg)
– Using co-citation matrix
•
All of them can discover meaningful
communities
But their methods are very expensive to
the whole World Wide Web with billions
of web pages.
May 22, 2017
Web Mining
165
A cheaper method
• The method from Ravi Kumar, Prabhakar
Raghavan, Sridhar Rajagopalan, Andrew
Tomkins
– IBM Almaden Research Center
• They call their method communities
trawling (CT)
• They implemented it on the graph of 200
millions pages, it worked very well
May 22, 2017
Web Mining
166
Basic idea of CT
• Definition of
communities
• dense directed
bipartite sub
graphs
Fans
– Bipartite graph: Nodes
are partitioned into two
sets, F and C
– Every directed edge in
the graph is directed
from a node u in F to a
node v in C
– dense if many of the
possible edges between F
and C are present Web Mining
May 22, 2017
Centers
F
C
167
Basic idea of CT
• Bipartite cores
– a complete bipartite
subgraph with at least i
nodes from F and at least j
nodes from C
– i and j are tunable
parameters
– A (i, j) Bipartite core
• Every community have such
a core with a certain i and j.
May 22, 2017
Web Mining
A (i=3, j=3) bipartite core
168
Basic idea of CT
• A bipartite core is the identity of a
community
• To extract all the communities is to
enumerate all the bipartite cores on the web.
• Author invent an efficient algorithm to
enumerate the bipartite cores. Its main idea
is iterate pruning -- elimination-
generation pruning
May 22, 2017
Web Mining
169
• Complete bipartite graph: there is an edge between
each node in F and each node in C
• (i,j)-Core: a complete bipartite graph with at least i
nodes in F and j nodes in C
• (i,j)-Core is a good signature for finding online
communities
•“Trawling”: finding cores
• Find all (i,j)-cores in the Web graph.
– In particular: find “fans” (or “hubs”) in the graph
– “centers” = “authorities”
– Challenge: Web is huge. How to find cores
efficiently?
May 22, 2017
Web Mining
170
Main idea: pruning
• Step 1: using out-degrees
– Rule: each fan must point to at least 6
different websites
– Pruning results: 12% of all pages (= 24M
pages) are potential fans
– Retain only links, and ignore page contents
May 22, 2017
Web Mining
171
Step 2: Eliminate mirroring pages
• Many pages are mirrors (exactly the same
•
•
•
•
•
page)
They can produce many spurious fans
Use a “shingling” method to identify and
eliminate duplicates
Results:
– 60% of 24M potential-fan pages are
removed
– # of potential centers is 30 times of # of
potential fans
May 22, 2017
Web Mining
172
Step 4: Iterative pruning
• To find (i,j)-cores
– Remove all pages whose # of out-links is <
i
– Remove all pages whose # of in-links is < j
– Do it iteratively
• Step 5: inclusion-exclusion pruning
• Idea: in each step, we
• – Either “include” a community”
• – Or we “exclude” a page from further
contention
May 22, 2017
Web Mining
173
• Check a page x with j out-degree. x is a fan
of a (i,j)-core if:
• – There are i-1 fans point to all the forward
neighbors of x
• – This step can be checked easily using the
index on fans and centers
• Result: for (3,3)-cores, 5M pages remained
• Final step:
• – Since the graph is much smaller, we can
afford to “enumerate” the remaining cores
May 22, 2017
Web Mining
174
• Step 3: using in-degrees of pages
• Delete pages highly references, e.g., yahoo,
altavista
• Reason: they are referenced for many
reasons, not likely forming an emerging
community
• Formally: remove all pages with more than k
inlinks (k = 50,for instance)
• Results:
– 60M pages pointing to 20M pages
– 2M potential fans
May 22, 2017
Web Mining
175
Weakness of CT
• The bipartite graph cannot suit all
kinds of communities
• The density of the community is hard
to adjust
May 22, 2017
Web Mining
176
Experiment on CT
• 200 millions web pages
• IBM PC with an Intel 300MHz Pentium II
processor, with 512M of memory, running
Linux
• i from 3 to 10 and j from 3 to 20
• 200k potential communities were discovered
29% of them cannot be found in Yahoo!.
May 22, 2017
Web Mining
177
Summary
• Conclusion: The methods to discover
communities from the web depend on
how we define the communities
through the link structure
• Future works:
– How to relate the contents to link structure
May 22, 2017
Web Mining
178
Web communities based on dense bipartite
graph patterns (WISE’01)
By Krishna Reddy and Masaru Kitsuregawa
May 22, 2017
Web Mining
179
Aim/Motivation
To find all the communities within a large
collection of web pages.
Proposed solution:
•Analyze linkage patterns
•Find DBG in the given collection of webpages
May 22, 2017
Web Mining
180
Definitions
Bipartite graph
A BG is a graph which can be partitioned into
two non-empty sets T and I. Every directed
edge of BG joins a node in T to a node in I
Dense Bipartite graph
A DBG is a BG where each node of T establishes
an edge with at least alpha nodes of I and each
node of I has atleast beta nodes as parents to it
Community
The set T contains the members of the
community if there exist a DBG(T,I,alpha,beta)
where alpha>= alpha_t and beta>=beta_t
Where
alpha_t and beta_t
May 22,
2017
Web Mining > 0.
181
DBG(T,I,p,q)
p
q
a
s
b
t
c
u
d
May 22, 2017
Web Mining
182
Definitions
Cocite: Association among pages based on the
existence of common children (URL’s).
Relax Cocite: we allow u,v,w to group if
cocite(u,v) and cocite(v,w) are true.
a
b
p
a
p
c
c
q
d
e
d
e
r
b
q
i)
May 22, 2017
f
Web Mining
f
g
183
Algorithm
1.For a given URL find T(set of URL’s). Relax-
cocite factor is 1.
a)While num_iterations<=n
• At a fixed relax-cocite factor value,find all
w’s such that relax-cocite(w,y) =true
• T= w U T
2. Community extraction
– Input contains
Page_set,outputDBG(T,I,alpha,beta)
– Edge file has <p,q> where p is the
parent of q.
May 22, 2017
Web Mining
184
Algorithm(contd…)
• For each P belongs to T,insert the edge<p,q>
in edge_file if q belongs child(q).
• Sort edge file based on source.Prepare T1
with<source,freq>.Remove <p,q> from
edge_file if freq<alpha.
• Sort the edge_file based on
destination.Prepare I1
with<q,freq>.Remove<p,q> from edgefile if
freq<beta.
• The result is a DBG(T,I,alpha,beta).
May 22, 2017
Web Mining
185
Advantages/Disadvantages
• Extracts all DBG’s in a pageset.
• Community extracted is significantly large.
DISADV:
• Need a URL to start with.
• Community members need links to be a part
of the community
May 22, 2017
Web Mining
186
Efficient Identification of
Web Communities
Gary William Flake, Steve Lawrence & C. Lee Giles
May 22, 2017
Web Mining
187
Presentation Structure
• Introduction
or why they did it!
– Motivation
– Background
• Theory
or how they did it!
– Definition
– Algorithm
• Experimentation
or how did they do!
– Results
– Conclusions
May 22, 2017
Web Mining
188
Motivation
• Exploding Web ~ 1,000,000,000
documents
• Search Engine Limitations
– Crawling the web
– Updating the web
– Precision vs Recall
16% Maximum Coverage!
• Web Communities
– Balanced Min Cut
– Identification is NP-hard
May 22, 2017
Web Mining
189
Background
• Bibliometrics, Citation analysis, Social
Networks
• Classical Clustering
– eg. CORA
• HITS
– hubs and authorities
May 22, 2017
Web Mining
190
s-t Max Flow & Min Cut
•Capacity weights
•Source & Sink
•Water In, Water Out!
G(V,E)
• Floyd & Fulkerson’s Max Flow = Min Cut
Theorem
• Incremental Shortest Augmentation
algorithm in poly-time
May 22, 2017
Web Mining
191
The Idea
• The Ideal Community C  V
Theorem1: A community C can be identified by
calculating the s-t minimum cut using
appropriately chosen source and sink nodes.
• Proof by Contradiction
May 22, 2017
Web Mining
192
The Algorithm
1. Choose Source(s) and Sink(s)
2. Generate G(V,E) using crawler
3. Find s-t Min Cut
•Virtual Sources & Sinks
•Choosing the Source
•Choosing the Sink
May 22, 2017
Source layers
Web Mining
Sink layers
193
Expectation Maximization
• Implementation Issues
– Small size G(V,E) = low recall
– Dependent on choice of source set
• Recurse over Algorithm
– Community obtained in one iteration used
as input to next iteration
• Termination not guaranteed
May 22, 2017
Web Mining
194
Experimental Results
• Testing neighborhoods …
– Support Vector Machine (SVM)
– The Internet Archive
– Ronald Rivest
• Criterion
– Precision & Recall
– Seed set size
– Running time
May 22, 2017
Web Mining
195
SVM Community
• Characterization
– Recent: Not listed in any portal
– Relatively small research community
• Seed Set
– svm.first.gmd.de, svm.research.bell-labs.com,
www.clrc.rhbnc.ac.uk/research/SVM, www.supportvector.net
• Performance
– 4 iterations of EM
– 11,000 URLs in the graph, 252 member web
pages
May 22, 2017
Web Mining
196
Internet Archive Community
• Characterization
Large, internal communities
• Seed Set : 11 URLs
• Performance
– 2 iterations of EM
– 7,000 URLs, 289 web pages
May 22, 2017
Web Mining
197
Ronald Rivest Community
• Characterization
– Community around an individual
• Seed set
• http://theory.lcs.mit.edu/~rivest
• Performance
– 4 iterations of EM
– 38,000 URLs, 150 pages
– Cormen’s pages as 1st and 3rd result
May 22, 2017
Web Mining
198
Summary
• Actual running time
– 1 sec on a 500 MHz Intel machine
• Max Flow Framework
• EM Approach
• Relevancy test
May 22, 2017
Web Mining
199
Applications
• Focused crawlers
• Increased Precision & Coverage
• Automated population of portal
categories
• Recall Addressed
• Improved filtering
• Keyword Spamming
• Topical Pruning – eg. Pornography
May 22, 2017
Web Mining
200
Future Work
• Generalize the notion of Community
– Parameterize with coupling factor
• Low value, weakly connected communities
• High value, highly connected communities
• Ideal community
• Co-learning and Co-boosting
May 22, 2017
Web Mining
201
References
• L. Page, S. Brin, "PageRank: Bringing order to the Web,"
Stanford Digital Libraries working paper 1997-0072.
• Chakrabarti, Dom, Kumar, “Mining the link structure of the
World Wide Web,” IEEE Computer, 32(8), August 1999
• K. Bharat, A. Broder, “The Connectivity Server: Fast access to
linkage information on the Web.” In Ashman and Thistlewaite
[2], pages 469--477. Brisbane, Australia, 1998
• B. Allan, “Finding Authorities and Hubs from Link Structures on
the World Wide Web”, ACM, May 2001
• Jeffrey Dean “Finding Related Pages in the World Wide Web”
http://citeseer.nj.nec.com/dean99finding.html
• A. Z. Border,… Graph structure in the web: experiments and
models. Proc. 9th WWW Conf., 2000.
• S. R. Kumar,… Trawling emerging cyber-communities
automatically. Proc. 8th WWW Conf., 1999.
May 22, 2017
Web Mining
202
References
• Principles of Data Mining, Hand, Mannila, Smyth. MIT Press,
2001.
• Notes from Dr. M.V. Ramakrishna
http://goanna.cs.rmit.edu.au/~rama/cs442/info.html
• Notes from CS 395T: Large-Scale Data Mining, Inderjit Dhillon
http://www.cs.utexas.edu/users/inderjit/courses/dm2000.html
• Link Analysis in Web Information Retreival, Monika Henzinger.
Bulletin of the IEEE computer Society Technical Committee on
Data Engineering, 2000.
research.microsoft.com/research/db/debull/A00sept/henzinge.ps
• slides from Data Mining: Concepts and Techniques, Jan and
Kamber, Morgan Kaufman, 2001.
May 22, 2017
Web Mining
203
1.
2.
3.
J. Srivastava, R. Cooley, M. Deshpande, Pang-Ning
Tan, Web Usage Mining: Discovery and Applications
of Usage Patterns from Web Data, SIGKDD
Explorations, Vol. 1, Issue 2, 2000.
B. Mobasher, R. Cooley and J. Srivastava, Web
Mining: Information and Pattern Discovery on the
World Wide Web, Proceedings of the 9th IEEE
International Conference on Tools with Artificial
Intelligence (ICTAI'97), November 1997.
B. Mobasher, Namit Jain, Eui-Hong (Sam) Han,
Jaideep Srivastava. Web Mining: Pattern Discovery
from World Wide Web Transactions. Technical
Report TR 96-060, University of Minnesota, Dept.
of Computer Science, Minneapolis, 1996
May 22, 2017
Web Mining
204
4.
5.
6.
7.
8.
R. Cooley, P. N. Tan., and J. Srivastava. (1999). WebSIFT: the
Web site information filter system. In Proceedings of the 1999
KDD Workshop on Web Mining, San Diego, CA. SpringerVerlag, in press.
R. W. Cooley, Web Usage Mining: Discovery and Application of
Interesting Patterns from Web data. PhD Thesis, Dept of
Computer Science, University of Minnesota, May 2000.
Cooley, R., Mobasher, B., and Srivastava, J. Web Mining:
Information and pattern Discovery on the World Wide Web.
IEEE Computer, pages 558-566, 1997.
Etzioni, O. The world wide web: Quagmire or gold mine.
Communications of the ACM, 39(11):65-68, 1996.
Kosala, R. and Blockeel, H. Web Mining Research: A
summary. SIGKDD Explorations, 2(1):1-15, 2000.
May 22, 2017
Web Mining
205
• Fayyad, U., Djorgovski, S., and Weir, N. Automating the analysis
and cataloging of sky surveys. In Advances in Knowledge
Discovery and Data Mining, pages 471-493. AAAI Press, 1996.
• Langley, P. User modeling in adaptive interfaces. In Proceedings
of the Seventh International Conference on User Modeling, pages
357-370, 1999.
• Madria, S.K., Bhowmick, S.S., Ng, W.K., and Lim, E.-P. Research
issues in web data mining. In Proceedings of Data Warehousing
and Knowledge Discovery, First International Conference, DaWaK
‘99, pages 303-312, 1999.
• Masand, B. and Spiliopoulou, M. Webkdd-99: Work-shop on web
usage analysis and user profiling. SIGKDD Explorations, 1(2),
2000.
May 22, 2017
Web Mining
206
• Smyth, P., Fayyad, U.M., Burl, M.C., and Perona, P.
Modeling subjective uncertainty in image annotation. In
Advances in Knowledge Discovery and Data Mining,
pages 517-539, 1996.
• Spiliopoulou, M. Data mining for the web. In Principles
of Data Mining and Knowledge Discovery, Second
European Symposium, PKDD ‘99, pages 588-589, 1999.
• Srivastava, J., Cooley, R., Deshpande, M., and Tan, P.-N.
Web usage mining: Discovery and applications of usage
patterns from web data. SIGMOD Explorations, 1(2),
2000.
• Zaiane, O.R., Xin, M., and Han, J. Discovering Web
access patterns and trends by applying OLAP and data
mining technology on Web logs. IEEE, pages 19-29,
1998.
May 22, 2017
Web Mining
207
Page Ranking
o The PageRank Citation Ranking: Bringing Order to the Web
(1998), Larry Page, Sergey Brin, R. Motwani, T. Winograd,
Stanford Digital Library Technologies Project..
o Authoritative Sources in a Hyperlinked Environment (1998), Jon.
Kleinberg, Journal of the ACM
o The Anatomy of a Large-Scale Hypertextual Web Search Engine
(1998) Sergey Brin, Lawrence Page, Computer Networks and
ISDN Systems.
o Web Search Via Hub Synthesis (1998) Dimitris Achlioptas, Amos
Fiat, Anna R. Karlin, Frank McSherry.
o What is this Page Known for? Computing Web Page Reputations
(2000) Davood Rafiei, Alberto O Mendelzon.
May•22, 2017
Web Mining
208
o Link Analysis in Web Infromation Retrieval, Monika Henzinger.
Bulletin of the IEEE computer Society Technical Committee on
Data Engineering, 2000.
• Finding Authorities and Hubs From Link Structures on the World
Wide Web, Allan Borodin, Gareth O. Roberts, Jeffrey S.
Rosenthal, Panayiotis Tsaparas, 2002.
• Web Communities and Classification
Enhanced hypertext categorization using hyperlinks
(1998) Soumen Chakrabarti, Byron Dom, and Piotr Indyk,
Proceedings of SIGMOD-98, ACM International Conference on
Management of Data.
• Automatic Resource list Compilation by Analyzing Hyperlink
Structure and Associated Text (1998) S. Chakrabarti, B. Dom, D.
Gibson, J. Keinberg, P. Raghavan, and s. Rajagopalan,
Proceedings of the 7th International World Wide Web
Conference.
• Inferring Web Communities from Link Topology (1998) David
Gibson, Jon Kleinberg, Prabhakar Raghavan, UK Conference on
Hypertext.
May•22, 2017
Web Mining
209
•
o Trawling the web for emerging cyber-communities (1999) Ravi
Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew
Tomkins, WWW8 / Computer Networks.
o Finding Related Pages in the World Wide Web (1999) Jeffrey
Dean, Monika R. Henzinger, WWW8 / Computer Networks.
o A System for Collaborative Web Resource Categorization and
Ranking Maxim Lifantsev.
• A Study of Approaches to Hypertext Categorization
(2002) Yiming Yang, Sean Slattery, Rayid Ghani, Journal of
Intelligent Information Systems.
o Hypertext Categorization using Hyperlink Patterns and Meta
Data (2001) Rayid Ghani, Sean Slattery, Yiming Yang.
May 22, 2017
Web Mining
210