Download Web Mining

Document related concepts
no text concepts found
Transcript
Web Mining : A Bird’s Eye
View
Sanjay Kumar Madria
Department of Computer Science
University of Missouri-Rolla, MO 65401
[email protected]
Web Mining
 Web mining - data mining techniques to automatically discover
and extract information from Web documents/services (Etzioni,
1996).
 Web mining research – integrate research from several research
communities (Kosala and Blockeel, July 2000) such as:
 Database (DB)
 Information retrieval (IR)
 The sub-areas of machine learning (ML)
 Natural language processing (NLP)
Mining the World-Wide Web
 WWW is huge, widely distributed, global information source for
 Information services: news, advertisements, consumer information, financi
al management, education, government, e-commerce, etc.
 Hyper-link information
 Access and usage information
 Web Site contents and Organization
Mining the World-Wide Web
Growing and changing very rapidly
 Broad diversity of user communities
 Only a small portion of the information on the Web is truly relevant or use
ful to Web users
 How to find high-quality Web pages on a specified topic?
 WWW provides rich sources for data mining

Challenges on WWW Interactions




Finding Relevant Information
Creating knowledge from Information available
Personalization of the information
Learning about customers / individual users
Web Mining can play an important Role!
Web Mining: more challenging
 Searches for



Web access patterns
Web structures
Regularity and dynamics of Web contents
 Problems





The “abundance” problem
Limited coverage of the Web: hidden Web sources, majority of data in DBMS
Limited query interface based on keyword-oriented search
Limited customization to individual users
Dynamic and semistructured
Web Mining : Subtasks
 Resource Finding

Task of retrieving intended web-documents
 Information Selection & Pre-processing

Automatic selection and pre-processing specific information from retrieved web
resources
 Generalization

Automatic Discovery of patterns in web sites
 Analysis

Validation and / or interpretation of mined patterns
Web Mining Taxonomy
Web Mining
Web Content
Mining
Web Structure
Mining
Web Usage
Mining
Web Content Mining
 Discovery of useful information from web contents / data / documents
 Web data contents: text, image, audio, video,
metadata and hyperlinks.
 Information Retrieval View ( Structured + Semi-Structured)


Assist / Improve information finding
Filtering Information to users on user profiles
 Database View


Model Data on the web
Integrate them for more sophisticated queries
Issues in Web Content Mining
 Developing intelligent tools for IR
- Finding keywords and key phrases
- Discovering grammatical rules and collocations
- Hypertext classification/categorization
- Extracting key phrases from text documents
- Learning extraction models/rules
- Hierarchical clustering
- Predicting (words) relationship
Cont….
 Developing Web query systems

WebOQL, XML-QL
 Mining multimedia data
- Mining image from satellite (Fayyad, et al. 1996)
- Mining image to identify small volcanoes on Venus (Smyth, et al 1996) .
Web Structure Mining
 To discover the link structure of the hyperlinks at the inter-document level
to generate structural summary about the Website and Web page.
 Direction 1: based on the hyperlinks, categorizing the Web pages and
generated information.
 Direction 2: discovering the structure of Web
document itself.
 Direction 3: discovering the nature of the hierarchy or network of
hyperlinks in the Website of a particular domain.
Web Structure Mining
 Finding authoritative Web pages
 Retrieving pages that are not only relevant, but also of high quality,
or authoritative on the topic
 Hyperlinks can infer the notion of authority
 The Web consists not only of pages, but also of hyperlinks pointing
from one page to another
 These hyperlinks contain an enormous amount of latent human
annotation
 A hyperlink pointing to another Web page, this can be considered as
the author's endorsement of the other page
Web Structure Mining
 Web pages categorization (Chakrabarti, et al., 1998)
 Discovering micro communities on the web
- Example: Clever system (Chakrabarti, et al., 1999), Google (Brin and Page,
1998)
 Schema Discovery in Semistructured Environment
Web Usage Mining
 Web usage mining also known as Web log mining
 mining techniques to discover interesting usage patterns from the
secondary data derived from the interactions of the users while surfing
the web
Web Usage Mining
 Applications
 Target potential customers for electronic commerce
 Enhance the quality and delivery of Internet information services to the
end user
 Improve Web server system performance
 Identify potential prime advertisement locations
 Facilitates personalization/adaptive sites
 Improve site design
 Fraud/intrusion detection
 Predict user’s actions (allows prefetching)
Problems with Web Logs
 Identifying users
– Clients may have multiple streams
– Clients may access web from multiple hosts
– Proxy servers: many clients/one address
– Proxy servers: one client/many addresses
 Data not in log
– POST data (i.e., CGI request) not recorded
– Cookie data stored elsewhere
Cont…
 Missing data
– Pages may be cached
– Referring page requires client cooperation
– When does a session end?
 Use of forward and backward pointers
 Typically a 30 minute timeout is used
 Web content may be dynamic
– May not be able to reconstruct what the user saw
 Use of spiders and automated agents – automatic request
we pages
Cont…
 Like most data mining tasks, web log
mining requires preprocessing
– To identify users
– To match sessions to other data
– To fill in missing data
– Essentially, to reconstruct the click stream
Log Data - Simple Analysis
 Statistical analysis of users
– Length of path
– Viewing time
– Number of page views
 Statistical analysis of site
– Most common pages viewed
– Most common invalid URL
Web Log – Data Mining
Applications
 Association rules
– Find pages that are often viewed together
 Clustering
– Cluster users based on browsing patterns
– Cluster pages based on content
 Classification
– Relate user attributes to patterns
Web Logs
 Web servers have the ability to log all
requests
 Web server log formats:
– Most use the Common Log Format (CLF)
– New, Extended Log Format allows
configuration of log file
 Generate vast amounts of data







Common Log Format
Remotehost: browser hostname or IP #
Remote log name of user (almost
always "-" meaning "unknown")
Authuser: authenticated username
Date: Date and time of the request
"request”: exact request lines from client
Status: The HTTP status code returned
Bytes: The content-length of response
Server Logs
2017년 5월 22일
Web Mining
25
Fields








Client IP: 128.101.228.20
Authenticated User ID: - Time/Date: [10/Nov/1999:10:16:39 -0600]
Request: "GET / HTTP/1.0"
Status: 200
Bytes: Referrer: “-”
Agent: "Mozilla/4.61 [en] (WinNT; I)"
Web Usage Mining
 Commonly used approaches (Borges and Levene, 1999)
- Maps the log data into relational tables before an adapted data mining
technique is performed.
Uses the log data directly by utilizing special pre-processing techniques.
 Typical problems
- Distinguishing among unique users, server sessions, episodes, etc. in the
presence of caching and proxy servers (McCallum, et al., 2000; Srivastava,
et al., 2000).
Request
 Method: GET
– Other common methods are POST and HEAD
 URI: /
 – This is the file that is being accessed. When a
directory is specified, it is up to the Server to
decide what to return. Usually, it will be the file
named “index.html” or “home.html”
 Protocol: HTTP/1.0
Status
 Status codes are defined by the HTTP
protocol.
 Common codes include:
– 200: OK
– 3xx: Some sort of Redirection
– 4xx: Some sort of Client Error
– 5xx: Some sort of Server Error
Web Mining Taxonomy
Web Mining
Web Content
Mining
Web Page
Content Mining
Web Structure
Mining
Search Result
Mining
Web Usage
Mining
General Access
Pattern Tracking
Customized
Usage Tracking
Mining the World Wide Web
Web Mining
Web Content
Mining
Web Page Content Mining
Web Page Summarization
WebOQL(Mendelzon et.al. 1998) …:
Web Structuring query languages;
Can identify information within given
web pages
•(Etzioni et.al. 1997):Uses heuristics to
distinguish personal home pages from
other web pages
•ShopBot (Etzioni et.al. 1997): Looks for
product prices within web pages
Web Structure
Mining
Search Result
Mining
Web Usage
Mining
General Access
Pattern Tracking
Customized
Usage Tracking
Mining the World Wide Web
Web Mining
Web Content
Mining
Web Page
Content Mining
Web Structure
Mining
Web Usage
Mining
Search Result Mining
Search Engine Result
Summarization
•Clustering Search Result
(Leouski and Croft, 1996, Zamir and
Etzioni, 1997):
Categorizes documents using
phrases in titles and snippets
General Access
Pattern Tracking
Customized
Usage Tracking
Mining the World Wide Web
Web Mining
Web Content
Mining
Search Result
Mining
Web Page
Content Mining
Web Structure Mining
Using Links
•PageRank (Brin et al., 1998)
•CLEVER (Chakrabarti et al., 1998)
Use interconnections between web pages to
give weight to pages.
Using Generalization
•MLDB (1994)
Uses a multi-level database representation of
the Web. Counters (popularity) and link lists
are used for capturing structure.
Web Usage
Mining
General Access
Pattern Tracking
Customized
Usage Tracking
Mining the World Wide Web
Web Mining
Web Content
Mining
Web Page
Content Mining
Search Result
Mining
Web Structure
Mining
Web Usage
Mining
General Access Pattern Tracking
•Web Log Mining (Zaïane, Xin and Han, 1998)
Uses KDD techniques to understand
general access patterns and trends.
Can shed light on better structure and
grouping of resource providers.
Customized
Usage Tracking
Mining the World Wide Web
Web Mining
Web Content
Mining
Web Page
Content Mining
Search Result
Mining
Web Structure
Mining
General Access
Pattern Tracking
Web Usage
Mining
Customized Usage Tracking
•Adaptive Sites (Perkowitz and Etzioni, 1997)
Analyzes access patterns of each user at a time.
Web site restructures itself automatically by
learning from user access patterns.
Web Content Mining
 Agent-based Approaches:



Intelligent Search Agents
Information Filtering/Categorization
Personalized Web Agents
 Database Approaches:


Multilevel Databases
Web Query Systems
Intelligent Search Agents
 Locating documents and services on the Web:


WebCrawler, Alta Vista (http://www.altavista.com): scan millio
ns of Web documents and create index of words (too many irrelevant, outda
ted responses)
MetaCrawler: mines robot-created indices
 Retrieve product information from a variety of vendor sites using o
nly general information about the product domain:

ShopBot
Intelligent Search Agents (Cont’
d)
 Rely either on pre-specified domain information about particular types of docu
ments, or on hard coded models of the information sources to retrieve and inter
pret documents:





Harvest
FAQ-Finder
Information Manifold
OCCAM
Parasite
 Learn models of various information sources and translates these into its own
concept hierarchy:

ILA (Internet Learning Agent)
Information Filtering/Categorizati
on
 Using various information retrieval techniques and characteristics of open hype
rtext Web documents to automatically retrieve, filter, and categorize them.


HyPursuit: uses semantic information embedded in link structures and document conten
t to create cluster hierarchies of hypertext documents, and structure an information space
BO (Bookmark Organizer): combines hierarchical clustering techniques and user int
eraction to organize a collection of Web documents based on conceptual information
Personalized Web Agents
 This category of Web agents learn user preferences and discover Web informati
on sources based on these preferences, and those of other individuals with simi
lar interests (using collaborative filtering)






WebWatcher
PAINT
Syskill&Webert
GroupLens
Firefly
others
Multiple Layered Web Architecture
Layern
More Generalized Descriptions
...
Layer1
Layer0
Generalized Descriptions
Multilevel Databases
 At the higher levels, meta data or generalizations are


extracted from lower levels
organized in structured collections, i.e. relational or object-oriented database.
 At the lowest level, semi-structured information are

stored in various Web repositories, such as hypertext documents
Multilevel Databases (Cont’d)
 (Han, et. al.):

use a multi-layered database where each layer is obtained via generalizati
on and transformation operations performed on the lower layers
 (Kholsa, et. al.):

propose the creation and maintenance of meta-databases at each informati
on providing domain and the use of a global schema for the meta-database
Multilevel Databases (Cont’d)
 (King, et. al.):

propose the incremental integration of a portion of the schema from each i
nformation source, rather than relying on a global heterogeneous database
schema
 The ARANEUS system:

extracts relevant information from hypertext documents and integrates thes
e into higher-level derived Web Hypertexts which are generalizations of the
notion of database views
Multi-Layered Database (MLDB)
 A multiple layered database model
based on semi-structured data hypothesis
 queried by NetQL using a syntax similar to the relational language SQL

 Layer-0:

An unstructured, massive, primitive, diverse global information-base.
 Layer-1:


A relatively structured, descriptor-like, massive, distributed database by d
ata analysis, transformation and generalization techniques.
Tools to be developed for descriptor extraction.
 Higher-layers:

Further generalization to form progressively smaller, better structured, an
d less remote databases for efficient browsing, retrieval, and information
discovery.
Three major components in MLDB
 S (a database schema):
 outlines the overall database structure of the global MLDB
 presents a route map for data and meta-data (i.e., schema) browsing
 describes how the generalization is performed
 H (a set of concept hierarchies):
 provides a set of concept hierarchies which assist the system to generalize lower layer infor
mation to high layeres and map queries to appropriate concept layers for processing
 D (a set of database relations):
 the whole global information base at the primitive information level (i.e., layer-0)
 the generalized database relations at the nonprimitive layers
The General architecture of WebLogMiner
(a Global MLDB)
Generalized Data
Higher layers
Site 1
Site 2
Concept
Hierarchies
Resource Discovery
(MLDB)
Knowledge Discovery (WLM)
Site 3
2017년 5월 22일
Characteristic Rules
Discriminant Rules
Association Rules
Web Mining
48

Techniques for Web usage
mining
Construct multidimensional view on the Weblog database

Perform multidimensional OLAP analysis to find the top N users, top N accessed Web
pages, most frequently accessed time periods, etc.
 Perform data mining on Weblog records


Find association patterns, sequential patterns, and trends of Web accessing
May need additional information,e.g., user browsing sequences of the Web pages in the
Web server buffer
 Conduct studies to

Analyze system performance, improve system design by Web caching, Web page
prefetching, and Web page swapping
Web Usage Mining - Phases
 Three distinctive phases: preprocessing, pattern discovery, and pattern
analysis
 Preprocessing - process to convert the raw data into the data abstraction
necessary for the further applying the data mining algorithm
 Resources: server-side, client-side, proxy servers, or database.
 Raw data: Web usage logs, Web page descriptions, Web site topology, user
registries, and questionnaire.
 Conversion: Content converting, Structure converting, Usage converting
 User: The principal using a client to interactively retrieve and render
resources or resource manifestations.
 Page view: Visual rendering of a Web page in a specific client
environment at a specific point of time
 Click stream: a sequential series of page view request
 User session: a delimited set of user clicks (click stream) across one or
more Web servers.
 Server session (visit): a collection of user clicks to a single Web server
during a user session.
 Episode: a subset of related user clicks that occur within a user session.
 Content Preprocessing - the process of converting text, image,
scripts and other files into the forms that can be used by the
usage mining.
 Structure Preprocessing - The structure of a Website is formed
by the hyperlinks between page views, the structure
preprocessing can be done by parsing and reformatting the
information.
 Usage Preprocessing - the most difficult task in the usage
mining processes, the data cleaning techniques to eliminate the
impact of the irrelevant items to the analysis result.
Pattern Discovery
 Pattern Discovery is the key component of the
Web mining, which converges the algorithms and techniques from
data mining, machine learning, statistics and pattern recognition
etc research categories.
 Separate subsections: statistical analysis,
association rules, clustering, classification,
sequential pattern, dependency Modeling.
 Statistical Analysis - the analysts may perform different
kinds of descriptive statistical analyses based on different
variables when analyzing the session file ; powerful tools in
extracting knowledge about visitors to a Web site.
 Association Rules - refers to sets of pages that are accessed
together with a support value exceeding some specified
threshold.
 Clustering: a technique to group together users or data items
(pages) with the similar characteristics.
 It can facilitate the development and execution of future
marketing strategies.
 Classification: the technique to map a data item into one of
several predefined classes, which help to establish a profile
of users belonging to a particular class or category.
Pattern Analysis
 Pattern Analysis - final stage of the Web usage mining.
 To eliminate the irrelative rules or patterns and to extract the
interesting rules or patterns from the output of the pattern
discovery process.
 Analysis methodologies and tools: query
mechanism like SQL, OLAP, visualization etc.

WUM – Pre-Processing
Data Cleaning
Removes log entries that are not needed for the mining
process
Data Integration
Synchronize data from multiple server logs, metadata
User Identification
Associates page references with different users
Session/Episode Identification
Groups user’s page references into user sessions
Page View Identification
Path Completion
Fills in page references missing due to browser and proxy
caching
WUM – Issues in User Session Identification
A single IP address is used by many users
different users
Proxy
server
Web server
Different IP addresses in a single session
ISP server
Single user
Web server
Missing cache hits in the server logs
User and Session Identification Issues






Distinguish among different users to a site
Reconstruct the activities of the users within the site
Proxy servers and anonymizers
Rotating IP addresses connections through ISPs
Missing references due to caching
Inability of servers to distinguish among different visits
WUM – Solutions
Remote Agent
A remote agent is implemented in Java Applet
It is loaded into the client only once when the first page is accessed
The subsequent requests are captured and send back to the server
Modified Browser
The source code of the existing browser can be modified to gain user
specific data at the client side
Dynamic page rewriting
When the user first submit the request, the server returns the
requested page rewritten to include a session specific ID
Each subsequent request will supply this ID to the server
Heuristics
Use a set of assumptions to identify user sessions and find the
missing cache hits in the server log
WUM – Heuristics
The session identification heuristics
Timeout: if the time between pages requests exceeds a c
ertain limit, it is assumed that the user is starting a new se
ssion
IP/Agent: Each different agent type for an IP address repr
esents a different sessions
Referring page: If the referring page file for a request is n
ot part of an open session, it is assumed that the request i
s coming from a different session
Same IP-Agent/different sessions (Closest): Assigns th
e request to the session that is closest to the referring pag
e at the time of the request
Same IP-Agent/different sessions (Recent): In the cas
e where multiple sessions are same distance from a page
request, assigns the request to the session with the most r
ecent referrer access in terms of time
Cont.
The path completion heuristics
If the referring page file of a session is not part of the previo
us page file of that session, the user must have accessed a
cached page
The “back” button method is used to refer a cached page
Assigns a constant view time for each of the cached page fi
le
WUM – Association Rule Generation
Discovers the correlations between pages that are
most often referenced together in a single server
session
 Provide the information
What are the set of pages frequently accessed together by Web users?
What page will be fetched next?
What are paths frequently accessed by Web users?
Association rule
A
B [ Support = 60%, Confidence = 80% ]
Example
“50% of visitors who accessed URLs /infor-f.html and
labo/infos.html also visited situation.html”
Associations & Correlations
 Page associations from usage data
– User sessions
– User transactions
 Page associations from content data
– similarity based on content analysis
 Page associations based on structure
– link connectivity between pages
 ==> Obtain frequent itemsets
Examples:
60% of clients who accessed /products/, also accessed
/products/software/webminer.htm.
30% of clients who accessed /special-offer.html, placed
an online order in /products/software/.
(Example from IBM official Olympics Site)
 {Badminton, Diving} ===> {Table Tennis} (a = 69.7%, s =
0.35%)
WUM – Clustering
 Groups together a set of items having similar characteristics
 User Clusters
Discover groups of users exhibiting similar browsing patterns
Page recommendation
User’s partial session is classified into a single cluster
The links contained in this cluster are recommended
Cont..
Page clusters
Discover groups of pages having related content
Usage based frequent pages
Page recommendation
The links are presented based on how often URL references
occur together across user sessions
Website Usage Analysis

Why developing a Website usage / utilization analyzation tool?

Knowledge about how visitors use Website could
- Prevent disorientation and help designers place important
information/functions exactly where the visitors look for and in the way
users need it
- Build up adaptive Website server
Clustering and Classification
clients who often access
 /products/software/webminer.html tend to be from
educational institutions.
clients who placed an online order for software tend to be
students in the 20-25 age group and live in the United States.
75% of clients who download software from
/products/software/demos/ visit between 7:00 and 11:00 pm on
weekends.
Website Usage Analysis
 Discover user navigation patterns in using Website
- Establish a aggregated log structure as a preprocessor to reduce the
search space before the actual log mining phase
- Introduce a model for Website usage pattern discovery by extending the
classical mining model, and establish the processing framework of this
model
Sequential Patterns & Clusters
30% of clients who visited /products/software/,
had done a search in Yahoo using the keyword
“software” before their visit
60% of clients who placed an online order for
WEBMINER, placed another online order for software
within 15 days
Website Usage Analysis
 Website client-server architecture facilitates recording user behaviors in every
steps by
-
submit client-side log files to server when users use clear functions or exit
window/modules
 The special design for local and universal back/forward/clear functions makes
user’s navigation pattern more clear for designer by
- analyzing local back/forward history and incorporate it with universal
back/forward history
Website Usage Analysis
 What will be included in SUA
1. Identify and collect log data
2.
Transfer the data to server-side and save them in a structure desired for
analysis
3. Prepare mined data by
establishing a customized aggregated log tree/frame
4. Use modifications of the typical data mining methods, particularly an
extension of a traditional sequence discovery algorithm, to mine user navigation
patterns
Website Usage Analysis
 Problem need to be considered:
- How to identify the log data when a user go through uninteresting function/module
- What marks the end of a user session?
- Client connect Website through proxy servers
 Differences in Website usage analysis with common Web usage mining
- Client-side log files available
- Log file’s format (Web log files follow Common Log Format specified as a part of HTTP protocol)
- Not necessary for log file cleaning/filtering (which usually performed in preprocess of Web log
mining)
Web Usage Mining - Patterns D
iscovery Algorithms
 (Chen et. al.) Design algorithms for Path Traversal Patterns, finding maxim
al forward references and large reference sequences.
Path Traversal Patterns
 Procedure for mining traversal patterns:



(Step 1) Determine maximal forward references from the original l
og data (Algorithm MF)
(Step 2) Determine large reference sequences (i.e., Lk, k1) from th
e set of maximal forward references (Algorithm FS and SS)
(Step 3) Determine maximal reference sequences from large referen
ce sequences
 Focus on Step 1 and 2, and devise algorithms for the efficien
t determination of large reference sequences
Determine large reference seque
 Algorithm FS:
ces



Utilizes the key ideas of algorithm DHP:
employs hashing and pruning techniques
DHP is very efficient for the generation of candidate itemsets, in particular for the large
two-itemsets, thus greatly improving the performance bottleneck of the whole process
 Algorithm SS:


employs hashing and pruning techniques to reduce both CPU and I/O costs
by properly utilizing the information in candidate references in prior passes, is able to av
oid database scans in some passes, thus further reducing the disk I/O cost
Patterns Analysis Tools
 WebViz [pitkwa94] --- provides appropriate tools and techniques to u
nderstand, visualize, and interpret access patterns.
 Proposes OLAP techniques such as data cubes for the purpose of simplif
ying the analysis of usage statistics from server access logs. [dyreua
et al]
Patterns Discovery and Analysis
Tools
 The emerging tools for user pattern
discovery use sophisticated techniques from AI, d
ata mining, psychology, and information theory, to mine for knowledge from collected
data:
 (Pirolli et. al.) use information foraging theory to combine path traversal pa
tterns, Web page typing, and site topology information to categorize pages for ea
sier access by users.
(Cont’d)
 WEBMINER :


introduces a general architecture for Web usage mining, automatically discoverin
g association rules and sequential patterns from server access logs.
proposes an SQL-like query mechanism for querying the discovered knowledge i
n the form of association rules and sequential patterns.
 WebLogMiner


Web log is filtered to generate a relational database
Data mining on web log data cube and web log database
WEBMINER
 SQL-like Query
 A framework for Web mining, the applications of data mining
and knowledge discovery techniques, association rules and se
quential patterns, to Web data:

Association rules: using apriori algorithm


40% of clients who accessed the Web page with URL /company/produ
cts/product1.html, also accessed /company/products/product
2.html
Sequential patterns: using modified apriori algorithm

60% of clients who placed an online order in /company/products/pr
oduct1.html, also placed an online order in /company/products/p
roduct4.html within 15 days
WebLogMiner
 Database construction from server log file:


data cleaning
data transformation
 Multi-dimensional web log data cube construction and manipulation
 Data mining on web log data cube and web log database
Mining the World-Wide Web
 Design of a Web Log Miner




Web log is filtered to generate a relational database
A data cube is generated form database
OLAP is used to drill-down and roll-up in the cube
OLAM is used for mining interesting knowledge
Web log
Database
Knowledge
Data Cube
Sliced and diced
cube
R(q)
(q,p)G out degre (q)
R(p) = /n (1 ) 
1
Data Cleaning
2
Data Cube
Creation
3
OLAP
4
Data Mining
Construction of Data Cubes
(http://db.cs.sfu.ca/sections/publication/slides/slides.html)
Amount
B.C.
Province Prairies
Ontario
sum
0-20K20-40K 40-60K60K- sum
All Amount
Comp_Method, B.C.
Comp_Metho
d
Database
… ...
Discipline
sum
Each dimension contains a hierarchy of values for one attribute
A cube cell stores aggregate values, e.g., count, sum, max, etc.
A “sum” cell stores dimension summation values.
Sparse-cube technology and MOLAP/ROLAP integration.
“Chunk”-based multi-way aggregation and single-pass computation.
WebLogMiner Architecture




Web log is filtered to generate a relational database
A data cube is generated from database
OLAP is used to drill-down and roll-up in the cube
OLAM is used for mining interesting knowledge
Database
Web log
Knowledge
Data Cube
Sliced and diced
cube
R(q)
(q,p)G out degre (q)
R(p) = /n (1 ) 
1
2
Data Cube
Data Cleaning
Creation
3
4
OLAP
Data Mining
WEBSIFT
2017년 5월 22일
Web Mining
93
What is WebSIFT?
 a Web Usage Mining framework that
 performs
preprocessing
 performs knowledge discovery
 uses the structure and content information about a Web site to
automatically define a belief set.
Overview of WebSIFT
 Based on WEBMINER prototype
 Divides the Web Usage Mining process into three main parts
Overview of WebSIFT
 Input:
Access
 Referrer and agent
 HTML files
 Optional data (e.g., registration data or remote agent logs)

Overview of WebSIFT
 Preprocessing:
 uses input data to construct a user session file
 site files are used to classify pages of a site
 Knowledge discovery phase
 uses existing data mining techniques to generate rules and patterns.
 generation of general usage stats
Information Filtering
 Links between pages provide evidence for supporting the belief that those
pages are related.
 Strength of evidence for a set pages being related is proportional to the
strength of the topological connection between the set of pages.
 Based on site content, can also look at content similarity and by calculating
“distance” between pages.
Information Filtering
Information Filtering
 Uses two different methods to identify interesting results from a
list of discovered frequent itemsets
Information Filtering
 Method 1:
 declare itemsets that contain pages not directly connected to be
interesting
 corresponds to a situation where a belief that a set of pages are related
has no domain or existing evidence but there is mined evidence. 
called Beliefs with Mined Evidence algo (BME)
Information Filtering
 Method 2:
 Absence of itemsets  evidence against a belief that pages are related.
 Pages that have individual support above a threshold but are not present together in
larger frequent itemsets  evidence against the pages being related.
 domain evidence suggests that pages are related the absence of the frequent itemset
can be considered interesting. This is handled by the Beliefs with Contradicting Evidence
algo (BCE )
Experimental Evaluation
Performed on web server of U of MN Dept of Comp Sci & Eng’g web site
Log spanned eight days in Feb 1999
Physical size of log: 19.3 MB
102,838 entries
After preprocessing: 43,158 page views (divided among 10,609 user sessions)
Threshold of 0.1% for support used to generate 693 frequent itemsets with maximum set size
of six pages.
 178 unique pages represented in all the rules.
 BCE and BME algos run on frequent itemsets.






Experimental Evaluation
Experimental Evaluation
Future work
 Filtering frequent itemsets, sequential patterns and clusters
 Incorporate probabilities and fuzzy logic into information filter
 Future works include path completion verification, page usage determination, application of the
pattern analysis results, etc.
Link Analysis
2017년 5월 22일
Web Mining
107
Link Analysis
 Finding patterns in graphs
Bibliometrics – finding patterns in citation graphs
 Sociometry – finding patterns in social networks
 Collaborative Filtering – finding patterns in rank(person, item) graph
 Webometrics – finding patterns in web page links

Web Link Analysis
 Used for
ordering documents matching a user query: ranking
 deciding what pages to add to a collection: crawling
 page categorization
 finding related pages
 finding duplicated web sites

Web as Graph
 Link graph:


node for each page
directed edge (u,v) if page u contains a hyperlink to page v
 Co-citation graph


node for each page
undirected edge (u,v) iff exists a third page w linking to both u and v
 Assumption:


link from page A to page B is a recommendation of page B by A
If A and B are connected by a link, there is a higher probability that they are
on the same topic
Web structure mining
 HITS (Topic distillation)
 PageRank (Ranking web pages used by Google)
 Algorithm in Cyber-community
HITS Algorithm
--Topic Distillation on WWW
2017년 5월 22일
Web Mining
112
HITS Method





Hyperlink Induced Topic Search
Kleinberg, 1998
A simple approach by finding hubs and authorities
View web as a directed graph
Assumption: if document A has hyperlink to document B, then the author of
document A thinks that document B contains valuable information
Main Ideas
 Concerned with the identification of the most
authoritative, or definitive, Web pages on a b
road-topic
 Focused on only one topic
 Viewing the Web as a graph
 A purely link structure-based computation, ig
noring the textual content
HITS: Hubs and Authority
 Hub: web page links to a collection of prominent sites on a common topic
 Authority: Pages that link to a collection of authoritativ
e pages on a broad topic; web page pointed to by hubs
 Mutual Reinforcing Relationship: a good authority is a page that is pointed to
by many good hubs, while a good hub is a page that points to many good
authorities
Hub-Authority Relations
Hubs
Authorities
Unrelated page of
large in-degree
HITS: Two Main Steps
 A sampling component, which constructs a focused collection of several
thousand web pages likely to be rich in relevant authorities
 A weight-propagation component, which determines numerical estimates of hub
and authority weights by an iterative procedure
 As the result, pages with highest weights are returned as hubs and authorities
for the research topic
HITS: Root Set and Base Set
 Using query term to collect a root set (S) of pages from index-based search
engine (AltaVista)
 Expand root set to base set (T) by including all pages linked to by pages in
root set and all pages that link to a page in root set (up to a designated size
cut-off)
 Typical base set contains roughly 1000-5000 pages
Step 1: Constructing Subgrap
h
1.1 Creating a root set (S)
- Given a query string on a broad topic
- Collect the t highest-ranked pages for the query
from a text-based search engine
1.2 Expanding to a base set (T)
- Add the page pointing to a page in root set
- Add the page pointed to by a page in root set
Root Set and Base Set (Cont’d)
T
S
S
Step 2:
Computing Hubs and Authoritie
s
2.1 Associating weights
- Authority weight xp
- Hub weight yp
- Set all values to a uniform constant initially
2.2 Updating weights
Updating Authority Weight
xp =q suchthat qp
yq
q1
Example
xp=yq1+yq2+yq3
q2
q3
P
Updating Hub Weight
yp =

xq
q such that pq
q1
Example
yp=xq1+xq2+xq3
q2
P
q3
Flowchart
Initialization
Update
all xvalues
Set all values to c,
e.g. c =1
Update
all yvalues
1st time
Update
all xvalues
Update
all yvalues
2nd time
Results
 All x- and y-values converge rapidly so that t
ermination of the iteration is guaranteed
 It can be proved in mathematical approach
 Pages with the highest x-values are viewed
as the best authorities, while pages with the
highest y-values are regarded as the best hu
bs
Implementation




Search engine:
Root set:
Base set:
Converging speed:
 Running time:
AltaVista
200 pages
1000-5000 pages
Very rapid,
less than 20 times
About 30 minutes
HITS: Advantages
 Weight computation is an intrinsic feature from collection of linked pages
 Provides a densely linked community of related authorities and hubs
 Pure link-based computation once the root set has been assembled, with no
further regard to query terms
 Provides surprisingly good search result for a wide range of queries
Drawbacks
 Limit On Narrow Topics


Not enough authoritative pages
Frequently returns resources for a
more general topic
 adding a few edges can potentially change scores considerably
 Topic Drifting
- Appear when hubs discuss multiple
topics
Improved Work
 To improve precision:
- Combining content with link information
- Breaking large hub pages into smaller units
- Computing relevance weights for pages
 To improve speed:
- Building a
Connectivity Server that provides linkage in
formation for all pages
Web Structure Mining



Page-Rank Method
CLEVER Method
Connectivity-Server Method
1. Page-Rank Method







Introduced by Brin and Page (1998)
Mine hyperlink structure of web to produce ‘global’ importance ranking of every web page
Used in Google Search Engine
Web search result is returned in the rank order
Treats link as like academic citation
Assumption: Highly linked pages are more ‘important’ than pages with a few links
A page has a high rank if the sum of the ranks of its back-links is high
Page Rank: Computation
 Assume:






R(u)
Fu
Bu
Nu
C
E(u)
: Rank of a web page u
: Set of pages which u points to
: Set of pages that points to u
: Number of links from u
: Normalization factor
: Vector of web pages as source of rank
 Page Rank Computation:
R (v )
R(u ) = c 
 cE (u )
vBu N v
Page Rank: Implementation
 Stanford WebBase project  Complete crawling and indexing system of with
current repository 24 million web pages (old data)
 Store each URL as unique integer and each hyperlink as integer IDs
 Remove dangling links by iterative procedures
 Make initial assignment of the ranks
 Propagate page ranks in iterative manner
 Upon convergence, add the dangling links back and recompute the rankings
Page Rank: Results
 Google utilizes a number of factors to rank the search results:

proximity, anchor text, page rank
 The benefits of Page Rank are the greatest for underspecified queries, example:
‘Stanford University’ query using Page Rank lists the university home page
the first
Page Rank: Advantages
 Global ranking of all web pages – regardless of their content, based
solely on their location in web graph structure
 Higher quality search results – central, important, and authoritative
web pages are given preference
 Help find representative pages to display for a cluster center
 Other applications: traffic estimation, back-link predictor, user
navigation, personalized page rank
 Mining structure of web graph is very useful for various information
retrieval
CLEVER Method





CLient–side EigenVector-Enhanced Retrieval
Developed by a team of IBM researchers at IBM Almaden Research Centre
Continued refinements of HITS
Ranks pages primarily by measuring links between them
Basic Principles – Authorities, Hubs


Good hubs points to good authorities
Good authorities are referenced by good hubs
Problems Prior to CLEVER
 Textual content that is ignored leads to problems caused by some features of
web:



HITS returns good resources for more general topic when query topics are narrowly-focus
ed
HITS occasionally drifts when hubs discuss multiple topics
Usually pages from single Web site take over a topic and often use same html template t
herefore pointing to a single popular site irrelevant to query topic
CLEVER: Solution




Replacing the sums of Equation (1) and (2) of HITS with weighted sums
Assign to each link a non-negative weight
Weight depends on the query term and end point
Extension 1: Anchor Text
 using text that surrounds hyperlink definitions (href’s) in Web pages, often
referred as ‘anchor text’
 boost weight enhancements of links that occur near instances of query ter
ms
CLEVER: Solution (Cont’d)
 Extension 2: Mini Hub Pagelets
 breaking large hub into smaller units
 treat contiguous subsets of links as mini-hubs or ‘pagelets’
 contiguous sets of links on a hub page are more focused on single topic th
an the entire page
CLEVER: The Process
Starts by collecting a set of pages
Gathers all pages of initial link, plus any pages linking to them
Ranks result by counting links
Links have noise, not clear which pages are best
Recalculate scores
Pages with most links are established as most important, links transmit more w
eigh
 Repeat calculation no. of times till scores are refined






CLEVER: Advantages
 Used to populate categories of different subjects with minimal human assistanc
e
 Able to leverage links to fill category with best pages on web
 Can be used to compile large taxonomies of topics automatically
 Emerging new directions: Hypertext classification, focused crawling, mining com
munities
Connectivity Server Method
 Server that provides linkage information for all pages indexe
d by a search engine
 In its base operation, server accepts a query consisting of a
set of one or more URLs and return a list of all pages that
point to pages in (parents) and list of all pages that are po
inted to from pages in (children)
 In its base operation, it also provides neighbourhood graph fo
r query set
 Acts as underlying infrastructure, supports search engine app
lications
What’s Connectivity Server (Cont’d)
Neighborhood Graph
CONSERV: Web Structure Mining
 Finding Authoritative Pages (Search by topic)
(pages that is high in quality and relevant to the topic)
 Finding Related Pages (Search by URL)
(pages that address same topic as the original page, not necessarily semanticall
y identical)
Algorithms include Companion, Cocitation
CONSERV: Finding Related Page
CONSERV: Companion Algorithm
An extension to HITS algorithm
Features:
 Exploit not only links but also their order on a page
 Use link weights to reduce the influence of pages that all reside on one host
 Merge nodes that have a large number of duplicate links
 The base graph is structured to exclude grandparent nodes but include nodes t
hat share child
Companion Algorithm (Cont’d)
Four steps
1. Build a vicinity graph for u
2. Remove duplicates and near-duplicates in graph.
3. Compute link weights based on host to host connection
4. Compute a hub score and a authority score for each node in the gra
ph, return the top ranked authority nodes.
Companion Algorithm (Cont’d)
Building the Vicinity Graph
 Set up parameters: B : no of parents of u, BF : no of children per pa
rent, F : no of children of u, FB : no of parents per child
 Stoplist (pages that are unrelated to most queries and have a very
high in-degree)
 Procedure
Go Back (B) : choose parents (randomly)
Back-Forward(BF) : choose siblings (nearest)
Go Forward (F) : choose children (first)
Forward-Back(FB) : choose siblings (highest in-degree)
Companion Algorithm (Cont’d)
Remove duplicate
 Near-duplicate, if two nodes, each has more than 10 links and they have at l
east 95% of their links in common
 Replace two nodes with a node whose links are the union of the links of the
two nodes
(mirror sites, aliases)
Companion Algorithm (Cont’d)
Assign edge (link) weights
 Link on the same host has weight 0
 If there are K links from documents on a host to a single docume
nt on diff host, each link has an authority weight of 1/k
 If there are k links from a single document on a host to a set of
documents on diff host, give each link a hub weight of 1/k
(prevent a single host from having too much influence on the com
putation)
Companion Algorithm (Cont’d)
Compute
hub and
authority
scores
Extension of the
HITS algorithm
with edge
weights
Initialize all elements of the hub vector H to 1
Initialze all elements of the authority vector A to 1
While the vectors H and A have not converged:
For all nodes n in the vicinity graph N,
A[n] :=  (n',n)edges(N) H[n'] x authority_weight(n',n)
For all n in N,
H[n] :=  (n',n)edges(N) A[n'] x hub_weight(n',n)
Normalize the H and A vectors.
CONSERV: Cocitation Algorithm
 Two nodes are co-cited if they have a common parent
 The number of common parents of two nodes is their degree of co-citation
 Determine the related pages by looking for sibling nodes with the highest degr
ee of co-citation
 In some cases there is an insufficient level of cocitation to provide meaningful
results, chop off elements of URL, restart algorithm.
e.g. A.com/X/Y/Z

A.com/X/Y
Comparative Study
 Page Rank
(Google)



Assigns initial ranking and retains
them independently from queries
(fast)
In the forward direction from link to
link
Qualitative result
 Hub/Authority (CLEVER, C-Server)



Assembles different root set and
prioritizes pages in the context of
query
Looks forward and backward
direction
Qualitative result
Connectivity-Based Ranking
 Query-independent: gives an intrinsic quality score to a page
 Approach #1: larger number of hyperlinks pointing to a page, the better the
page


drawback?
each link is equally important
 Approach #2: weight each hyperlink proportionally to the quality of the page
containing the hyperlink
Query-dependent ConnectivityBased Ranking
 Carrier and Kazman
 For each query, build a subgraph of the link graph G limited to
pages on query topic
 Build the neighborhood graph
A start set S of documents matching query given by search engine
(~200)
2. Set augmented by its neighborhood, the set of documents that either
point to or are pointed to by documents in S (limit to ~50)
3. Then rank based on indegree
1.
Idea
 We desire pages that are relevant (in the neighborhood graph) and
authoritative
 As in page rank, not only the in-degree of a page p, but the quality of
the pages that point to p. If more important pages point to p, that
means p is more authoritative
 Key idea: Good hub pages have links to good authority pages
 given user query, compute a hub score and an authority score for each
document
 high authority score  relevant content
 high hub score  links to documents with relevant content
Improvements to Basic
Algorithm
 Put weights on edges to reflect importance of links, e.g., put higher
weight if anchor text associated with the link is relevant to query
 Normalize weights outgoing from a single source or coming into a
single sink. This alleviates spamming of query results
 Eliminate edges between same domain
Discovering Web communities
on the web
2017년 5월 22일
Web Mining
158
Introduction
 Introduction of the cyber-community
 Methods to measure the similarity of web pages on the web
graph
 Methods to extract the meaningful communities through the li
nk structure
What is cyber-community
 A community on the web is a group of web pages sharing a common intere
st
 Eg. A group of web pages talking about POP Music
 Eg. A group of web pages interested in data-mining
 Main properties:
 Pages in the same community should be similar to each other in conten
ts
 The pages in one community should differ from the pages in another co
mmunity
 Similar to cluster
Two different types of communiti
es
 Explicitly-defined communities

They are well known ones, such as the res eg.
ource listed by Yahoo!
Arts
Music
 Implicitly-defined communities

Classic
Painting
Pop
They are communities unexpected or invisi
ble to most users
eg. The group of web
pages interested in a
particular singer
Two different types of communiti
es
 The explicit communities are easy to identify

Eg. Yahoo!, InfoSeek, Clever System
 In order to extract the implicit communities, we need analyze the web-graph o
bjectively
 In research, people are more interested in the implicit communities
Similarity of web pages
 Discovering web communities is similar to clustering. For clustering, we must d
efine the similarity of two nodes
 A Method I:

For page and page B, A is related to B if there is a hyper-link from A to B, or fro
m B to A

Page
Not so good. Consider
theAhome page of IBM and Microsoft.
Page B
Similarity of web pages
 Method II (from Bibliometrics)

Co-citation: the similarity of A and B is measured by the numb
er of pages cite both A and B
Page A

Page B
Bibliographic coupling: the similarity of A and B is measure
d by the number of pages cited by both A and B.
Page A
Page B
Methods of clustering
 Clustering methods based on co-citation analysis:
 Methods derived from HITS (Kleinberg)

Using co-citation matrix
 All of them can discover meaningful communities
But their methods are very expensive to the whole World Wide
Web with billions of web pages.
A cheaper method
 The method from Ravi Kumar, Prabhakar Raghavan, Sridhar R
ajagopalan, Andrew Tomkins

IBM Almaden Research Center
 They call their method communities trawling (CT)
 They implemented it on the graph of 200 millions pages, it w
orked very well
Basic idea of CT
 Definition of communities
 dense directed bipartite
sub graphs
Bipartite graph: Nodes are par
titioned into two sets, F and C
 Every directed edge in the gr
aph is directed from a node u
in F to a node v in C
 dense if many of the possible
edges between F and C are pr
esent

Fans
Centers
F
C
Basic idea of CT
 Bipartite cores
 a complete bipartite subgraph with at le
ast i nodes from F and at least j nodes
from C
 i and j are tunable parameters
 A (i, j) Bipartite core
 Every community have such a core with a ce
rtain i and j.
A (i=3, j=3) bipartite core
Basic idea of CT
 A bipartite core is the identity of a community
 To extract all the communities is to enumerate all the bipar
tite cores on the web.
 Author invent an efficient algorithm to enumerate the bipar
tite cores. Its main idea is iterate pruning -- eliminatio
n-generation pruning
• Complete bipartite graph: there is an edge between
each node in F and each node in C
• (i,j)-Core: a complete bipartite graph with at least i
nodes in F and j nodes in C
• (i,j)-Core is a good signature for finding online
communities
•“Trawling”: finding cores
• Find all (i,j)-cores in the Web graph.
– In particular: find “fans” (or “hubs”) in the graph
– “centers” = “authorities”
– Challenge: Web is huge. How to find cores
efficiently?
Main idea: pruning
• Step 1: using out-degrees
– Rule: each fan must point to at least 6 different websites
– Pruning results: 12% of all pages (= 24M pages) are potential fans
– Retain only links, and ignore page contents
Step 2: Eliminate mirroring pages
 Many pages are mirrors (exactly the same page)
 They can produce many spurious fans
 Use a “shingling” method to identify and eliminate
duplicates
 Results:
 – 60% of 24M potential-fan pages are removed
 – # of potential centers is 30 times of # of potential fans
Step 4: Iterative pruning
 To find (i,j)-cores
– Remove all pages whose # of out-links is < i
– Remove all pages whose # of in-links is < j
– Do it iteratively
 Step 5: inclusion-exclusion pruning
 Idea: in each step, we
 – Either “include” a community”
 – Or we “exclude” a page from further contention
 Check a page x with j out-degree. x is a fan of a
(i,j)-core if:
 – There are i-1 fans point to all the forward
neighbors of x
 – This step can be checked easily using the index
on fans and centers
 Result: for (3,3)-cores, 5M pages remained
 Final step:
 – Since the graph is much smaller, we can afford
to “enumerate” the remaining cores
 Step 3: using in-degrees of pages
 Delete pages highly references, e.g., yahoo, altavista
 Reason: they are referenced for many reasons, not likely
forming an emerging community
 Formally: remove all pages with more than k inlinks (k =
50,for instance)
 Results:
– 60M pages pointing to 20M pages
– 2M potential fans
Weakness of CT
 The bipartite graph cannot suit all kinds of communities
 The density of the community is hard to adjust
Experiment on CT
 200 millions web pages
 IBM PC with an Intel 300MHz Pentium II processor, with 512M of memory, r
unning Linux
 i from 3 to 10 and j from 3 to 20
 200k potential communities were discovered
29% of them cannot be found in Yahoo!.
Summary
 Conclusion: The methods to discover communities from the web d
epend on how we define the communities through the link struct
ure
 Future works:

How to relate the contents to link structure
Web communities based on dense
bipartite graph patterns (WISE’01)
By Krishna Reddy and Masaru Kitsuregawa
Aim/Motivation
To find all the communities within a large
collection of web pages.
Proposed solution:
•Analyze linkage patterns
•Find DBG in the given collection of webpages
Definitions
Bipartite graph
A BG is a graph which can be partitioned into two non-empty sets T and I.
Every directed edge of BG joins a node in T to a node in I
Dense Bipartite graph
A DBG is a BG where each node of T establishes an edge with at least
alpha nodes of I and each node of I has atleast beta nodes as parents to
it
Community
The set T contains the members of the community if there exist a
DBG(T,I,alpha,beta) where alpha>= alpha_t and beta>=beta_t Where
alpha_t and beta_t > 0.
DBG(T,I,p,q)
p
q
a
s
b
t
c
u
d
Definitions
Cocite: Association among pages based on the
existence of common children (URL’s).
Relax Cocite: we allow u,v,w to group if
cocite(u,v) and cocite(v,w) are true.
a
b
p
a
p
c
c
q
d
e
d
e
r
b
q
i)
f
f
g
Algorithm
1.For a given URL find T(set of URL’s). Relax-cocite factor is 1.
a)While num_iterations<=n
 At a fixed relax-cocite factor value,find all w’s such that relax-cocite(w,y)
=true
 T= w U T
2. Community extraction

Input contains Page_set,outputDBG(T,I,alpha,beta)

Edge file has <p,q> where p is the parent of q.
Algorithm(contd…)
 For each P belongs to T,insert the edge<p,q> in edge_file if q belongs child(q).
 Sort edge file based on source.Prepare T1 with<source,freq>.Remove <p,q>
from edge_file if freq<alpha.
 Sort the edge_file based on destination.Prepare I1
with<q,freq>.Remove<p,q> from edgefile if freq<beta.
 The result is a DBG(T,I,alpha,beta).
Advantages/Disadvantages
 Extracts all DBG’s in a pageset.
 Community extracted is significantly large.
DISADV:
 Need a URL to start with.
 Community members need links to be a part of the community
Efficient Identification of
Web Communities
Gary William Flake, Steve Lawrence & C. Lee Giles
2017년 5월 22일
Web Mining
187
Presentation Structure
 Introduction


Motivation
Background
 Theory



or how they did it!
Definition
Algorithm
 Experimentation

or why they did it!
Results
Conclusions
or how did they do!
Motivation
 Exploding Web ~ 1,000,000,000 documents
 Search Engine Limitations



Crawling the web
Updating the web
Precision vs Recall
 Web Communities


Balanced Min Cut
Identification is NP-hard
16% Maximum Coverage!
Background
 Bibliometrics, Citation analysis, Social Networks
 Classical Clustering

eg. CORA
 HITS

hubs and authorities
s-t Max Flow & Min Cut
•Capacity weights
•Source & Sink
•Water In, Water Out!
G(V,E)
 Floyd & Fulkerson’s Max Flow = Min Cut Theorem
 Incremental Shortest Augmentation algorithm in poly-time
The Idea
 The Ideal Community C  V
Theorem1: A community C can be identified by calculating the s-t minimum cut
using appropriately chosen source and sink nodes.
 Proof by Contradiction
The Algorithm
1.
2.
3.
Choose Source(s) and Sink(s)
Generate G(V,E) using crawler
Find s-t Min Cut
•Virtual Sources & Sinks
•Choosing the Source
•Choosing the Sink
Source layers
Sink layers
Expectation Maximization
 Implementation Issues


Small size G(V,E) = low recall
Dependent on choice of source set
 Recurse over Algorithm

Community obtained in one iteration used as input to next iteration
 Termination not guaranteed
Experimental Results
 Testing neighborhoods …



Support Vector Machine (SVM)
The Internet Archive
Ronald Rivest
 Criterion



Precision & Recall
Seed set size
Running time
SVM Community
 Characterization


Recent: Not listed in any portal
Relatively small research community
 Seed Set

svm.first.gmd.de, svm.research.bell-labs.com, www.clrc.rhbnc.ac.uk/research/SVM, www.supportvector.net
 Performance


4 iterations of EM
11,000 URLs in the graph, 252 member web pages
Internet Archive Community
 Characterization
Large, internal communities
 Seed Set : 11 URLs
 Performance


2 iterations of EM
7,000 URLs, 289 web pages
Ronald Rivest Community
 Characterization

Community around an individual
 Seed set

http://theory.lcs.mit.edu/~rivest
 Performance



4 iterations of EM
38,000 URLs, 150 pages
Cormen’s pages as 1st and 3rd result
Summary
 Actual running time

1 sec on a 500 MHz Intel machine
 Max Flow Framework
 EM Approach
 Relevancy test
Applications
 Focused crawlers

Increased Precision & Coverage
 Automated population of portal categories

Recall Addressed
 Improved filtering


Keyword Spamming
Topical Pruning – eg. Pornography
Future Work
 Generalize the notion of Community

Parameterize with coupling factor


Low value, weakly connected communities
High value, highly connected communities
 Ideal community
 Co-learning and Co-boosting
References
 L. Page, S. Brin, "PageRank: Bringing order to the Web," Stanford Digital Libraries
working paper 1997-0072.
 Chakrabarti, Dom, Kumar, “Mining the link structure of the World Wide Web,” IEEE
Computer, 32(8), August 1999
 K. Bharat, A. Broder, “The Connectivity Server: Fast access to linkage information on
the Web.” In Ashman and Thistlewaite [2], pages 469--477. Brisbane, Australia, 1998
 B. Allan, “Finding Authorities and Hubs from Link Structures on the World Wide Web”,
ACM, May 2001
 Jeffrey Dean “Finding Related Pages in the World Wide Web”
http://citeseer.nj.nec.com/dean99finding.html
 A. Z. Border,… Graph structure in the web: experiments and models. Proc. 9th WWW C
onf., 2000.
 S. R. Kumar,… Trawling emerging cyber-communities automatically. Proc. 8th WWW Co
nf., 1999.
References





Principles of Data Mining, Hand, Mannila, Smyth. MIT Press, 2001.
Notes from Dr. M.V. Ramakrishna http://goanna.cs.rmit.edu.au/~rama/cs442/info.html
Notes from CS 395T: Large-Scale Data Mining, Inderjit Dhillon
http://www.cs.utexas.edu/users/inderjit/courses/dm2000.html
Link Analysis in Web Information Retreival, Monika Henzinger. Bulletin of the IEEE computer
Society Technical Committee on Data Engineering, 2000.
research.microsoft.com/research/db/debull/A00sept/henzinge.ps
slides from Data Mining: Concepts and Techniques, Jan and Kamber, Morgan Kaufman, 2001.
1.
2.
3.
J. Srivastava, R. Cooley, M. Deshpande, Pang-Ning Tan, Web Usage Mining: Discovery
and Applications of Usage Patterns from Web Data, SIGKDD Explorations, Vol. 1, Issue
2, 2000.
B. Mobasher, R. Cooley and J. Srivastava, Web Mining: Information and Pattern
Discovery on the World Wide Web, Proceedings of the 9th IEEE International
Conference on Tools with Artificial Intelligence (ICTAI'97), November 1997.
B. Mobasher, Namit Jain, Eui-Hong (Sam) Han, Jaideep Srivastava. Web Mining:
Pattern Discovery from World Wide Web Transactions. Technical Report TR 96-060,
University of Minnesota, Dept. of Computer Science, Minneapolis, 1996
4.
5.
6.
7.
8.
R. Cooley, P. N. Tan., and J. Srivastava. (1999). WebSIFT: the Web site
information filter system. In Proceedings of the 1999 KDD Workshop
on Web Mining, San Diego, CA. Springer-Verlag, in press.
R. W. Cooley, Web Usage Mining: Discovery and Application of
Interesting Patterns from Web data. PhD Thesis, Dept of Computer
Science, University of Minnesota, May 2000.
Cooley, R., Mobasher, B., and Srivastava, J. Web Mining: Information
and pattern Discovery on the World Wide Web. IEEE Computer, pages
558-566, 1997.
Etzioni, O. The world wide web: Quagmire or gold mine.
Communications of the ACM, 39(11):65-68, 1996.
Kosala, R. and Blockeel, H. Web Mining Research: A summary. SIGKDD
Explorations, 2(1):1-15, 2000.
 Fayyad, U., Djorgovski, S., and Weir, N. Automating the analysis and
cataloging of sky surveys. In Advances in Knowledge Discovery and Data
Mining, pages 471-493. AAAI Press, 1996.
 Langley, P. User modeling in adaptive interfaces. In Proceedings of the
Seventh International Conference on User Modeling, pages 357-370, 1999.
 Madria, S.K., Bhowmick, S.S., Ng, W.K., and Lim, E.-P. Research issues in
web data mining. In Proceedings of Data Warehousing and Knowledge
Discovery, First International Conference, DaWaK ‘99, pages 303-312, 1999.
 Masand, B. and Spiliopoulou, M. Webkdd-99: Work-shop on web usage
analysis and user profiling. SIGKDD Explorations, 1(2), 2000.
 Smyth, P., Fayyad, U.M., Burl, M.C., and Perona, P. Modeling subjective uncertainty in image
annotation. In Advances in Knowledge Discovery and Data Mining, pages 517-539, 1996.
 Spiliopoulou, M. Data mining for the web. In Principles of Data Mining and Knowledge
Discovery, Second European Symposium, PKDD ‘99, pages 588-589, 1999.
 Srivastava, J., Cooley, R., Deshpande, M., and Tan, P.-N. Web usage mining: Discovery and
applications of usage patterns from web data. SIGMOD Explorations, 1(2), 2000.
 Zaiane, O.R., Xin, M., and Han, J. Discovering Web access patterns and trends by applying
OLAP and data mining technology on Web logs. IEEE, pages 19-29, 1998.
Page Ranking
o
The PageRank Citation Ranking: Bringing Order to the Web (1998), Larry Page, Sergey Brin, R.
Motwani, T. Winograd, Stanford Digital Library Technologies Project..
o
Authoritative Sources in a Hyperlinked Environment (1998), Jon. Kleinberg, Journal of the ACM
o
The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) Sergey Brin, Lawrence P
age, Computer Networks and ISDN Systems.
o
Web Search Via Hub Synthesis (1998) Dimitris Achlioptas, Amos Fiat, Anna R. Karlin, Frank McS
herry.
o
What is this Page Known for? Computing Web Page Reputations (2000) Davood Rafiei, Alberto O
Mendelzon.

o

Link Analysis in Web Infromation Retrieval, Monika Henzinger. Bulletin of the IEEE computer So
ciety Technical Committee on Data Engineering, 2000.
Finding Authorities and Hubs From Link Structures on the World Wide Web, Allan Borodin, Gare
th O. Roberts, Jeffrey S. Rosenthal, Panayiotis Tsaparas, 2002.
 Web Communities and Classification




Enhanced hypertext categorization using hyperlinks (1998) Soumen Chakrabarti, Byron Dom, a
nd Piotr Indyk, Proceedings of SIGMOD-98, ACM International Conference on Management of Da
ta.
Automatic Resource list Compilation by Analyzing Hyperlink Structure and Associated Text (19
98) S. Chakrabarti, B. Dom, D. Gibson, J. Keinberg, P. Raghavan, and s. Rajagopalan, Proceedings of
the 7th International World Wide Web Conference.
Inferring Web Communities from Link Topology (1998) David Gibson, Jon Kleinberg, Prabhakar R
aghavan, UK Conference on Hypertext.
o
Trawling the web for emerging cyber-communities (1999) Ravi Kumar,
Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins, WWW8 / C
omputer Networks.
o
Finding Related Pages in the World Wide Web (1999) Jeffrey Dean, Mon
ika R. Henzinger, WWW8 / Computer Networks.
o
A System for Collaborative Web Resource Categorization and Rankin
g Maxim Lifantsev.
 A Study of Approaches to Hypertext Categorization (2002) Yiming Yang,
Sean Slattery, Rayid Ghani, Journal of Intelligent Information Systems.
o
Hypertext Categorization using Hyperlink Patterns and Meta Data (200
1) Rayid Ghani, Sean Slattery, Yiming Yang.