Download Web Mining - WordPress.com

Document related concepts
no text concepts found
Transcript
Data Mining Algorithms
Web Mining
Web Mining Outline
Goal: Examine the use of data mining on
the World Wide Web
Introduction
Web Content Mining
Web Structure Mining
Web Usage Mining
2
Introduction
The Web is perhaps the single largest data
source in the world.
Web mining aims to extract and mine useful
knowledge from the Web.
A multidisciplinary field: data mining, machine
learning, natural language processing,
statistics, databases, information retrieval,
multimedia, etc.
Due to the heterogeneity and lack of structure
of Web data, mining is a challenging task.
3
Opportunities and Challenges
The amount of info on the Web is huge, and easily
accessible.
The coverage of Web info is very wide and diverse.
Info/data of almost all types exist on the Web, e.g.,
structured tables, texts, multimedia data, etc.
Much of the Web information is semi-structured
due to the nested structure of HTML code.
Much of the Web info is linked. hyperlinks among
pages within a site, and across different sites.
Much of the Web info is redundant. Same piece of
info or its variants may appear in many pages.
4
Opportunities and Challenges
Web is noisy. A Web page typically contains many kinds of
info, e.g., main contents, advertisements, navigation
panels, copyright notices, etc.
Web consists of surface Web and deep Web.
– Surface Web: pages that can be browsed using a browser.
– Deep Web: can only be accessed thro parameterized QI.
Web is also about services.
Web is dynamic. Information on the Web changes
constantly. Keeping up with the changes and monitoring
the changes are important issues.
The Web is a virtual society. It is not only about data,
information and services, but also about interactions
among people, organizations and automatic systems, i.e.
communities.
5
Web Mining Other Issues
Size
– The Indexed Web contains at least 4.16 billion
pages (Thursday, 10 October, 2013).
The Dutch Indexed Web contains at least
188.34 million pages (Thursday, 10 October,
2013)
– Grows at about 1 million pages a day
– Google indexes > 45 billion documents
Diverse types of data
So not possible to warehouse or normal
data mining
6
7
Web Data
Web pages
Intra-page structures (HTML, XML code)
Inter-page structures (actual linkage
structures between web pages)
Usage data
Supplemental data
– Profiles
– Registration information
– Cookies
8
Web Mining Taxonomy
9
Web Mining Taxonomy
Web Content Mining
– Extends work of basic search engines
Web Structure Mining
– Mine structure (links, graph) of the Web
Web Usage Mining
– Analyses Logs of Web Access
Web Mining applications include Target
Advtg., Recommendation Engines,
CRM etc
10
Web Content Mining
Extends work of basic search engines
Web content mining: mining, extraction
and integration of useful data, information
and knowledge from Web page contents
Search Engines
– IR application, Keyword based, Similarity
between query and document
– Crawlers, Indexing
– Profiles
– Link analysis
11
Issues in Web Content Mining
Developing intelligent tools for IR
– Finding keywords and key phrases
– Discovering grammatical rules and
collocations
– Hypertext classification/categorization
– Extracting key phrases from text documents
– Learning extraction models/rules
– Hierarchical clustering
– Predicting (words) relationship
12
Search Engine – Two Rank Functions
Ranking based on link
structure analysis
Search
Rank Functions
Similarity
based on Relevance Ranking
content or
text
Inverted
Index
Term Dictionary
(Lexicon)
Importance Ranking
(Link Analysis)
Backward Link
(Anchor Text)
Indexer
Anchor Text
Generator
Meta Data
Web Topology
Graph
Forward
Index
Forward
Link
Web Graph
Constructor
URL
Dictioanry
Web Page Parser
Web Pages
13
How do We Find Similar Web
Pages?
Content based approach
Structure based approach
Combing both content and structure
approach
14
Relevance Ranking
• Inverted index
- A data structure for supporting text queries
- like index in a book
indexing
disks with
documents
aalborg
.
.
.
.
.
arm
armada
armadillo
armani
.
.
.
.
.
zz
3452, 11437, …..
4, 19, 29, 98, 143, ...
145, 457, 789, ...
678, 2134, 3970, ...
90, 256, 372, 511, ...
602, 1189, 3209, ...
inverted index
Crawlers
Robot (spider) traverses the hypertext structure
in the Web.
Collect information from visited pages
Used to construct indexes for search engines
Traditional Crawler – visits entire Web and
replaces index
Periodic Crawler – visits portions of the Web and
updates subset of index
Incremental Crawler – selectively searches the
Web and incrementally modifies index
Focused Crawler – visits pages related to a
particular subject
16
Focused Crawler
Only visit links from a page if that page is
determined to be relevant.
Classifier is static after learning phase.
Components:
– Classifier which assigns relevance score to
each page based on crawl topic.
– Distiller to identify hub pages.
– Crawler visits pages based on crawler and
distiller scores.
17
Focused Crawler
Classifier to related documents to topics
Classifier also determines how useful
outgoing links are
Hub Pages contain links to many relevant
pages. Must be visited even if not high
relevance score.
18
Focused Crawler
19
Virtual Web View
Multiple Layered DataBase (MLDB) built on top
of the Web.
Each layer of the database is more generalized
(and smaller) and centralized than the one
beneath it.
Upper layers of MLDB are structured and can be
accessed with SQL type queries.
Translation tools convert Web documents to XML.
Extraction tools extract desired information to
place in first layer of MLDB.
Higher levels contain more summarized data
obtained through generalizations of the lower
levels.
20
Multilevel Databases
Text
Image
Audio
Video
Maps
Games
Levels of A MLDB
Layer 0 :
– Unstructured, massive and global information base.
Layer 1:
– Derived from lower layers.
– Relatively structured.
– Obtained by data analysis, transformation &
Generalization.
Higher Layers (Layer n):
– Further generalization to form smaller, better structured
databases for more efficient retrieval.
Web Query System
 These systems attempt to make use of:
– Standard database query language – SQL
– Structural information about web documents
– Natural language processing for queries made in
www searches.
 Examples:
– WebLog: Restructuring extracted information from
Web sources.
– W3QL: Combines structure query (organization of
hypertext) and content query (information retrieval
techniques).
Architecture of a Global
MLDB
Source 1
Generalized Data
Concept
Hierarchy
Source 2
Higher
Levels
.
.
.
Source n
Resource Discovery (MLDB)
Knowledge Discovery
Personalization
Web access or contents tuned to better fit the
desires of each user.
Manual techniques identify user’s preferences
based on profiles or demographics.
Collaborative filtering identifies preferences
based on ratings from similar users.
Content based filtering retrieves pages
based on similarity between pages and user
profiles.
25
Applications
ShopBot
Bookmark Organizer
Recommender Systems
Intelligent Search Engines
26
Document Classification
Supervised Learning
– Supervised learning is a ‘machine learning’
technique for creating a function from training data .
– Documents are categorized
– The output can predict a class label of the input object
(called classification).
Techniques used are
– Nearest Neighbor Classifier
– Feature Selection
– Decision Tree
Feature Selection
Removes terms in the training documents which
are statistically uncorrelated with the class labels
Simple heuristics
– Stop words like “a”, “an”, “the” etc.
– Empirically chosen thresholds for ignoring
“too frequent” or “too rare” terms
– Discard “too frequent” and “too rare terms”
Document Clustering
Unsupervised Learning : a data set of input objects
is gathered
Goal : Evolve measures of similarity to cluster a
collection of documents/terms into groups within
which similarity within a cluster is larger than across
clusters.
Hypothesis : Given a `suitable‘ clustering of a
collection, if the user is interested in document/term
d/t, he is likely to be interested in other members of
the cluster to which d/t belongs.
Hierarchical
– Bottom-Up
– Top-Down
Partitional
Semi-Supervised Learning
A collection of documents is available
A subset of the collection has known labels
Goal: to label the rest of the collection.
Approach
– Train a supervised learner using the labeled subset.
– Apply the trained learner on the remaining
documents.
Idea
– Harness information in the labeled subset to enable
better learning.
– Also, check the collection for emergence of new
topics
Web-Structure Mining
Generate structural summary about the
Web site and Web page
• Depending upon the hyperlink, ‘Categorizing
the Web pages and the related Information @
inter domain level
•
Discovering the Web Page Structure.
•
Discovering the nature of the hierarchy of
hyperlinks in the website and its structure.
Web-Structure Mining
cont…
Finding Information about web pages
Retrieving information about the relevance and the
quality of the web page.
Finding the authoritative on the topic and content.
Inference on Hyperlink
The web page contains not only information but also
hyperlinks, which contains huge amount of
annotation.
Hyperlink identifies author’s endorsement of the other
web page.
Web Structure Mining
Mine structure (links, graph) of the Web
Techniques
– PageRank
– CLEVER
Create a model of the Web organization.
May be combined with content mining to
more effectively retrieve important pages.
33
Web as a Graph
Web pages as nodes of a graph.
Links as directed edges.
my page
www.vesit.edu
www.vesit.edu
my page
www.vesit.edu
www.google.com
www.google.com
www.google.com
34
Link Structure of the Web
Forward links (out-edges).
Backward links (in-edges).
Approximation of importance/quality: a
page may be of high quality if it is referred
to by many other pages, and by pages of
high quality.
35
Authorities and Hubs
Authority is a page which has relevant
information about the topic.
Hub is a page which has collection of links
to pages about that topic.
a1
a2
h
a3
a4
36
PageRank
Introduced by Brin and Page (1998).
Mine hyperlink structure of web to produce
‘global’ importance ranking of every web
page.
Used in Google Search Engine.
Web search result is returned in the rank
order.
Treats link as like academic citation.
Assumption: Highly linked pages are more
‘important’ than pages with a few links.
37
PageRank
Used by Google
Prioritize pages returned from search by
looking at Web structure.
Importance of page is calculated based
on number of pages which point to it –
Backlinks.
Weighting is used to provide more
importance to backlinks coming form
important pages.
38
PageRank: Main Idea
A page has a high rank if the sum of the
ranks of its back-links is high.
Google utilizes a number of factors to rank
the search results:
– proximity, anchor text, page rank
The benefits of Page Rank are the greatest
for underspecified queries, example: ‘Mumbai
University’ query using Page Rank lists the
university home page the first.
39
Basic Idea
Back-links coming from important pages
convey more importance to a page.
For example, if a web page has a link from
the yahoo home page, it may be just one
link but it is a very important one.
A page has high rank if the sum of the
ranks of its back-links is high.
This covers both the case when a page
has many back-links and when a page has
a few highly ranked back-links.
40
Definition
A page’s rank is equal to the sum of all
the pages pointing to it.
Rank (v)
Rank (u )  
Nv
vBu
Bu  set of pages with links to u
N v  number of links from v
41
Simplified PageRank Example
Rank(u) = Rank
of page u , where
c is a
normalization
constant (c < 1 to
cover for pages
with no outgoing
links).
42
Expanded Definition
R(u): page rank of page u
c: factor used for normalization (<1)
Bu: set of pages pointing to u
Nv: outbound links of v
R(v): page rank of site v that points to u
E(u): distribution of web pages that a random
surfer periodically jumps (set to 0.15)
R(v)
R(u)  c 
 cE (u)
vBu N v
43
Problem 1 - Rank Sink
Page cycles pointed by some incoming link.
Loop will accumulate rank but never distribute it.
44
Problem 2 - Dangling Links
In general, many Web pages do not have either back links or forward links.
Dangling links do not affect the ranking of any other page directly, so they
are removed until all the PageRanks are calculated.
45
PageRank (cont’d)
PR(p) = c (PR(1)/N1 + … + PR(n)/Nn)
– PR(i): PageRank for a page i which
points to target page p.
– Ni: number of links coming out of page i
46
HITS
Hyperlink-Induces Topic Search
Based on a set of keywords, find set of relevant
pages – R.
Identify hub and authority pages for these.
– Expand R to a base set, B, of pages linked to or from R.
– Calculate weights for authorities and hubs.
Pages with highest ranks in R are returned.
47
Authorities and Hubs
Authority is a page which has relevant
information about the topic.
Hub is a page which has collection of links
to pages about that topic.
a1
a2
h
a3
a4
48
Authorities and Hubs (cont.)
Good hubs are the ones that point to good
authorities.
Good authorities are the ones that are
pointed to by
h
good hubs.
1
a1
h2
a2
h3
a3
a4
h4
h5
a5
a496
Finding Authorities and Hubs
First, construct a focused sub-graph of the
www.
Second, compute Hubs and Authorities
from the sub-graph.
50
Construction of Sub-graph
Topic
Search Engine
Rootset
Pages
Crawler
Expanded
set
Pages
Forward link pages
Rootset
51
Root Set and Base Set
Use query term to
collect a root set of
pages from textbased search engine
(Lycos, Altavista ).
Root set
52
Root Set and Base Set (cont.)
Expand root set into
base set by including
(up to a designated
size cut-off):
– All pages linked to by
pages in root set
– All pages that link to a
page in root set
Base set
Root set
53
Hubs & Authorities
Calculation
Iterative algorithm on Base Set: authority weights a(p), and hub
weights h(p).
– Set authority weights a(p) = 1, and hub weights h(p) = 1 for
all p.
– Repeat following two operations
(and then re-normalize a and h to have unit norm):
h(v1)
v1
h(v2)
v2
h(v3)
v3
p
a( p) 
 h(q)
q pointsto p
p
h( p ) 
v1
a(v1)
v2
a(v2)
v3
a(v3)
 a(q)
p pointsto q
54
Example
0.45, 0.45
0.45, 0.45
Hub 0.45, Authority 0.45
0.45, 0.45
55
Example (cont.)
0.45, 0.9
1.35, 0.9
Hub 0.9, Authority 0.45
0.45, 0.9
56
Algorithmic Outcome
Applying iterative multiplication (power
iteration) will lead to calculating
eigenvector of any “non-degenerate” initial
vector.
Hubs and authorities as outcome of
process.
Principal eigenvector contains highest hub
and authorities.
57
Results
Although HITS is only link-based (it
completely disregards page content) results
are quite good in many tested queries.
From narrow topic, HITS tends to end in more
general one.
Specific of hub pages - many links can cause
algorithm drift. They can point to authorities in
different topics.
Pages from single domain / website can
dominate result, if they point to one page not necessarily a good authority.
58
Possible Enhancements
Use weighted sums for link calculation.
Take advantage of “anchor text” - text
surrounding link itself.
Break hubs into smaller pieces. Analyze each
piece separately, instead of whole hub page as
one.
Disregard or minimize influence of links inside
one domain.
IBM expanded HITS into Clever; not seen as
viable real-time search engine.
59
CLEVER Method
 CLient–side EigenVector-Enhanced Retrieval
 Developed by a team of IBM researchers at IBM
Almaden Research Centre
 Ranks pages primarily by measuring links
between them
 Continued refinements of HITS ( Hypertext
Induced Topic Selection)
 Basic Principles – Authorities, Hubs
– Good hubs points to good authorities
– Good authorities are referenced by good hubs
http://www.almaden.ibm.com/projects/clever.shtml
Problems Prior to CLEVER
 Textual content that is ignored leads to problems
caused by some features of web:
– HITS returns good resources for more general
topic when query topics are narrowly-focused
– HITS occasionally drifts when hubs discuss
multiple topics
– Usually pages from single Web site take over
a topic and often use same html template
therefore pointing to a single popular site
irrelevant to query topic
http://www.almaden.ibm.com/projects/clever.shtml
CLEVER: Solution
 Extension 1: Anchor Text
– using text that surrounds hyperlink definitions (href’s)
in Web pages, often referred as ‘anchor text’
– boost weight enhancements of links that occur near
instances of query terms
 Extension 2: Mini Hubs/Pagelets
– breaking large hub into smaller units
– treat contiguous subsets of links as mini-hubs or
‘pagelets’
– contiguous sets of links on a hub page are more
focused on single topic than the entire page
http://www.almaden.ibm.com/projects/clever.shtml
CLEVER: The Process
 Starts by collecting a set of pages
 Gathers all pages of initial link, plus any pages
linking to them
 Ranks result by counting links
 Links have noise, not clear which pages are
best
 Recalculate scores
 Pages with most links are established as most
important, links transmit more weight
 Repeat calculation no. of times till scores are
refined
http://www.almaden.ibm.com/projects/clever.shtml
CLEVER
Identify authoritative and hub pages.
Authoritative Pages :
– Highly important pages.
– Best source for requested information.
Hub Pages :
– Contain links to highly important pages.
64
CLEVER
The CLEVER algorithm is an extension of standard
HITS and provides an appropriate solution to the
problems that result from standard HITS.
CLEVER assigns a weight to each link based on the
terms of the queries and end-points of the link.
It combines anchor text to set weights to the links as
well.
Moreover, it breaks large hub pages into smaller
units so that each hub page is focused on as a single
topic.
Finally, in the case of a large number of pages from a
single domain, it scales down the weights of pages to
reduce the probabilities of overhead weights
65
PageRank vs. HITS
PageRank
(Google)
– computed for all web
pages stored in the
database prior to the
query
– computes authorities
only
– Trivial and fast to
compute
HITS
(CLEVER)
– performed on the set of
retrieved web pages for
each query
– computes authorities
and hubs
– easy to compute, but
real-time execution is
hard
66
Web Usage Mining
Performs mining on Web Usage data or
Web Logs
A web log is a listing of page reference
data also called as a click steam
Can be seen from either server
perspective – better web site design
Or client perspective – prefetching of web
pages etc.
67
Web Usage Mining
Applications
Personalization
Improve structure of a site’s Web pages
Aid in caching and prediction of future page
references
Improve design of individual pages
Improve effectiveness of e-commerce (sales
and advertising)
Improve web server performance (Load
Balancing)
68
Web Usage Mining Activities
Preprocessing Web log
– Cleanse
– Remove extraneous information
– Sessionize
Session: Sequence of pages referenced by one user at a sitting.
Pattern Discovery
– Count patterns that occur in sessions
– Pattern is sequence of pages references in session.
– Similar to association rules
Transaction: session
Itemset: pattern (or subset)
Order is important
Pattern Analysis
69
Web Usage Mining Issues
Identification of exact user not possible.
Exact sequence of pages referenced by a
user not possible due to caching.
Session not well defined
Security, privacy, and legal issues
70
Web Usage Mining - Outcome
Association rules
– Find pages that are often viewed
together
Clustering
– Cluster users based on browsing patterns
– Cluster pages based on content
Classification
– Relate user attributes to patterns
71
Web Log Cleansing
Replace source IP address with unique
but non-identifying ID.
Replace exact URL of pages referenced
with unique but non-identifying ID.
Delete error records and records
containing not page data (such as figures
and code)
72
Data Structures
Keep track of patterns identified during
Web usage mining process
Common techniques:
– Trie
– Suffix Tree
– Generalized Suffix Tree
– WAP Tree
73
Web Usage Mining – Three
Phases
http://www.acm.org/sigs/sigkdd/explorations/issue1-2/srivastava.pdf
Phase 1: Pre-processing
Converts the raw data into the data
abstraction necessary for the further
applying the data mining algorithm
– Mapping the log data into relational
tables before an adapted data mining
technique is performed.
– Using the log data directly by utilizing
special pre-processing techniques.
75
Raw data – Web log
Click stream: a sequential series of page
view request
User session: a delimited set of user clicks
(click stream) across one or more Web
servers.
Server session (visit): a collection of user
clicks to a single Web server during a user
session.
Episode: a subset of related user clicks
that occur within a user session.
76
Phase 2: Pattern Discovery
Pattern Discovery uses techniques
such as statistical analysis,
association rules, clustering,
classification, sequential pattern,
dependency Modeling.
77
Phase 3: Pattern Analysis
A process to gain Knowledge about how
visitors use Website in order to
– Prevent disorientation and help designers to
place important information/functions exactly
where the visitors look for and in the way
users need it.
– Build up adaptive Website server
78
79
Techniques for Web usage mining
 Construct multidimensional view on the Weblog database
– Perform multidimensional OLAP analysis to find the top
N users, top N accessed Web pages, most frequently
accessed time periods, etc.
 Perform data mining on Weblog records
– Find association patterns, sequential patterns, and
trends of Web accessing
– May need additional information,e.g., user browsing
sequences of the Web pages in the Web server buffer
 Conduct studies to
– Analyze system performance, improve system design
by Web caching, Web page prefetching, and Web page
swapping
Software for Web Usage Mining
 WEBMINER :
– introduces a general architecture for Web usage
mining, automatically discovering association rules
and sequential patterns from server access logs.
– proposes an SQL-like query mechanism for querying
the discovered knowledge in the form of association
rules and sequential patterns.
 WebLogMiner
– Web log is filtered to generate a relational database
– Data mining on web log data cube and web log
database
WEBMINER
 SQL-like Query
 A framework for Web mining,
– Association rules: using Apriori algorithm
40% of clients who accessed the Web page with
URL /company/products/product1.html, also
accessed /company/products/product2.html
– Sequential patterns:
60% of clients who placed an online order in
/company/products/product1.html, also placed
an online order in
/company/products/product4.html within 15
days
WebLogMiner
 Database construction from server log
file:
– data cleaning
– data transformation
 Multi-dimensional web log data cube
construction and manipulation
 Data mining on web log data cube and
web log database
Mining the World-Wide Web
 Design of a Web Log Miner
–
–
–
–
Web log is filtered to generate a relational database
A data cube is generated from the database
OLAP is used to drill-down and roll-up in the cube
OLAM is used for mining interesting knowledge
Web log
Database
Data Cube
Sliced and diced
cube
Knowledge
3
1 Data Cleaning
2 Data Cube Creation
OLAP
4
Mining