Download Sampling the Web - Department of Computer Engineering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

URL redirection wikipedia , lookup

Transcript
Extracting knowledge from the
World Wide Web
Monika Henzinger and Steve Lawrence
Google Inc.
Presented
by
Murat Şensoy
Objective
The World Wide Web provides an exceptional
opportunity to automatically analyze a large
sample of interests and activity in the world.
But How to:
Extract knowledge from the web
The Challenge:
web
Distributed and heterogeneous nature of the
makes large-scale analysis difficult.
Objective
The paper provide an overview of recent
methods for :

Sampling the Web
 Analyzing
and Modeling Web Growth
Sampling the Web
 Due to sheer size of the Web, even simple
statistics about it are unknown.
 The ability to sample web pages or web
servers uniformly at random is very
useful for determining statistics.
 The Question is:
How to Sample the Web uniformly ?
Sampling the Web
Two Famous Sampling Methods
for the Web are :


Random Walk
IP address Sampling
Sampling the Web with
Random Walk
Main Idea :
Thus, the
probability that a
page is sampled
is a constant
independent of
the page.
Visit the pages with a
probability proportional
to its PageRank value
Sample the visited pages
with a probability
inversely proportional to
its PageRank value
PageRank
PageRank has several definitions.
Google’s creators Brin and Page published definition
of PageRank as used in Google.
Sergey Brin and Lawrence Page,"The Anatomy of a LargeScale Hypertextual Web Search Engine”,in Proceedings of
the 7th International World Wide Web Conference, pp. 107–
117,1998.
PageRank
PageRank has another definition depending on
Random Walk.
- Initial page is chosen randomly from all pages.
- Let walk is at page p at a given time step.
- With probability d, follow an out-link of p .
- With probability 1-d, select a page out of all pages.
The PageRank of a page p is the fraction of steps
that the walk spent at p in the limit.
PageRank
Two problems arise in the implementation:
Random Walk assumes already that it
can find a random page on the web; the
problem that we actually want to solve.

Many hosts on the web have a large
number of links with in the same host and
very few leaving them.

PageRank
Henzinger proposed and implemented a modified Random Walk
- Given a set of initial pages
- Choose start page randomly from initial pages.
- Let walk is at page p at a given time step.
- With probability d, follow an out-link of p .
- With probability 1-d, select a random host among visited hosts,
then jump to a randomly selected page out of all pages visited
on this host so far.
- All pages in the initial set are also considered to be visited.
Sampling the Web with
Random Walk
The modified random walk visits a page with probability
approximately proportional to its PageRank value.
Afterward, the visited pages are sampled with probability
inversely proportional to their PageRank value.
Thus, the probability that a page is sampled is a constant
independent of the page.
Sampling the Web with
Random Walk
An example of statistics generated using this
approach:
Sampling the Web with
IP Address Sampling
 IP V.4 Addresses : 4 bytes
 IP V.6 Addresses : 16 bytes
There are about 4.3 billion possible IP V.4 addresses.
IP address sampling is an approach depending on randomly
sampling IP addresses and testing for a web server at the
standard port (http:80 or https:443).
This approach works only for IP V.4
IP V.6 address space, 2128 addresses, is too much to explore.
Sampling the Web with
IP Address Sampling
Solution:
Check
Multiple
Times
Sampling the Web with
IP Address Sampling
This method finds many web servers that would not
normally
be considered as a part of the publicly indexable web.
- Servers with authorization requirements
- Servers with no content
- Hardware that provides a Web Interface
Sampling the Web with
IP Address Sampling
A number of issues lead to minor biases:
- An IP address may host several web sites
- Multiple IP addresses may serve identical
content
- Some web servers may not use the standard
There isport.
a higher probability of finding larger sites using
multiple IP addresses to serve the same content.
Solution: Use the domain name system.
Sampling the Web with
IP Address Sampling
The distribution of
server types found
from sampling 3.6
million IP addresses
in February 1999
Lawrence, S. & Giles, C. L.
(1999) Nature 400, 107–109.
Analyses from the same study
Only 34.2 % of servers contained the
common “keyword” or “description”
meta-tags on their homepage.
Low usage of simple HTML metadata
standard suggest that acceptance of
more complex standards such as XML
will be very slow.
Discussion On Sampling
the Web
Current techniques exhibit biases and do not
achieve a uniform random sample.


For Random Walk, any implementation is
limited to a finite random walk.
For IP address sampling, main challenge is how
to sub-sample the pages accessible from a given
IP address.
Analyzing and Modeling
Web Growth
We can also extract valuable information by
analyzing and modeling the growth of pages
and links on the web.
The Web has a degree distribution following the Power Law.
P( k ) ~ k

For in-link distribution
  2.1
For out-link distribution
  2.72
Analyzing and Modeling
Web Growth
This observation led to the design of various
models for the Web.
 Preferential Attachment of Barabasi et al.
 Mixed Model of Pennock et al.
 Copy Model of Kleinberg et al.
 The Hostgraph Model
Preferential Attachment
As the network grows, the probability that a
given node receives an edge is proportional
to that node’s current connectivity.
‘rich get richer’
Probability that a new node is connected to node u is
P  ku
 kw
nodew
Preferential Attachment
Model suggest that for a node u No
created at
time tu, the expected degree isEvidence
m(t/tu)0.5.
Thus older pages get rich faster than newer pages.
Model explains Power Law in-link distribution.
However, the model exponent is 3 (by mean-field
theory), whereas the observed exponent is 2.1.
In reality, different link distributions are observed
among web pages of the same category.
Winners don’t take all
The early models fail to account for significant
deviations from power law scaling common in
almost all studied networks.
For example, among web pages of the same
category, link distributions can diverge strongly
from power law scaling, exhibiting a roughly lognormal distribution.
Moreover, conclusions about the attack and failure
tolerance of the Internet based on the early models
may not fully hold within specific communities.
Winners don’t take all
NEC researchers (Pennock et al.) discovered
that the degree of "rich get richer" or "winners
take all" behavior varies in different categories
and may be significantly less than previously
thought.
Winners don’t take all
Pennock et al. introduced a new model of
network growth, mixing uniform and
preferential attachment, that accurately
accounts for the true connectivity
distributions found in web categories, the
web as a whole, and other social and
biological networks.
Winners don’t take all
The numbers
represent the
degree to
which link
growth is
preferential
(new links are
created to
already
popular sites).
Copy Model
Kleinberg et al. explained the power-law inlink
existing node v
distributions with a copy model that an
constructs
is chosen
With
a directed graph.
uniformly at
a new node u is
added with d
outlinks
Dest 0
Dest 1
u
Dest jth
Probability: 1- 
Choose destination
uniformly at
random among
existing nodes
random.
v
Dest jth
Dest d-1
This model is also a
mixture of uniform and
preferential influences
on network growth.
Copy with
Probability: 
The Hostgraph Model
 Models the Web on the host or domain level.
 Each node represents a host.
 Each directed edge represents the hyperlinks from
pages on the source host to pages on the target host.
The Hostgraph Model
Bharat et al. show that the weighted inlink and
the weighted outlink distributions in the host
graph have a power law distribution with  =1.62
and  = 1.67, respectively. However, the number
of hosts with small degree is considerably
smaller than predicted by the model.
There is "flattening" of the curve for low degree
hosts.
The Hostgraph Model
With probability  add a
an existing node v is
new Bharat
node u with et
d al. made a modification chosen
uniformly
to the
copyat
outlinks
random.
graph
model,
called
the
re-link
model,
to
and with probability 1-
Then select d links of v
With
select an
existing node
explain
this flattening.
uniformly at random
Probability: 1- 
with additional d
Choose destination
outlinks.
uniformly at random
Dest 0
Dest 1
u
Dest jth
v
among existing
nodes
Dest 0
Dest d-1
With probability 1- no
new node is added.So
number of low degree
nodes is reduced.
Copy with
Probability: 
Dest jth
Dest d
The Hostgraph Model
Communities on the Web
Identification of communities on the web is valuable . Practical
applications include :
• Automatic web portals
• Focused search engines
• Content filtering
• Complementing text-based searches
Community identification also allows for analysis of the
entire web and the objective study of relationships within
and between communities.
Communities on the Web
Flake et al. define a web community as :
A collection of web pages such that each member
page has more hyperlinks within the community than
outside of the community.
Flake et al. show that the web self-organizes such that these
link-based communities identify highly related pages.
Communities on the Web
Communities on the Web
There are alternatives for the indication of Web communities:
Kumar et al. consider dense bipartite subgraphs as indications of
communities.
Other approaches :
 Bibliometric methods such as cocitation and bibliographic coupling
 The PageRank algorithm
 The HITS algorithm
 Bipartite subgraph identification
 Spreading activation energy
Conclusion
There are still many open problems:
 The problem of uniformly sampling the web is still
open in practice: which pages should be counted, and
how can we reduce biases?
 Web growth models approximate the true nature of
how the web grows: how can the current models be
refined to improve accuracy, while keeping the models
relatively simple and easy to understand and analyze?
 Finally, community identification remains an open
area: how can the accuracy of community identification
be improved, and how can communities be best
structured or presented to account for differences of
opinion in what is considered a community?
Thanks For Your Patience
Appendix
Google’s PageRank
We assume page A has pages T1...Tn which point to it (i.e., are
citations). The parameter d is a damping factor which can be
set between 0 and 1. We usually set d to 0.85. Also C(A) is
defined as the number of links going out of page A. The
PageRank of a page A is given as follows:
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
PageRank or PR(A) can be calculated using a simple iterative
algorithm, and corresponds to the principal eigenvector of the
normalized link matrix of the web
Google’s PageRank
 Example :
d = 0.85
PageRank(C) = 0.15 + 0.85( 1.49/2 + 0.78/1 ) = 1.45
PageRank for 26 million web pages can be computed in a
few hours on a medium size workstation (Brin&Page 98).
The Hostgraph Model