Download Sampling the Web - Department of Computer Engineering

Extracting knowledge from the World Wide Web Monika Henzinger and Steve Lawrence Google Inc. Presented by Murat Şensoy Objective The World Wide Web provides an exceptional opportunity to automatically analyze a large sample of interests and activity in the world. But How to: Extract knowledge from the web The Challenge: web Distributed and heterogeneous nature of the makes large-scale analysis difficult. Objective The paper provide an overview of recent methods for :  Sampling the Web  Analyzing and Modeling Web Growth Sampling the Web  Due to sheer size of the Web, even simple statistics about it are unknown.  The ability to sample web pages or web servers uniformly at random is very useful for determining statistics.  The Question is: How to Sample the Web uniformly ? Sampling the Web Two Famous Sampling Methods for the Web are :   Random Walk IP address Sampling Sampling the Web with Random Walk Main Idea : Thus, the probability that a page is sampled is a constant independent of the page. Visit the pages with a probability proportional to its PageRank value Sample the visited pages with a probability inversely proportional to its PageRank value PageRank PageRank has several definitions. Google’s creators Brin and Page published definition of PageRank as used in Google. Sergey Brin and Lawrence Page,"The Anatomy of a LargeScale Hypertextual Web Search Engine”,in Proceedings of the 7th International World Wide Web Conference, pp. 107– 117,1998. PageRank PageRank has another definition depending on Random Walk. - Initial page is chosen randomly from all pages. - Let walk is at page p at a given time step. - With probability d, follow an out-link of p . - With probability 1-d, select a page out of all pages. The PageRank of a page p is the fraction of steps that the walk spent at p in the limit. PageRank Two problems arise in the implementation: Random Walk assumes already that it can find a random page on the web; the problem that we actually want to solve.  Many hosts on the web have a large number of links with in the same host and very few leaving them.  PageRank Henzinger proposed and implemented a modified Random Walk - Given a set of initial pages - Choose start page randomly from initial pages. - Let walk is at page p at a given time step. - With probability d, follow an out-link of p . - With probability 1-d, select a random host among visited hosts, then jump to a randomly selected page out of all pages visited on this host so far. - All pages in the initial set are also considered to be visited. Sampling the Web with Random Walk The modified random walk visits a page with probability approximately proportional to its PageRank value. Afterward, the visited pages are sampled with probability inversely proportional to their PageRank value. Thus, the probability that a page is sampled is a constant independent of the page. Sampling the Web with Random Walk An example of statistics generated using this approach: Sampling the Web with IP Address Sampling  IP V.4 Addresses : 4 bytes  IP V.6 Addresses : 16 bytes There are about 4.3 billion possible IP V.4 addresses. IP address sampling is an approach depending on randomly sampling IP addresses and testing for a web server at the standard port (http:80 or https:443). This approach works only for IP V.4 IP V.6 address space, 2128 addresses, is too much to explore. Sampling the Web with IP Address Sampling Solution: Check Multiple Times Sampling the Web with IP Address Sampling This method finds many web servers that would not normally be considered as a part of the publicly indexable web. - Servers with authorization requirements - Servers with no content - Hardware that provides a Web Interface Sampling the Web with IP Address Sampling A number of issues lead to minor biases: - An IP address may host several web sites - Multiple IP addresses may serve identical content - Some web servers may not use the standard There isport. a higher probability of finding larger sites using multiple IP addresses to serve the same content. Solution: Use the domain name system. Sampling the Web with IP Address Sampling The distribution of server types found from sampling 3.6 million IP addresses in February 1999 Lawrence, S. & Giles, C. L. (1999) Nature 400, 107–109. Analyses from the same study Only 34.2 % of servers contained the common “keyword” or “description” meta-tags on their homepage. Low usage of simple HTML metadata standard suggest that acceptance of more complex standards such as XML will be very slow. Discussion On Sampling the Web Current techniques exhibit biases and do not achieve a uniform random sample.   For Random Walk, any implementation is limited to a finite random walk. For IP address sampling, main challenge is how to sub-sample the pages accessible from a given IP address. Analyzing and Modeling Web Growth We can also extract valuable information by analyzing and modeling the growth of pages and links on the web. The Web has a degree distribution following the Power Law. P( k ) ~ k  For in-link distribution   2.1 For out-link distribution   2.72 Analyzing and Modeling Web Growth This observation led to the design of various models for the Web.  Preferential Attachment of Barabasi et al.  Mixed Model of Pennock et al.  Copy Model of Kleinberg et al.  The Hostgraph Model Preferential Attachment As the network grows, the probability that a given node receives an edge is proportional to that node’s current connectivity. ‘rich get richer’ Probability that a new node is connected to node u is P  ku  kw nodew Preferential Attachment Model suggest that for a node u No created at time tu, the expected degree isEvidence m(t/tu)0.5. Thus older pages get rich faster than newer pages. Model explains Power Law in-link distribution. However, the model exponent is 3 (by mean-field theory), whereas the observed exponent is 2.1. In reality, different link distributions are observed among web pages of the same category. Winners don’t take all The early models fail to account for significant deviations from power law scaling common in almost all studied networks. For example, among web pages of the same category, link distributions can diverge strongly from power law scaling, exhibiting a roughly lognormal distribution. Moreover, conclusions about the attack and failure tolerance of the Internet based on the early models may not fully hold within specific communities. Winners don’t take all NEC researchers (Pennock et al.) discovered that the degree of "rich get richer" or "winners take all" behavior varies in different categories and may be significantly less than previously thought. Winners don’t take all Pennock et al. introduced a new model of network growth, mixing uniform and preferential attachment, that accurately accounts for the true connectivity distributions found in web categories, the web as a whole, and other social and biological networks. Winners don’t take all The numbers represent the degree to which link growth is preferential (new links are created to already popular sites). Copy Model Kleinberg et al. explained the power-law inlink existing node v distributions with a copy model that an constructs is chosen With a directed graph. uniformly at a new node u is added with d outlinks Dest 0 Dest 1 u Dest jth Probability: 1-  Choose destination uniformly at random among existing nodes random. v Dest jth Dest d-1 This model is also a mixture of uniform and preferential influences on network growth. Copy with Probability:  The Hostgraph Model  Models the Web on the host or domain level.  Each node represents a host.  Each directed edge represents the hyperlinks from pages on the source host to pages on the target host. The Hostgraph Model Bharat et al. show that the weighted inlink and the weighted outlink distributions in the host graph have a power law distribution with  =1.62 and  = 1.67, respectively. However, the number of hosts with small degree is considerably smaller than predicted by the model. There is "flattening" of the curve for low degree hosts. The Hostgraph Model With probability  add a an existing node v is new Bharat node u with et d al. made a modification chosen uniformly to the copyat outlinks random. graph model, called the re-link model, to and with probability 1- Then select d links of v With select an existing node explain this flattening. uniformly at random Probability: 1-  with additional d Choose destination outlinks. uniformly at random Dest 0 Dest 1 u Dest jth v among existing nodes Dest 0 Dest d-1 With probability 1- no new node is added.So number of low degree nodes is reduced. Copy with Probability:  Dest jth Dest d The Hostgraph Model Communities on the Web Identification of communities on the web is valuable . Practical applications include : • Automatic web portals • Focused search engines • Content filtering • Complementing text-based searches Community identification also allows for analysis of the entire web and the objective study of relationships within and between communities. Communities on the Web Flake et al. define a web community as : A collection of web pages such that each member page has more hyperlinks within the community than outside of the community. Flake et al. show that the web self-organizes such that these link-based communities identify highly related pages. Communities on the Web Communities on the Web There are alternatives for the indication of Web communities: Kumar et al. consider dense bipartite subgraphs as indications of communities. Other approaches :  Bibliometric methods such as cocitation and bibliographic coupling  The PageRank algorithm  The HITS algorithm  Bipartite subgraph identification  Spreading activation energy Conclusion There are still many open problems:  The problem of uniformly sampling the web is still open in practice: which pages should be counted, and how can we reduce biases?  Web growth models approximate the true nature of how the web grows: how can the current models be refined to improve accuracy, while keeping the models relatively simple and easy to understand and analyze?  Finally, community identification remains an open area: how can the accuracy of community identification be improved, and how can communities be best structured or presented to account for differences of opinion in what is considered a community? Thanks For Your Patience Appendix Google’s PageRank We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to the principal eigenvector of the normalized link matrix of the web Google’s PageRank  Example : d = 0.85 PageRank(C) = 0.15 + 0.85( 1.49/2 + 0.78/1 ) = 1.45 PageRank for 26 million web pages can be computed in a few hours on a medium size workstation (Brin&Page 98). The Hostgraph Model

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Sampling the Web - Department of Computer Engineering