Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Recent Advances in Telecommunications, Signals and Systems Ranking WebPages Using Web Structure Mining Concepts Zakaria Suliman Zubi Computer Science Department Faculty of Science Sirte University Sirte, Libya Email: [email protected] Abstract: - With the rapid growth of the Web, users get easily lost in the rich hyper structure on the web. Providing relevant information to the users to supply to their needs is the primary goal of the owners of these websites. Web mining is one of the techniques that could help the websites owner in this direction. Web mining was categorized into three categories such as web content mining, web usage mining and web structure mining. Web structure mining plays an important role in this approach. Two page ranking algorithms such as PageRank and Hyperlink-Induced Topic Search (HITS) are commonly used in web structure mining. Both algorithms treat all links equally when distributing rank scores. A comparison between both algorithms was discussed in this paper as well. Ranking WebPages is an important mission as it assists the user look for highly ranked pages that are relevant to the query. Different metrics have been proposed to rank web pages according to their quality, and a brief discussion of the two prominent ones was conducted in this paper also. Key-Words: - Web Mining, Web Content Mining, Web Usage Mining, Web Structure Mining, HITS, PageRank, Authority and Hubs. carry out the problem with the help of other areas like Database (DB), Information retrieval (IR), Natural Language Processing (NLP), and Machine Learning etc. These techniques can be used to discuss and analyze the useful information from web. Dealing with these aspects, there are some challenges we should take it into account as follow [3]: 1 Introduction The web is a rich source of information and persists to increase in size and difficulty. Retrieving the necessary web page on the web, efficiently and effectively, is becoming a challenge aspect now days [1]. On every occasion a user needs to search the relevant pages, the user prefers those relevant pages to be at hand. Relevant web page is one that provides the same topic as the original page but it is not semantically identical to original page [1]. As a matter of fact the Web is unstructured data warehouse, which delivers the mass amount of information and also enlarges the complexity of dealing information from different perspective of knowledge searchers, business analysts and web service providers [2]. Beside, the Google report on in 2008 that there are 1 trillion unique URLs on the web [3]. Web has grown enormously and the usage of web is unbelievable so it is essential to understand the data structure of web. The mass amount of information becomes very hard for the users to find, extract, filter or evaluate the relevant information. This issue lifts up the attention to the obligation of some technique that can solve these challenges. 1) Web is huge. 2) Web pages are semi structured. 3) Web information stands to be diversity in meaning. 4) Degree of quality of the information extracted. 5) Conclusion of knowledge from information extracted. The paper is organized as follows- The categories of Web Mining are discussed in Section 2. Section 3 explains the important of Web Page Ranking and two important algorithms such as Hypertext Induced Topic Selection (HITS) algorithm and PageRank algorithm. In section 4, we explore the comparison between Web Page Ranking algorithms used. The Conclusion remarks are given in Section 5. 2 Web Mining Categories Web Mining consists of three main categories according to the web data used as input in Web Data Mining. (1) Web Content Mining; (2) Web Usage Web mining can be easily used in this direction to ISBN: 978-1-61804-169-2 21 Recent Advances in Telecommunications, Signals and Systems involves the automatic discovery of user access patterns from one or more web servers. Through this mining technique we can determine what users are looking for on the Internet. Some might be looking for only technical data, where as some others might be interested in multimedia data. Mining and (3); Web Structure Mining. A. Web Content Mining Web content mining is the procedure of retrieving the information from the web into more structured forms and indexing the information to retrieve it quickly. It focuses mainly on the structure within a web documents as a inner document level. Web usage mining can be defined also as the application of data mining techniques to discover interesting usage patterns from web usage data, in order to understand and better serve the needs of web-based applications [2]. Usage data captures the identity or origin of web users along with their browsing behavior at a web site. Web usage mining itself can be classified further depending on the kind of usage data. Web Content Mining is an area related to Data Mining because many Data Mining techniques can be applied in Web content Mining. Since data mining deals with different types of data includes text, images, audio and video whereas; web content mining had all types of data. It is also related to text mining because much of the web contents are text, but is also quite different from these because web data is mainly semi structured in nature and text mining focuses mainly on unstructured text. Table 1 summarizes the type of concepts of web content mining. Web usage mining tries to make sense of the data generated by the Web surfer's sessions or behaviors. While Web-content mining and Web-structure mining utilize real or primary data on the Web; Web Usage Mining Web Content Mining IR view DB View - User Interactivity -Server Logs (log-files) -Browser Logs Representation -Relational Table - Graph - User Behavior Machine Learning Method Statistical Association rules -Site Construction Application -Adaptation and management Categories -Marketing, -User Modeling View of Data Main Data View of Data -Unstructured Main Data -Semi Structured -Structured -Web Site as DB - Text documents -Hypertext documents -Hypertext documents Representation -Bag of words, n-gram -Edge labeled Graph, Terms, -Relational - Phrases, Concepts or Ontology -Relational Learning Method -Machine Learning -Proprietary algorithms -Statistical (including:-Association rules NLP) Tab 2. Illustrate the Web usage mining category Application -Categorization Categories -Clustering -Finding frequent sub structures -Finding extract rules -Web site schema patterns in text discovery Web-usage mining mines the secondary data derived from the behavior of users while interacting with the web. This includes data from Web server-access logs, proxy-server logs, browser logs, user profiles, registration data, user sessions or transactions, cookies, bookmark data, and any other data that is derived from a person's interaction with the Web. These types of data are shown in table2. Tab 1., Gives an overview of Web content mining category. B. Web Usage Mining In many researches Web Usage Mining is used to identify the browsing patterns by analyzing the navigational behavior of user [10]. It focuses on techniques that can be used to predict the user behavior while the user interacts with the web. It uses the resulting data on the web. This activity ISBN: 978-1-61804-169-2 C. Web Structure Mining Web structure mining is defined as the process by which we discover the model of link structure of the web pages. We classify the links; generate the ease of use information such as the similarity and 22 Recent Advances in Telecommunications, Signals and Systems business to link the information of its own Web site to enable navigation and the irrelevant web pages into site maps. The more links provided within the relationship of the web pages enable the navigation to yield the link hierarchy allowing navigation ease [11]. relations among them by taking the advantage of hyperlink topology [9]. PageRank and hyperlink analysis also fall in this class. The aim of Web Structure Mining is to generate structured abstract about the website and web page. It attempts to discover the link structure of hyper links at inter document level. As it is very ordinary that the web documents contain links and they use both the real or primary data on the web so it can be accomplished that Web Structure Mining has a relation with Web Content Mining. It is quite frequently to join these two mining tasks in an application. Table 3 viewed the type of data that can be joined in web mining application. Therefore, Web mining and the use of structure mining can provide strategic results for marketing of a Web site for production of sale. The more traffic directed to the Web pages of a particular site increases the level of return visitation to the site and recall by search engines relating to the information or product provided by web sites that serve a company or any business community. This also enables marketing strategies to provide results that are more productive through navigation of the pages linking to the homepage of the site itself. Using this concepts the relevant web pages can be ranked based on their quality to the query that the user or customer uses in the browser. According to the above understanding of web structure mining web page ranking plays an important role in navigating relevant pages with the highest quality. Web Structure Mining View of Data Main Data Representation Method Application Categories -Link Structure -Link Structure -Graph -Web pages Hits -Proprietary algorithms - Web PageRank -Categorization -Clustering Tab 3. Web Structure mining data type 3 Web Page Ranking As a matter of fact, web structure mining, is discoverable by the provision of web structure schema through database techniques for Web pages. It allows a search engine to pull data relating to a search query directly to the linking Web page from the Web site the content rests upon. This completion takes place through use of spiders scanning the Web sites, retrieving the home page, then, and linking the information through reference links to bring forth the specific page containing the desired information. Searching the web involves two main steps: Extracting the pages relevant to a query and ranking them according to their quality. Ranking is essential as it helps the user looks for “quality” pages that are related to the query. Different metrics have been proposed to rank web pages according to their quality. We briefly discuss two of the famous ones. With the rapid growth of the Web, providing relevant pages of the highest quality to the users based on their queries becomes increasingly difficult. The reasons are that some web pages are not self-descriptive and that some links exist purely for navigational purposes. Therefore, finding suitable pages through a search engine that relies on web contents or makes use of hyperlink information is very difficult. Web Structure mining minimize two main problems of the web due to its vast amount of data. The first problem is to irrelevant search results. Relevance of search information become unorganized due to the problem that search engines often only tolerate for low precision criteria. Second problem is the incapability to index the vast amount of information hosted on the Web. This causes a low amount of recall with content mining. This minimization comes in part with the function of discovering the model underlying the Web hyperlink structure provided by Web structure mining [8]. Therefore, PageRank provides a more advanced way to calculate the importance or significance of a Web page than simply counting the number of pages that are linking to it (called as “backlinks”). If a backlink comes from an “important” page, then that backlink is given a higher weighting than those backlinks comes from non-important pages [4]. In a simple way, link from one page to another page may The main purpose for structure mining is to extract previously unknown relationships between Web pages. This structure data mining provides use for a ISBN: 978-1-61804-169-2 23 Recent Advances in Telecommunications, Signals and Systems Kleinberg. It was a predecessor to PageRank. The idea behind Hubs and Authorities stemmed from a particular approaching into the creation of web pages when the Internet was originally forming. Since the certain web pages, known as hubs and served as large directories which are not actually authoritative in the information that it held. But these web pages are used as compilations of a broad catalog of information which led users directly to other authoritative pages. As a matter of fact, a good hub identified a page which piercing to many other pages and a good authority characterized a page that was linked by many different hubs [1]. be considered as a vote. However, not only the number of votes a page receives is considered important, but the “importance” or the “relevance” of the ones that cast these votes as well. To solve these problems mentioned in this section, many algorithms have been proposed in this direction. Among these algorithms are PageRank [10] and Hypertext Induced Topic Selection (HITS) [2, 9]. PageRank algorithm is an often used algorithm in Web Structure Mining. It measures the significance of the pages by analyzing the relations between hyperlinks of the web pages. Moreover, the algorithm ranks web pages based on the web structure [1, 8]. PageRank algorithm has been developed by Google and is named after Larry Page, Google’s co-founder and president [10]. Therefore, we can map two scores for each page: its authority, which approximates the value of the content of the page, and its hub value, which calculate the value of its links to other pages [5]. HITS ranks WebPages by evaluating their inlinks and outlink. In this algorithm, WebPages pointed to many hyperlinks also called as authorities whereas WebPages that point to many hyperlinks are called hubs [4, 5, 11]. The algorithm was devolved to be used in a popular searching engine called Clever. Since the must functionality of Google is to retrieves a list of related pages to a given query based on factors such as title, tags or keywords. Then it uses PageRank to adjust the results so that more “important” pages are provided at the top of the page list [10]. Hubs and Authorities In the other hand, another PageRank algorithm called Hyperlink-Induced Topic Search (HITS) (also known as hubs and authorities) is a link analysis algorithm that rates Web pages. This algorithm was developed by Jon Kleinberg, as a precursor to PageRank. The idea behind Hubs and Authorities stemmed from a particular insight into the design of the web pages. This algorithm define certain web pages, known as hubs, served as large directories that were not actually authoritative in the information held. But it uses as a compilations of a broad catalog of information that led users directly to other authoritative pages. In other words, a good hub represented a page that pointed to many other pages, and a good authority represented a page that was linked by many different hubs [1]. The scheme therefore assigns two scores for each page: its authority, which estimates the value of the content of the page, and its hub value, which estimates the value of its links to other pages. The following subsection describes those algorithms in more details. Hubs and authorities can be viewed as “fans’ and “centers” in a bipartite core of a web graph, where the “fans” represent the hubs and the “centers” represent the authorities. The hub and authority scores computed for each web page indicate the extent to which the web page serves as a hub pointing to good authority pages or as an authority on a topic pointed to by good hubs. The scores are calculated for a set of pages related to a topic using an iterative process called HITS [9]. First a query is submitted to a search engine and a set of significant documents is retrieved. This set, called the “root set,” is then extended by including web pages that point to those in the “root set” and are pointed by those in the “root set.” This new set is called the “base set.” An adjacency matrix, A is formed such that if there exists at least one hyperlink from page i to page j, then Ai,j = 1, otherwise Ai,j = 0. HITS algorithm is then used to calculate the hub and authority scores for these set of pages. A. HITS (Hyper-link Induced Topic Search) Algorithm There have been alterations and improvements to the basic page rank and hubs and authorities approaches such as SALSA (Lempel and Moran 2000), topic sensitive page rank, (Haveliwala 2002) and web page reputations (Mendelzon and Rafiei 2000). These different hyperlink based metrics have Hyperlink-Induced Topic Search (HITS) (also known as hubs and authorities) is a link analysis algorithm that rates Web pages, developed by Jon ISBN: 978-1-61804-169-2 24 Recent Advances in Telecommunications, Signals and Systems been discussed by Desikan, Srivastava, Kumar, and Tan (2002).The mechanism of using the authorities and hubs can be illustrated in Figure 1. n ∑ hub(i) i =1 Where n is the total number of pages linked to p and i is a page linked to p. That is, the Authority score of a page is the sum of all the Hub scores of pages that point to it. Hub Update Rule ∀p , we update hub(p) to be: n ∑ auth(i) i =1 Fig1. Shows the Hubs and Authorities The Hub score and Authority score for a node is computed with the following algorithm: • Start with each node having a hub score and authority score of 1. • Run the Authority Update Rule • Run the Hub Update Rule • Normalize the values by dividing each Hub score by the sum of the squares of all Hub scores, and dividing each Authority score by the sum of the squares of all Authority scores. • Repeat from necessary. the second step Where n is the total number of pages p links to and i is a page which p links to. Thus a page's Hub score is the sum of the Authority scores of all its linking pages Finally, a normalization step can be assigned when the final hub-authority scores of nodes are determined after endless repetitions of the algorithm. As directly and iteratively applying the Hub Update Rule and Authority Update Rule leads to diverging values, it is necessary to normalize the matrix after every iteration. Thus the values obtained from this process will eventually converge [6]. as B. PageRank Algorithm PageRank is a link analysis algorithm, used by the Google Internet search engine that assigns a numerical weighting to each element of a hyperlinked set of documents, such as the web with the principle of "measuring" its qualified importance within the set. The algorithm may be applied to any set of entities with reciprocal quotations and references. The numerical weight that it assigns to any given element E is also called the PageRank of E and denoted by PR(E).[6]. Authorities and Hubs Rules To begin the ranking, auth(p) = 1 and hub(p) = 1. We consider two types of updates: Authority Update Rule and Hub Update Rule. In order to compute the hub/authority scores of each node, repeated iterations of the Authority Update Rule and the Hub Update Rule are applied. A k-step application of the Hub-Authority algorithm entails applying for k times first the Authority Update Rule and then the Hub Update Rule [6]. Authority Update Rule The PageRank algorithm, one of the most commonly used page ranking algorithms, states that if a page has significant links to it, its links to other pages also become important. Therefore, PageRank takes the backlinks into consideration and propagates the ranking through links: a page has a ∀p , we update auth(p) to be: ISBN: 978-1-61804-169-2 25 Recent Advances in Telecommunications, Signals and Systems link will be directed to the document with the 0.5 PageRank. high rank if the sum of the ranks of its backlinks is high [8, 10]. Figure 2 shows an example of backlinks: page A is a backlink of page B and page C while page B and page C are backlinks of page D [7]. Example: Assume a small universe of four web pages: A, B, C and D. The initial approximation of PageRank would be consistently divided between these four documents. Hence, each document would begin with an estimated PageRank value of 0.25. In the original form of PageRank initial values were simply 1. This meant that the sum of all pages was the total number of pages on the web at that time. It would assume a probability distribution between 0 and 1.In this distribution a simple probability distribution will be used, which mean the initial value of 0.25. If pages B, C, and D are only linked to A, they would each give 0.25 PageRank to A. All PageRank PR(A) in this simplistic system would thus gather to A because all links would be pointing to A. Fig2. Illustrates the Backlinks PageRank algorithm is also defined as a metric for ranking hypertext documents based on their quality. Page, Brin, Motwani, and Winograd they developed this metric for the popular search engine Google by Brin and Page in 1998. The main idea is that a page has a high rank if it is pointed by many highly ranked pages. The rank of a page depends upon the ranks of the pages pointing to it. This procedure is done iteratively until the rank of all pages is resolute [8, 7]. Where the value will be: 0.75. Suppose that page B has a link to page C as well as to page A, while page D has links to all three pages. The value of the link-votes is divided among all the outbound links on a page. Thus, page B gives a vote worth 0.125 to page A and a vote worth 0.125 to page C. Only one third of D's PageRank is counted for A's PageRank (approximately 0.083). We defined the PageRank as a probability distribution used to represent the likelihood that a person randomly clicking on links will arrive at any picky page. PageRank can be computed for collections of documents of any size. It is understood in several research papers that the distribution is consistently divided among all documents in the collection at the beginning of the computational process. The PageRank computations need several passes, called "iterations", through the collection to adjust approximate PageRank values to more closely reflect the theoretical true value [7]. In other words, the PageRank conferred by an outbound link is equal to the document's own PageRank score divided by the normalized number of outbound links L(?) (it is assumed that links to specific URLs only count once per document). In the general case, the PageRank value for any page u can be expressed as: A probability is expressed as a numeric value between 0 and 1. A 0.5 probability is usually expressed as a "50% chance" of something happening. Hence, a PageRank of 0.5 means there is a 50% chance that a person clicking on a random ISBN: 978-1-61804-169-2 For instance, the PageRank value for a page u is 26 Recent Advances in Telecommunications, Signals and Systems of HITS algorithms are <O(log N). The accuracy was calculated also for both algorithms. reliant on the PageRank values for each page v out of the set Bu (this set contains all pages linking to page u), divided by the number L(v) of links from page v. 5 Conclusion 4 Comparison of Pagerank Algorithms Web mining was defined as a data mining techniques to automatically retrieve, extract and evaluate information for knowledge discovery from web documents and services. This information was left from the past behavior of the users. Web Structure Mining is one of the three categories of web mining for data to be used to identify the relationship between Web pages linked by information or direct link connection. It plays an important role in this approach. Many algorithms are used in Web Structure Mining to rank the relevant pages. Algorithms such as PageRank and HyperlinkInduced Topic Search (HITS) are used in this paper. Those algorithms were expressed in details and a comparative study was explained in section 4. In the future work, we are planning to carry out performance analysis of PageRank and HITS and working on finding the ways to categorize the users and web pages to obtain the better PageRank results. In this section we explore the important of Page Rank algorithms that are commonly used for information retrieval and compare those algorithms in different criteria such as: in which technical mining used, functionality, accuracy, parameters, complexity, limitation and searching engine used. Table 4 illustrates the comparison between those algorithms. Criteria Mining technique used Functionality Accuracy (high, middle, low) Parameters Complexity Limitations Search Engine Algorithm PageRank HITS WSM WSM & WCM - Computes scores at index time. - Resultsare sorted on the importance of pages. High Computes scores of n highly relevant pages on the fly. Backlinks Backlinks, Forward Links & content <O(log N) Topic drift and efficiency problem Clever O(log N) Query independent Google About the Author Middle Zakaria Suliman Zubi-- He received his Ph.D. in Computer Science in 2002 from Debrecen University in Hungary, he is an Associate Professor since 2010. He is a reviewer of many scientific journals such as Word Scientific and Engineering Academy and Society (WSEAS) , Journal of Software Engineering and Applications (JSEA), Member of the International Association of Engineers (IAENG), Journal of Engineering and Technology Research (JETR) , World Academy of Science Engineering and Technology (WASET) journal, an Associate Editor in the Journal of the WSEAS Transactions on Information Science and Applications and more local journals in Libya. He is a member of the Association for Computing Machinery society (ACM), a member of IEEE society, a member of the Word Scientific and Engineering Academy and Society (WSEAS). He published as authors and a co-author in many researches and technical reports in local and international journals and conference proceedings. Tab 4. Describes the comparison of PageRank algorithms. PageRank and Hyperlink-Induced Topic Search (HITS) treat all links equally when distributing the rank score. PageRank is used in Web Structure Mining. But HITS are used in both structure Mining and Web Content Mining. PageRank computes the score at indexing time and sort them according to importance of page where as HITS computes the hub and authority score of n highly relevant pages. The input parameters used in Page Rank are BackLinks. PageRank uses Backlinks and Forward Links as Input Parameter, HITS uses Backlinks, Forward Link and Content as Input Parameters. Complexity of PageRank algorithm is O(log N) where as complexity ISBN: 978-1-61804-169-2 Acknowledgment This work has been fully supported and funded by the Libyan Higher Education Ministry throughout Sirte University. 27 Recent Advances in Telecommunications, Signals and Systems international conference on Data networks, communications, computers (DNCOCO'09), Manoj Jha, Charles Long, Nikos Mastorakis, and Cornelia Aida Bulucea (Eds.). World Scientific and Engineering Academy and Society , Stevens Point, Wisconsin, USA, 73-8. References: [1] Brin, and L. Page, The Anatomy of a Large Scale Hypertextual Web Search Engine,, Computer Network and ISDN Systems, Vol. 30, Issue 1-7, pp. 107-117, 1998. [2] C. Ding, X. He, P. Husbands, H. Zha, and H. Simon, Link analysis: Hubs and authorities on the world. Technical report: 47847, 2001. [3] J. Hou and Y. Zhang, Effectively Finding Relevant Web Pages from Linkage Information, IEEE Transactions on Knowledge and Data Engineering, Vol. 15, No. 4, 2003. [4] J. Kleinberg, Authoritative Sources in a HyperLinked Environment, Journal of the ACM 46(5), pp. 604-632, 1999. [5] J. M. Klienberg, Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604-632, September 1999. [6] N. Duhan, A.K. Sharma and K.K. Bhatia, Page Ranking Algorithms: A Survey, Proceedings of the IEEE International Conference on Advance Computing, 2009. [7] P Ravi Kumar, and Singh Ashutosh kumar, Web Structure Mining Exploring Hyperlinks and Algorithms for Information Retrieval, American Journal of applied sciences, 7 (6) 840-845 2010. [8] S. Chakrabarti, B.Dom, D.Gibson, J. Kleinberg, R. Kumar, P. Raghavan,S. Rajagopalan, and A. Tomkins, Mining the Link Structure of the World Wide Web, IEEE Computer, Vol. 32, pp. 60-67, 1999 [9] Zakaria Suliman Zubi, Marim Aboajela Emsaed. 2010. Sequence mining in DNA chips data for diagnosing cancer patients. In Proceedings of the 10th WSEAS international conference on Applied computer science (ACS'10), Hamido Fujita and Jun Sasaki (Eds.). World Scientific and Engineering Academy and Society (WSEAS), Stevens Point, Wisconsin, USA, 139-151. [10] Zakaria Suliman Zubi. 2009. Using some web content mining techniques for Arabic text classification. In Proceedings of the 8th WSEAS ISBN: 978-1-61804-169-2 28