Download Ranking WebPages Using Web Structure Mining Concepts

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Recent Advances in Telecommunications, Signals and Systems
Ranking WebPages Using Web Structure Mining Concepts
Zakaria Suliman Zubi
Computer Science Department
Faculty of Science
Sirte University
Sirte, Libya
Email: [email protected]
Abstract: - With the rapid growth of the Web, users get easily lost in the rich hyper structure on the
web. Providing relevant information to the users to supply to their needs is the primary goal of the
owners of these websites. Web mining is one of the techniques that could help the websites owner in
this direction. Web mining was categorized into three categories such as web content mining, web
usage mining and web structure mining. Web structure mining plays an important role in this
approach. Two page ranking algorithms such as PageRank and Hyperlink-Induced Topic Search
(HITS) are commonly used in web structure mining. Both algorithms treat all links equally when
distributing rank scores. A comparison between both algorithms was discussed in this paper as well.
Ranking WebPages is an important mission as it assists the user look for highly ranked pages that are
relevant to the query. Different metrics have been proposed to rank web pages according to their
quality, and a brief discussion of the two prominent ones was conducted in this paper also.
Key-Words: - Web Mining, Web Content Mining, Web Usage Mining, Web Structure Mining, HITS,
PageRank, Authority and Hubs.
carry out the problem with the help of other areas
like Database (DB), Information retrieval (IR),
Natural Language Processing (NLP), and Machine
Learning etc. These techniques can be used to
discuss and analyze the useful information from
web. Dealing with these aspects, there are some
challenges we should take it into account as follow
[3]:
1 Introduction
The web is a rich source of information and
persists to increase in size and difficulty. Retrieving
the necessary web page on the web, efficiently and
effectively, is becoming a challenge aspect now
days [1]. On every occasion a user needs to search
the relevant pages, the user prefers those relevant
pages to be at hand. Relevant web page is one that
provides the same topic as the original page but it is
not semantically identical to original page [1]. As a
matter of fact the Web is unstructured data
warehouse, which delivers the mass amount of
information and also enlarges the complexity of
dealing information from different perspective of
knowledge searchers, business analysts and web
service providers [2]. Beside, the Google report on
in 2008 that there are 1 trillion unique URLs on the
web [3]. Web has grown enormously and the usage
of web is unbelievable so it is essential to
understand the data structure of web. The mass
amount of information becomes very hard for the
users to find, extract, filter or evaluate the relevant
information. This issue lifts up the attention to the
obligation of some technique that can solve these
challenges.
1) Web is huge. 2) Web pages are semi structured.
3) Web information stands to be diversity in meaning.
4) Degree of quality of the information extracted. 5)
Conclusion of knowledge from information extracted.
The paper is organized as follows- The categories
of Web Mining are discussed in Section 2. Section 3
explains the important of Web Page Ranking and
two important algorithms such as Hypertext Induced
Topic Selection (HITS) algorithm and PageRank
algorithm. In section 4, we explore the comparison
between Web Page Ranking algorithms used. The
Conclusion remarks are given in Section 5.
2 Web Mining Categories
Web Mining consists of three main categories
according to the web data used as input in Web Data
Mining. (1) Web Content Mining; (2) Web Usage
Web mining can be easily used in this direction to
ISBN: 978-1-61804-169-2
21
Recent Advances in Telecommunications, Signals and Systems
involves the automatic discovery of user access
patterns from one or more web servers. Through this
mining technique we can determine what users are
looking for on the Internet. Some might be looking
for only technical data, where as some others might
be interested in multimedia data.
Mining and (3); Web Structure Mining.
A. Web Content Mining
Web content mining is the procedure of retrieving
the information from the web into more structured
forms and indexing the information to retrieve it
quickly. It focuses mainly on the structure within a
web documents as a inner document level.
Web usage mining can be defined also as the
application of data mining techniques to discover
interesting usage patterns from web usage data, in
order to understand and better serve the needs of
web-based applications [2]. Usage data captures the
identity or origin of web users along with their
browsing behavior at a web site. Web usage mining
itself can be classified further depending on the kind
of usage data.
Web Content Mining is an area related to Data
Mining because many Data Mining techniques can
be applied in Web content Mining. Since data
mining deals with different types of data includes
text, images, audio and video whereas; web content
mining had all types of data. It is also related to text
mining because much of the web contents are text,
but is also quite different from these because web
data is mainly semi structured in nature and text
mining focuses mainly on unstructured text. Table 1
summarizes the type of concepts of web content
mining.
Web usage mining tries to make sense of the data
generated by the Web surfer's sessions or behaviors.
While Web-content mining and Web-structure
mining utilize real or primary data on the Web;
Web Usage Mining
Web Content Mining
IR view
DB View
- User Interactivity
-Server Logs (log-files)
-Browser Logs
Representation -Relational Table
- Graph
- User Behavior
Machine Learning
Method
Statistical
Association rules
-Site Construction
Application
-Adaptation and management
Categories
-Marketing,
-User Modeling
View of Data
Main Data
View of Data -Unstructured
Main Data
-Semi Structured
-Structured
-Web Site as DB
- Text documents
-Hypertext documents
-Hypertext documents
Representation -Bag of words, n-gram -Edge labeled Graph,
Terms,
-Relational
- Phrases, Concepts or
Ontology
-Relational Learning
Method
-Machine Learning -Proprietary algorithms
-Statistical (including:-Association rules
NLP)
Tab 2. Illustrate the Web usage mining category
Application -Categorization
Categories -Clustering
-Finding frequent sub
structures
-Finding extract rules -Web site schema
patterns in text
discovery
Web-usage mining mines the secondary data derived
from the behavior of users while interacting with the
web. This includes data from Web server-access
logs, proxy-server logs, browser logs, user profiles,
registration data, user sessions or transactions,
cookies, bookmark data, and any other data that is
derived from a person's interaction with the Web.
These types of data are shown in table2.
Tab 1., Gives an overview of Web content mining
category.
B. Web Usage Mining
In many researches Web Usage Mining is used to
identify the browsing patterns by analyzing the
navigational behavior of user [10]. It focuses on
techniques that can be used to predict the user
behavior while the user interacts with the web. It
uses the resulting data on the web. This activity
ISBN: 978-1-61804-169-2
C. Web Structure Mining
Web structure mining is defined as the process by
which we discover the model of link structure of the
web pages. We classify the links; generate the ease
of use information such as the similarity and
22
Recent Advances in Telecommunications, Signals and Systems
business to link the information of its own Web site
to enable navigation and the irrelevant web pages
into site maps. The more links provided within the
relationship of the web pages enable the navigation
to yield the link hierarchy allowing navigation ease
[11].
relations among them by taking the advantage of
hyperlink topology [9]. PageRank and hyperlink
analysis also fall in this class. The aim of Web
Structure Mining is to generate structured abstract
about the website and web page. It attempts to
discover the link structure of hyper links at inter
document level. As it is very ordinary that the web
documents contain links and they use both the real
or primary data on the web so it can be
accomplished that Web Structure Mining has a
relation with Web Content Mining. It is quite
frequently to join these two mining tasks in an
application. Table 3 viewed the type of data that can
be joined in web mining application.
Therefore, Web mining and the use of structure
mining can provide strategic results for marketing of
a Web site for production of sale. The more traffic
directed to the Web pages of a particular site
increases the level of return visitation to the site and
recall by search engines relating to the information
or product provided by web sites that serve a
company or any business community. This also
enables marketing strategies to provide results that
are more productive through navigation of the pages
linking to the homepage of the site itself. Using this
concepts the relevant web pages can be ranked
based on their quality to the query that the user or
customer uses in the browser. According to the
above understanding of web structure mining web
page ranking plays an important role in navigating
relevant pages with the highest quality.
Web Structure Mining
View of Data
Main Data
Representation
Method
Application
Categories
-Link Structure
-Link Structure
-Graph
-Web pages Hits
-Proprietary algorithms
- Web PageRank
-Categorization
-Clustering
Tab 3. Web Structure mining data type
3 Web Page Ranking
As a matter of fact, web structure mining, is
discoverable by the provision of web structure
schema through database techniques for Web pages.
It allows a search engine to pull data relating to a
search query directly to the linking Web page from
the Web site the content rests upon. This completion
takes place through use of spiders scanning the Web
sites, retrieving the home page, then, and linking the
information through reference links to bring forth
the specific page containing the desired information.
Searching the web involves two main steps:
Extracting the pages relevant to a query and ranking
them according to their quality. Ranking is essential
as it helps the user looks for “quality” pages that are
related to the query. Different metrics have been
proposed to rank web pages according to their
quality. We briefly discuss two of the famous ones.
With the rapid growth of the Web, providing
relevant pages of the highest quality to the users
based on their queries becomes increasingly
difficult. The reasons are that some web pages are
not self-descriptive and that some links exist purely
for navigational purposes. Therefore, finding
suitable pages through a search engine that relies on
web contents or makes use of hyperlink information
is very difficult.
Web Structure mining minimize two main
problems of the web due to its vast amount of data.
The first problem is to irrelevant search results.
Relevance of search information become
unorganized due to the problem that search engines
often only tolerate for low precision criteria. Second
problem is the incapability to index the vast amount
of information hosted on the Web. This causes a low
amount of recall with content mining. This
minimization comes in part with the function of
discovering the model underlying the Web hyperlink
structure provided by Web structure mining [8].
Therefore, PageRank provides a more advanced
way to calculate the importance or significance of a
Web page than simply counting the number of pages
that are linking to it (called as “backlinks”). If a
backlink comes from an “important” page, then that
backlink is given a higher weighting than those
backlinks comes from non-important pages [4]. In a
simple way, link from one page to another page may
The main purpose for structure mining is to extract
previously unknown relationships between Web
pages. This structure data mining provides use for a
ISBN: 978-1-61804-169-2
23
Recent Advances in Telecommunications, Signals and Systems
Kleinberg. It was a predecessor to PageRank. The
idea behind Hubs and Authorities stemmed from a
particular approaching into the creation of web
pages when the Internet was originally forming.
Since the certain web pages, known as hubs and
served as large directories which are not actually
authoritative in the information that it held. But
these web pages are used as compilations of a broad
catalog of information which led users directly to
other authoritative pages. As a matter of fact, a good
hub identified a page which piercing to many other
pages and a good authority characterized a page that
was linked by many different hubs [1].
be considered as a vote. However, not only the
number of votes a page receives is considered
important, but the “importance” or the “relevance”
of the ones that cast these votes as well.
To solve these problems mentioned in this section,
many algorithms have been proposed in this
direction. Among these algorithms are PageRank
[10] and Hypertext Induced Topic Selection (HITS)
[2, 9]. PageRank algorithm is an often used
algorithm in Web Structure Mining. It measures the
significance of the pages by analyzing the relations
between hyperlinks of the web pages. Moreover, the
algorithm ranks web pages based on the web
structure [1, 8]. PageRank algorithm has been
developed by Google and is named after Larry Page,
Google’s co-founder and president [10].
Therefore, we can map two scores for each page:
its authority, which approximates the value of the
content of the page, and its hub value, which
calculate the value of its links to other pages [5].
HITS ranks WebPages by evaluating their inlinks
and outlink. In this algorithm, WebPages pointed to
many hyperlinks also called as authorities whereas
WebPages that point to many hyperlinks are called
hubs [4, 5, 11]. The algorithm was devolved to be
used in a popular searching engine called Clever.
Since the must functionality of Google is to
retrieves a list of related pages to a given query
based on factors such as title, tags or keywords.
Then it uses PageRank to adjust the results so that
more “important” pages are provided at the top of
the page list [10].
Hubs and Authorities
In the other hand, another PageRank algorithm
called Hyperlink-Induced Topic Search (HITS) (also
known as hubs and authorities) is a link analysis
algorithm that rates Web pages. This algorithm was
developed by Jon Kleinberg, as a precursor to
PageRank. The idea behind Hubs and Authorities
stemmed from a particular insight into the design of
the web pages. This algorithm define certain web
pages, known as hubs, served as large directories
that were not actually authoritative in the
information held. But it uses as a compilations of a
broad catalog of information that led users directly
to other authoritative pages. In other words, a good
hub represented a page that pointed to many other
pages, and a good authority represented a page that
was linked by many different hubs [1]. The scheme
therefore assigns two scores for each page: its
authority, which estimates the value of the content of
the page, and its hub value, which estimates the
value of its links to other pages. The following
subsection describes those algorithms in more
details.
Hubs and authorities can be viewed as “fans’ and
“centers” in a bipartite core of a web graph, where
the “fans” represent the hubs and the “centers”
represent the authorities. The hub and authority
scores computed for each web page indicate the
extent to which the web page serves as a hub
pointing to good authority pages or as an authority
on a topic pointed to by good hubs.
The scores are calculated for a set of pages related
to a topic using an iterative process called HITS [9].
First a query is submitted to a search engine and a
set of significant documents is retrieved. This set,
called the “root set,” is then extended by including
web pages that point to those in the “root set” and
are pointed by those in the “root set.” This new set is
called the “base set.” An adjacency matrix, A is
formed such that if there exists at least one hyperlink
from page i to page j, then Ai,j = 1, otherwise Ai,j = 0.
HITS algorithm is then used to calculate the hub and
authority scores for these set of pages.
A. HITS (Hyper-link Induced Topic Search)
Algorithm
There have been alterations and improvements to
the basic page rank and hubs and authorities
approaches such as SALSA (Lempel and Moran
2000), topic sensitive page rank, (Haveliwala 2002)
and web page reputations (Mendelzon and Rafiei
2000). These different hyperlink based metrics have
Hyperlink-Induced Topic Search (HITS) (also
known as hubs and authorities) is a link analysis
algorithm that rates Web pages, developed by Jon
ISBN: 978-1-61804-169-2
24
Recent Advances in Telecommunications, Signals and Systems
been discussed by Desikan, Srivastava, Kumar, and
Tan (2002).The mechanism of using the authorities
and hubs can be illustrated in Figure 1.
n
∑ hub(i)
i =1
Where n is the total number of pages linked to p
and i is a page linked to p. That is, the Authority
score of a page is the sum of all the Hub scores of
pages that point to it.
Hub Update Rule
∀p , we update hub(p) to be:
n
∑ auth(i)
i =1
Fig1. Shows the Hubs and Authorities
The Hub score and Authority score for a node is
computed with the following algorithm:
•
Start with each node having a hub score
and authority score of 1.
•
Run the Authority Update Rule
•
Run the Hub Update Rule
•
Normalize the values by dividing each
Hub score by the sum of the squares of
all Hub scores, and dividing each
Authority score by the sum of the
squares of all Authority scores.
•
Repeat from
necessary.
the
second
step
Where n is the total number of pages p links to
and i is a page which p links to. Thus a page's Hub
score is the sum of the Authority scores of all its
linking pages
Finally, a normalization step can be assigned when
the final hub-authority scores of nodes are
determined after endless repetitions of the
algorithm. As directly and iteratively applying the
Hub Update Rule and Authority Update Rule leads
to diverging values, it is necessary to normalize the
matrix after every iteration. Thus the values
obtained from this process will eventually converge
[6].
as
B. PageRank Algorithm
PageRank is a link analysis algorithm, used by the
Google Internet search engine that assigns a
numerical weighting to each element of a
hyperlinked set of documents, such as the
web
with the principle of "measuring" its qualified
importance within the set. The algorithm may be
applied to any set of entities with reciprocal
quotations and references. The numerical weight
that it assigns to any given element E is also called
the PageRank of E and denoted by PR(E).[6].
Authorities and Hubs Rules
To begin the ranking, auth(p) = 1 and hub(p) = 1.
We consider two types of updates: Authority Update
Rule and Hub Update Rule. In order to compute the
hub/authority scores of each node, repeated
iterations of the Authority Update Rule and the Hub
Update Rule are applied. A k-step application of the
Hub-Authority algorithm entails applying for k times
first the Authority Update Rule and then the Hub
Update Rule [6].
Authority Update Rule
The PageRank algorithm, one of the most
commonly used page ranking algorithms, states that
if a page has significant links to it, its links to other
pages also become important. Therefore, PageRank
takes the backlinks into consideration and
propagates the ranking through links: a page has a
∀p , we update auth(p) to be:
ISBN: 978-1-61804-169-2
25
Recent Advances in Telecommunications, Signals and Systems
link will be directed to the document with the 0.5
PageRank.
high rank if the sum of the ranks of its backlinks is
high [8, 10]. Figure 2 shows an example of
backlinks: page A is a backlink of page B and page
C while page B and page C are backlinks of page D
[7].
Example:
Assume a small universe of four web pages: A, B,
C and D. The initial approximation of PageRank
would be consistently divided between these four
documents. Hence, each document would begin
with an estimated PageRank value of 0.25.
In the original form of PageRank initial values
were simply 1. This meant that the sum of all pages
was the total number of pages on the web at that
time. It would assume a probability distribution
between 0 and 1.In this distribution a simple
probability distribution will be used, which mean
the initial value of 0.25.
If pages B, C, and D are only linked to A, they
would each give 0.25 PageRank to A. All PageRank
PR(A) in this simplistic system would thus gather to
A because all links would be pointing to A.
Fig2. Illustrates the Backlinks
PageRank algorithm is also defined as a metric
for ranking hypertext documents based on their
quality. Page, Brin, Motwani, and Winograd they
developed this metric for the popular search engine
Google by Brin and Page in 1998. The main idea is
that a page has a high rank if it is pointed by many
highly ranked pages. The rank of a page depends
upon the ranks of the pages pointing to it. This
procedure is done iteratively until the rank of all
pages is resolute [8, 7].
Where the value will be: 0.75.
Suppose that page B has a link to page C as well as
to page A, while page D has links to all three pages.
The value of the link-votes is divided among all the
outbound links on a page. Thus, page B gives a vote
worth 0.125 to page A and a vote worth 0.125 to
page C. Only one third of D's PageRank is counted
for A's PageRank (approximately 0.083).
We defined the PageRank as a probability
distribution used to represent the likelihood that a
person randomly clicking on links will arrive at any
picky page. PageRank can be computed for
collections of documents of any size. It is
understood in several research papers that the
distribution is consistently divided among all
documents in the collection at the beginning of the
computational
process.
The
PageRank
computations need several passes, called
"iterations", through the collection to adjust
approximate PageRank values to more closely
reflect the theoretical true value [7].
In other words, the PageRank conferred by an
outbound link is equal to the document's own
PageRank score divided by the normalized number
of outbound links L(?) (it is assumed that links to
specific URLs only count once per document).
In the general case, the PageRank value for any
page u can be expressed as:
A probability is expressed as a numeric value
between 0 and 1. A 0.5 probability is usually
expressed as a "50% chance" of something
happening. Hence, a PageRank of 0.5 means there
is a 50% chance that a person clicking on a random
ISBN: 978-1-61804-169-2
For instance, the PageRank value for a page u is
26
Recent Advances in Telecommunications, Signals and Systems
of HITS algorithms are <O(log N). The accuracy was
calculated also for both algorithms.
reliant on the PageRank values for each page v out
of the set Bu (this set contains all pages linking to
page u), divided by the number L(v) of links from
page v.
5 Conclusion
4 Comparison of Pagerank
Algorithms
Web mining was defined as a data mining
techniques to automatically retrieve, extract and
evaluate information for knowledge discovery from
web documents and services. This information was
left from the past behavior of the users. Web
Structure Mining is one of the three categories of
web mining for data to be used to identify the
relationship between Web pages linked by
information or direct link connection. It plays an
important role in this approach. Many algorithms are
used in Web Structure Mining to rank the relevant
pages. Algorithms such as PageRank and HyperlinkInduced Topic Search (HITS) are used in this paper.
Those algorithms were expressed in details and a
comparative study was explained in section 4. In the
future work, we are planning to carry out
performance analysis of PageRank and HITS and
working on finding the ways to categorize the users
and web pages to obtain the better PageRank results.
In this section we explore the important of Page
Rank algorithms that are commonly used for
information retrieval and compare those algorithms
in different criteria such as: in which technical
mining used, functionality, accuracy, parameters,
complexity, limitation and searching engine used.
Table 4 illustrates the comparison between those
algorithms.
Criteria
Mining
technique
used
Functionality
Accuracy
(high, middle,
low)
Parameters
Complexity
Limitations
Search
Engine
Algorithm
PageRank
HITS
WSM
WSM & WCM
- Computes scores
at index time.
- Resultsare sorted
on the importance
of pages.
High
Computes scores
of n highly
relevant pages on
the fly.
Backlinks
Backlinks,
Forward
Links & content
<O(log N)
Topic drift and
efficiency
problem
Clever
O(log N)
Query
independent
Google
About the Author
Middle
Zakaria Suliman Zubi-- He received his Ph.D. in
Computer Science in 2002 from Debrecen
University in Hungary, he is an Associate Professor
since 2010. He is a reviewer of many scientific
journals such as Word Scientific and Engineering
Academy and Society (WSEAS) , Journal of
Software Engineering and Applications (JSEA),
Member of the International Association of
Engineers (IAENG), Journal of Engineering and
Technology Research (JETR) , World Academy of
Science Engineering and Technology (WASET)
journal, an Associate Editor in the Journal of the
WSEAS Transactions on Information Science and
Applications and more local journals in Libya. He is
a member of the Association for Computing
Machinery society (ACM), a member of IEEE
society, a member of the Word Scientific and
Engineering Academy and Society (WSEAS). He
published as authors and a co-author in many
researches and technical reports in local and
international journals and conference proceedings.
Tab 4. Describes the comparison of PageRank
algorithms.
PageRank and Hyperlink-Induced Topic Search
(HITS) treat all links equally when distributing the
rank score. PageRank is used in Web Structure
Mining. But HITS are used in both structure Mining
and Web Content Mining. PageRank computes the
score at indexing time and sort them according to
importance of page where as HITS computes the hub
and authority score of n highly relevant pages. The
input parameters used in Page Rank are BackLinks.
PageRank uses Backlinks and Forward Links as Input
Parameter, HITS uses Backlinks, Forward Link and
Content as Input Parameters. Complexity of
PageRank algorithm is O(log N) where as complexity
ISBN: 978-1-61804-169-2
Acknowledgment
This work has been fully supported and funded by
the Libyan Higher Education Ministry throughout
Sirte University.
27
Recent Advances in Telecommunications, Signals and Systems
international conference on Data networks,
communications, computers (DNCOCO'09),
Manoj Jha, Charles Long, Nikos Mastorakis, and
Cornelia Aida Bulucea (Eds.). World Scientific
and Engineering Academy and Society , Stevens
Point, Wisconsin, USA, 73-8.
References:
[1] Brin, and L. Page, The Anatomy of a Large
Scale Hypertextual Web Search Engine,,
Computer Network and ISDN Systems, Vol. 30,
Issue 1-7, pp. 107-117, 1998.
[2] C. Ding, X. He, P. Husbands, H. Zha, and H.
Simon, Link analysis: Hubs and authorities on
the world. Technical report: 47847, 2001.
[3] J. Hou and Y. Zhang, Effectively Finding
Relevant Web Pages from Linkage Information,
IEEE Transactions on Knowledge and Data
Engineering, Vol. 15, No. 4, 2003.
[4] J. Kleinberg, Authoritative Sources in a HyperLinked Environment, Journal of the ACM 46(5),
pp. 604-632, 1999.
[5] J. M. Klienberg, Authoritative sources in a
hyperlinked environment. Journal of the ACM,
46(5):604-632, September 1999.
[6] N. Duhan, A.K. Sharma and K.K. Bhatia, Page
Ranking Algorithms: A Survey, Proceedings of
the IEEE International Conference on Advance
Computing, 2009.
[7] P Ravi Kumar, and Singh Ashutosh kumar, Web
Structure Mining Exploring Hyperlinks and
Algorithms for Information Retrieval, American
Journal of applied sciences, 7 (6) 840-845 2010.
[8] S. Chakrabarti, B.Dom, D.Gibson, J. Kleinberg,
R. Kumar, P. Raghavan,S. Rajagopalan, and A.
Tomkins, Mining the Link Structure of the
World Wide Web, IEEE Computer, Vol. 32, pp.
60-67, 1999
[9] Zakaria Suliman Zubi,
Marim Aboajela
Emsaed. 2010. Sequence mining in DNA chips
data for diagnosing cancer patients. In
Proceedings of the 10th WSEAS international
conference on Applied computer science
(ACS'10), Hamido Fujita and Jun Sasaki (Eds.).
World Scientific and Engineering Academy and
Society (WSEAS), Stevens Point, Wisconsin,
USA, 139-151.
[10] Zakaria Suliman Zubi. 2009. Using some web
content mining techniques for Arabic text
classification. In Proceedings of the 8th WSEAS
ISBN: 978-1-61804-169-2
28