Download Automatic Web Page Categorization by Link and Context Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

URL shortening wikipedia , lookup

URL redirection wikipedia , lookup

Transcript
Automatic Web Page Categorization by Link
and Context Analysis
Giuseppe Attardi
Antonio Gulli
Fabrizio Sebastiani
Introduction
 Document retrieval on the Web
 Search engines – keyword-based searches
 Classified categories – each category lists Web sites
relevant to that category
Introduction
 Document d  Category c
 Requires understanding of both d and c
 Has traditionally been accomplished manually
 Disadvantages
 Growth rate, number of web pages
 Highly subjective, lesser quality
Introduction
 Automatic classification
 Text categorization
 Build the representation of a category using a training set
of documents pre-categorized under it
 Compare representation of a given document d with
representation of the category c to decide if d belongs to
c
 Other approaches
 Basic idea – categorization by content
Introduction
 Categorization by context
 Uses the context surrounding a link
 Uses relevance hints that are present in the structure
of HTML documents
 Advantage
 Ability to deal with multimedia material since it
analyzes context and not content
 Theseus [Teseo]
Improving Web search engines
 AltaVista: “refine” capability
 Infoseek: grouping of query results, retrieving
similar pages
 Automatic categorization techniques  better
Web retrieval tools, organized material e.g. Lycos,
Infoseek (Content Classification Engine - CCE)
Categorization by context
 Basic idea
 The referring Web page must contain enough hints
about the document’s content
 These hints are sufficient to classify the document
 What are these hints?
 Anchor text of a link: <A>…</A>
 Page title
 Section titles
Architecture
 Tasks performed





Spidering
Structure analysis
URL categorization
Weight combination
Catalog update
Spidering and HTML Structure Analysis
<html>
<head> <title> Yahoo! – Science: Biology </title> </head>
<body>
...
<ul> <li> <a href=“esg-www.mit.edu:8001/esgbio/”>MIT Biology Hypertextbook</a> introductory resource including information on chemistry, biochemistry,
genetics, cell and molecular biology, and immunology.
<li>
...
Spidering and HTML Structure Analysis
 The following URL context path is created
http://esg-www.mit.edu:8001/esgbio:
“MIT Biology Hypertextbook”:
“introductory resource including information on
chemistry, biochemistry, genetics, cell and molecular
biology, and immunology”:
“Yahoo! – Science: Biology”
URL Categorization
 One URL may have several context paths
 Category tree – each node identifies a category
 URL categorization finds the most appropriate
categories to which the URL should belong
 Produces a sequence of weights associated to each node
in the category tree
 URL: N1=w1, N2=w2, N3=w3, …, Nn=wn
 Each weight wi  degree of confidence
Weight Combination
 Weights from all context paths for a URL are
added and normalized
 If the weight of a node is greater than a certain
threshold, the URL is categorized under that
node
Theseus
 Theseus is a tool built to verify validity of the
method
 Components
 TreeTagger: a part-of-speech tagger
 HTML parser written in Perl
 HTML structure analyzer (produces the context tree)
written in Java
 Experimented using the Arianna catalog
Theseus: Exploiting Noun Phrases
 What is noun-phrase analysis?
 “a high school female student”
 without noun-phrase analysis  “high school”
 with noun-phrase analysis  detects that the subject of the
phrase is not “high school”
 Does it improve the effectiveness of classification?
 Lesser number of documents per category
 Overall improvement of about 5%
Theseus: Identifying Site Structure, Link
Identification
 Performs initial breadth-first analysis to a depth
of 3
 Repeated links (occurrence of 90% or more) are
considered structural links and eventually get
discarded
 Link identification is performed in the initial
phase of site analysis
 Ability to recognize CGI references
Theseus: Integration With a Search Engine
 Example: Yahoo!
 Several benefits
 avoid separate spidering of Web documents
 provide support for queries within categories –
“Search within this category”
 Vice-versa
 category information can be used to group query
results – improved presentation
Theseus: Assessment
 Experiment: Categorize a subset of Yahoo! pages
 Obtained the same categorization in most cases
 Classifies approximately 500 sites per hour
 Is more precise
 “microbiology journals” instead of “biology
journals”
Theseus: Assessment
Open Issues
 Building category profiles
 By hand
 Learning techniques
 Possible solution: minimal category profiles, to be
extended in the learning phase
 Proper ranking of documents in the catalog
Part-of-speech Tagging
 The task of POS-tagging is to assign part of speech tags
to words reflecting their syntactic category. But often,
words can belong to different syntactic categories in
different contexts. For instance, the string "books" can
have two readings: in the sentence he books tickets the
word "books" is a third person singular verb, but in the
sentence he reads books it is a plural noun. A POS-tagger
should segment a word, determine its possible readings,
and assign the right reading given the context.