Download Web Content Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Chapter 6
Web Content Mining
L. Malak Bagais
Web Mining
 Data mining techniques applied to the Web
 Three areas:
1. web-usage mining
2. web-structure mining
3. web-content mining
Web Usage Mining
 Does not deal with the contents of web documents
 Goals:
- to determine how a website’s visitors use web
resources
- to study their navigational patterns
 The data used for web-usage mining is essentially
secondary
Web Structure Mining
 Web-structure mining is concerned with the topology
of the Web
 Focuses on data that organizes the content and
facilitates navigation
 The principal source of information is hyperlinks,
connecting one page to another
 Chapter 8 presents web-structure mining
Web Content Mining
 Web-content mining deals with primary data on the Web
 actual content of the web documents
 web-content mining is to extract information
 users locate and extract information relevant to their
needs
 Web-content mining is composed of multiple data types:
text, images, audio, and video
 It also deals with crawling the Web and searching for
information
Web Content Mining
Web Content Mining
 Web-content mining techniques are used to discover
useful information from content on the web





textual
audio
video
images
metadata
Origin of web data
 Some of the web content is generated dynamically
using queries to database management systems
 Other web content may be hidden from general users
Problems with Web data
 Problems with the web data







Distributed data
Large volume
Unstructured data
Redundant data
Quality of data
Extreme percentage volatile data
Varied data
Web Crawler
 A computer program that navigates the hypertext
structure of the web
 Crawlers are used to ease the formation of indexes used
by search engines
 The page(s) that the crawler begins with are called the
seed URLs.
 Every link from the first page is recorded and saved in a
queue
Periodic Web Crawler
 Builds an index visiting number of pages and then
replaces the current index
 Known as a periodic crawler because it is activated
periodically
Focused Web Crawlers
Generally recommended for use due to large
size of the Web
 Visits pages related to topics of interest
 If a page is not pertinent, the entire set of possible
pages below it is pruned
Web Crawler
 Crawling process
 Begin with group of URLs
 Submitted by users
 Common URLs
 Breath-first or depth-first
 Extract more URLs
 Numerous crawlers
 Problem of redundancy
 Web partition  robot per partition
Focused Crawler
The focused crawler structure consists of two major
parts:
1.
The distiller
2.
The classifier
The Distiller
 A distiller verifies which pages contain links to
other relevant pages, which are called hub pages.
 Identifies hypertext nodes which are considered as
good access points to more relevant pages (HITS
algorithm).
The hypertext classifier

A hypertext classifier establishes a resource rating
to estimate how advantageous it would be for the
crawler to pursue the links out of that page.

The classifier connects a significant score for each
document with respect to the crawl topic.

Evaluates the relevance of hypertext documents
according to the given topic.
Focused Crawler
The pages that the crawler visits are selected using a
priority-based structure managed by the priority
associated with pages by the classifier and the distiller
Focused Crawler- how it works
 User identifies sample documents that are of interest.
 Sample documents are classified based on a
hierarchical classification tree.
 Documents are used as the seed documents to begin
the focused crawling
Focused Crawler
Each document is classified into a leaf node of the taxonomy tree
 One approach, hard focus, follows links if there is an ancestor of this
node that has been marked as good
 Another approach, soft focus, identifies the probability that a page, d, is
relevant as follows:
R(d ) 

good ( c )
P (c | d )
 where c is a node in the tree (thus a page)
good (c) is the indication that it has been labeled to be of interest
 The priority of visiting a page not yet visited is the maximum of the relevance of
pages that have been visited and point to it
Context Graph
 Focused crawling has proposed the use of context graphs,
which in turn created the context focused crawler (CFC)
 The CFC performs crawling in two steps:
1. Context graphs and classifiers are constructed using a
set of seed documents as a training set
2. Crawling is performed using the classifiers to guide it.
 How is it different from the focused crawler? Context
graphs are updated during the crawl.
Context Graph
Search Engines
Search Engine
Uses a ‘spider’ or ‘crawler’ that crawls the Web hunting
for new or updated Web pages to store in an index
Search Engine
Basic components to a search engine:
The crawler /spider
Gathers new or updated information on Internet
websites
The index
Used to store information about several websites
The search software
Performs searching through the huge index in an
effort to generate an ordered list of useful search
results
Search Engine Mechanism
Search Engines
 Generic structure of all search engines is basically the
same
 However, the search results differ from search engine
to search engine for the same search terms, why?
Responsibilities of Search Engines
 Document collection
 choose the documents to be indexed
 Document indexing
 indicate the content of the selected documents.
 Searching
 indicate the user information need into a query
 retrieval (search algorithms, ranking of web pages)
 Results
 present the outcome
Phases of Query Binding
Query binding is the
process of translating a
user need into a search
engine query
Phases of Query Binding
Three-tier process :
1. The first level involves the user formulating the
information need into a question or a list of terms using
experiences and vocabulary and entering it into the search
engine.
2. The search engine must translate the words with possible
spelling errors into processing tokens.
3. The search engine must use the processing tokens to
search the document database and retrieve the appropriate
documents.
Types of Queries
 Boolean Queries:
Boolean logic queries connect words in the search using
operators such as AND or OR.
 Natural Language Queries:
In natural language queries the user frames as a question
or a statement
 Thesaurus Queries:
In a thesaurus query the user selects the term from a
preceding set of terms predetermined by the retrieval
system
Types of Queries cont.
 Fuzzy Queries:
Fuzzy queries reflect no specificity. (handling misspelling,
variations of the same word)
 Term Searches:
The most common type of query on the Web is when a
user provides a few words or phrases for the search
 Probabilistic Queries:
Probabilistic queries refer to the way in which the IR
system retrieves documents according to relevancy .
The Robot Exclusion
Why would the developers prefer to exclude robots
from parts of their websites?
 The robot exclusion protocol
 to indicate restricted parts of the Website to robots that
visit a site
 for giving crawlers/spiders (“robots”) limited access to a
website
The Robot Exclusion
Website administrators and content providers can limit
robot activity through two mechanisms:
 The Robots Exclusion Protocol is used by Website
administrators to specify which parts of the site
should not be visited by a robot, by providing a file
called robots.txt on their site.
 The Robots META Tag is a special html META tag that
can be used in any Web page to indicate whether that
page should be indexed, or parsed for links.
Example of the Robots META Tag
<META NAME="ROBOTS" CONTENT="NOINDEX,
NOFOLLOW">
If a web page contains the above tag, a robot should
not index this document (indicated by the word
NOINDEX), nor parse it for links (specified using
NOFOLLOW).
The Robot Exclusion
The Robot Exclusion
Robots.txt
 The "User-agent: *" means this section applies to all
robots.
 The "Disallow: /" tells the robot that it should not visit
any pages on the site.
Example-1
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/
Example-1
 In this example, three directories are excluded.
 Note that you need a separate "Disallow" line for
every URL prefix you want to exclude -- you cannot
say "Disallow: /cgi-bin/ /tmp/" on a single line.
 Also, you may not have blank lines in a record, as they
are used to delimit multiple records.
Example-2
User-agent: Google
Disallow: /
To allow a single robot
What modifications on the robots.txt if we wanted to
exclude the bing robot?
Important Considerations when
using robots.txt
 Robots can ignore your /robots.txt. Especially
malware robots that scan the web for security
vulnerabilities, and email address harvesters used by
spammers will pay no attention.
 The /robots.txt file is a publicly available file. Anyone
can see what sections of your server you don't want
robots to use.
 It is not advisable to use /robots.txt to hide
information.
Robots META tag
 Robots.txt is only accessible by web administrators.
 META tag can be used by individual web page authors
 The robots META tag is placed in the <HEAD> section
of the HTML page.
Robot META tag
<html>
<head>
<meta name=“robots” content=“noindex,
nofollow”>
…
<title>..<title>
</head>
Content terms








ALL,
NONE,
INDEX,
NOINDEX,
FOLLOW,
NOFOLLOW
ALL= INDEX, FOLLOW
NONE=NOINDEX, NOFOLLOW
Content combinations
<meta name=“robots” content=“index,follow”>
== <meta name=“robots” content=“all”>
<meta name=“robots” content=“noindex,follow”>
<meta name=“robots” content=“index,nofollow”>
<meta name=“robots”
content=“noindex,nofollow”>
= <meta name=“robots” content=“none”>
Exercise
Check if the KSU website has a robot exclusion file
robots.txt.
Multimedia Information Retrieval
 Perspective of images and videos
 Content system for images is the Query by Image Content (QBIC)
system:
 A three-dimensional color feature vector, where distance
measure is simple Euclidean distance.
 k-dimensional color histograms, where the bins of the histogram
can be chosen by a partition-based clustering algorithm.
 A three-dimensional texture vector consisting of features that
measure scale, directionality, and contrast. Distance is computed
as a weighted Euclidean distance measure, where the default
weights are inverse variances of the individual features.
Multimedia Information Retrieval
The query can be expressed directly in terms of the
feature representation itself
 For instance, Find images that are 40% blue in color and
contain a texture with specific coarseness property
 Or a specific layout
Multimedia Information Retrieval
 MIR System
www.hermitagemuseum.org/html_En/index.html
 A QBIC Layout Search Demo that illustrates a step by
step demonstration of the search described in the
text can be found at:
www.hermitagemuseum.org/fcgibin/db2www/qbicLayout.mac/qbic?selLang=English.
Multimedia Information Retrieval
 As multimedia become apparent as a more
extensively used data format, it is vital to deal with
the issues of:





metadata standards
classification
query matching
presentation
evaluation
 To guarantee the development and deployment of
efficient and effective multimedia information
retrieval systems