Download Web Pages - Vasile Avram

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

URL redirection wikipedia , lookup

Transcript
Defining Metrics to Automate the Quantitative
Analysis of Textual Information within a Web Page
by Professor Vasile AVRAM, PhD
Informatics in Economy Department
Academy of Economic Studies –
Bucharest ROMANIA
1 Search Engines
1st Collect information (keywords, url, content, links in/out etc);
2nd Analyze the collected information:
- ranked
- indexed
3rd Store in the database (compressed?)
Search
Engine
Web
Pages
Crawler
Indexer
follow
links
Spider
find
pages
Downloads
Pages
Analyze
Information
Cataloged
Information
Database
Results
Engine
Search request
Search results
2 Ranking and SEO
Common page ranking criteria:
-Location – position of the keyword;
-Frequency – the frequency with which the search term
appears on the page;
- Links – the type and number of links on a web page;
- Click-throughs – the number of click-throughs has the site
versus click-throughs the other pages that are shown in the
page ranking.
2 Hiding information by exploiting CSS features
Figure 1 Aspect of a webpage with CSS enabled (left) and CSS disabled (right)
2 Hiding information by exploiting CSS features
Figure 2 The source of the web page
2 Hiding information by exploiting CSS features
Figure 3 What a spider sees in the page
4 Determining the effective amount of text information (EATI)
within a web page
Figure 4 A snapshot of the page (IECapt; [6])
4 Determining the effective amount of text information (EATI)
within a web page
ATIOCR
EATI 
TIES
(1)
Effective Amount of Text Information (EATI) is determined as a
ratio between the amount of text information (we denote this
by ATIOCR) obtained by applying an optical character
recognition (OCR) to the snapshot of the web page (figure 4)
over the text information extracted by spider (denoted by
TIES) as shown in figure 3.
4 Determining the effective amount of text information (EATI)
within a web page
The value of the ratio can be:
-less than 1, case in which the page contains hidden
information in reverse proportion with value of the metric (as
less the metric is as huge the hidden text amount is);
-equal to 1, the ideal case when what shown is what contained;
- greater than 1, case in which we have extra text information
and signals that the page have images containing text
information which, in most cases, not considered when ranking.
As big as much extra text we have.
4 Determining the effective amount of text information (EATI)
within a web page
The working procedure used to valuate the metric involves the
following three steps and corresponding type tools:
1st. Use a spider to extract the text information within a
webpage and determine TIES value required in formula (1). The
spider we build is based on theory in [4] and libraries available
at [6] and our functions to clean up the extracted text;
2nd. Use a snapshot application program that can be called
within a robot body to take a snapshot of the page involved in
step one and save as an image format;
3rd. Apply an OCR tool (here applied ReadIRIS Pro 11) to the
image saved at previous step and obtain the recognized text
required to determine ATIOCR in (1).
A. Determine textual information contained by graphic elements
(TIG) metric
The procedure used to determine the textual information contained
by graphic elements (I denote that by TIG) within a web page is:
1st. Use a spider to extract the graphic elements (images, pictures,
shapes etc) together with their positional coordinates and
recompose a working web page of the same size as the original
and containing only that graphic elements positioned at their
proper coordinates;
2nd. Use a snapshot application program that can be called within a
robot body to take a snapshot of the page involved in step one
and save as an image format accepted as input by OCR tool;
3rd. Apply an OCR tool to the image saved at previous step and
obtain the recognized text required to determine the textual
information contained by graphic elements (TIG) value.
B. Determining the quantity of textual information shown to the
user (QTISU) metric
QTISU  ATIOCR  TIG
(2)
C. Determining the text information shown to the user (TISU)
from tags metric
QTISU
TISU 
 100
TIES
(3)
- TISU=100 what is shown = what extracted by the spider (no hidden
information used);
- TISU<100  the percent of hiding textual information from the one
contained by tags. As less is as much hidden textual information is.
D. The percent of textual information revealed by graphic
elements to the user (TIRGU) metric
TIG
TIRGU 
 100
ATIOCR
(4)
TIRGU=100  the entire text information shown to the user is
contained only by the graphic elements;
TIRGU<100  the percent of textual information revealed to the
user by graphic elements. As less is as much shown textual
information comes from tags.
5 Conclusions
References
[1]Jorge Cardoso (ed), Semantic Web Services: Theory, Tools
and Applications, IGI Global © 2007 Books24x7.
<http://common.books24x7.com/book/id_20775/book.asp>
[2] Vasile Avram, “Effective Amount of Text Information (EATI)
in a Web Page – A Proposal for a New Metric and Method to
Determine”, The proceedings of the 9th international
conference on Informatics in Economy may 2009, Editura
Economică, ISBN 978-606-505-172-2, pp 163-168
[3] Jerri L. Ledford – SEO Search Engine Optimization Bible,
Wiley Publishing 2008
[4] Google - Hidden text and links, Webmaster Tools,
www.google.com
[5] Michael Schrenk - Webbots, Spiders, and Screen Scrapers:
A Guide to Developing Internet Agents with PHP/CURL, No
Starch Press. 2007 Books24x7.
<http://common.books24x7.com/book/id_22218/book.asp>
References
[ [6] http://www.sourceforge.org – Open Source PHP libraries
for robots development
[7] P.J. Deitel, H.M. Deitel – Internet and World Wide Web How
to Program, fourth edition, Prentice Hall 2008, pages 160-190
[8] World Wide Web Consortium - The Specification of
Standards for HTML, XHTML, CSS, XML: http://www.w3.org
[9] Vasile Avram – Internet Technologies for Business:
Documents and Websites-structure and description
languages, http://www.avrams.ro/lecture-notes.htm
[10] Yahoo! Search Content Quality Guidelines,
www.yahoo.com
[11] SEO tools-Search Engine marketing, www.seologic.com