Download Web Pages - Vasile Avram

Defining Metrics to Automate the Quantitative Analysis of Textual Information within a Web Page by Professor Vasile AVRAM, PhD Informatics in Economy Department Academy of Economic Studies – Bucharest ROMANIA 1 Search Engines 1st Collect information (keywords, url, content, links in/out etc); 2nd Analyze the collected information: - ranked - indexed 3rd Store in the database (compressed?) Search Engine Web Pages Crawler Indexer follow links Spider find pages Downloads Pages Analyze Information Cataloged Information Database Results Engine Search request Search results 2 Ranking and SEO Common page ranking criteria: -Location – position of the keyword; -Frequency – the frequency with which the search term appears on the page; - Links – the type and number of links on a web page; - Click-throughs – the number of click-throughs has the site versus click-throughs the other pages that are shown in the page ranking. 2 Hiding information by exploiting CSS features Figure 1 Aspect of a webpage with CSS enabled (left) and CSS disabled (right) 2 Hiding information by exploiting CSS features Figure 2 The source of the web page 2 Hiding information by exploiting CSS features Figure 3 What a spider sees in the page 4 Determining the effective amount of text information (EATI) within a web page Figure 4 A snapshot of the page (IECapt; [6]) 4 Determining the effective amount of text information (EATI) within a web page ATIOCR EATI  TIES (1) Effective Amount of Text Information (EATI) is determined as a ratio between the amount of text information (we denote this by ATIOCR) obtained by applying an optical character recognition (OCR) to the snapshot of the web page (figure 4) over the text information extracted by spider (denoted by TIES) as shown in figure 3. 4 Determining the effective amount of text information (EATI) within a web page The value of the ratio can be: -less than 1, case in which the page contains hidden information in reverse proportion with value of the metric (as less the metric is as huge the hidden text amount is); -equal to 1, the ideal case when what shown is what contained; - greater than 1, case in which we have extra text information and signals that the page have images containing text information which, in most cases, not considered when ranking. As big as much extra text we have. 4 Determining the effective amount of text information (EATI) within a web page The working procedure used to valuate the metric involves the following three steps and corresponding type tools: 1st. Use a spider to extract the text information within a webpage and determine TIES value required in formula (1). The spider we build is based on theory in [4] and libraries available at [6] and our functions to clean up the extracted text; 2nd. Use a snapshot application program that can be called within a robot body to take a snapshot of the page involved in step one and save as an image format; 3rd. Apply an OCR tool (here applied ReadIRIS Pro 11) to the image saved at previous step and obtain the recognized text required to determine ATIOCR in (1). A. Determine textual information contained by graphic elements (TIG) metric The procedure used to determine the textual information contained by graphic elements (I denote that by TIG) within a web page is: 1st. Use a spider to extract the graphic elements (images, pictures, shapes etc) together with their positional coordinates and recompose a working web page of the same size as the original and containing only that graphic elements positioned at their proper coordinates; 2nd. Use a snapshot application program that can be called within a robot body to take a snapshot of the page involved in step one and save as an image format accepted as input by OCR tool; 3rd. Apply an OCR tool to the image saved at previous step and obtain the recognized text required to determine the textual information contained by graphic elements (TIG) value. B. Determining the quantity of textual information shown to the user (QTISU) metric QTISU  ATIOCR  TIG (2) C. Determining the text information shown to the user (TISU) from tags metric QTISU TISU   100 TIES (3) - TISU=100 what is shown = what extracted by the spider (no hidden information used); - TISU<100  the percent of hiding textual information from the one contained by tags. As less is as much hidden textual information is. D. The percent of textual information revealed by graphic elements to the user (TIRGU) metric TIG TIRGU   100 ATIOCR (4) TIRGU=100  the entire text information shown to the user is contained only by the graphic elements; TIRGU<100  the percent of textual information revealed to the user by graphic elements. As less is as much shown textual information comes from tags. 5 Conclusions References [1]Jorge Cardoso (ed), Semantic Web Services: Theory, Tools and Applications, IGI Global © 2007 Books24x7. <http://common.books24x7.com/book/id_20775/book.asp> [2] Vasile Avram, “Effective Amount of Text Information (EATI) in a Web Page – A Proposal for a New Metric and Method to Determine”, The proceedings of the 9th international conference on Informatics in Economy may 2009, Editura Economică, ISBN 978-606-505-172-2, pp 163-168 [3] Jerri L. Ledford – SEO Search Engine Optimization Bible, Wiley Publishing 2008 [4] Google - Hidden text and links, Webmaster Tools, www.google.com [5] Michael Schrenk - Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL, No Starch Press. 2007 Books24x7. <http://common.books24x7.com/book/id_22218/book.asp> References [ [6] http://www.sourceforge.org – Open Source PHP libraries for robots development [7] P.J. Deitel, H.M. Deitel – Internet and World Wide Web How to Program, fourth edition, Prentice Hall 2008, pages 160-190 [8] World Wide Web Consortium - The Specification of Standards for HTML, XHTML, CSS, XML: http://www.w3.org [9] Vasile Avram – Internet Technologies for Business: Documents and Websites-structure and description languages, http://www.avrams.ro/lecture-notes.htm [10] Yahoo! Search Content Quality Guidelines, www.yahoo.com [11] SEO tools-Search Engine marketing, www.seologic.com

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Web Pages - Vasile Avram