Download Clipboard

Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx Ihr Logo Introduction  Web content mining is the mining, extraction and integration of useful data, information and knowledge from Web page contents. - textual - audio - video - still images - metadata - hyperlinks Your Logo Introduction  Problems with the web data  Distributed data  Large volume  Unstructured data  Redundant data  Quality of data  Extreme percentage volatile data  Varied data Your Logo Introduction  Two approaches of web-content mining:  agent-based software agents perform the content mining  database oriented view the Web data as belonging to a database Your Logo Web Crawler  A computer program that navigates the hypertext structure of the web  Crawlers are used to ease the formation of indexes used by search engines  The page(s) that the crawler begins with are called the seed URLs.  Builds an index visiting number of pages and then replaces the current index  Known as a periodic crawler because it is activated periodically Your Logo Web Crawler  Another type is a Focused Crawler  Generally recommended for use due to large size of the Web  Visits pages related to topics of interest  If a page is not pertinent, the entire set of possible pages below it is pruned Your Logo Web Crawler  Crawling process  Begin with group of URLs  Submitted by users  Common URLs  Breath-first or depth-first  Extract more URLs  Numerous crawlers  Problem of redundancy  Web partition  robot per partition Your Logo Focused Crawler  The focused crawler structure consists of two major parts:  The distiller  The hypertext classifier Your Logo Focused Crawler  The pages that the crawler visits are selected using a priority-based structure managed by the priority associated with pages by the classifier and the distiller Your Logo Focused Crawler  Sample documents are identified and classified based on a hierarchical classification tree  Documents are used as the seed documents to begin the focused crawling Your Logo Context Graph  Focused crawling has proposed the use of context graphs, which in turn created the context focused crawler (CFC)  The CFC performs crawling in two steps:  Context graphs and classifiers are constructed using a set of seed documents as a training set  Crawling is performed using the classifiers to guide it Your Logo Content Graph Your Logo Implementation of a Web Crawler  Wget is a free GNU utility that makes it possible to retrieve web documents  Wget supports Internet protocols  HTTP (Hyper Text Transfer Protocol)  FTP (File Transfer Protocol)  Recursively browse through the structure of HTML documents and FTP directory trees Your Logo Commonly Used Options for Wget Your Logo Methods for Crawl Class Your Logo Crawl class Figure 7.7 Code from the main of Crawl class (Suitable for Java programmers) Your Logo The readContent Method of Crawl Class  Your Logo Figure 7.8 Code from the readContent method of Crawl class (Suitable for Java programmers) Code for Extracting Links from Crawl Class Figure 7.9 Your Logo Thank you for your attention Your Logo

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Clipboard