Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx Ihr Logo Introduction  Web content mining is the mining, extraction and integration of useful data, information and knowledge from Web page contents. - textual - audio - video - still images - metadata - hyperlinks Your Logo Introduction  Problems with the web data  Distributed data  Large volume  Unstructured data  Redundant data  Quality of data  Extreme percentage volatile data  Varied data Your Logo Introduction  Two approaches of web-content mining:  agent-based software agents perform the content mining  database oriented view the Web data as belonging to a database Your Logo Web Crawler  A computer program that navigates the hypertext structure of the web  Crawlers are used to ease the formation of indexes used by search engines  The page(s) that the crawler begins with are called the seed URLs.  Builds an index visiting number of pages and then replaces the current index  Known as a periodic crawler because it is activated periodically Your Logo Web Crawler  Another type is a Focused Crawler  Generally recommended for use due to large size of the Web  Visits pages related to topics of interest  If a page is not pertinent, the entire set of possible pages below it is pruned Your Logo Web Crawler  Crawling process  Begin with group of URLs  Submitted by users  Common URLs  Breath-first or depth-first  Extract more URLs  Numerous crawlers  Problem of redundancy  Web partition  robot per partition Your Logo Focused Crawler  The focused crawler structure consists of two major parts:  The distiller  The hypertext classifier Your Logo Focused Crawler  The pages that the crawler visits are selected using a priority-based structure managed by the priority associated with pages by the classifier and the distiller Your Logo Focused Crawler  Sample documents are identified and classified based on a hierarchical classification tree  Documents are used as the seed documents to begin the focused crawling Your Logo Context Graph  Focused crawling has proposed the use of context graphs, which in turn created the context focused crawler (CFC)  The CFC performs crawling in two steps:  Context graphs and classifiers are constructed using a set of seed documents as a training set  Crawling is performed using the classifiers to guide it Your Logo Content Graph Your Logo Implementation of a Web Crawler  Wget is a free GNU utility that makes it possible to retrieve web documents  Wget supports Internet protocols  HTTP (Hyper Text Transfer Protocol)  FTP (File Transfer Protocol)  Recursively browse through the structure of HTML documents and FTP directory trees Your Logo Commonly Used Options for Wget Your Logo Methods for Crawl Class Your Logo Crawl class Figure 7.7 Code from the main of Crawl class (Suitable for Java programmers) Your Logo The readContent Method of Crawl Class  Your Logo Figure 7.8 Code from the readContent method of Crawl class (Suitable for Java programmers) Code for Extracting Links from Crawl Class Figure 7.9 Your Logo Thank you for your attention Your Logo