Download 5.1 UNIT -5 material

Web Mining World wide web is an huge, global information repository for •news, advertisements, consumer information, financial management, education, government, e-commerce •Hyper-link information •Access and usage information •Already containing billions of pages •Millions of pages getting added every day WWW provides rich sources for data mining Challenges •Too huge for effective data warehousing and data mining •Too complex and heterogeneous: no standards and structure. It holds unstructured data, there is no structure or schema to the Web There are no standard retrieval methods 5/5/2017 Data Mining -By Dr. S. C. Shirwaikar 1 Web Data can be classified into 5 classes •Content of Actual Web Pages – text data graphics data •Intrapage structure – HTML or XML code for the page •Interpage Structure – Linkage structure or navigation structure between Web Pages •Usage data- how web pages are accessed by visitors – dynamic data •User profiles – Demographic or registration information obtained about users- The information stored in cookies Depending on type of data that is mined web mining can be divided into several classes 5/5/2017 Data Mining -By Dr. S. C. Shirwaikar 2 Web Mining Web Content Mining Web page Content Mining 5/5/2017 Web Structure Mining Search result Mining General Access pattern tracking Data Mining -By Dr. S. C. Shirwaikar Web Usage Mining Customized usage tracking 3 Web Page content mining is the traditional searching of web pages via content Search result mining is build on the top of search engines and uses their result as the data to be mined Web structure mining uses organization of pages on the web Web usage mining uses the logs of web access It can be general or targeted at specific usage or users 5/5/2017 Data Mining -By Dr. S. C. Shirwaikar 4 Web Content mining Traditional search engines are key word based. Improvements can be by using techniques as concept hierarchies and synonyms, user profiles and analyzing the links between pages. Queries may use expressions of keywords E.g., car and repair shop, tea or coffee, DBMS but not Oracle Queries and retrieval should consider synonyms, e.g., repair and maintenance Major difficulties of the model Synonymy: A keyword T does not appear anywhere in the document, even though the document is closely related to T, e.g., data mining V/s Knowledge discovery Polysemy: The same keyword may mean different things in different contexts, e.g., mining 5/5/2017 Data Mining -By Dr. S. C. Shirwaikar 5 Basic Content mining is text mining Degree of relevance based on the nearness of the keywords, relative frequency of the keywords, etc. Basic techniques Stop list Set of words that are deemed “irrelevant”, even though they may appear frequently E.g., a, the, of, for, with, etc. Stop lists may vary when document set varies Word stem Several words are small syntactic variants of each other since they share a common word stem E.g., drug, drugs, drugged A term frequency table Each entry frequent_table(i, j) = # of occurrences of the word ti in document di .Usually, the ratio instead of the absolute number of occurrences is used 5/5/2017 Data Mining -By Dr. S. C. Shirwaikar 6 Intelligent search agents(software systems) use other techniques besides keyword search Concept Hierarchies – It defines a sequence of mappings from a set of low level concepts to higher level, more general concepts India Maharashtra Mumbai Punjab Karnataka Pune Synonyms Bombay can be used instead of Mumbai XML schema- web documents are not structured being in HTML. The web is slowly adapting XML(extensible markup language) which will structure documents and mining can be efficient 5/5/2017 Data Mining -By Dr. S. C. Shirwaikar 7 Crawlers Crawlers are used to facilitate the creation of indices used by search engines. A robot or spider or crawler is a program that traverses the hypertext structure of the web. The page or set of pages that the crawler starts with are referred to as the seed URL. All visited links are saved in a queue, and these are then searched for information. The collected information is saved in indices for the users of associated search engines Crawlers are activated periodically. Crawlers usually replace entire index , while incremental crawlers selectively search the web and update index incrementally 5/5/2017 Data Mining -By Dr. S. C. Shirwaikar 8 Focused Crawlers visit pages related to the topics of interest. If a page is not relevant the links are not followed. The focused crawler architecture consists of three primary components •A hypertext classifier that associates a relevance score for each document w.r.t. crawl topic. Additionally a resource rating estimates whether it will be beneficial to follow the links moving out of the page •A distiller determines Hub pages A Hub is a Web page that provides collection of links to many relevant pages . Hub pages facilitate search though they themselves do not contain much information •The crawler performs crawling via a priority based structure governed by classifier and distiller 5/5/2017 Data Mining -By Dr. S. C. Shirwaikar 9 Harvest rate- It is a high precision rate performance objective for the focused crawler A manually created classification tree is used as a seed for focused crawling Harvest System Harvest system is based on the use of caching, indexing and crawling. It uses gatherers and brokers. A gatherer obtains information for indexing from an Internet service provider, while a broker provides the index and query interface. Indices in Harvest are topic specific 5/5/2017 Data Mining -By Dr. S. C. Shirwaikar 10 Harvest gatherers use Essence system to assist in collecting data Essence classifies documents by creating a semantic index Semantic indexing generates different types of information for different types of files It then creates indices on this information based on keywords Essence uses file extensions to help classify file types Virtual Web view The problem of handling large amount of data on the Web can be solved by creating a multiple layered database (MLDB) Each layer of this database is more generalized than the layer beneath it. 5/5/2017 Data Mining -By Dr. S. C. Shirwaikar 11 MLDB • Layer0: the Web itself • Layer1: the Web page descriptor layer – Contains descriptive information for pages on the Web – An abstraction of Layer0: substantially smaller but still rich enough to preserve most of the interesting, general information – Organized into dozens of semistructured classes • document, person, organization, ads, directory, sales, software, game, stocks, library_catalog, geographic_data, scientific_data, etc. • Layer2 and up: various Web directory services constructed on top of Layer1 – provide multidimensional, application-specific services – WordNet Semantic Network is used-groups of synonyms in English linked by lexical and semantic relationship 5/5/2017 Data Mining -By Dr. S. C. Shirwaikar 12 • Benefits: – Multi-dimensional Web info summary analysis – Approximate and intelligent query answering – Web high-level query answering (WebSQL, WebML) Covers, covered by, like , close to – Web content and structure mining – Observing the dynamics/evolution of the Web • Is it realistic to construct such a meta-Web? – Benefits even if it is partially constructed – Benefits may justify the cost of tool development, standardization and partial restructuring 5/5/2017 Data Mining -By Dr. S. C. Shirwaikar 13 Web Structure Mining Web structure can be used for finding authoritative Web pages-retrieving pages that are not only relevant, but also of high quality, or authoritative on the topic Hyperlinks can infer the notion of authority The Web consists not only of pages, but also of hyperlinks pointing from one page to another These hyperlinks contain an enormous amount of latent human annotation A hyperlink pointing to another Web page, this can be considered as the author's endorsement of the other page PageRank is used to measure the importance of a page and is used for prioritizing pages . It is calculated based on number of pages that point to it 5/5/2017 Data Mining -By Dr. S. C. Shirwaikar 14 It is bases on number of backlinks to a page. Weights are used to provide more importance to backlinks coming from important pages Given a page p, Bp the set of pages pointing to p and Fp denoting the set of links out of p PR(p) = c ∑ PR(q)/ |Fq| q ε Bq Where c is a value between 0 and 1 used for normalization A problem called rank sink exists when A points to B, B points to A which is solved by adding an additional term Problems with the Web linkage structure •Not every hyperlink represents an endorsement •Other purposes are for navigation or for paid advertisements •One authority will seldom have its Web page point to its rival authorities in the same field •Authoritative pages are seldom particularly descriptive 5/5/2017 Data Mining -By Dr. S. C. Shirwaikar 15 Hub It is a set of Web pages that provides collections of links to authorities HITS ( Hyperlink Induced Topic Search) •Explore interactions between hubs and authoritative pages •Use an index-based search engine to form the root set Many of these pages are presumably relevant to the search topic Some of them should contain links to most of the prominent authorities •Expand the root set into a base set •Include all of the pages that the root-set pages link to, and all of the pages that link to a page in the root set, up to a designated size cutoff •Apply weight-propagation An iterative process that determines numerical estimates of hub and authority weights 5/5/2017 Data Mining -By Dr. S. C. Shirwaikar 16 •Output a short list of the pages with large hub weights, and the pages with large authority weights for the given search topic Systems based on the HITS algorithm •Clever, Google: achieve better quality search results than those generated by term-index engines such as AltaVista and those created by human ontologists such as Yahoo! Difficulties from ignoring textual contexts •Drifting: when hubs contain multiple topics •Topic hijacking: when many pages from a single Web site point to the same single popular site 5/5/2017 Data Mining -By Dr. S. C. Shirwaikar 17 Web Usage Mining Mining Web log records to discover user access patterns of Web pages Web logs provide rich information about Web dynamics Typical Web log entry includes the URL requested, the IP address from which the request originated, and a timestamp Logs can be examined from either a client or server perspective Log is evaluated for the information about the site where service resides- can be used to improve design of sites Log is evaluated for Client’s sequence of clicks (clickstream data)- can be used to improve performance by prefetching and caching of pages 5/5/2017 Data Mining -By Dr. S. C. Shirwaikar 18 Applications •Target potential customers for electronic commerce •Enhance the quality and delivery of Internet information services to the end user frequent access behaviour can be used for caching improve the quality of web site •Improve Web server system performance •Identify potential prime advertisement locations improve dales and advertisement •Develop user profile aiding in personalization 5/5/2017 Data Mining -By Dr. S. C. Shirwaikar 19 Web Usage Mining activities • Construct multidimensional view on the Weblog database – Perform multidimensional OLAP analysis to find the top N users, top N accessed Web pages, most frequently accessed time periods, etc. • Perform data mining on Weblog records – Find association patterns, sequential patterns, and trends of Web accessing – May need additional information,e.g., user browsing sequences of the Web pages in the Web server buffer • Conduct studies to – Analyze system performance, improve system design by Web caching, Web page prefetching, and Web page swapping 5/5/2017 Data Mining -By Dr. S. C. Shirwaikar 20

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 5.1 UNIT -5 material