Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
WEB USAGE MINING NEGATIVEASSOCIATION s.vignesh 1hk07cs073 HKBKCE Web Mining  Web Mining is the use of the data mining   techniques to automatically discover and extract information from web documents/services Discovering useful information from the WorldWide Web and its usage patterns My Definition: Using data mining techniques to make the web more useful and more profitable (for some) and to increase the efficiency of our interaction with the web Web Mining  Data Mining Techniques       Association rules Sequential patterns Classification Clustering Outlier discovery Applications to the Web    E-commerce Information retrieval (search) Network management Examples of Discovered Patterns  Association rules   Classification   People with age less than 40 and salary > 40k trade on-line Clustering   98% of AOL users also have E-trade accounts Users A and B access similar URLs Outlier Detection  User A spends more than twice the average amount of time surfing on the Web Web Mining  The WWW is huge, widely distributed, global information service centre for  Information services: news, advertisements, consumer information, financial management, education, government, e-commerce, etc.  Hyper-link information  Access and usage information  WWW provides rich sources of data for data mining Why Mine the Web?  Enormous wealth of information on Web      Lots of data on user access patterns   Financial information (e.g. stock quotes) Book/CD/Video stores (e.g. Amazon) Restaurant information (e.g. Zagats) Car prices (e.g. Carpoint) Web logs contain sequence of URLs accessed by users Possible to mine interesting nuggets of information   People who ski also travel frequently to Europe Tech stocks have corrections in the summer and rally from November until February Why is Web Mining Different?  The Web is a huge collection of documents except for    The Web is very dynamic   Hyper-link information Access and usage information New pages are constantly being generated Challenge: Develop new Web mining algorithms and adapt traditional data mining algorithms to   Exploit hyper-links and access patterns Be incremental Web Mining Applications  E-commerce (Infrastructure)      Information retrieval (Search) on the Web     Generate user profiles Targetted advertizing Fraud Similar image retrieval Automated generation of topic hierarchies Web knowledge bases Extraction of schema for XML documents Network Management   Performance management Fault management User Profiling  Important for improving customization    Generate user profiles based on their access patterns    Provide users with pages, advertisements of interest Example profiles: on-line trader, on-line shopper Cluster users based on frequently accessed URLs Use classifier to generate a profile for each cluster Engage technologies   Tracks web traffic to create anonymous user profiles of Web surfers Has profiles for more than 35 million anonymous users Internet Advertizing  Ads are a major source of revenue for Web portals (e.g., Yahoo, Lycos) and E-commerce sites  Plenty of startups doing internet advertizing   Doubleclick, AdForce, Flycast, AdKnowledge Internet advertizing is probably the “hottest” web mining application today Internet Advertizing  Scheme 1:    Manually associate a set of ads with each user profile For each user, display an ad from the set based on profile Scheme 2:    Automate association between ads and users Use ad click information to cluster users (each user is associated with a set of ads that he/she clicked on) For each cluster, find ads that occur most frequently in the cluster and these become the ads for the set of users in the cluster Internet Advertizing     Use collaborative filtering (e.g. Likeminds, Firefly) Each user Ui has a rating for a subset of ads (based on click information, time spent, items bought etc.) Rij - rating of user Ui for ad Aj Problem: Compute user Ui’s rating for an unrated ad Aj ? A1 A2 A3 Internet Advertizing  Key Idea: User Ui’s rating for ad Aj is set to Rkj, where Uk is the user whose rating of ads is most similar to Ui’s  User Ui’s rating for an ad Aj that has not been previously displayed to Ui is computed as follows:     Consider a user Uk who has rated ad Aj Compute Dik, the distance between Ui and Uk’s ratings on common ads Ui’s rating for ad Aj = Rkj (Uk is user with smallest Dik) Display to Ui ad Aj with highest computed rating Fraud  With the growing popularity of E-commerce, systems to detect and prevent fraud on the Web become important  Maintain a signature for each user based on buying patterns on the Web (e.g., amount spent, categories of items bought)  If buying pattern changes significantly, then signal fraud  HNC software uses domain knowledge and neural networks for credit card fraud detection Retrieval of Similar Images  Given:   A set of images Find: All images similar to a given image  All pairs of similar images   Sample applications: Medical diagnosis  Weather predication  Web search engine for images  E-commerce  Retrieval of Similar Images   QBIC, Virage, Photobook Compute feature signature for each image   QBIC uses color histograms WBIIS, WALRUS use wavelets  Use spatial index to retrieve database image whose signature is closest to the query’s signature  WALRUS decomposes an image into regions A single signature is stored for each region Two images are considered to be similar if they have enough similar region pairs   Images retrieved by WALRUS Query image Problems with Web Search Today  Today’s search engines are plagued by problems: the abundance problem (99% of info of no interest to 99% of people)  limited coverage of the Web (internet sources hidden behind search interfaces) Largest crawlers cover < 18% of all web pages  limited query interface based on keywordoriented search  limited customization to individual users  Problems with Web Search Today  Today’s search engines are plagued by problems:  Web is highly dynamic  Lot of pages added, removed, and updated every day  Very high dimensionality Improve Search By Adding Structure to the Web  Use Web directories (or topic hierarchies)  Provide a hierarchical classification of documents (e.g., Yahoo!) Yahoo home page Recreation Travel  Sports Business Companies Science Finance News Jobs Searches performed in the context of a topic restricts the search to only a subset of web pages related to the topic Automatic Creation of Web Directories  In the Clever project, hyper-links between Web pages are taken into account when categorizing them     Use a bayesian classifier Exploit knowledge of the classes of immediate neighbors of document to be classified Show that simply taking text from neighbors and using standard document classifiers to classify page does not work Inktomi’s Directory Engine uses “Concept Induction” to automatically categorize millions of documents Network Management  Objective: To deliver content to users quickly and reliably   Router Server Traffic management Fault management Service Provider Network Why is Traffic Management Important?  While annual bandwidth demand is increasing ten-fold on average, annual bandwidth supply is rising only by a factor of three  Result is frequent congestion at servers and on network links    during a major event (e.g., princess diana’s death), an overwhelming number of user requests can result in millions of redundant copies of data flowing back and forth across the world Olympic sites during the games NASA sites close to launch and landing of shuttles Traffic Management   Key Ideas  Dynamically replicate/cache content at multiple sites within the network and closer to the user  Multiple paths between any pair of sites  Route user requests to server closest to the user or least loaded server  Use path with least congested network links Akamai, Inktomi Traffic Management Congested link Congested server Request Router Server Service Provider Network Traffic Management  Need to mine network and Web traffic to determine  What content to replicate? Which servers should store replicas? Which server to route a user request?  What path to use to route packets?    Network Design issues     Where to place servers? Where to place routers? Which routers should be connected by links? One can use association rules, sequential pattern mining algorithms to cache/prefetch replicas at server Fault Management  Fault management involves  Quickly identifying failed/congested servers and links in network  Re-routing user requests and packets to avoid congested/down servers and links  Need to analyze alarm and traffic data to carry out root cause analysis of faults  Bayesian classifiers can be used to predict the root cause given a set of alarms Web Mining Issues  Size    Grows at about 1 million pages a day Google indexes 9 billion documents Number of web sites  Netcraft survey says 72 million sites (http://news.netcraft.com/archives/web_server_survey.html)  Diverse types of data      Images Text Audio/video XML HTML Number of Active Sites Total Sites Across All Domains August 1995 - October 2007 Systems Issues  Web data sets can be very large   Cannot mine on a single server!   Tens to hundreds of terabytes Need large farms of servers How to organize hardware/software to mine multi-terabye data sets  Without breaking the bank! Different Data Formats Structured Data  Unstructured Data  OLE DB offers some solutions!  Web Data Web pages  Intra-page structures  Inter-page structures  Usage data  Supplemental data  Profiles  Registration information  Cookies  Web Usage Mining Pages contain information  Links are ‘roads’  How do people navigate the Internet    Web Usage Mining (clickstream analysis) Information on navigation paths available in log files  Logs can be mined from a client or a server perspective  Website Usage Analysis   Why analyze Website usage? Knowledge about how visitors use Website could      Provide guidelines to web site reorganization; Help prevent disorientation Help designers place important information where the visitors look for it Pre-fetching and caching web pages Provide adaptive Website (Personalization) Questions which could be answered     What are the differences in usage and access patterns among users? What user behaviors change over time? How usage patterns change with quality of service (slow/fast)? What is the distribution of network traffic over time? Website Usage Analysis Website Usage Analysis Website Usage Analysis Analog – Web Log File Analyser Gives basic statistics such as • number of hits • average hits per time period • what are the popular pages in your site • who is visiting your site • what keywords are users searching for to get to you • what is being downloaded http://www.analog.cx/ Web Usage Mining Process Web Usage Mining Process Web Usage Mining Process Web Mining Outline Goal: Examine the use of data mining on the World Wide Web  Web Content Mining  Web Structure Mining  Web Usage Mining Web Mining Taxonomy Modified from [zai01] Web Content Mining     Examine the contents of web pages as well as result of web searching Can be thought of as extending the work performed by basic search engines Search engines have crawlers to search the web and gather information, indexing techniques to store the information, and query processing support to provide information to the users Web Content Mining is: the process of extracting knowledge from web contents Semi-structured Data  Content is, in general, semistructured  Example: Title  Author  Publication_Date  Length  Category  Abstract  Content  Structuring Textual Data    Many methods designed to analyze structured data If we can represent documents by a set of attributes we will be able to use existing data mining methods How to represent a document?   Vector based representation (referred to as “bag of words” as it is invariant to permutations) Use statistics to add a numerical dimension to unstructured text Document Representation   A document representation aims to capture what the document is about One possible approach:   Each entry describes a document Attribute describe whether or not a term appears in the document Document Representation Another approach: • Each entry describes a document • Attributes represent the frequency in which a term appears in the document Document Representation • Stop Word removal: Many words are not informative and thus irrelevant for document representation the, and, a, an, is, of, that, … • Stemming: reducing words to their root form (Reduce dimensionality) A document may contain several occurrences of words like fish, fishes, fisher, and fishers. But would not be retrieved by a query with the keyword “fishing” Different words share the same word stem and should be represented with its stem, instead of the actual word “Fish”