Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Web Mining & Pattern Discovery Paul George CSE 8331 Submitted to: Dr. M. Dunham Department of Computer Science Southern Methodist University, Texas, USA 1 TABLE OF CONTENTS ABSTRACT....................................................................................................................... 3 INTRODUCTION .............................................................................................................. 4 Web Content Mining....................................................................................................... 6 Agent-Based Approach ............................................................................................... 6 Intelligent Search Agents ........................................................................................ 6 Information Filtering/Categorization ...................................................................... 6 Personalized Web Agents ....................................................................................... 6 Database Approach ..................................................................................................... 6 Multilevel Databases ............................................................................................... 7 Web Query Systems ................................................................................................ 7 Overview of Crawlers ................................................................................................. 7 Personalization ............................................................................................................ 9 Web Usage Mining ....................................................................................................... 10 PATTERN DISCOVERY ............................................................................................... 11 Preprocessing Tasks ...................................................................................................... 11 Data Cleaning............................................................................................................ 11 Transaction Identification ......................................................................................... 11 Discovery Techniques ................................................................................................... 13 Path Analysis ............................................................................................................ 13 Association Rules...................................................................................................... 13 Clustering and Classification .................................................................................... 15 Sequential Patterns .................................................................................................... 16 WEB USAGE MINING ARCHITECTURE – WEBMINER ....................................... 18 Tools ............................................................................................................................. 19 BENEFITS ....................................................................................................................... 20 APPLICATIONS ............................................................................................................. 22 RESEARCH AREAS ..................................................................................................... 24 CONCLUSION................................................................................................................ 25 BIBLIOGRAPHY ............................................................................................................ 26 2 ABSTRACT Web mining has been and is the focus of many research papers. Web mining can be classified into three categories: Web content mining which is the process of discovering information from various resources available on WWW, Web structure mining which is the process of discovering knowledge from the interconnections of hypertext documents and Web usage mining which is the process of pattern discovery and analysis. In this paper Web content mining approaches have been briefly discussed and Web usage mining has been concentrated upon in the area of pattern discovery. I have also listed some applications where Web mining could be applied. Also WEBMINER, a system which is used for Web usage mining has been covered briefly. In the end I have concluded by listing the issues and the research directions with respect to the topics covered. 3 INTRODUCTION Firstly let’s define data mining, Data Mining can be defined as finding hidden information in a database or exploratory data analysis, data driven discovery, and deductive learning [1]. Web mining is data mining applied to the World Wide Web i.e. mining of data related to World Wide Web. The Web data can be any of the following: Web page content HTML/XML code Data automatically generated which are stored as server access logs, referrer logs and cookies residing on the client. E-commerce transaction data When data mining is applied to the Web, it can perform several functions like: Information extraction: This deals with acquiring/interpreting useful information using the Web data which may lead to Business Intelligence Resource discovery: This is the discovery of locations of unfamiliar files on the network which may or may not be relevant. Generalization: It relates to the discovery of information patterns [4]. Following page shows Web mining classification – Fig. 1. As we see in Fig. 1, Web mining is divided into Web content mining, Web structure mining and Web usage mining. In the following sections in this paper I have discussed the types of approaches in Web content mining and very briefly covered Web structure mining and concentrated more on pattern discovery in Web usage mining. 4 Web mining can be classified as shown below: WEB MINING WEB CONTENT MINING Agent Based Approach WEB STRUCTURE MINING WEB USAGE MINING Database Approach Fig.1: Web mining classification [3]. 5 Web Content Mining Web content mining can be thought of as a process of information or resource discovery from resources on WWW [5]. There are two approaches in Web content mining: Agent-based Database approaches. Agent-Based Approach The agent-based approach involves artificial intelligence systems that can act autonomously or semi-autonomously on behalf of a particular user, to discover and organize Web-based information [7]. Agent-based Web mining systems can be classified into three categories: Intelligent Search Agents Many intelligent Search agents e.g. Harvest [10], FAQ-Finder [11] are available which search for information that are relevant by using domain specific characteristics to interpret and organize the discovered information [7]. The examples of agents mentioned above rely on documents which have domain specific information, or on “hard coded models of the information sources to retrieve and interpret documents” [7]. Information Filtering/Categorization Those agents that belong to this category use IR techniques and characteristics of open hypertext Web documents to automatically retrieve, filter, and categorize them. For example, HyPursuit [13] uses semantic information embedded in link structures as well as document content to create cluster hierarchies of hypertext documents, and structure an information space [7]. Personalized Web Agents Another category of Web agents includes those that learn about user preferences or obtain them (from WWW) and discover sources from web data that correspond to these preferences, and possibly those of other individuals with similar interests. Example of such an agent includes the WebWatcher [14] [7]. Database Approach The database approach focuses on integrating and organizing the heterogeneous and semi-structured data on the Web into more structured and high-level collections of resources. Analysis can be then performed on these resources of data using standard querying mechanisms. Examples of such resources can be relational or object-oriented databases [8]. 6 Multilevel Databases This idea has been proposed to organize web based information. According to this idea, the database is organized into hierarchies. The lowest level contains primitive semi-structured information stored in various Web repositories, such as hypertext documents whereas as we go higher in the hierarchy the data would be a generalization of the lower level and organized in structured collections such as relational or object-oriented databases. Example ARANEUS system [13] extracts relevant information from hypertext documents and integrates these into higher-level derived Web Hypertexts which are generalizations of the notion of database views [8]. Web Query Systems These systems attempt to utilize standard database query languages such as SQL and even natural language processing and interlace it with the types of queries that are used in World Wide Web searches. Example W3QL [14] combines structure queries, based on the organization of hypertext documents, and content queries, based on information retrieval techniques [8]. Basic content mining is a type of text mining. It extends the work performed by basic search engines. It can also improve a traditional search engine through various techniques like analyzing links between pages and user profile. Overview of Crawlers Crawlers (spider or robot) is a program that traverses the hypertext structure in the web [1]. This is used in Web content mining in search engines. Before I explain how a crawler works let me explain what a Seed URL is. Seed URL’s: It is the page / pages that the crawler starts with. How it works: Starting from the Seed URL, the crawler records and saves all links in a queue. These new pages are in turn searched and their links are saved. As these Robots search the web, they collect information about each page like keywords and store in indices for users of the associated search engine [1]. Types of crawlers Periodic crawler: It is activated periodically. It may visit a certain number of pages and then stop, build an index, and replace the existing index. Incremental crawler: It updates as opposed to replacing index. Focused crawler: This visits pages related to topics of interest. 7 To give the idea where crawlers are used in a search engine, given below is Search Engine General Architecture [3] which is self explanatory: Fig. 2: Architecture of a search engine. [3] 8 Personalization Personalization is another area in Web content mining. In this the contents of a web page are modified to better fit the desires of the user. The goal here is to entice a current customer to purchase something he or she may not have thought about purchasing. It includes such techniques as use of cookies, use of databases and more complex methods [1]. Personalization can be viewed as type of [1]: Classification: The behavior of a user is determined based on those for the class. Clustering: The behavior is determined based on those users to which he/she is determined to be similar to. Prediction: Used to predict what the user would like to see. There are three types of Web page personalization [1]: Manual Techniques: This works by using details of user registration preferences or through the use of rules that are based on profiles to classify individuals. Collaborative filtering: In this personalization is achieved by recommending information that has previously been given high ratings from similar users. Content-based filtering: This retrieves pages based on similarity between them and user. 9 Web Usage Mining This is the process of discovering user access patterns (or user habits), as data are automatically collected in daily access logs [4]. Web usage mining performs mining on web usage data or web logs. A web log is a listing of page reference data [1]. It is at times referred to as clickstream data as each entry corresponds to a mouse click. Logs can be examined from two perspectives [1]: Server: It can be used to find where a web site resides and also improve the design. Client: By evaluating a client’s sequence of clicks, information about a user (group of users) is detected. This information could be used to perform prefetching and caching of pages and hence easy and fast loading of web pages. Analyzing logs can help organizations to effectively market their products by effective promotional campaigns, advertisements. It can also provide important information how to restructure a web site for effective organizational presence[7]. Web usage mining can be divided into two parts. Firstly pattern discovery which I have covered in the following section and secondly pattern analysis [7]. Web log/Web usage mining can be used for the following [3]: To enhance server performance To improve Web-site navigation Target customers for electronic commerce Identify potential prime advertisements locations 10 PATTERN DISCOVERY As we have previously discussed the various uses of Web usage mining, there are various issues in pre-processing data for mining that must be taken care of before the mining algorithms can be run. These include developing a model of access log data, developing techniques to clean/filter the raw data to eliminate outliers and/or irrelevant items, grouping individual page accesses into transactions, integration of various data sources such as user registration information, and specializing generic data mining algorithms to take advantage of the specific nature of access log data [7]. Preprocessing Tasks This can be divided into two activities: Data Cleaning Transaction Identification Data Cleaning This is one of the most important activities as presence of irrelevant data during analysis phase may yield wrong results. Elimination of irrelevant items can be reasonably accomplished by checking the suffix of the URL name. For instance, all log entries with filename suffixes such as, gif, jpeg, GIF, JPEG, jpg, JPG, and map can be removed [7]. It may happen that a page may be accessed many time but listed only once. Recent methods which make use of cookies, cache busting, and explicit user registration try to overcome this problem. As detailed in [17], all the methods have drawbacks. Cookies can be deleted by the user, cache busting defeats the speed advantage that caching was created to provide and can be disabled, and user registration is voluntary and users often provide false information. User identification is also another problem. Machine names cannot be used, as it is possible for more than one user to be behind that name (here name applies to the IP address of the machine) e.g. a company network have a single IP but several people may be accessing the same URL. [18] Discusses algorithm to check if each incoming request is reachable from the pages already visited. If a page is requested that is not directly linked to the previous pages, multiple users are assumed to exist on the same machine [7]. Other methods to identify users are IP address, browser agent, temporal information which are discussed in [17]. Transaction Identification The raw server log can be thought of as in two ways; either as a single transaction of many page references or set of many transactions each of single page reference. Sequences of page references must be grouped into logical units representing Web transactions or user sessions. A user session is all of the page references made by a user during a single visit to a site. A transaction differs from a user session in that the size of a 11 transaction can range from a single page reference to the entire page references in a user session, depending on the criteria used to identify transactions [7]. The goal of transaction identification is to create meaningful clusters of references for each user. Thus, the task can be dividing large transaction into multiple smaller ones to merging small transactions into many large ones. [19] Two types of transactions are defined. The first type is navigation-content, where each transaction consists of a single content reference and all of the navigation references in the traversal path leading to the content reference. These transactions can be used to mine for path traversal patterns. The second type of transaction is content-only, which consists of all of the content references for a given user session. These transactions can be used to discover associations between the content pages of a site [19]. A given page reference is classified as either navigational or content, based on the time spent on the page. This kind of page typing is further delineated in [18], where various page types such as index pages, personal home pages, etc. are used in the discovery of user patterns. [19] A general transaction definition can be got from [19]. In that it assumes each transaction is made up of references from only one user. This will hence be a merge activity. Once the log has been arranged according to the general transaction definition we can apply one of the three divide modules to the log: reference length, maximal forward reference and time window. The first two methods identify transactions based on User browsing behavior model. User browsing behavior model uses the concept of navigation and content page. According to this model there are two ways of defining a transaction. One is to define a transaction as all navigation references up to and including each content reference for a given user. The second would be to define a transaction as all of content-references. Time window is used a benchmark to compare with the other two algorithms. [19] Reference length module is based on the assumption that the amount of time a user spends on a page correlates to whether the page should be classified as a navigation or content page for that user. [19] In Maximal forward reference module each transaction is defined to be a set of pages in the path from the first page up the log up to the page a backward reference is made. A new transaction is started when the next forward reference is made. “A forward reference is defined to be a page not already in the set of pages for the current transaction”. Similarly, a backward reference is defined to be a page that is already contained in the set of pages for the current transaction. [19] “Time window transaction identification module divides the log for a user into time intervals no larger than a specified parameter”. [19] Discusses the three modules in detail. 12 Discovery Techniques Once the log has been divided into transactions, techniques can be applied to perform pattern mining. Some of them are listed below: Path Analysis Association Rules Clustering and Classification Sequential Patterns Path Analysis To perform pattern discovery path analysis makes use of graphs. A graph represents some relation defined on Web pages. Thus, for a Web site, a graph may be drawn with Web pages as nodes and hypertext links between pages as directed edges [7]. The navigation-content transactions of [19], maximal forward reference transactions of [20], or user sessions of [18] can be used for path analysis some of which has been briefly introduced above. Path analysis could be used to determine most frequently visited paths in a Web site. [7] [18] Discusses approaches to extracting structure in the Web which can be used to form higher level abstractions that reduce the complexity and increase the richness of an overview. The approach discussed in [18] is based on graph structure representations of similarity, or strength of association and relations. Three types of graph structures are discussed in this. The first type of graph structure represents the link topology of a Web locality by using arcs labeled with unit strengths to connect one node to another where the node represents the page and an arc represents the hyperlink connecting the two pages. (A Web locality may be thought of as a “complex abstract space in which we have arranged Web pages of different functional categories or types”). The second type of graph structure represents inter-page text content similarity by labeling arcs connecting nodes with the computed text similarities between corresponding Web pages. The third type represents flow of users through the locality by labeling the arcs between two nodes with the number of users that go from one page to another. [18] Examples of information that can be discovered through path analysis are [7]: 75% of clients who accessed /company/product2 did so by starting at /company and proceeding through /company/new, /company/products and /company/product1: This suggests that there is useful information in product2 page, but the path to the page is not clear. 65% of the clients left the site after four or less page references. This suggests that more information should be provided in the first four pages from the points of entry into the site. Association Rules The problem of mining association rules is to find all rules that satisfy a user-specified minimum support and minimum confidence. An association rule is of the form X => Y 13 where X,Y are set of items and X ∩ Y is null. The support of an item (or set of items) is the percentage of transactions in which that item (or items) occurs. It is denoted by s. Confidence for an association rule X => Y is the ratio of the number of transactions that contain X U Y to the number of transactions that contain X. Confidence measures the strength of the rule while support measures how often it should occur in the database. It is denoted by α. Applications include cross-marketing, catalog design, store layout and customer segmentation based on buying patterns. Also, when we are dealing with Web mining, we would mostly be working with transaction logs rather than databases. [1][23] Association rule discovery techniques are generally applied to databases of transactions where each transaction consists of a set of items. The difficulty here is to discover all associations and correlations among data items where the presence of one set of items in a transaction implies (with a certain degree of confidence) the presence of other items. In the context of Web mining, this problem amounts to discovering the correlations among references to various files available on the server by a given client. Each transaction is comprised of a set of URLs accessed by a client in one visit to the server. [7] Since usually such transaction databases/logs contain extremely large amounts of data, current association rule discovery techniques try to prune the search space according to ‘support’ for items under consideration. [7] The problem of discovering generalized association rules can be divided into three parts [23]: 1. Find all sets of items (itemsets) whose support is greater than user-specified minimum support. Itemsets with minimum support are called frequent itemsets. 2. Use the frequent itemsets to generate the desired rules. 3. Prune all uninteresting rules from this set. There are several algorithms to find association rules. I will be reviewing some of the algorithms which have been found to improve on the “Basic” algorithms (e.g. Apriori algorithm). In most cases, taxonomies (is-a hierarchies) over the items are available. Earlier work on association rules only took into account leaf-level items in the taxonomy while not considering presence of taxonomies. Finding rules at different levels could be valuable since [23]: Rules at lower levels may not have minimum support. Taxonomies can be used to prune uninteresting or redundant rules. [23] Presents few algorithms which run 2 to 5 times faster than the “Basic” algorithm (Apriori algorithm[21]). The algorithms were also seen to perform 100 times faster on real-life dataset. 14 “Cumulate” algorithm [23] was developed by adding optimizations to the Basic algorithms. “The name itself indicates that all itemsets of a certain size are counted in one pass”. “Stratify” algorithm [23] is developed assuming the optimizations of Cumulate algorithm is applicable. Different versions of “Stratify” i.e. “Estimate” and “EstMerge” is also explained. “Basic”, “Cumulate” and “EstMerge” have also been compared in [23] on the basis of minimum support, number of transactions, fan-out, number of roots, number of items/levels and depth ratio. Details about these algorithms have been studied in [23]. [26] Discusses “FDM (Fast Distributed Mining)” algorithm for generation association rules in a distributed environment. “FDM generated a small number of candidate sets and substantially reduces the number of messages to be passed at mining association rules”. [26] Discusses three versions FDM and also “Count Distribution” algorithm for parallel mining. More information and comparisons on these algorithms has been studied in [26]. [21] Shows the best features of algorithms like “Apriori”, “AprioriTid” and a hybrid of both “AprioriHybrid”. Shown below are examples using association rule discovery techniques using which we can find correlations such as the following [7]: 55% of clients who accessed the Web page with URL page1.html, also accessed page2.html; or 25% of clients, who accessed /company/offer.html, placed an online order in /company/offer/product1. Discovery of such rules for organizations engaged in electronic commerce can help in the development of effective marketing strategies. But, in addition, association rules discovered from WWW access logs can give an indication of how to best organize the organization's Web space. [7] Clustering and Classification Discovering classification rules [29] allows one to “develop a profile of items belonging to a particular group according to their common attributes”. This “profile can then be used to classify new data items that are added to the database”. In Web mining, classification techniques allow one to develop a profile for clients who access particular server files based on demographic information available on those clients, or based on their access patterns. For example classification on WWW access logs may lead to the discovery of relationships such as the following [7]: clients from state or government agencies who visit the site tend to be interested in the page /company/products/product1.html; or 40% of clients, who placed an online order in product 2 from Web page /company/products/product2.html, were in the 20-25 age group and lived on the West Coast. 15 In some cases, information about clients can be captured by servers from the client browsers. This includes information available on the client side in the history files, cookie files, etc. Other methods used to obtain profile and demographic information on clients includes user registration, online survey forms etc. [7] [29] Discusses a fast scalable classifier for data mining. According to this most of the classification algorithms are designed only for memory-resident data, thus limiting their suitability for data mining large data sets. “SLIQ” is presented in [29]. SLIQ stands for “Supervised Learning in Quest, where Quest is a data mining project at IBM Almaden research center”. “SLIQ is a decision tree classifier that can handle both numeric and categorical data”. SLIQ is studied in detail in [29]. “Clustering analysis allows one to group together clients or data items that have similar characteristics”. Clustering of client information or data items on Web transaction logs, can assist in marketing strategies, both online and off-line, such as automated return mail to clients falling within a certain cluster, or dynamically changing a particular site for a client, on a return visit, based on past classification of that client. [7] Sequential Patterns Sequential patterns are used to find patterns in the transaction logs/databases such that the presence of a set of items is followed by another item in the time-stamp ordered transaction set. In Web server transaction logs, a visit by a client is recorded over a period of time. The “time stamp associated with a transaction in this case will be a time interval which is determined and attached to the transaction during the data cleaning or transaction identification processes.” The discovery of sequential patterns in Web server access logs allows Web-based organizations to predict user visit patterns and helps organization in targeting advertising to specific users. [7] [27] Discusses problems of recognizing frequent episodes in sequences of events occurring in transaction logs. An episode is defined as “a collection of events that occur within a given size in a partial order”. Moreover the individual page access must occur within a given time frame. Episodes can be classified into serial, parallel and general. Serial is one in which events are totally ordered. In parallel there need not be any particular order but they do satisfy the time constraint. General satisfies some partial order. [27][1] The time frame is specified by the user within which the episode must occur. [27] Discusses an algorithm for finding all frequent episodes. Also studied in [27] is how to recognize serial and parallel episodes. [28] Discusses three algorithms i.e. “AprioriSome”, “AprioriAll” and “DynamicSome”. All these are used for mining sequential data. [28] mostly concentrates on the problem of finding text sequences that match a given regular expression. Each algorithm can be divided into five basic stages .i.e. sort phase, itemset phase, transformation phase, 16 sequence phase and maximal phase. These algorithms are closely studied and compared in [28] By analyzing this information, the Web mining system can determine temporal relationships among data items such as the following [7]: 50% of clients who visited /search/topics/, had done a search in Yahoo, within the past week on keyword mining An example for determining similar time sequences using sequential patterns can be thought of as the need to find common characteristics of all clients that visited a particular file within the time period [t1,t2] or we may like to know which is the most accessed page in the week.[7] I have also found that there are two other techniques that can be applied during pattern discovery phase. They are Statistical Analysis[30] and Dependency Modeling[30]. They are studied in [30] Following pattern discovery is pattern analysis, but I will not be covering it in this paper. It is studied in detail in [7]. 17 WEB USAGE MINING ARCHITECTURE – WEBMINER A general architecture for Web usage mining is presented in [7] [9]. The WEBMINER is a system that implements parts of this general architecture. The first part is domain dependent application. The second part is the domain independent application. This includes pattern discovery and analysis as part of the system's data mining engine. The overall architecture for the Web mining process is depicted below:[7] Figure 3: A General Architecture for Web Usage Mining [7] Just to briefly explain the above figure, data cleaning is the first step performed in the Web usage mining process. We have discussed some of the techniques to clean the log data. “Currently, the WEBMINER system uses the simplistic method of checking filename suffixes. Some low level data integration tasks may also be performed at this stage, such as combining multiple logs, incorporating referrer logs, etc”. [7][9] After the data cleaning using one or a series of transaction identification modules into clusters. We have discussed a few techniques how to separate data into transactions. The “WEBMINER system currently has reference length, maximal forward reference, and time window divide modules, and a time window merge module”. [7] “Access log data may not be the only source of data for the Web mining process. User registration data, for example, is playing an increasingly important role, particularly as more security and privacy conscious client-side applications restrict server access to a variety of information, such as the client user IDs”. The data collected through user registration is then integrated with the access log data. There are also known or discovered attributes of references pages that could be integrated into a higher level database schema. The discovered attributes could include page types, usage frequency 18 and link structures. “While WEBMINER currently does not incorporate user registration data, various data integration issues are being explored in the context of Web usage mining”. [7][9] “In WEBMINER, a simple Query mechanism has been implemented by adding some primitives to an SQL-like language”. This allows the user to specify his patterns of interest to the mining engine. [7] This information from the query is used to reduce the scope, and thus the cost of the mining process. The development of a more general query mechanism along with appropriate Web-based user interfaces and visualization techniques, are still in research.[7] Tools Some of the newest web data mining tools from Megaputer are: WebAnalyst and XSellAnalyst. More details can be got from [31] 19 BENEFITS Let’s have a look at some of the benefits you get from Web mining [31]: Match your available resources to visitor interests: Resources can be products you “sell, information fragments you distribute online, banner ads from your client advertisers, e-mail fragments from a mailing list, or anything else” which is distributed online. “Metadata of these resources are then stored in a database. WebAnalyst helps learn visitor interests by collecting and analyzing information generated by interactions with your website, such as clickstream data, search requests, and cookies. WebAnalyst can use the gleaned knowledge to rank your resources by their relevance to the user's interests. Servicing a user request for information, with the best matching resources, results in a higher visitor-to-customer conversion rate for your ebusiness”. [31] Increase the value of each visitor Upon carrying out collaborative filtering, we can predict what kind if information a visitor may be interested in, and the products she might consider purchasing. “These predictions used to present the visitor with related products and resources”, and hence chances of them purchasing it. “This knowledge significantly increases the value of a customer for an e-business when used in individualized cross-selling and up-selling promotions, and thus increasing revenue.” [31] Improve the visitor's experience at the website “A sound combination of data and text mining techniques can help determine user interests - early in the process of the visitor's interaction with the website. This allows the website to act interactively and proactively and deliver the most relevant customized resources to the visitor”. In the world of Internet, easy access to relevant information might make a difference between a profitable customer and lost opportunity. “By increasing the customer's satisfaction, you reduce attrition and build brand loyalty”. [31] Perform targeted resource management Since, all visitors are different in buying behavior you may notice that some of them are your best potential customers, ready to click and buy, while others are prospecting for information, simultaneously familiarizing themselves with your brand. These prospecting customers may become “very important and profitable customers” in the future. Also not to forget there is another group of visitors who enjoy only free rides. “These folks will use promotional resources that you offer to the fullest extent, but will never purchase anything. All these visitors come through a single pipe to your website and are in a common queue for your website resources”. It’s best to your advantage if you can tell each type of visitor apart from the other. “Your website performance is limited and you 20 might want to prioritize requests coming from your best prospects. If you are distributing promotional resources of high value, you might want to spend your promotional budget wisely by offering and delivering your promotional materials only to your best prospects - not to every Web surfer on the planet. WebAnalyst can work with load-balancing products to provide the best quality of service to your best customers”. [31] Collect information in new ways “While for the majority of e-vendors the task of collecting data is just an intermediate step necessary for better targeting their marketing, for others this task might be the main motivation for creating a website itself”. Traditional data collection methods like promotions, surveys, focus groups, etc. have many well known problems, including high cost, poor response rates and low accuracy. “Now imagine that you can offer your promotional items online through a content-rich website, where visitors can find useful information in addition to submitting their contact information and requesting the promotion. WebAnalyst can learn the visitor's preferences (at virtually no cost) based on the content that the user was browsing. Of course, WebAnalyst is designed to work handin-hand with your privacy management system, allowing you to collect valuable data while respecting the privacy of your visitors”. [31] Test the relevance of content and web site architecture Perhaps you would like to increase usability, or optimize your website for the eyes of your best prospects by taking close look at the website’s content and architecture. “Log analyzers can help you visualize the most navigated paths through your website, averaged over all visitors. When optimizing your website structure, your main concern should be to improve experience of your most promising prospects, and not just everybody. Roughly 15% of your website visitors comprise really valuable prospects. The remaining 85% have little value to you other than sustaining the brand recognition traffic. Thus you have to segregate your least important prospects and subtract their contribution from the overall picture of the site navigation. What is left represents the real quality of your website. This is the picture that can help you really improve your bottom line”.[31] 21 APPLICATIONS Some of the applications of Web mining is shown below [31]: E-tailers: Includes: B2B and B2C Ventures WebAnalyst is easily applied to any B2C or B2B e-tailing scheme. Any company that profits by selling goods or services via the Web may benefit from WebAnalyst's ability to find new cross-sell opportunities, enable comprehensive prospect profiling, and improve customer satisfaction. [31] Advertising-Based Sites: Includes: Entertainment sites, Media Portals, Advertising Providers When your revenue is advertising-based, you know that blindly serving ads to visitors will not result in a large click-thru rate. Instead, ads must be intelligently targeted to the user, providing the visitor with products and services that they are interested in. WebAnalyst's data collection services give you customizable access to every bit of information that passes between the visitor and the server. Next, its cutting-edge data mining modules analyze the data and match the visitor's profile with the ads that they are most interested in. Finally, WebAnalyst's long-term data mining modules find new patterns and refine existing models to improve response rates even further. [31] Information Repositories: Includes: Libraries, Technical Support Sites, Media Sites, Content Providers Information overload is a problem that grows larger every day. You would like to use your staff for content creation, yet you find yourself spending exponentially more time on indexing, summarization, and other metadata tasks. WebAnalyst's semantic text analysis capabilities can automate these tasks, and create user navigation systems on the fly. [31] Web Integrators: Includes: Web Development, Consulting, and ASPs If you are a web integrator looking to provide web intelligence solutions, WebAnalyst may be exactly what you are looking for. WebAnalyst is a flexible, modular system that is designed for customization.[31] Web content mining can be used for “discovering unexpected information from your competitors' web sites” [32]: Finding unexpected information is “useful in many applications. For example, it is useful for a company to find unexpected information about its competitors, e.g., unexpected services and products that its competitors offer”. 22 It is very difficult for a human user to view each page to discover the unexpected information. Automated assistance is needed. [32], proposes a number of methods to help the user find various types of unexpected information from his/her competitors Web sites. Experiment results show that these techniques are very useful in practice and also efficient. We can also use Neural Network for Web content filtering [33]: Identifying what is the pornography web site can be solved, hence restricting access. Methods like “PICS,URL blocking, Term filtering, Intellegent Content Analysis and Intellegent Content Analysis” are discussed in [33] There are many applications, but only a few of them have been discussed above. Web-Based Applications which use Web mining can be listed below [34]: - Business Intelligence - Computational Societies and Markets - Conversational Systems - Customer Relationship Management (CRM) - Direct Marketing - Electronic Commerce and Electronic Business - Electronic Library - Information Markets - Price Dynamics and Pricing Algorithms - Measuring and Analyzing Web Merchandising - Web-Based Decision Support Systems - Web-Based Distributed Information Systems - Web-Based EDI - Web-Based Learning Systems - Web Marketing - Web Publishing 23 RESEARCH AREAS The area of Web mining is still new. Hence, there are still a lot of improvements to be done in various techniques used in Web mining. The techniques being applied to Web content mining draw heavily from the work on information retrieval, databases, intelligent agents, etc. [7] Important area of ongoing research is to continue to develop methods of clustering log entries into user transactions, including using criteria such as time differential among entries, time spent on a page relative to the page size, and user profile information collected during user registration.[19] Association rules did not consider the quantities of the items bought in a transaction, which are useful for some applications. [21] Techniques have to be developed which would remove uninteresting rules which are generated along with interesting rules. [23] Open issues that exist is the creation of intelligent tools that can assist in the interpretation of mined knowledge Also there needs to be an approach where various logs can be integrated together into a more comprehensive model.[7] Another issue is the privacy issue. “The issue revolves around the fact that most users want to maintain strict anonymity on the Web. On the other hand site administrators are interested in finding out the demographics of users as well as the usage statistics of different sections of their website. W3C has an ongoing initiative for Platform for Privacy Preferences (P3P). P3P provides a protocol which allows the site administrators to publish the privacy policies followed by a site in a machine readable format. When the user visits the site for the first time the browser reads the privacy policies followed by the site and then compares that with the security setting configured by the user. If policies are satisfactory the browser continues requesting pages from the site, otherwise negotiation protocol is used to arrive at a setting which is acceptable to the user”. [30] Other research areas include Web semantic mining and Web farming. 24 CONCLUSION In this paper I have discussed Web mining and its classifications i.e. content mining and usage mining. The two types of content mining have been looked into. Also I have given a small overview about crawlers and how they work. Web usage mining has been covered in a greater depth compared to content mining. The four techniques i.e. path analysis, association rules, classification and clustering, sequential patterns have been covered. Also I have briefed upon the architecture of Web usage mining. Finally I have concluded by listing down the benefits of Web mining and its applications followed by issues of topics covered and interesting research areas in this field. 25 BIBLIOGRAPHY 1 : Data mining :Introductory and Advanced Topics - Margaret Dunham (Text Book) 2 : Yan Wang. Web mining and knowledge discovery of usage patterns - A survey. http://db.uwaterloo.ca/~tozsu/courses/cs748t/surveys/wang-slides.pdf 3 : Oleksandr Romanko. McMaster University. Web mining. http://www.cas.mcmaster.ca/~cs4tf3/romanko_slides.pdf 4 : May Chau. Web mining technology and academic librarianship : Human-Machine connections for the twenty first century.http://www.firstmonday.dk/issues/issue4_6/chau/ 5 : Cooley, Bamshad and Jaideep, 1997. op.cit. at http://www-users.cs.umn.edu/~mobasher/webminer/survey/survey.html 6 : Oren Etzioni, 1996. "The World Wide Web: Quagmire or Gold Mine," Communications of the ACM, volume 36, number 11 (November), pp. 65-68. 7 : R. Cooley, Bamshad Mobasher and J. Srivastava 1997. Web Mining : Information and Pattern Discovery on the World wide Web. 8 : A. Joshi. Web mining : http://www.cs.umbc.edu/~ajoshi/web-mine/ 9 : Bamshad Mobasher. WEBMINER: A System for Pattern Discovery from World Wide Web Transactions. http://maya.cs.depaul.edu/~mobasher/Research-01.html 10 : BDH 94 C. M. Brown, B. B. Danzig, D. Hardy, U. Manber, and M. F. Schwartz. The harvest information discovery and access system. In Proc. 2nd International World Wide Web Conference, 1994. 11 : HBML95 K. Hammond, R. Burke, C. Martin, and S. Lytinen. Faq-finder: A casebased approach to knowledge navigation. In Working Notes of the AAAI Spring Symposium: Information Gathering from Heterogeneous, Distributed Environments. AAAI Press, 1995. 12 : WVS 96 R. Weiss, B. Velez, M. A. Sheldon, C. Namprempre, P. Szilagyi, A. Duda, and D. K. Gifford. Hypursuit: a hierarchical network search engine that exploits contentlink hpertexxt clustering. In Hypertext'96: The Seventh ACM Conference on Hypertext, 1996. 13 : PA97 P. Merialdo P. Atzeni, G. Mecca. Semistructured and structured data in the web: Going back and forth. In Proceedings of the Workshop on the Management of Semistructured Data (in conjunction with ACM SIGMOD), 1997. 26 14 : KS95 D. Konopnicki and O. Shmueli. W3qs: A query system for the world wide web. In Proc. of the 21th VLDB Conference, pages 54--65, Zurich, 1995. 15 : Hypertext and information retrieval and Web mining. http://www.cyberartsweb.org/cpace/ht/lanman/wm1.htm 16 : [Madria 1999] S.K.Madria, S.S.Rhowmich, W.K.Ng, and F.P.Lim. Research issues in Web data mining. In Proceedings of Data Warehousing and Knowledge Discovery, First International Conference. DaWaK'99, pages 303-312, 1999. 17 : Pit97 J. Pitkow. In search of reliable usage data on the www. In Sixth International World Wide Web Conference, pages 451--463, Santa Clara, CA, 1997. 18 : PPR96 P. Pirolli, J. Pitkow, and R. Rao. Silk from a sow's ear: Extracting usable structures from the web. In Proc. of 1996 Conference on Human Factors in Computing Systems (CHI-96), Vancouver, British Columbia, Canada, 1996 19 : CMS97 R. Cooley, B. Mobasher, and J. Srivastava. Grouping web page references into transactions for mining world wide web browsing patterns. Technical Report TR 97021, University of Minnesota, Dept. of Computer Science, Minneapolis, 1997. 20 : CPY96 M.S. Chen, J.S. Park, and P.S. Yu. Data mining for path traversal patterns in a web environment. In Proceedings of the 16th International Conference on Distributed Computing Systems, pages 385--392, 1996. 21 : AS94 R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. of the 20th VLDB Conference, pages 487--499, Santiago, Chile, 1994. 22 : HS95 M. A. W. Houtsma and A. N. Swami. Set-oriented mining for association rules in relational databases. In Proc. of the 11th Int'l Conf. on Data Eng., pages 25--33, Taipei, Taiwan, 1995. 23 : SA95 R. Srikant and R. Agrawal. Mining generalized association rules. In Proc. of the 21th VLDB Conference, pages 407--419, Zurich, Switzerland, 1995. 25 : Benefits of Web mining http://www.dmreview.com/editorial/dmdirect/012800_doherty.htm 26 : A Fast Distributed Algorithm for Mining Association Rules. David W. Cheung, Jiawei Han, Vincent T. Ng, Ada W. Fu Yongjian Fu. 27 : MTV95 H. Mannila, H. Toivonen, and A. I. Verkamo. Discovering frequent episodes in sequences. In Proc. of the First Int'l Conference on Knowledge Discovery and Data Mining, pages 210--215, Montreal, Quebec, 1995. 27 28 : SA96 R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. In Proc. of the Fifth Int'l Conference on Extending Database Technology, Avignon, France, 1996. 29 : MAR96 M. Mehta, R. Agrawal, and J. Rissanen. SLIQ: A fast scalable classifier for data mining. In Proc. of the Fifth Int'l Conference on Extending Database Technology, Avignon, France, 1996. 30 : R. Cooley, Mukund Deshpande, Pang-Ning Tan and J. Srivastava 1997. Web Usage Mining: Discovery and Applications of Usage Patterns from the Web Data. 31 : Benefits of Web data mining : http://www.megaputer.com/products/wa/benefits.php3 32 : Bing Liu, Yiming Ma, Philip S. Yu. Discovering Unexpected Information from Your Competitors’ Web Sites (2001) 33 : Applications http://neuron.et.ntust.edu.tw/homework/91/NN/91NNHomework2/Web%20mining/web_ app.htm 34 : Web applications : http://mail.cs.uiuc.edu/pipermail/colt/2000-October/000164.html 28