Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Proc. of Int. Conf. on Emerging Trends in Engineering and Technology A Study on Different Types of Web Mining Renu, Deepti Gaur Itm University, Gurgaon, India Email: [email protected] Email: [email protected] Abstract— The enormous content of information on the World Wide Web makes it obvious candidate for data mining research. Data mining techniques’s application to the World Wide Web referred as Web mining where this term has been used in three distinct ways; Web Content Mining, Web Usage Mining and Web Structure Mining. In this survey paper, we discuss the three types of the Web mining and approaches of Web Usage Mining and the basic Association Rules algorithm called Apriori Algorithm. Index Terms— Web mining, Web content mining, Web usage mining, Web structure mining I. INTRODUCTION Web Mining is the use of the data mining techniques to automatically discover and extract information from web documents/services. It is used to discover useful information from the World-Wide Web and its usage patterns. The subtasks[4] of Web mining are:Resource Finding: - To retrieve intended documents and services on Web. Information Selection/pre-processing: - Automatically extracting and pre-processing specific information from newly discovered Web resources. Generalization: - Refers to as the discovery of general patterns at individual Web sites and across multiple sites. Analysis: - The validation and/or interpretation of the mined patterns. Visualization: - Presenting the result of an interactive analysis in a visual, easy to understand fashion. The Three Web mining categories depends on which kind of data to be mined that is mining for information or mining the Web link structure or mining the user navigation pattern, see Figure 1. Mining for information focuses on the development of techniques for assisting a user in finding documents that meet a certain criterion that is web content mining. Web content mining refers to the discovery of useful information from web contents, which includes image, text, audio, video etc. The mining of the link structure aims at developing techniques to take advantage of the collective judgment of web page quality which is available in the form of hyperlinks that is web structure mining. Web structure mining tries to discover the model underlying the link structures of the web. Model is based on the topology of hyperlinks with or without description of links. Mining for user navigation patterns focuses on techniques which study the user behavior when navigating the web that is web usages mining. Web usage mining refers discovery of user access patterns from Web servers. Web usages data include data from the web server proxy server logs, browser logs, access logs, registration data, user session or transactions, user profiles, user queries, bookmark data, mouse clicks and scrolls, cookies, or any other data as result of interaction. DOI: 03.AETS.2013.3.229 © Association of Computer Electronics and Electrical Engineers, 2013 A. Web Content Mining It deals with discovering useful information or knowledge from Web pages contents. Web content is very rich consisting of textual, images, audio, video etc and metadata(data about data) as well as hyperlinks. The data may be unstructured (free text) or structured (data from a database) or semi-structured (html) or Multimedia data (receive less attention than text or hypertext) although much of the Web is unstructured. There are two approaches in web content mining that are explained as follows: Figure 1. Web Mining Categories Agent-Based Approach: Agent-based web mining systems can be categories into three types as follows: 1. Intelligent Search Agents: There are several intelligent Web agents that have been developed and search for relevant information using domain characteristics and user profiles to organize and interpret the discovered information. 2. Information Filtering/Categorization: There are number of Web agents who use various information retrieval techniques and characteristics o f open hypertext Web documents to automatically retrieve, filter, and categorize them. 3. Personalized Web Agents: Personalized Web agents learn about user preferences and using this user preference discover Web information sources, and those of other individuals with similar interests. Database Approach It has focused on techniques for organizing the semi-structured data on the web into more structured collection of resources, and using standard database querying mechanisms and data mining techniques to analyze it. 1. Multilevel Databases: In this approach the lowest level of the database contains semi-structured information stored in various repositories of the Web (store and manage a large collection of data objects), such as Hypertext documents. At the higher level Meta data or generalizations are extracted from lower levels and organized in structured collections that is relational or object-oriented databases. 2. Web Query Systems: Many Web-based systems and languages utilize standard database query languages such as SQL, structure information about Web documents, even natural language processing for the queries that are used in World Wide Web searches. B. Web Structure Mining It is the Process of discovering structure information from the Web. It generates the structural summary about the Web page and Web site and Discover the link structure of the hyperlinks at the inter-document level. It discovers the nature of the hierarchy of hyperlinks in the website and its structure and Discovers similarities between sites. The subtasks of Web structure mining are: 1. Finding Information about the Web page: It means to retrieve information about the relevance and the quality of the Web page and find the authoritative on the topic and the content. 2. Inference on Hyperlink: The Web page contains not only information but also hyperlink, which contains huge amount of annotation. Hyperlink identifies the endorsement of the author of the other Web page. 3. Authority and Hub: A hub is a page that has link to many authorities and an authority is a page with good content on the query topic and pointed by many hub pages i.e. it is relevant and popular. 839 II. ALGORITHMS FOR WEB STRUCTURE MINING A. HITS Algorithm HITS that is Hyperlink-Induced Topic Search was proposed by Jon M. Kleinberg. The approach consists of two phases:1. It uses the query terms to collect a starting set of pages (200 pages) from an index-based search engine that is the root set of pages. The base set is generated by augmenting the root set by including all the pages to which the root set pages link , and all the pages linking to a page in the root set, up to a designed size cutoff, such as 3000-6000. 2. A weight-propagation phase is initiated. It is an iterative process that determines numerical estimates of hub and authority weights. Associate a non-negative authority weight, ap, and a non-negative weight, hp, with each p in the base set and Initialize all a and h values to a uniform constant. The authority and hub weights are updated based on the following equations: ap = ∑(q such that q →p) hq (1) hp = ∑(q such that q ←p) aq (2) Equation (1) implies that if a page is pointing to by many good hubs, its authority weight should increase (means sum of current hub weights of all the pages pointing to it). Equation (2) implies that if a page is pointing to many good authorities, its hub weight should increase (means it is the sum of the current authority weights of all the pages it points to). Then we define a matrix A whose rows and columns correspond to Web pages with entry Aij=1 if page i links to page j, and 0 if not. Let a and h be the authority and hub weight vectors, whose ith component corresponds to the degrees of authority and hubbiness of the ith page. Then : h = A × a. a = AT × h. Where AT is the transposition of matrix A. Unfolding the above two equations n number of times, we have: h = (AAT)h= (AAT)2h=------= (AAT)nh a = (ATA)a= (ATA)2a=------= (ATA)na B. Page Rank Algorithm Page rank algorithm was presented and published by Sergey Brin and Larry Page at the Seventh International World Wide Web Conference (WWW7) in April 1998. It is a search ranking algorithm that uses hyperlinks on the Web. Based on this algorithm, the search engine Google was built, which has been a huge success. The following ideas based on rank prestige are used to derive the Page Rank algorithm: 1. A hyperlink from a page pointing to another page is an implicit conveyance of authority to the target page. So, the more in-links that a page i receives, the more the page i’s prestige. 2. Pages that point to page i also have their own prestige scores. A page with a higher prestige score that points to i is more important than a page with a lower prestige score pointing to i. Or we can say, a page is important if it is pointed to by other important pages. According to rank prestige in social networks, the importance of page i (i’s Page Rank score) is determined by summing up the Page Rank scores of all pages pointing to i. Since a page may point to many other pages, the prestige score of this page should be shared among all the pages that it points to. Suppose Web as a directed graph G = (V, E), where V is the set of vertices or nodes that is the set of all pages and E is the set of directed edges in the graph that is hyperlinks. Let the total number of pages on the Web be n (that is, n =|V |). The Page Rank score of the page i (denoted by P(i))is defined by P(i) = ∑ P(j)/Oj, (j,i)є E Where Oj is number of out-links of page j. Let P be a n-dimensional column vector of Page Rank values that is P = (P(1),P(2),...,P(n))T. Let A be the adjacency matrix of our graph with 1/Oj if (i, j) є E Aij = 0 otherwise Or the system of n equations with P = ATP ( 1) Which is the characteristic equation of eigen system, where the solution to P is an eigenvector with the corresponding eigen value of 1.Since this is a circular definition, we use an iterative algorithm is to solve it. The Power Iteration Method for Page Rank 840 Page Rank-Iterate(G) P0 ← e/n k←1 repeat Pk ← (1 − d )e + dAT Pk -1; k ← k + 1; until ||Pk – Pk-1||< ε return Pk The problem is that Eqn 1 does not quite suffice because the Web graph does not meet the conditions. In fact, Eqn 1 can also be derived based on the Markov chain. After augmenting the Web graph to satisfy the conditions, the following Page Rank equation is produced: P = (1 − d)e + dATP where e is a column vector of all 1’s. So the Page Rank formula for each page i: n P(i) = (1-d)+d∑ Aji P(j) j=1 and it is equivalent to the formula given in the original Page Rank papers: P(i) = (1-d)+d ∑ P(j)/Oj, (j,i) є E The parameter d is called the damping factor which can be set to a value between 0 and 1. C. Web Usage Mining Discovery of meaningful patterns[2],[5] from data, generated by the client-server transactions on one or more Web servers. Typical Sources of the Data are: 1. Automatically generated data that was stored in server access logs(records all requests that are processed by the server), referrer logs, agent logs, and client-side cookies. 2. E-commerce and product-oriented user events (e.g. shopping cart changes, ad or product click-throughout etc.) 3. User profiles and/or user ratings 4. Meta-data, page attributes, structure of each page contained in a web site, content of the web page. Web Log Format: A Web server log file contains request made to the web server. The most popular log format are Common Log Format (CLF) and the extended CLF. CLF file is created by the web server to keep track of the requests that occur on a web site. The CLF is shown as follows:<ip_addr><base_url>_<date><method><file><protocol><code><bytes><referrer><user_agent> Approach of Web usage mining: The web usage mining includes the following several steps: data collection, data pretreatment, knowledge discovery and pattern analysis. Data collection: The first step in Web usage mining and consist of collecting the relevant web data which will be analyzed to provide useful information about the user’s behavior. Two main sources of data for Web usage mining are Data on the Web server side and Data on the client side. And if intermediaries introduced in the client server communication, then they also become source for usage data such as Proxy Servers and packet sniffers. Data Preprocessing: Some databases are inconsistent, insufficient and noisy. Data Preprocessing[7] or pretreatment is used to transform those databases into integrate and consistent database which may be mine. It includes User Identification, Data Fusion and Cleaning, Session Identification and Path completion. Web usage mining process is incomplete without using preprocessing section. Data Fusion and Cleaning: In some cases, multiple servers are used to reduce the load on any particular server. So, merging of log files from several Web and application server is known as Data Fusion, see Figure 2. A user comes from multiple Web or application servers then data fusion merge data and solved various user identification session etc. 841 Figure 2. Data Fusion Data Cleaning[2] is used to eliminate irrelevant items means irrelevant records in web access log to be eliminated. Data cleaning is mostly use to removing extraneous references to embedded objects that may not be important for the purpose of analysis, including references to graphics, style files, or sound files. Some information should not provide useful information in analysis or data mining tasks then Data cleaning is us ed. Remove erroneous references. The following two kinds of records are unnecessary and should be removed: 1. The records of graphics, videos and the format information. 2. The record with the failed HTTP status code like record with the status codes over 299 or under 200 should be removed. Algorithm for Data Cleaning Read record in database For each record in database Read fields (URI-Stem) If fields = {*.gif, *.jpg, *.css} then Remove records Else Save records End if Next record User Identification: In web usages mining does not require knowledge about a user history because the users visit or request given more than one time to the server. If we visit more than one time, then it generate multiple sessions for each user. It is used to identify who access the Web site and which page are accessed. It is also known as user activity records. User is identified by using IP address and user agent in log files. Client request to the server, then it generates log files and at that time client also send the user agent to the server. It can be explained with the help of following example: TABLE I: USER DATA TIME 0.04 0.10 0.12 0.15 0.20 0.25 0.28 0.33 0.35 IP 192.168.100.101 192.168.100.101 192.168.100.102 192.168.100.102 192.168.100.102 192.168.100.102 192.168.100.101 192.168.100.101 192.168.100.102 URL A B A B C D C D D REFF A A B C B C C AGENT X X Y Y Y Y X X Y Then, the user can be find out by the combination of IP + AGENT using Table 1 as follows: TABLE II: USER1 TIME 0.04 0.10 0.28 0.33 IP 192.168.100.101 192.168.100.101 192.168.100.101 192.168.100.101 URL A B C D 842 REFF A B C AGENT X X X X T ABLE III: U SER 2 TIME 0.12 0.15 0.20 0.25 0.35 IP 192.168.100.102 192.168.100.102 192.168.100.102 192.168.100.102 192.168.100.102 URL A B C D D REFF A B C C AGENT Y Y Y Y Y The Table 2 and Table 3 shows 192.168.100.101 visits more than one time as well as 192.168.100.102 also visits. So, we find the users by the IP and User Agent. Session Identification: It is used to divide the page access of each user at a time into individual sessions. A session is a sequence of page views by a single user during a single visit. A session is the process of user activity record of each user in the log file and is Used to find that how many session create a single user login to the Web site. Session can be used in two ways: 1. Time Oriented 2. Structure Oriented Time Oriented: Time Oriented depends on the timestamp or date and time of request in the server log file. In the time oriented session there are two types: i) The difference between the First and Last request is <= 30 minutes. ii) The difference between First request and next request is <=10. And it is called as one session. Structure Oriented: Structure Oriented capture in the referrer field of server log and depends on referrer field is currently open or that user currently login referrer. Means it is belonging to more than one “open” constructed session as shown in Table 4. For example: TABLE IV: SESSION DATA TIME 0.04 0.10 0.12 0.15 0.20 0.25 0.48 0.52 0.58 IP 192.168.100.101 192.168.100.101 192.168.100.102 192.168.100.102 192.168.100.102 192.168.100.102 192.168.100.101 192.168.100.101 192.168.100.102 URL A B A B C D C D D REFF A A B C B C C AGENT X X Y Y Y Y X X Y In Table 5 structure oriented session means 192.168.100.102 this user session open for time s tamp 0:12 to 0:25 and 0:58. It is consider as one session. T ABLE V: SESSIONS WITH STRUCTURED ORIENTED TIME 0.12 0.15 0.20 0.25 0.58 IP 192.168.100.102 192.168.100.102 192.168.100.102 192.168.100.102 192.168.100.102 URL A B C D D REFF A B C C AGENT Y Y Y Y Y TABLE VI: SESSION1 TIME 0.12 0.15 0.20 0.25 IP 192.168.100.102 192.168.100.102 192.168.100.102 192.168.100.102 URL A B C D REFF A B C AGENT Y Y Y Y TABLE VII: SESSION 2 TIME 0.58 IP 192.168.100.102 URL D 843 REFF C AGENT Y Path completion: After completion sessions as shown in Table 6 and Table 7 we start path completion, because that user how web pages visited that should be confirmed using path completion phase. Some reasons that causes Path incompletion are:- Local cache, agent cache and browser’s back button that can result in some important access not recorded in the access log file and the number of URL recorded in log may be less than the real one. To discover user’s travel pattern, the missing page in the user access path should be appended. In path completion missing Reference method is used means users backtrack should not be stored in server log file. Pattern Discovery: Use statistical method to carry on the analysis such as frequency analysis, mean, median, etc. and mine the pretreated data[1]. Clustering of users help to discover groups of users with similar navigation patterns (provide personalized Web content). Classification is the technique to map a data item into one of several predefined classes. Association Rules discover correlations among pages accessed together by a client. Sequential Patterns extract frequently occurring inter- session patterns such that the presence of a set of items s followed by another item in time order. Dependency Modeling determines if there are any significant dependencies among the variables in the Web. Association Rule: Let I = {i1, i2…………. In} be a set of n different items, and DB = {T1, T1…………… Tm} be the transaction database consisting of m transaction where each transaction Ti = {i1, i2…………..ik} is a set of k elements from I. An association rule [ 3] is then specified as X=>Y, where X, Y belongs to I and X∩Y= Ø. Such rules have two attributes associated with them that are Support and Confidence. Let S be the % of transaction in DB which contains XᴜY then S is known as Support of X=>Y. Let C be the % of transaction in DB containing X which also contain Y then the rule X=>Y holds with confidence C. X is known as antecedent and Y as consequent. Association Rule in Web usage mining: In Web usage mining, association rule[6] refers to set of pages which are accessed together with a minimum support value which can help in organizing Web space efficiently. For example:- Consider if 70% of the users who accessed get/programs/courses/x.asp also accessed get/programs/courses/y.asp, but only 30% of those who accessed get/programs/courses accessed get/programs/courses/y.asp, then it shows that some information in x.asp is making the clients access y.asp. This helps the designers to decide on designing a link between the above two pages. So, Association rule mining is one of the most popular pattern –discovery methods. Steps of Apriori Algorithm: 1. By scanning the database iteratively to find the support count of each k-itemset where k=0,1,….p such that p≤ maximum number of items in the database. All those item sets whose support count is greater than or equal to the user specified minimum support is known as a frequent itemset. 2. Generate association rules from the frequent item sets. For every frequent itemset X, if X,Y≠ Ø and support(X)/support(Y) ≥minimum confidence, then Y=>(X-Y) Apriori Algorithm: Find the frequent itemsets that uses an iterative level-wise approach based on candidate generation,. Input : Database D of transactions; minimum support threshold, min_sup. Output : L, frequent itemsets in D. Method : C1 = {Candidate 1-itemsets}; L1 = {c є C1 | c.count ≥ minsupport}; For(k=2;Lk-1 ≠ Ø;k++) DO Begin Ck = Candidate-gen(Lk-1); For all transactions Ti є DB Do Begin Ct = subset(Ck,t); For all candidate c є Ct Do c.count++; End Lk = {c єCk | c.count ≥ minsup} End Answer = ᴜkLk; Procedure Candidate-gen(Lk-1) Ck ← Ø; For all l1,l2 є Lk-1 With l1 = {i1,……..ik-1} 844 l2 = {i1,…….i’k-1} And ik-1 < i’k-1 Do C ← {i1,….ik-1, i’k-1}; Ck ← Ck ᴜ {c} For each (k-1) subset s of c do If(s є Lk-1) then Delete c from Ck; End End Return Ck; FP Growth Algorithm: FP-growth algorithm is the first pattern-growth concept algorithm. FP-growth algorithm constructs an FP-tree structure and mines frequent patterns by traversing the constructed FP-tree. An FP-tree is a compact data structure that represents the data set in tree form. FP-tree structure :The FP-tree structure[8] has sufficient information to mine complete frequent patterns. It consists of a prefix tree of frequent 1-itemset and a frequent-item header table. A node in the prefix-tree has three fields: item-name (register which item this node represents), node-link and count. 1. Item-name is the name of the item. 2. Node-link is the link to the next same item name node in the FP-tree. Each entry in the frequent-item header table has two fields: item-name and head of node-link. 3. Head of node-link is the link to the first same item-name node in the prefix-tree. 4. Count is the number of transactions that consist of the frequent 1-items on the path from root to this node. Construction of FP-tree 1. First, create the root of the tree, labeled with “null”. 2. Scan the database D again. (Before this, we scanned it to create 1-itemset and then L). 3. The items in each transaction are processed in L order (i.e. sorted order). 4. A branch is created for each transaction with items having their support count separated by colon. 5. Whenever the same node is encountered in another transaction, we increment the support count of the common node or Prefix. 6. To facilitate tree traversal, an item header table is built such that each item points to its occurrences in the tree via a chain of node-links. 7. Now, The problem of mining frequent patterns in database is transformed to that of mining the FPTree. Mining the FP-Tree by Creating Conditional (sub) pattern bases 1. Start from each frequent length-1 pattern (as an initial suffix pattern). 2. Construct its conditional pattern base which consists of the set of prefix paths in the FP-Tree cooccurring with suffix pattern. 3. Then, Construct its conditional FP-Tree & perform mining on such a tree. 4. The pattern growth is achieved by concatenation of the suffix pattern with the frequent patterns generated from a conditional FP-Tree. 5. The union of all frequent patterns (generated by step 4) gives the required frequent itemset. Advantages of FP-Growth Algorithm 1. FP-growth is an order of magnitude faster than Apriori algorithm and tree-projection. 2. No candidate generation, no candidate test 3. Use compact data structure 4. Eliminate repeated database scan Pattern Analysis: Pattern Analysis is a final stage of the whole Web usage mining. The main goal of this process is to eliminate irrelevant (or unwanted rules) or patterns and to extract the interesting rules or patterns from the output of the pattern discovery process. The output of earlier stage web usage mining is often not suitable for the web site administrators. The type of information sought according to this is “How are people using the sites? Which pages are being accessed most frequently?” All these type of questions require the analysis of the structure of hyperlinks as well as the contents of the pages. This can be done with the help of some analysis methodologies and tools. The common techniques which have been used for pattern analysis are visualization techniques, Data & Knowledge Querying, OLAP(online analytical processing) techniques, and Usability Analysis. 845 Visualization Techniques: Visualization has been used very successfully in helping people understand various types of the phenomena, both real and abstract. Hence it is a natural choice for understanding the behavior of web users. Groth (1999) argues that visualization is simply the graphical presentation of data. OLAP Techniques: On-line Analytical Processing (OLAP) is emerging as a very powerful paradigm for strategic analysis of databases in business settings. The key characteristics of strategic analysis include: very large data volume, support for various kinds of information aggregation, explicit support for the temporal dimension, and long-range analysis in which overall trends are more important than details of individual data items. On-line Analytical Processing can be performed directly on top of relational databases, but industry has developed specialized tools to make it more efficient and effective. Data and Knowledge Querying The reason for the great success of relational database technology has been the existence of a high-level, declarative, query language, which allows an application to express what conditions must be satisfied by the data it needs, rather than having to specify how to get the required data. Given the large number of patterns that may be mined, there appears to be a definite need for a mechanism to specify the focus of the analysis. Firstly, constraints may be placed on the database to restrict the portion of the database from which to mine for. And secondly, querying may be performed on the knowledge that has been extracted by the mining process, and in this case a language for querying knowledge rather than data is needed. Usability Analysis: The first step undertaken in this method is to develop instrumentation methods that collect data about Software usability. This data is then used to build computerized models and simulations that explain the data. Finally, various visualization and data presentation techniques are used to help an analyst understand the phenomenon. And usability analysis approach can also be used to model the browsing behavior of users on the web. However, as most of those techniques are disliked by users because of slow speeds, inflexibility, difficult to maintain and limited functionality. There still remains a lot of work to be undertaken by both researcher and developer to develop a more efficient, flexible and powerful set of tools to undertake this task. III. CONCLUSION Web Mining is used to extract the useful information from very large amount of web data. In this article, we tried to give a brief introduction to the broad field of Web mining. Therefore, we motivated this field of research, gave a more formal definition of the terms used herein and presented a brief overview of currently available Web mining techniques and preprocessing and discovery of Web usage mining. REFERENCES [1] Dr. Punithavalli and Mrs. V. Sujatha, “A Study of Web Navigation Pattern Using Clustering Algorithm in Web Log Files,” IJSER, Vol.- 2, Issue – 9, September – 2011. [2] Mr. Balram Purswani and Mr. Akshay Upadhyay, “ Web Usage Mining has Pattern Discovery,” IJSER, Vol.- 3, Issue – 2, ISSN 2250-3153, February – 2013. [3] Ms Kiruthika M, Ms Dipa Dixit and Mr. Rahul Jadhav, “ Pattern Discovery Using Association Rules,” IJASCA, Vol. – 2, No.- 12, 2011. [4] Hemant Kumar Singh and Brijendra Singh, “Web Data Mining Research: A Survey,” IEEE – 2010. [5] Jaideep Srivastava, Robert Cooley, Mukund Deshpande and Pang-Ning Tan, “Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data,’ SIGKDD Explorations, Vol.- 1, Issue – 2, Jan.-2000. [6] Prof. Marathe Dagadu Mitharam, “Association Rule in Web Usage mining,” IJSER, Vol.- 3, Issue- 4, April – 2012. [7] Prof. Marathe Dagadu Mitharam, “ Preprocessing in Web Usage Mining,” IJSER, Vol.- 3, Issue – 2, ISSN 22295518, February – 2012. [8] R. Ivancsy, I. Vajk, “Frequent Pattern Mining in Web Log Data” Acta Polytechnica Hungarica Vol. 3, No. 1, pp. 7790, 2006. 846