Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter - III Web Data Mining Analysis 3.1 Data Mining Data mining has an important place in today’s world. It becomes an important research area as there is a huge amount of data available in most of the applications. This huge amount of data must be processed in order to extract useful information and knowledge, since they are not explicit. Data Mining is the process of discovering interesting knowledge from large amount of data1. Figure 3.1 Phases of Data Mining 52 The data source for data mining can be different types of databases such as text files or other types of files including different kinds of data. Data mining is an interdisciplinary research field related to database systems, statistics, machine learning, information retrieval etc. Data Mining is an iterative process consisting the following list of processes: • data cleaning • data integration • data selection • data transformation • data mining • pattern evaluation • knowledge presentation The complete data mining process is given in Figure 3.1. Data cleaning task handles missing and redundant data in the source file. The real world data can be incomplete, inconsistent and corrupted. In this process, missing values can be filled or removed, noise values are smoothed, outliers are identified and each of these deficiencies are handled by different techniques. Data integration process combines data from various sources. The source data can be multiple distinct databases having different data definitions. In this case, data integration process inserts data into a single coherent data store from these multiple data sources. In the data selection process, the relevant data from data source are retrieved for data mining purposes. 53 Data transformation process converts source data into proper format for data mining. Data transformation includes basic data management tasks such as smoothing, aggregation, generalization, normalization and attributes construction. In Data mining process, intelligent methods are applied in order to extract data patterns. Pattern evaluation is the task of discovering interesting patterns among extracted pattern set. Knowledge representation includes visualization techniques, which are used to interpret discovered knowledge to the user. Data Mining has various application areas including banking, biology, e-commerce etc. These are most well-known and classical application areas. On the other hand, the new data mining applications include processing spatial data, multimedia data, time-related data and World Wide Web. World Wide Web is one of the largest and most widely known data source. Today, www contains billions of documents edited by millions of people. The total size of the whole documents can be interpreted in many terabytes. All documents on www are distributed over millions of computers that are connected by telephone lines, optical fibers and radio modems. Www is growing at a very large rate in size of the traffic, the amount of the documents and the complexity of web sites. Due to this trend, the demand for extracting valuable information from this huge amount of data source is increasing everyday. This leads to new area called Web Mining2, which is the application of data mining techniques to World Wide Web. 54 3.2 Web Mining 3.2.1 General Overview of Web Mining With the rapid and explosive growth of information available over the Internet, World Wide Web has become a powerful platform to store, disseminate and retrieve information as well as mine useful knowledge. Due to the properties of the huge, diverse, dynamic and unstructured nature of Web data, Web data research has encountered a lot of challenges, such as scalability, multimedia and temporal issues etc. As a result, Web users are always drowning in an “ocean” of information and facing the problem of information overload when interacting with the web. Typically, the following are the problems mentioned in Web related research and applications. 3.2.1.1 Finding relevant information : To find specific information on the web, users often either browse Web documents directly or use a search engine as a search assistant. When a user utilizes a search engine to locate information, the user often enters one or several keywords as a query, then the search engine returns a list of ranked pages based on the relevance to the query. However, there are usually two major concerns associated with the query-based Web search3 namely low precision and low recall. Low precision is caused by a lot of irrelevant pages returned by the search engine and low recall is due to the lack of capability of indexing all Web pages available on the Internet. This causes the difficulty in locating the unindexed information that is actually relevant. 55 3.2.1.2 Finding needed information: Most search engines perform in a query-triggered way that is mainly on a basis of one keyword or several keywords entered. Sometimes the results returned by the search engine don’t exactly match what a user really needs due to the fact of the existence of the homology. For example, when one user with an information technology background wishes to search information with respect to “Python” programming language, the user might be presented with information on the creatural python, one kind of snake rather than the programming language, given entering only one “python” word as query. In other words, the semantics of Web data4 is rarely taken into account in the context of Web search. 3.2.1.3 Learning useful knowledge: With traditional Web search service, query results relevant to query input are returned to Web users in a ranked list of pages. In some cases, the users are interested in not only browsing the returned collection of Web pages, but also extracting potentially useful knowledge out of them (data mining oriented). More interestingly, more studies5-7 have been conducted on how to utilize the Web as a knowledge base for decision making or knowledge discovery recently. 3.2.1.4 Recommendation/personalization of information: While a user interacts with the web, there is a wide diversity of user’s navigational preference, which results in needing different contents and presentations of information. Thus, to improve the Internet service quality and increase the user click rate on a specific website, it is necessary for a Web developer or designer to know what the user really wants to do, predict which pages the user is potentially interested in, and present the customized Web pages to the user by learning user navigational pattern knowledge4, 8, 9. 56 The above problems place the existing search engines and other Web applications under significant stress. A variety of efforts have been contributed to deal with these difficulties by developing advanced computational intelligent techniques or algorithms from different research domains, such as database, data mining, machine learning, information retrieval and knowledge management etc. The challenges listed above leads to a research for effective discovery and use of resources in World Wide Web, which also leads to web mining. The Whole schema results in new research area called Web Mining. Indeed, there is no major difference between data mining and web mining. Web Mining can be defined as application of data mining techniques to extract knowledge from the web data including web documents, hyperlinks between documents, usage logs of websites, etc.10. There are two different approaches for defining Web mining. The first approach is a ‘process-centric view’, which defines Web mining as a sequence of ordered tasks11. Second one is a ’data-centric view’, which defines web mining with respect to the types of web data that was used in the mining process12. The data-centric definition has become more acceptable 3, 13, 14 . Web Mining can be classified with respect to data it uses. Web involves three types of data13; the actual data on the WWW, the web log data obtained from the users who browsed the web pages and the web structure data. Thus, the web mining focuses on three important dimensions; web structure mining, web content mining and web usage mining. The detailed overview of web mining categories is given in Section 3.3. 57 3.2.2 Types of Web Data World Wide Web contains various information sources in different formats. As it is stated above World Wide Web involves three types of data, the categorization is given in Figure 3.2. Web content data is the data, which web pages are designed for presenting to the users. Web content data consists of free text, semistructured data like HTML pages and more structured data like automatically generated HTML pages, XML files or data in tables related to web content. Textual, image, audio and video data types falls into this category. The most common web content data is HTML pages in the web. HTML (Hypertext Markup Language) is designed to determine the logical organizations of documents with hypertext extensions. HTML was firstly implemented by Tim Berners-Lee at CERN, and became popular by the Mosaic browser developed at NCSA. In 1990s it has become widespread with the growth of the Web. After that, HTML has been extended in various ways. The www depends on the web page authors and vendors sharing the same conventions of HTML. Different browsers in various formats can view an HTML document in different ways. To illustrate, one browser may indent the beginning of a paragraph, while another may only leave a blank line. However, base structure remains the same and the organization of document is constant. HTML instructions divide the text of a web page into sub blocks called elements. The HTML elements can be examined in two categorizes: those that define how the body of the document is to be displayed by the browser, and those that define the information about the document, such as the title or relationships to other documents. 58 Figure 3.2 Types of Web Data Another common web content data is the XML documents. XML is a markup language for documents containing structured information. Structured information contains both the content and the information about what content includes and stands for. Almost all documents have some structure. XML has been accepted as a markup language, which is a mechanism to identify structures in a document. XML specification determines a standard way to add markup to documents. XML doesn’t specify semantic or tag set. In fact it is a meta-language for describing markups. It provides mechanism to define tags and the structural relationships. All of the semantics of an XML document will either be defined by the applications that process them or by style sheets. Dynamic server pages are also important part of web content data. Dynamic content can be any web content, which is processed or compiled by the web server before sending the results to the web browser. On the other hand, static content is content, which is sent to the browser without modification. Common forms of dynamic content are Active Server Pages 59 (ASP), Pre-Hypertext Processor (PHP) pages and Java Server Pages (JSP). Today, several web servers support more than one type of active server pages. The size of the web graph is varying from one domain to another domain. An example the graph of a particular web domain is given in the Figure 3.3. Figure 3.3 An Example Web Graph for a Particular Web Domain The edges of web graph has the following semantics: Outgoing arcs stand for hypertext links contained in the corresponding page and incoming arcs represent the hypertext links through which the corresponding page is reached. Web graph is used in applications such as web indexing, detection of web communities and web searching. The whole web graph grows with an amazing rate. More specifically, in January 2010, it was estimated that whole web graph consists of about 10.2 billions nodes15 and 150 billions edges, since the average node has roughly seven hypertext links (directed edges) to other pages16. In addition, approximately 20.3 millions nodes are 60 added every day and many nodes are modified or removed, so that the Web graph might currently contains more than 10 billions nodes and about 100 billions edges in all. Web usage data includes web log data from web server access logs, proxy server logs, browser logs, registration data, cookies and any other data generated as the results of web user interactions with web servers. Web log data is created on web server. Every Web server has a unique IP address and a domain name. When any user enters (a URL) in any browser, this request is sent to the web server. After that operation, web server fetches the page and sends it to user’s browser. Web server data are created from the relationship between web user’s interaction with a web site and the web server. A web server log, containing Web server data, is created as a result of the http process that is run on Web servers17. All types of server activities such as success, errors, and lack of response are logged into a server log file18. Web servers dynamically produce and update four types of “usage” log files: access log, agent log, error log, and referrer log. Web Access Logs have fields containing web server data, including the date, time, user’s IP address, user action, request method and requested data. Error Logs includes data about specific events such as "file not found," "document contains no data," or configuration errors; providing server administrator information on “problematic and erroneous” links on the server. Other type of data recorded to the error log is aborted transmissions. Agent logs provide data about the browser, browser version, and operating system of the requesting user. 61 Generally, Web server logs are stored in Common Logfile Format [CLF] or Extended Logfile Format [ELF]. Common Logfile Format includes date, time, client IP, remote log name of a user, bytes transferred, server name, requested URL, and http status code returned. Extended Logfile Format includes bytes sent and received, server name, IP address, port, request query, requested service name, time elapsed for transaction to complete, version of transfer protocol used, user agent which is the browser program making the request, cookie ID, and referrer. Web server logging tools, also known as Web traffic analyzers, analyze the log files of a Web server and produce reports from this information from this data source. These data can be used in the planning and optimizing web site structure. 3.3 Web Mining Categories Web Mining can be broadly divided into three categories according to the kinds of data to be mined10 : • Web Content Mining • Web Structure Mining • Web Usage Mining Figure 3.4 Taxonomy of Web Mining 62 Web content mining is the task of extracting knowledge from the content of documents on World Wide Web like mining the content of html files. Web document text mining, resource discovery based on concepts indexing or agent-based technology fall in this category. Web structure mining is the process of extracting knowledge from the link structure of the World Wide Web. Web usage mining, also known as Web Log Mining, is the process of discovering interesting patterns from web access logs on servers. The Taxonomy of Web Mining is given in Figure 3.4. 3.3.1 Web Content Mining Web Content Mining is the process of extracting useful information from the contents of Web documents. Content data is the collection of information designed to be conveyed to the users. It may consist of text, images, audio, video, or structured records such as lists and tables. Text mining and its application to Web content has been the most widely studied forms of web content mining. Some of the issues including the text mining are; topic discovery, extracting association patterns, clustering of web documents and classification of Web Pages. These fields also involve using techniques from other disciplines such as Information Retrieval (IR) and Natural Language Processing (NLP). There is also significant body of work exist for discovering knowledge from images in the fields of image processing and computer vision. The application of these techniques to Web content mining has not been very effective yet. 63 The unstructured documents like pure text files also falls into web content mining. Unstructured documents are free texts in www such as newspaper pages. Most of the researches in this area uses bag of words in order to represent unstructured documents19. Other researches in web content mining includes Latent Semantic Indexing (LSI)20 which tries to transform original document vectors to a lower dimensional space by analyzing structure of elements in the document pile. Another important information resource used in web content mining is the positions of words21, 22 in the document. There are also researches focusing on this topic for solving document categorization problem. The use of text compression is also new research area for text classification task. Web content mining applications range from text classification or text categorization to finding extracting pattern or rules. Topic detection and tracking problems are also research areas related to web content mining. Text Mining methods with their document representations included in web content mining are given in Table-1 below3. 64 Table 3.1 Text Mining Method Included in Web Content Mining Document Representation Bag of Words Method Episode Rules TFIDF Naive Bayes Bayes Nets Support Vector Machines Hidden Markov Models Maximum Entropy Rule Based System Boosted Decision Trees Neural Networks Logistic Regression Clustering Algorithms K-nearest Neighbor Decision Trees Bag of Words with n-grams Self Organizing Maps Unsupervised Hierarchical Clustering Decision Trees Statistical Analysis Word positions Relational Phrases Episode Rules Propositional Rule Based Systems Decision Trees Naive Bayes Bayes Nets Support Vector Machines Rule Based System Clustering Algorithms K-nearest Neighbor Decision Trees Relative Entropy Association Rules Rule Based System Rule Learning Text Compression Concept Categories Terms Hyponyms and synonyms Sentences and clauses Named Entity 65 3.3.2 Web Structure Mining As it is stated above the web graph composed of web pages as nodes and hyperlinks as edges, which represents the connection between two web pages. Web structure mining can be defined as a task of discovering structure information from the web. The aim of web structure mining is to produce structural information about the web site and its web pages. Unlike Web content mining, which mainly concentrates on the information of single document, web structure mining tries to discover the link structures of the hyperlinks between documents. By using topology information of hyperlinks, web structure mining can classify the Web pages and produce results such as the similarity and relationship between different Web sites. Web Structure Mining can be classified into two categories based on the type of structure data used. The structural data for Web structure mining is the link information and document structure. Given a collection of web pages and topology, interesting facts related to page connectivity can be discovered. There has been a detailed study about inter-page relations and hyperlink analysis23, which provides an up-to-date survey. In addition, web document contents can also be represented in a tree-structured format, based on the different HTML and XML tags within the page. Recent studies24, 25 have focused on automatically extracting document object model (DOM) structures out of documents. Interesting facts describing the connectivity in the Web subset can be discovered based on the given collection of connected web documents. The structure information obtained from the Web structure mining has the followings: 66 • The information about measuring the frequency of the local links in the web tuples in a web table • The information about measuring the frequency of web tuples in a web table containing links within the same document • The information measuring the frequency of web tuples in a web table that contains links that are global and the links that point towards different web sites • The information measuring the frequency of identical web tuples that appear in the web tables. In general, if a web page is connected to another web page with hyperlinks or the web pages are neighbors, the relationship between these web pages needs to be discovered. The relations between these web pages are categorized with respect to different properties: They may be related by synonyms or ontology, they may have similar contents, they may be on the same server or same person may create them. Another task of web structure mining is to discover the nature of the hierarchy or network of hyperlink in the web sites of a particular domain. This may help to generalize the flow of information in Web sites that may represent some particular domain; therefore the query processing can be performed easier and more efficient. Web structure mining has a strong relation with the web content mining, Because Web documents contain links, and they both use the real or primary data on the Web. It's possible to observe that these two mining areas are applied to same task. Web structure data describes the organization of the content. Intrapage structure information includes the arrangement of various HTML or XML tags within a given page. Inter-page structure information is hyperlinks connecting one page to another. Web graph is constructed by 67 hyperlinks information from web pages. The web graph has been widely adopted as the core describing the web structure. It is most widely accepted way of representing web structure related to web page connectivity (dynamic and static links). The Web graph is a representation of the WWW at a given time26. It stores the link structure and connectivity between the HTML documents in the www. Each node in the graph corresponds to a unique web page or a document. An edge represents an HTML link from one page to another. The general properties of web graphs are given below: • Directed, very large and sparse. • Highly dynamic – Nodes and edges are added/deleted very often – Content of the existing nodes are also subject to change – Pages and hyperlinks created on the fly • Apart from primary connected component there are also smaller disconnected components 3.3.3 Web Usage Mining Web Usage Mining is the process of applying data mining techniques to discover interesting patterns from Web usage data. Web usage mining provides better understanding for serving the needs of Web-based applications27. The quality of the patterns discovered in web usage mining process highly depends on the quality of the data used in the mining processes. Web usage data contains information about the Internet addresses of web users with their navigational behaviour. The basic information source for web usage mining itself can be classified in three categories, these are: 68 Application Server Data: Application server softwares like Web logic, Broad Vision, Story Server used for e-commerce applications, have important properties in their structure. These properties will allow many e-commerce applications to be built on top of them. One of the most important properties of application servers is their ability to keep track of several types of business transactions and record them in application server logs. Application Level Data: At the application server, the number of event types are increased while moving to upper layers. Application level data can be logged in order to generate histories of specially defined events. This type of data is classified in three categories based on the source of information. These categories are: server side, client side and proxy side data. Server side data gives information about the behaviors of all users, whereas the client side data gives information about a user using that particular client. Proxy side data is somewhere in between the client and server side data. Web Server Data: This is the most commonly used data type in web usage mining applications. It is the data obtained from user logs that are kept by a web server. The basic information source in most of the web usage mining applications is the access log files at server side. When any user agent (e.g., IE, Mozilla, Netscape, etc) hits an URL in a domain, the information related to that operation is recorded in an access log file. 69 Access log file on the server side contains log information of user that opened a session. These logs include the list of items that a user agent has accessed. The log format of the file is Common Log Format [CLF], which includes special record formats. These records have seven common fields, which are: 1. User’s IP address 2. Access date and time 3. Request method (GET or POST), 4. URL of the page accessed, 5. Transfer protocol (HTTP 1.0, HTTP 1.1), 6. Success of return code. 7. Number of bytes transmitted. The information in this record can be used to recover session information. The attributes that are needed to obtain session information in these tuples are: 1. User’s IP address 2. Access date and time 3. URL of the page accessed The possible application areas of web usage mining are prediction of the user's behavior within the site, comparison between expected and actual Web site usage, reconstruction of web site structure based on the interests of its users. Many researches have been done in the Database, Information Retrieval, Intelligent Agents and Topology, which provide basic foundation for web content mining, web structure mining. However, web usage mining is a relatively new research area, and has more and more attentions in recent years. After this general introduction to the web usage mining, phases of web usage mining are given in the next section. 70 3.4 Characteristics of Web Data For the data on the web, it has its own distinctive features compared to the data in conventional database management systems. Web data usually exhibits the following characteristics: The data on the Web is huge in amount. Currently, it is hard to estimate the exact data volume available on the Internet due to the exponential growth of Web data every day. For example, in 2009, one of the first Web search engines, the Google had an index of 14 billion Web pages and Web accessible documents. As of November, 2010, the top search engines claim to index from 10 billion to 50 billion Web documents. The enormous volume of data on the Web makes it difficult to handle Web data via well traditional database techniques. The data on the Web is distributed and heterogeneous. Due to the essential property of the Web being an interconnection of various nodes over the Internet, Web data is usually distributed across a wide range of computers or servers, which are located at different places around the world. Meanwhile, Web data is often exhibiting the intrinsic nature of multimedia: that is, in addition to textual information, which is mostly used to express content in terms of text message, many other types of Web data, such as images, audio files and video slips are often included in a Web page. It requires the developed techniques for Web data processing with the ability of dealing with heterogeneity of multimedia data. The data on the Web is unstructured. There are, so far, no rigid and uniform data structures or schemas which Web pages should strictly followed, that are common requirements in conventional database management. Instead, Web designers are able to 71 arbitrarily organize related information on the Web together in their own ways, as long as the information arrangement meets the basic layout requirements of Web documents, such as HTML format. Although Web pages in well-defined HTML format could contain some preliminary Web data structures, e.g. tags or anchors, these structural components, however, can primarily benefit the presentation quality of Web documents rather than reveal the semantics contained in Web documents. As a result, there is an increasing requirement to better deal with the unstructured nature of Web documents and extract the mutual relationships hidden in Web data for facilitating users to locate needed Web information or service. The data on the Web is dynamic. The implicit and explicit structure of Web data is updated frequently. Especially, due to different applications of web based data management systems, a variety of presentations of Web documents will be generated as contents in database updates. And dangling links and relocation problems will be produced when domain or file names changes or disappear. This feature leads to frequent schema modifications of Web documents, which often suffer traditional information retrieval. The aforementioned features indicate that Web data is a specific type of data different from the data residing in traditional database systems. As a result, there is an increasing demand to develop more advanced techniques to address Web information search and data management. According to the aims and purposes, these studies and developments are mainly about two aspects of Web data management, that is, how to accurately find the needed information on the Internet, i.e. Web information search, and how to efficiently and fully manage and utilize the information/knowledge available 72 from the Internet, i.e. Web data/knowledge management. Especially, with the recent popularity and development of the Internet, such as semantic web, Web 3.0 and so on, more and more advanced Web data based services and applications are emerging for Web users to easily locate the needed information and efficiently share information in a collaborative environment. 3.5 Web Data Search Web search engine technology28,29 has emerged catering for the rapid growth and exponential flux of Web data on the Internet, to help Web users find desired information, and has resulted in various commercial Web search engines available online such as Yahoo!, Google, AltaVista, TamilHunt and so on. Search engines can be categorized into two types: one is a generalpurpose search engine and another is a specific-purpose search engine. The general-purpose search engines, for example, the well-known Google search engine, are to retrieve as many Web pages available on the Internet that are relevant to the query as possible to Web users. The returned Web pages to user are ranked in a sequence according to their relevant weights to the query, and the satisfaction with the search results from users is dependent on how quickly and how accurately users can find the desired information. The specific–purpose search engines, on the other hand, aim at searching those Web pages for a specific task or an identified community. For example, Google Scholar and DBLP are two representatives of the specific-purpose search engines. The former is a search engine for searching academic papers or books as well as their citation information for different disciplines, while the latter is designed for a specific researcher community, i.e. computer science, which provides various information regarding conferences or journals in the 73 computer science domain, such as conference homepage, abstracts or full text of papers published in the computer science journals or conference proceedings. DBLP has become a helpful and practicable tool for researchers or engineers in computer science area to find the needed literature easily, or to assess the track record of one researcher conveniently. No matter which type the search engine is, each search engine owns a background text database, which is indexed by a set of keywords extracted from the collected documents. To satisfy the higher recall and accuracy rate of the search, Web search engines are requested to provide an efficient and effective mechanism to collect and manage the Web data, and the capabilities to match the user query with the background indexing database quickly and to rank the returned Web contents in an efficient way so that Web user can locate the desired Web pages in a short time or via clicking a few hyperlinks. To achieve these aims, a variety of algorithms or strategies are involved in handling the above mentioned tasks 28-34 , which lead to a hot and popular topic in the context of web-based research, i.e. Web data management. 3.6 Web Log Data Collection Data gathered from web servers is placed into special files called logs and can be used for web usage mining. Usually this data is called web log data as all visitors activities are logged into this file3. In real life web log files are huge source of information. For example, the web log file generated running online information site produces log with the size of 30 – 40 MB in one month time, another advertising company collects the file of the size of 6 MB during one day. 74 There are many commercial web log analysis tools (Angoss; Clementine; MINEit; NetGenesis)35 - 38 . Most of them focus on statistical information such as the largest number of users per time period, business type of users visiting the web site (.edu, .com) or geographical location (.in, .uk, .sg), pages popularity by calculating number of times they have been visited and etc. However, statistics without describing relationships between visited pages consequently leave much valuable information undiscovered38,39. This lack of depth of analytic scope has stimulated web log research area to expand to an individual research field beneficial and vital to e-business components. The main goals which might be achieved mining web log data are the following: Web log examination enables to restructure the web site to let clients access the desired pages with the minimum delay. The problem of how to identify usable structures on the WWW related with understanding what facilities are available for dealing with this problem and how to utilize them41. Web log inspection allows improving navigation. This can manifest itself by organizing important information into the right places, managing links to other pages in the correct sequence, preloading frequently used pages. Attracting more advertisement capital by placing adverts into the most frequently accessed pages. Interesting patterns of customer behaviour can be identified. For example, valuable information can be gained by discovering the most popular paths to the specific web pages and paths users take upon leaving these pages. These findings can be used effectively for redesigning the web site to better channel users to specific web pages. 75 Turning non-customers into customers increasing the profit42. Analysis should be provided on both groups: customers and noncustomers in order to identify characteristic patterns. Such findings would help to review customers’ habits and help site maintainers to incorporate these observed patterns into the site architecture and thereby assist turning the non-customer into a customer. From empirical findings43 it is observed that people tend to revisit pages just visited and access only a few pages frequently. Humans browse in small clusters of related pages and generate only short sequences of repeated URLs. This shows that there is no need to increase number of information pages on the web site. More important is to concentrate to the efficiency of the material placed and accessibility of these clusters of pages. General benefits obtained from analysing Web logs are allocating resources more efficiently, finding new growth opportunities, improving marketing campaigns, new product planning, increasing customer retention, discovering cross selling opportunities and better forecasting. 3.7 The common log format Various web servers generate different formatted logs: CERF Net, Cisco PIX, Gauntlet, IIS standard/Extended, NCSA Common/Combined, Netscape Flexible, Open Market Extended, Raptor Eagle. Nevertheless, the most common log format is common log format (CLF) and appears exactly as follows (see Figure 1.5): 76 Figure 3.5. Example of the Common Log Format: IP address, authentication (rfcname and logname), date, time, GTM zone, request method, page name, HTTP version, status of the page retrieving process and number of bytes transferred. host/ip rfcname logname [DD/MMM/YYYY:HH:MM:SS -0000]"METHOD /PATH HTTP/1.0" code bytes. host/ip - is visitor’s hostname or IP address. rfcname - returns user’s authentication. Operates by looking up specific TCP/IP connections and returns the user name of the process owning the connection. If no value is present, a "-" is assumed. logname - using local authentication and registration, the user's log name will appear; otherwise, if no value is present, a "-" is assumed. DD/MMM/YYYY:HH:MM:SS - 0000 this part defines date consisted of the day (DD), month (MMM), years (YYYY). Time stamp is defined by hours (HH), minutes (MM), seconds (SS). Since web sites can be retrieved any time of the day and server logs user’s time, the last symbol stands for the difference from Greenwich Mean Time (for example, Indian Standard Time is -05.30). method - methods found in log files are PUT, GET, POST, HEAD44. PUT allows user to transfer/send a file to the web server. By default, PUT is used by web site maintainers having administrator’s privileges. 77 For example, this method allows uploading files through the given form on the web. Access others then site maintaining is forbidden. GET transfers the whole content of the web document to the user. POST sends information to the web server that a new object is created and linked. The content of the new object is enclosed as the body of the request and is transferred to the user. Post information usually goes as an input to the Common Gateway Interface (CGI) programs. HEAD demonstrates the header of the “page body”. Usually it is used to check the availability of the page. path stands for the path and files retrieved from the web server. HTTP/1.0 defines the version of the protocol used by the user to retrieve information from the web server. code identifies the success status. For example, 200 means that the file has been retrieved successfully, 404 - the file was not found, 304 - the file was reloaded from cache, 204 indicates that upload was completed normally and etc. bytes number of bytes transferred from the web server to another machine. Figure 3.6. CLF followed by additional data fields: web page name visitor gets from - referrer page, browser type and cookie information 78 It is possible to adjust web server’s options to collect additional information such as REFERER_URL, HTTP_USER_AGENT and HTTP_COOKIE (Fleishman 1996). REFERER_URL defines URL names where from visitors came. HTTP_USER_AGENT identifies browser’s version the visitors use. HTTP_COOKIE variable is a persistent token which defines visitors identification number during browsing sessions. Then CLF is a form depicted in Figure 3.6. 3.8 Web Log Data Pre-Processing Steps Web log data pre-processing step is a complex process. It can take up to 80% of the total KDD time45 and consists of stages presented in Figure 3.7. The aim of data pre-processing is to select essential features, clean data from irrelevant records and finally transform raw data into sessions. The latter step is unique, since session creation is appropriate just for web log datasets and involves additional work caused by user identification problem and various nonsettled standpoints how sessions must be identified. All these stages will be analysed in more detail in order to understand why pre-processing plays an important role in KDD process mining complex web log data. Figure 3.7. Pre-processing web log data is one of the most complex part’s in KDD process. 79 3.9 Web Personalization Web personalization is the process of customizing a Web site to the needs of specific users, taking advantage of the knowledge acquired from the analysis of the user's navigational behavior (usage data) in correlation with other information collected in the Web context, namely, structure, content, and user profile data. Due to the explosive growth of the Web, the domain of Web personalization has gained great momentum both in the research and commercial areas. The steps of a Web personalization process include: (a) the collection of Web data, (b) the modelling and categorization of these data (preprocessing phase), (c) the analysis of the collected data and (d) the determination of the actions that should be performed. The ways that are employed in order to analyse the collected data include content-based filtering, collaborative filtering, rule-based filtering and Web usage mining. The site is personalized through the highlighting of existing hyperlinks, the dynamic insertion of new hyperlinks that seem to be of interest for the current user, or even the creation of new index pages. Content-based filtering systems are solely based on individual users’ preferences. The system tracks each user’s behaviour and recommends them items that are similar to items the user liked in the past. Collaborative filtering systems invite users to rate objects or divulge their preferences and interests and then return information that is predicted to be of interest for them. This is based on the assumption that users with similar behaviour (for example users that rate similar objects) have analogous interests. 80 In rule-based filtering the users are asked to answer to a set of questions. These questions are derived from a decision tree, so as the user proceeds on answering them, what she/he finally receives as a result (for example a list of products) is tailored to their needs. Content-based, rulebased and collaborative filtering may also be used in combination, for deducing more accurate conclusions. In this thesis the main focus is on Web usage mining. This process relies on the application of statistical and data mining methods to the Web log data, resulting in a set of useful patterns that indicate users’ navigational behaviour. The data mining methods that are employed are: association rule mining, sequential pattern discovery, clustering and classification. This knowledge is then used from the system in order to personalize the site according to each user’s behaviour and profile. The block diagram illustrated in Figure 3.8 represents the functional architecture of a Web personalization system in terms of the modules and data sources that were described earlier. The content management module processes the Web site’s content and classifies it in conceptual categories. The Web site’s content can be enhanced with additional information acquired from other Web sources, using advanced search techniques. Given the site map structure and the usage logs, a Web usage miner provides results regarding usage patterns, user behaviour, session and user clusters, click-stream information etc. Additional information about the individual users can be obtained by the user profiles. 81 Figure 3.8 Modules of a Web personalization system Moreover, any information extracted from the Web usage mining process concerning each user’s navigational behaviour can then be added to his/her profile. All this information about nodes, links, Web content, typical behaviours and patterns is conceptually abstracted and classified into semantic categories. Any information extracted from the interrelation between knowledge acquired using usage mining techniques and knowledge acquired from content management will then provide the framework for evaluating possible alternatives for restructuring the site. A publishing mechanism will perform the site modification, ensuring that each user navigates through the optimal site structure. The available content options for each user will be ranked according to user's interests. 82 3.10 User Profiling In order to personalize a Web site, the system should be able to distinguish between different users or groups of users. This process is called user profiling and its objective is the creation of an information base that contains the preferences, characteristics and activities of the users. In the Web domain and especially in e-commerce, user profiling has been developed significantly since Internet technologies provide easier means of collecting information about the users of a Web site, which in the case of ebusiness sites are potential customers. A user profile can be either static, when the information it contains is never or rarely altered (e.g. demographic information), or dynamic when the user profile’s data change frequently. Such information is obtained either explicitly, using on-line registration forms and questionnaires resulting in static user profiles, or implicitly, by recording the navigational behaviour and/or the preferences of each user, resulting in dynamic user profiles. A way of uniquely identifying a visitor through a session is by using cookies. A cookie is defined as “the data sent by a Web server to a Web client, stored locally by the client and sent back to the server on subsequent requests.” In other words, a cookie is simply an HTTP header that consists of a text-only string, which is inserted into the memory of a browser. It is used to uniquely identify a user during Web interactions within a site and contains data parameters that allow the remote HTML server to keep a record of the user identity, and what actions he takes at the remote Web site. The contents of a cookie file depend on the Web site that is being visited. In general, information about the visitor’s identification is stored, along with password information. Additional information such as credit card details, if one is used during a transaction, as well as details concerning the visitor’s activities at the Web site, for example, which pages were visited, which purchases were 83 made, or which advertisements were selected, can also be included. Often, cookies point back to more detailed customer information stored at the Web server. Another way of uniquely identifying users through a Web transaction is by using identd, an identification protocol specified in RFC 1413 that provides a means to determine the identity of a user of a particular TCP connection. Given a TCP port number pair, it returns a character string, which identifies the owner of that connection (the client) on the Web server’s system. Finally, a user can be identified making the assumption that each IP corresponds to one user. In some cases, IP addresses are resolved into domain names that are registered to a person or a company, thus more specific information is gathered. As already mentioned, user profiling information can be explicitly obtained by using online registration forms requesting information about the visitor, such as name, age, sex, likes, and dislikes. Such information is stored in a database, and each time the user logs on the site, it is retrieved and updated according to the visitor’s browsing and purchasing behavior. All of the aforementioned techniques for profiling users have certain drawbacks. First, in the case where a system depends on cookies for gathering user information, there exists the possibility of the user having turned off cookie support on his browser. Other problems that may occur when using cookies technology are the fact that because a cookie file is stored locally in the user’s computer, the user might delete it and when the user revisits a Web site, it will be regarded as a new visitor. Furthermore, if no additional information is provided (e.g., some logon id), there occurs an identification problem if more than one user browses the Web using the same computer. A similar problem occurs when using identd, inasmuch as the client should be configured in a mode that permits plaintext transfer of ids. 84 A potential problem in identifying users using IP address resolving, is that in most cases this address is that of the ISP, and that does not suffice for specifying the user’s location. On the other hand, when gathering user information through registration forms or questionnaires, many users submit false information about themselves and their interests resulting in the creation of misleading profiles. In the latter case, there are two further options: either regarding each user as a member of a group and creating aggregate user profiles, or addressing any changes to each user individually. When addressing the users as a group, the method used is the creation of aggregate user profiles based on rules and patterns extracted by applying Web usage mining techniques to Web server logs. Using this knowledge, the Web site can be appropriately customized. 3.11 Privacy Issues The most important issue that should be encountered during the user profiling process is privacy violation. Many users are reluctant to giving away personal information either implicitly or explicitly, being hesitant to visit Web sites that use cookies (if they are aware of their existence) or avoiding to disclose personal data in registration forms. In both cases, the user looses anonymity and is aware that all of their actions will be recorded and used, in many cases without their consent. Additionally, even if a user has agreed to supply personal information to a site, through cookie technology such information can be exchanged between sites, resulting to its disclosure without the user’s permission. P3P (Platform for Privacy Preferences) is a W3C proposed recommendation [P3P] that suggests an infrastructure for the privacy of data interchange. This standard enables Web sites to express their privacy practices in a standardized format that can be automatically retrieved and 85 interpreted by user agents. Therefore, the process of reading privacy policies will be simplified for the users, since key information about what data is collected by a Web site can be automatically conveyed to a user, and discrepancies between a site's practices and the user's preferences concerning the disclosure of personal data will be automatically flagged. P3P, however, does not provide a mechanism for ensuring that sites actually act according to their policies. 3.12 Tools & Applications Some of the most popular Web sites that use methods such as decision tree guides, collaborative filtering and cookies in order to profile users and create customized Web pages are listed. Additionally, a brief description of the most important tools available for user profiling is given. An overview along with products’ references is provided in Table 3.2. Table 3.2 User Profiling Tools Vendor BroadVision [BRO] Macromedia [MAC] Product Name One-To-One LikeMinds Microsoft Firefly [MSF] Passport NetPerceptions [NPE] Neuromedia [NME] Collaborative Filtering Group Lens Learn [OSE] Sesame Cookies User Registration * * * * * NeuroStudio OpenSesame Page Customization 86 * * * * * Popular Web sites such as Yahoo!, Excite or Microsoft Network [MSN] allow users customize home pages based on their selections of available content, using information supplied by the users and cookies thereafter. In that way, each time the user logs in the site, what user sees is a page containing information addressed to user’s interests. Rule-based filtering is used from online retailers such as Dell and Apple Computer, giving users the ability to easily customize product configuration before ordering. As long as recommendation systems are concerned, the most popular example is amazon.com. The system analyses past purchases and posts suggestions on the shopper's customized recommendations page. Users who haven't made a purchase before can rate books and see listings of books they might like. The same approach, based on user ratings, is used in many similar online shops, such as CDNOW. Commercial Web sites, including many search engines such as AltaVista or Lycos, have associations with commercial marketing companies, such as DoubleClick Inc. These sites use cookies to monitor their visitors’ activities, and any information collected is stored as a profile in DoubleClick’s database. DoubleClick then uses this profile information to decide which advertisements or services should be offered to each user when she/he visits one of the affiliated DoubleClick sites. Of course, this information is collected and stored without the users’ knowledge and more importantly, consent. There are several systems available for creating user profiles. They vary according to the user profiling method that they use. These include a) Broadvision’s One-To-One, a high-end marketing tool designed to let sites recognize customers and display relevant products and services (customers include Kodak Picture Network, and US West), b) Net 87 Perception’s GroupLens, a collaborative filtering solution requiring other users to actively or passively rate content (clients include Amazon.com and Musicmaker), c) Open Sesame’s Learn Sesame, a cookie-based product (clients include Ericsson and Toronto Dominion Bank), d) the early leader in collaborative filtering Firefly Passport, developed by MIT Media Lab and now owned by Microsoft (clients include Yahoo, Ziff-Davis and Barnes&Noble), e) Macromedia’s LikeMinds Preference Server, another collaborative filtering system that examines users’ behaviour and finds other users with similar behaviours in order to create a prediction or product recommendation (clients include Cinemax-HBO’s Movie Matchmaker and Columbia House’s Total E entertainment site), f) Neuromedia’s NeuroStudio, an intelligent-agent software that allows Webmasters to give users the option to create customized page layouts, using either cookies or user log-in (customers include Intel and Y2K Links Database site), and g) Apple’s WebObjects, a set of development tools that allow customized data design (clients include The Apple Store and Cybermeals)46. 3.13 Summary In this chapter, a detailed KDD schema is demonstrated and explanation is provided for each step. Established relationship between data mining and its branch – web mining. Investigated essential characteristics of web mining. Taxonomy is depicted and demonstrated that web mining consist of 3 subareas: web structure mining, web usage mining, web content mining. Peculiarities of each web mining subareas and tasks that can be achieved using various data related to the web are briefly described. Analysis of data collection sources is provided with the different formats of web data. Material which is collected and presented in this chapter is a comprehensive guide into web mining area. 88 Reference 1. J. Han, M. Kamber (2000): Data Mining: Concepts and Techniques Morgan Kaufmann. 2. O. Etzioni (1996), “The World Wide Web: Quagmire or Gold Mine”, in Communications of the ACM, 39(11):65-68. 3. Kosala, R. and H. Blockeel, Web Mining Research: A Survey. SIGKDD Explorations, 2000. 2(1): p. 1-15. 4. Ghani, R. and A. Fano. Building Recommender Systems Using a Knowledge Base of Product Semantics. in Proceedings of the Workshop on Recommendation and Personalization in E-Commerce, at the 2nd International Conference on Adaptive Hypermedia and Adaptive Web Based Systems (AH2002). 2002, p. 11-19, Malaga, Spain. 5. Chakrabarti, S., et al. The Structure of Broad Topics on the Web. in Proceeding of 11th International World Wide Web Conference. 2002, p. 251 - 262, Honolulu, Hawaii, USA. 6. Büchner, A.G. and M.D. Mulvenna, Discovering Internet Marketing Intelligence through Online Analytical Web Usage Mining. SIGMOD Record, 1998. 27(4): p.54-61. 7. Chang, G., et al., eds. Mining the World Wide Web: An Information Search Approach. The Information Retrieval. Vol. 10. 2001, KAP. 8. Pierrakos, D., et al. Web Community Directories: A New Approach to Web Personalization. in Proceeding of the 1st European Web Mining Forum (EWMF'03). 2003, p. 113-129, Cavtat-Dubrovnik, Croatia. 9. Mobasher, B., Web Usage Mining and Personalization, in Practical Handbook of Internet Computing, M.P. Singh, Editor. 2004, CRC Press. p. 15.1-37. 89 10. J. Srivastava, P. Desikan and V. Kumar (2002) "Web Mining: Accomplishments & Future Directions", National Science Foundation Workshop on Next Generation Data Mining (NGDM'02) 11. O. Etzioni (1996), “The World Wide Web: Quagmire or Gold Mine”, in Communications of the ACM, 39(11):65-68. 12. R. Cooley (2000) Web Usage Mining: Discovery and Application of Interesting Patterns from Web Data. Ph.D. Thesis. University of Minnesota. May 2000. 13. S. K. Madria, S. S. Bhowmick (1999), W. K. Ng, E. P. Lim: Research Issues in Web Data Mining. DaWaK: 303-312. 14. J. Borges, M. Levene (1998), “Mining Association Rules in Hypertext Databases”, in Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98), New York City. 15. B. Murray and A. Moore (2002). Sizing the Internet. White paper, Cyveillance. 16. J. M. Kleinberg, R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins (1999). The Web as a Graph: Measurements, Models, and Methods. Proceedings of the International Conference on Combinatorics and Computing, pp. 1-18. 17. A. G. Buchner, & M. D. Mulvenna (1998). Discovering Internet marketing intelligence through online analytical web usage mining. SIGMOD Record, 27 (4), 54-61. 18. J. C. Bertot, C. R. McClure, W. E. Moen & J. Rubin (1997). Web usage statistics: Measurement issues and analytical techniques. Government Information Quarterly, 14 (4), 373-395. 19. G. Salton and M. McGill (1983), Introduction to Modern information Retrieval. McGraw Hill. 90 20. S. Deerwester, S. Dumains, G. Furnas, T. Landauer and R. Harshman (1990). Indexing by Latent Symantic Analysis. Journal of American Society for Information Science. 41(6): 391-407. 21. H. Ahonen, O. Heionen, M. Klemettinen and A. Verkamo (1998). Applying data mining techniques for descriptive phrase extraction in digital document collections. In advances in Digital Libraries (ADL 98). Santa Barbara California, USA. 22. W.W. Cohen (1995). Learning to classify English text with ilp methods. In Advances in Inductive Logic Programming. (Ed. L. De Raed)m IOS Press. 23. P. Desikan, J. Srivastava, V. Kumar, P.-N. Tan (2002), “Hyperlink Analysis – Techniques & Applications”, Army High Performance Computing Center Technical Report. 24. K. Wang and H. Lui (1998), “Discovering Typical Structures of Documents: A Road Map Approach”, in Proceedings of the ACM SIGIR Symposium on Information Retrieval. 25. C. H. Moh, E. P. Lim, W. K. Ng (2000), “DTD-Miner: A Tool for Mining DTD from XML Documents”, WECWIS: 144-151. 26. M. Gandhi, K. Jeyebalan, J. Kallukalam, A. Rapkin, P. Reilly, N. Widodo (2004), Web Research Infrastructure Project Final Report , Cornell University. 27. J. Srivastava, R. Cooley, M. Deshpande and P-N. Tan (2000). “Web Usage Mining: Discovery and Applications of usage patterns from Web Data”, SIGKDD Explorations, Vol 1, Issue 2. 28. Brin, S. and L. Page, The PageRank Citation Ranking: Bringing Order to the Web (http://www-db.stanford.edu/~backrub/ pageranksub.ps.). 1998. 91 29. Ding, C., et al., PageRank, HITS and a Unified Framework for Link Analysis, L.B.N.L.T. Report, Editor. 2002, University of California, Berkeley, CA. 30. Borodin, A., et al. Finding Authorities and Hubs from Hyperlink Structures on the World Wide Web. in Proceedings of the 10th International World Wide Web Conference. 2001, p. 415-429, Hong Kong, China. 31. Haveliwala, T. Topic-Sensitive PageRank. in Proceedings of the 11th International World Wide Web Conference. 2002, p. 517-526, Honolulu, Hawaii, USA. 32. Kamvar, S., H. TH, and G.G. Manning CD. Extrapolation Methods for Accelerating PageRank Computations. in Proceedings of WWW'03. 2003, p. 261-270, Budapest, Hungary. 33. Page, L., Brin S, Motwani R, Winograd T, The Pagerank Citation Ranking: Bringing Order to the Web, in Report. 1998, Report in Computer Science Department, Stanford University. 34. Richardson, M. and D. P. The Intelligent Surfer:Probabilistic Combination of Link and Content Information in PageRank. in 2001 Neural Information Processing Systems Conference (NIPS 2001). 2001, p. 1441-1448, Vancouver, British Columbia, Canada: MIT Press, Cambridge, MA. 35. Angoss. [accessed 2003.11.14]. Available from Internet: <http://www.angoss.com/angoss.html/>. 36. Clementine. [accessed 2005.09.03]. Available from Internet: <http://www.spss.com/clementine/>. 37. MINEit. [accessed 2001.11.21]. Available from Internet: <http://www.mineit.com/products/easyminer/>. 38. NetGenesis. [accessed 2003.05.04]. <http://www.netgen.com/>. 92 Available from Internet: 39. Pitkow, J.; Bharat, K. 1994a. WEBVIZ: a tool for World-Wide Web access log analysis, in Proc. of the First International World Wide Web Conference. 35–41. 40. Cooley, R.; Mobasher, B.; Srivastava, J. 1997a. Grouping Web Page References into Transactions for Mining World Wide Web Browsing Patterns, in Proc. of the IEEE Knowledge and Data Engineering Exchange Workshop (KDEX'97). 2–7. 41. Pirolli, P.; Pitkow, J.; Rao, R. 1996. Silk from a Sow's Ear: Extracting Usable Structure from the Web, in Proc. of the Human factors in computing systems: Common ground;CHI 96. 118–125. 42. Faulstich, L.; Spiliopoulou, M.; Winkler, K. 1999. A Data Mining Analyzing the Navigational Behaviour of Web Users, in Proc. of the Workshop on Machine Learning User Modeling of the ACAI'99 International Conf. 44–49. 43. Tauscher, L.; Greenberg, S. 1997. How people revisit web pages: empirical findings and implications for the design of history systems, in International Journal of Human Computer Studies. 47(1): 97–138. 44. Savola, T.; Brown, M.; Jung, J.; Brandon, B.; Meegan, R.; Murphy, K.; O'Donnell, J.; Pietrowicz, S. R. 1996. Using HTML, 1043 p. 45. Ansari, S.; Kohavi, R.; Mason, L.; Zheng, Z. 2001. Integrating ECommerce and Data Mining: Architecture sand Challenges, in Proc. of the Data mining. 27–34. 46. Richard Dean, Personalizing Your Web Site, http://Webbuilder.netscape.com/Webbuilding/pages/Business/Person al/index.html 93