Download WebMining

Web Mining (Web Usage Mining) Web Mining – The Idea  In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and other multimedia files available via internet and the number is still rising. But considering the impressive variety of the web, retrieving interesting content has become a very difficult task. Opportunities and Challenges  Web offers an unprecedented opportunity and challenge to data mining  The amount of information on the Web is huge, and easily      accessible. The coverage of Web information is very wide and diverse. One can find information about almost anything. Information/data of almost all types exist on the Web, e.g., structured tables, texts, multimedia data, etc. Much of the Web information is semi-structured due to the nested structure of HTML code. Much of the Web information is linked. There are hyperlinks among pages within a site, and across different sites. Much of the Web information is redundant. The same piece of information or its variants may appear in many pages. Opportunities and Challenges  The Web is noisy. A Web page typically contains a mixture of many kinds of information, e.g., main contents, advertisements, navigation panels, copyright notices, etc.  The Web is also about services. Many Web sites and pages enable people to perform operations with input parameters, i.e., they provide services.  The Web is dynamic. Information on the Web changes constantly. Keeping up with the changes and monitoring the changes are important issues.  Above all, the Web is a virtual society. It is not only about data, information and services, but also about interactions among people, organizations and automatic systems, i.e., communities. Web Mining  Web is the single largest data source in the world  Due to heterogeneity and lack of structure of web data, mining is a challenging task  Multidisciplinary field: data mining, machine learning, natural language processing, statistics, databases, information retrieval, multimedia, etc. Web Mining  The term created by Orem Etzioni (1996)  Application of data mining techniques to automatically discover and extract information from Web data Data Mining vs. Web Mining  Traditional data mining  data is structured and relational  well-defined tables, columns, rows, keys, and constraints.  Web data  Semi-structured and unstructured  readily available data  rich in features and patterns Web Data  Web Structure  Web Content  Web Usage Classification of Web Mining Techniques •Content mining: extract model from web contents, such as text, images, video, and semi- structured (HTML or XML) or structured documents (digital libraries) •Structure mining: aims at finding the underlying topology and organization of web resources •Usage mining: discover usage patterns from web server log files, user queries, and registration data Web-Structure Mining  Generate structural summary about the Web site and Web page Depending upon the hyperlink, ‘Categorizing the Web pages and the related Information @ inter domain level Discovering the Web Page Structure. Discovering the nature of the hierarchy of hyperlinks in the website and its structure. Web Mining Web Structure Web Content Web Usage Mining Mining Mining Web-Structure Mining (cont.)  Finding Information about web pages Retrieving information about the relevance and the quality of the web page. Finding the authoritative on the topic and content.  Inference on Hyperlink The web page contains not only information but also hyperlinks, which contains huge amount of annotation. Hyperlink identifies author’s endorsement of the other web page. Web-Usage Mining  What is Usage Mining? Discovering user ‘navigation patterns’ from web data. Prediction of user behavior while the user interacts with the web. Helps to Improve large Collection of resources. Web Mining Web Structure Web Content Mining Mining Web Usage Mining Web-Usage Mining (cont.)  Usage Mining Techniques Data Preparation Data Collection Data Selection Data Cleaning Data Mining Association Rules Sequential Patterns Classification Clustering Web Content Mining  ‘Process of information’ or resource discovery from content of millions of sources across the World Wide Web  E.g. Web data contents: text, Image, audio, video, metadata and hyperlinks  Goes beyond key word extraction, or some simple statistics of words and phrases in documents. Web Mining Web Structure Web Content Web Usage Mining Mining Mining Web Content Mining  Pre-processing data before web content mining: feature selection (Piramuthu 2003)  Post-processing data can reduce ambiguous searching results (Sigletos & Paliouras 2003)  Web Page Content Mining  Mines the contents of documents directly  Search Engine Mining  Improves on the content search of other tools like search engines. Web Content Mining  Web content mining is related to data mining and text mining. [Bing Liu. 2005]  It is related to data mining because many data mining techniques can be applied in Web content mining.  It is related to text mining because much of the web contents are texts.  Web data are mainly semi-structured and/or unstructured, while data mining is structured and text is unstructured. Web Usage Mining Web Usage Mining is the application of data mining techniques to discover usage patterns from Web data, in order to understand and better serve the needs of Web-based applications. Srivastava et al Usage mining reflects the behavior of humans as they interact with the Internet. Web Usage Mining - the usage  Restructure a website  Extract user access patterns to target ads  Number of access to individual files  Predict user behavior based on previously learned rules and users’ profile  Present dynamic information to users based on their interests and profiles Introduction  The WWW continues to grow at an astounding rate resulting in increase of complexity of tasks such as web site design, web server design and of simply navigating through a web site  An important input to these design tasks is analysis of how a web site is used. Usage information can be used to restructure a web site in order to better serve the needs of users of a site  Web usage mining is the application of data mining techniques to large web data repositories in order to produce results that can be used in these design tasks.  Some of the data mining algorithms that are commonly used in web usage mining are:  Association rule generation  Sequential Pattern generation  Clustering Introduction(con.)  The input for the web usage mining process is a file, referred to as a user session file, that gives an exact accounting of who accessed the web site, what pages were requested and in what order, and how long each page was viewed  Web server log does not reliably represent a user session file. Hence, several preprocessing tasks must be performed prior to applying data mining algorithms to the data collected from server logs. High Level Web Usage Mining Process Phases in the DM Process – CRISP-DM Like data mining, web usage mining may be viewed in the context of the Cross Industry Standard Process for Data mining. According to CRISP-DM, a given data mining project has a life cycle consisting of six phases. The CRISP-DM Phases 1) Business understanding phase: Clearly declare the project objectives and requirements in terms of the business or research unit as a whole. 2) Data understanding phase: Collect the data and discover initial insights 3) Data preparation phase: Covers all aspects of preparing the final data set, used for all subsequent phases, from the initial raw data. 4) Modeling phase: Select and apply appropriate modeling techniques 5) Evaluation phase: The models delivered by the preceding phase are evaluated for quality and effectiveness before being deployed for use. Web Usage Data  A framework for web usage mining is proposed by Srivastava et al.  The process consists of four phases:  The input stage  The preprocessing stage  The pattern discovery stage  The pattern analysis stage Web Usage Data - Input stage The files that are retrieved: Sources • • • • • Server access logs Server Referrer logs Agent logs Registration information(if any) Information concerning the site topology Web Usage Data – preprocessing stage The raw web logs do not arrive in a format appropriate for data mining. The most common tasks:  Data cleaning and filtering  De-spidering  User identification  Session identification  Path completion Web Usage Data – pattern discovery stage In this stage the web data are ready for the application of statistical and data mining methods for discovering patterns:  Standard statistical analysis  Clustering algorithms  Association rules  Classification algorithms  Sequential patterns Web Usage Data – pattern analysis stage  Not all the patterns uncovered in the pattern discovery stage would be considered interesting or useful.  In the pattern analysis stage, human analysts examine the output from the pattern discovery stage and gather the most interesting, useful and actionable patterns. Web Usage Data – clickstream analysis  Web usage mining is sometimes referred to as clickstream analysis.  A clickstream is the aggregate sequence of page visits executed by a particular user navigation through a web site.  Clickstream data also consists of logs, cookies, metatags and other data used to transfer webpages from server to browser.  Other requests of the browser like image files must be aggregated into page views at the preprocessing stage.  Then a series of page views can be woven together into a session. Web Usage Data – Web server log files  For each request from a user’s browser to a web server, a response is generated automatically, called web log file, log file or web log.  A sample from the EPA(Environmental Protection Agency) web log data available from the Internet Traffic Archive: Web Usage Data – Web server log files – the fields  Remote Host field     This field consists of the Internet IP address of the remote host making the request such as “141.243.1.172” Data/Time field The date/time filed with this format: [DD:HH:MM:SS] Http Request field This filed consists of the information that the client’s browser has requested from the web server. Contains: the request method, the URI(Uniform Resource Identifier), the header and the protocol. The URI can be used to analyze the frequency of visitor requests for pages and files. The header information can be used, for example, which keywords are being used by visitors in search engines that point to your site. Status Code field Not all browser requests succeed. This field provides a three-digit response from the web server to the client’s browser, indicating the status of the request. Transfer Volume(Bytes) field Indicates the size of the file(web page, graphics, etc.) in bytes, sent by the web server to the client’s browser. This field is useful for helping to monitor the network traffic, the load carried by the network throughout the 24 hour cycle. Web Usage Data – Web server log files Web Logs come in various formats, which vary depending on the configuration of the web server.  Common Log Format(CLF): Supported by a variety of web servers. It has  Extended Common Log Format(ECLF): A variation of the common log format.  Microsoft IIS Log format: This format records more fields than the other formats, so that more information can be uncovered. Web Usage Data – Auxiliary Information  Besides web logs, further auxiliary information may be available in the form of user registration information, user demographic information, and so on.  These data usually reside on separate servers from the web log data and will need to be merge with the web logs before preprocessing can be done.  Finally, to perform the preprocessing task, the analyst will need to know the topology or structure of the website, the network of hierarchies and relationships among the web pages, and so on. Preprocessing For Web Usage Mining Preprocessing is needed in order to:  Clean up the data  Rid the web log file of nonhuman access behavior(spiders, crawlers and other automatic web bots)  Identify each distinct user  Identify the user session  Perform path completion Preprocessing For Web Usage Mining  Clean up the data The data cleaning and filtering portion of the preprocessing phase consists of the following three steps: 1) 2) 3) Variable extraction Time stamp derivation Page extension exploration and filtering Preprocessing - Data cleaning and filtering Data cleaning/filtering step 1: variable extraction 1) From the date/time field, extract the date variable 2) From the date/time field, extract the time variable 3) From the HTTP request field, extract the request method 4) From the HTTP request field, extract the page(URI) 5) From the HTTP request field, extract the protocol version Data cleaning/filtering step 2: creating a Time Stamp 1) Find the number of days between the web log entry date and the software’s base line date 2) Multiply this number of days by 86,400 which is number of seconds within a day 3) Find the time in seconds since midnight that is represented by the time in the web log entry 4) Add (2) and (3) Preprocessing - Data cleaning and filtering The figure shows the results from the variable extraction and time stamp creation. The baseline date for this example is January 1, 1995. Preprocessing - Data cleaning and filtering Page extension exploration and filtering  Problem: The HTTP protocol requires a separate connection for every file that is requested from the web server. Therefore, a user’s request to view a particular page often results in several log entries since graphics and scripts are downloaded in addition to the HTML file. In most cases, only the log entry of the HTML file request is relevant and should be kept for the user session file  Solution: Elimination of items believed irrelevant can be reasonably accomplished by checking the suffix of URL name. All log entries with filename suffixes such as GIF,JPEG,JPG and map can be removed. However, the list can be modified depending on the site being analyzed Preprocessing – De-Spidering The Web Log File  Web search engines need the most current information available from the WWW to provide this information to their customers.  They dispatch spiders, crawlers and automatic web bots to crawl around the Web performing exhaustive searches of Web sites.  This behavior is not considered interesting from a web usage mining standpoint.  The most direct method of deleting these from the web logs is to identify the spider’s name in the user agent field, when supplied. For contact purposes, the bots often also include a URL or an e-mail address.  Example of crawlers: Google bot, MSN bot, Yahoo! Slurp, etc. Preprocessing – User Identification Internet essentially is stateless. User Identification is one way of introducing a state into this stateless system. Another means of identifying users is the use of cookies. - Cookies can be used to connect current web page accesses to previous accesses. In addition to tracking user access, the m0st common uses for cookies are:  To avoid requiring returning registered users from signing in again each time they access the site  To personalize the user’s experience: for example with individualized recommendations  To maintain the user’s shopping cart for e-commerce sites. Preprocessing – User Identification(con.)  The remote host field, or IP address field, may be used to identify users. However, the widespread use of proxy servers, corporate firewalls and local caches makes the use of the IP address as a substitute for user identification problematic.  For example several users may be accessing the same site, using a proxy server, which will provide the web server with the same IP address for each user. Preprocessing – User Identification(con.)  Since users generally do not provide their own identification, we should seek alternative methods to identify them.  Using some heuristics we can recognize users from one another:  If the agent field differs for two web log entries, the requests are from two different users.  There are at least two users represented here. Preprocessing – User Identification(con.) Based on this you can assume the following paths through the web site taken by each user:  User1: A->B->E->K->I->O->E->L  User2: A->C->G->M->H->N Preprocessing – User Identification(con.) However, if we apply the information available from the referrer field, and the web site topology, we can uncover the highly likely result that “user 1” here is actually two different users.  User1: A->B->E->K->I->O->E->L  User2: A->C->G->M->H->N  User3: I->O Preprocessing – User Identification(con.) In general the following procedure could be used to identify users: 1) Sort the web log file by ID address and then by time stamp. 2) For each distinct ID address, identify each distinct agent. 3) For each user identified in step 2, apply path information collected from the referrer field and the site topology to determine whether this behavior is more likely the result of two or more users. 4) To identify each user, combine the user identification information from steps 1 to 3 with available cookie and registration information. Preprocessing – Session Identification Sessionizing or session identification is a process by which the aggregate page requests made by a particular user over a long period of time are partitioned into individual sessions. The most straightforward approach is to assign a timeout after a certain length of time has passed since the user’s last request. A time out of 25.5 minutes was established based on experimental data, while many web usage analysts and commercial applications set the timeout threshold at 30 minutes. Preprocessing – Session Identification(con.) Considering the example above, we get the following 4 sessions:     session1 (user 1): A->B->E->K-> E->L session2 (user 2): A->C->G->M->H->N session3 (user 3): I->O session4 (user 1): E->L Preprocessing – Session Identification(con.) Session identification procedure: 1) For each distinct user identified in the preceding section, assign a unique session ID. 2) Define the timeout threshold t. 3) For each user, perform the following: Find the time difference between every two consecutive web log entries. If this difference exceeds the threshold t, assign a new session ID to the later entry. 4) Sort the entries by session ID 1) 2) Preprocessing – Path Completion  Not all page views seen by the user are recorded in the web server log. For example many people use the Back button in their browsers. In such a case the cached version of the page is used. This leads to some holes in the web server’s record of the user’s path through the web site. This problem is caused because of Local Caching.  Knowledge of site topology must be applied to complete these paths, in a process known as Path Completion.  Once the missing pages have been identified, they are inserted into the session file. Preprocessing – Path Completion(con.)  Considering again the session 2 identified in the preceding example:  session2 (user 2): A->C->G->M->H->N      The path completion process leads us to the following sessions: session1 (user 1): A->B->E->K session2 (user 2): A->C->G->M->G->C->H->N session3 (user 3): I->O session4 (user 1): E->L Preprocessing – Further Steps Besides the specialized preprocessing steps for handling web log data described so far, the web usage miner must still apply the usual data mining preprocessing steps. Some of them include:      Data quality monitoring Handling missing data Identifying misclassifications Identifying outliers using both graphical and numerical methods Normalization and standardization Modeling For Web Usage Mining Association Rules  Relates pages that are most often referenced together in a single      server session Sets of pages that are accessed together with a support value exceeding some specified threshold These pages may not directly be connected by hyperlinks Useful for Web designers to restructure their Web sites These rules serve as a heuristic for prefetching documents in order to reduce user-perceived latency when loading a page from a remote site Several algorithms have been developed. Some widely used ones are the Apriori algorithm and FPG(Frequent Pattern Growth). Association Rules(con.)  X == > Y  (support, confidence)  60% of clients who accessed /products/, also accessed /products/software/webminer.htm.  30% of clients who accessed /special-offer.html, placed an online order in /products/software/. Mining Sequential Patterns  Support for a pattern now depends on the ordering of the items, which was not true for Association Rules.  For example: a transaction consisting of URLs ABCD in that order contains BC as a subsequence, but does not contain CB  Useful for predicting future patterns in order to place advertisements for a certain user group.  Example: 60% of clients who placed an online order for WEBMINER, placed another online order for software within 15 days Clustering  Group together a set of items having similar characteristics  Clustering has been widely used in Web Usage Mining to group together similar sessions.  In the data domain, there are two main types of clustering for discovery and analysis:  Usage clusters   Establish groups of users exhibiting similar browsing patterns Useful for inferring user demographics in order to perform market segmentation and personalization.  Page clusters  Discover groups of pages that have related content  Useful for search engines and Web assistance providers E.g. : clients who often access /products/software/webminer.html tend to be from educational institutions. Classification  Mapping a data item into one of several predefined classes  Develop a profile of users belonging to a particular class or category  Requires feature extraction and selection that best describe the properties of a given class or category  Techniques  Decision tree classifiers, naïve Bayesian classifier, k-nearest neighbor classifiers, support vector machines, etc.  E.g.  30% of users who place online orders in /Product/Music are in the 19-25 age group and live on the West coast Let’s try to group this set of test scores into letter grades Maximize the inter-cluster distance and minimize the intra-cluster distance Clustering – a one-dimensional example Intra-cluster distance 3 4 2.13 3.33 Clustering: Just specify number of groups. Groups themselves are defined by data Classification: 50 55 60 65 8 70 75 6 80 85 90 95 6 Inter-cluster distance (gap used here) Map data into predefined groups Privacy Issues  Web Usage Mining tools integrate diﬀerent data sources (Web logs, cookies data, as well as personal data) to accurately track users behavior. This raises the issue of users privacy, a topic that is currently highly relevant for the whole data mining area.  One of the main proposals to deal with privacy issues in the Web area Is the Platform for Privacy Preferences or P3P.  To solve the related privacy issues, researchers have also tackled the problem as the possibility to develop eﬀective user models without accessing precise information available in individual data records in order to not violating users privacy. References: o Data Mining The Web, Uncovering Patterns In Web Content, o o o o o o Structure, And Usage- zdravko markov, daniel t.larose http://www.sciencedirect.com/science/article/pii/S0169023X04001387 http://www.sciencedirect.com/science/article/pii/S1877050911000202 http://en.wikipedia.org/wiki/Web_mining http://www.slideshare.net/Tommy96/powerpoint-presentation4036474/download http://www.slideshare.net/Tommy96/webminingppt/download http://www.slideshare.net/Tommy96/web-mining-tutorial/download

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download WebMining