Download WebMining

Document related concepts
no text concepts found
Transcript
Web Mining
(Web Usage Mining)
Web Mining – The Idea
 In recent years the growth of the World Wide Web
exceeded all expectations. Today there are several
billions of HTML documents, pictures and other
multimedia files available via internet and the number
is still rising. But considering the impressive variety of
the web, retrieving interesting content has become a
very difficult task.
Opportunities and Challenges
 Web offers an unprecedented opportunity and challenge to
data mining
 The amount of information on the Web is huge, and easily





accessible.
The coverage of Web information is very wide and diverse. One can
find information about almost anything.
Information/data of almost all types exist on the Web, e.g.,
structured tables, texts, multimedia data, etc.
Much of the Web information is semi-structured due to the nested
structure of HTML code.
Much of the Web information is linked. There are hyperlinks
among pages within a site, and across different sites.
Much of the Web information is redundant. The same piece of
information or its variants may appear in many pages.
Opportunities and Challenges
 The Web is noisy. A Web page typically contains a mixture of many
kinds of information, e.g., main contents, advertisements,
navigation panels, copyright notices, etc.
 The Web is also about services. Many Web sites and pages enable
people to perform operations with input parameters, i.e., they
provide services.
 The Web is dynamic. Information on the Web changes constantly.
Keeping up with the changes and monitoring the changes are
important issues.
 Above all, the Web is a virtual society. It is not only about data,
information and services, but also about interactions among
people, organizations and automatic systems, i.e., communities.
Web Mining
 Web is the single largest data source in the world
 Due to heterogeneity and lack of structure of web data,
mining is a challenging task
 Multidisciplinary field:
data mining, machine learning, natural language
processing, statistics, databases, information
retrieval, multimedia, etc.
Web Mining
 The term created by Orem Etzioni (1996)
 Application of data mining techniques to automatically
discover and extract information from Web data
Data Mining vs. Web Mining
 Traditional data mining
 data is structured and relational
 well-defined tables, columns, rows, keys, and
constraints.
 Web data
 Semi-structured and unstructured
 readily available data
 rich in features and patterns
Web Data
 Web Structure
 Web Content
 Web Usage
Classification of Web Mining Techniques
•Content mining: extract model from web contents,
such as text, images, video, and semi- structured
(HTML or XML) or structured documents (digital libraries)
•Structure mining: aims at finding the underlying
topology and organization of web resources
•Usage mining: discover usage patterns from web
server log files, user queries, and registration data
Web-Structure Mining
 Generate structural summary about the Web site and
Web page
Depending upon the hyperlink, ‘Categorizing the Web
pages and the related Information @ inter domain level
Discovering the Web Page Structure.
Discovering the nature of the hierarchy of hyperlinks
in the website and its structure.
Web Mining
Web Structure Web Content Web Usage
Mining
Mining
Mining
Web-Structure Mining (cont.)
 Finding Information about web pages
Retrieving information about the relevance and the
quality of the web page.
Finding the authoritative on the topic and content.
 Inference on Hyperlink
The web page contains not only information but also
hyperlinks, which contains huge amount of annotation.
Hyperlink identifies author’s endorsement of the other
web page.
Web-Usage Mining
 What is Usage Mining?
Discovering user ‘navigation patterns’ from web data.
Prediction of user behavior while the user interacts
with the web.
Helps to Improve large Collection of resources.
Web Mining
Web Structure Web Content
Mining
Mining
Web Usage
Mining
Web-Usage Mining (cont.)
 Usage Mining Techniques
Data Preparation
Data Collection
Data Selection
Data Cleaning
Data Mining
Association Rules
Sequential Patterns
Classification
Clustering
Web Content Mining
 ‘Process of information’ or resource discovery
from content of millions of sources across the
World Wide Web
 E.g. Web data contents: text, Image, audio, video,
metadata and hyperlinks
 Goes beyond key word extraction, or some simple
statistics of words and phrases in documents.
Web Mining
Web Structure Web Content Web Usage
Mining
Mining
Mining
Web Content Mining
 Pre-processing data before web content mining:
feature selection (Piramuthu 2003)
 Post-processing data can reduce ambiguous
searching results (Sigletos & Paliouras 2003)
 Web Page Content Mining
 Mines the contents of documents directly
 Search Engine Mining
 Improves on the content search of other tools like search
engines.
Web Content Mining
 Web content mining is related to data mining and text
mining. [Bing Liu. 2005]
 It is related to data mining because many data mining
techniques can be applied in Web content mining.
 It is related to text mining because much of the web
contents are texts.
 Web data are mainly semi-structured and/or
unstructured, while data mining is structured and text is
unstructured.
Web Usage Mining
Web Usage Mining is the application of data
mining techniques to discover usage patterns
from Web data, in order to understand and better
serve the needs of Web-based applications.
Srivastava et al
Usage mining reflects the behavior of humans as
they interact with the Internet.
Web Usage Mining - the usage
 Restructure a website
 Extract user access patterns to target ads
 Number of access to individual files
 Predict user behavior based on previously learned
rules and users’ profile
 Present dynamic information to users based on
their interests and profiles
Introduction
 The WWW continues to grow at an astounding rate resulting in
increase of complexity of tasks such as web site design, web server
design and of simply navigating through a web site
 An important input to these design tasks is analysis of how a web site is
used. Usage information can be used to restructure a web site in order to
better serve the needs of users of a site
 Web usage mining is the application of data mining techniques to large web
data repositories in order to produce results that can be used in these
design tasks.
 Some of the data mining algorithms that are commonly used in web usage
mining are:
 Association rule generation
 Sequential Pattern generation
 Clustering
Introduction(con.)
 The input for the web usage mining process is a file, referred
to as a user session file, that gives an exact accounting of who
accessed the web site, what pages were requested and in what
order, and how long each page was viewed
 Web server log does not reliably represent a user session file.
Hence, several preprocessing tasks must be performed prior
to applying data mining algorithms to the data collected from
server logs.
High Level Web Usage Mining Process
Phases in the DM Process – CRISP-DM
Like data mining, web usage mining may be viewed in the context
of the Cross Industry Standard Process for Data mining. According
to CRISP-DM, a given data mining project has a life cycle consisting
of six phases.
The CRISP-DM Phases
1) Business understanding phase:
Clearly declare the project objectives and requirements in terms
of the business or research unit as a whole.
2) Data understanding phase:
Collect the data and discover initial insights
3) Data preparation phase:
Covers all aspects of preparing the final data set, used for all
subsequent phases, from the initial raw data.
4) Modeling phase:
Select and apply appropriate modeling techniques
5) Evaluation phase:
The models delivered by the preceding phase are evaluated for
quality and effectiveness before being deployed for use.
Web Usage Data
 A framework for web usage mining is proposed by
Srivastava et al.
 The process consists of four phases:
 The input stage
 The preprocessing stage
 The pattern discovery stage
 The pattern analysis stage
Web Usage Data - Input stage
The files that are retrieved:
Sources
•
•
•
•
•
Server access logs
Server Referrer logs
Agent logs
Registration information(if any)
Information concerning the site topology
Web Usage Data – preprocessing stage
The raw web logs do not arrive in a format appropriate
for data mining.
The most common tasks:
 Data cleaning and filtering
 De-spidering
 User identification
 Session identification
 Path completion
Web Usage Data – pattern discovery stage
In this stage the web data are ready for the application of
statistical and data mining methods for discovering
patterns:
 Standard statistical analysis
 Clustering algorithms
 Association rules
 Classification algorithms
 Sequential patterns
Web Usage Data – pattern analysis stage
 Not all the patterns uncovered in the pattern discovery
stage would be considered interesting or useful.
 In the pattern analysis stage, human analysts examine
the output from the pattern discovery stage and gather
the most interesting, useful and actionable patterns.
Web Usage Data – clickstream analysis
 Web usage mining is sometimes referred to as clickstream
analysis.
 A clickstream is the aggregate sequence of page visits executed by
a particular user navigation through a web site.
 Clickstream data also consists of logs, cookies, metatags and
other data used to transfer webpages from server to browser.
 Other requests of the browser like image files must be aggregated
into page views at the preprocessing stage.
 Then a series of page views can be woven together into a session.
Web Usage Data – Web server log files
 For each request from a user’s browser to a web server, a response is
generated automatically, called web log file, log file or web log.
 A sample from the EPA(Environmental Protection Agency) web log
data available from the Internet Traffic Archive:
Web Usage Data – Web server log files – the fields
 Remote Host field




This field consists of the Internet IP address of the remote host
making the request such as “141.243.1.172”
Data/Time field
The date/time filed with this format: [DD:HH:MM:SS]
Http Request field
This filed consists of the information that the client’s browser has
requested from the web server.
Contains: the request method, the URI(Uniform Resource Identifier),
the header and the protocol.
The URI can be used to analyze the frequency of visitor requests for pages
and files.
The header information can be used, for example, which keywords are
being used by visitors in search engines that point to your site.
Status Code field
Not all browser requests succeed. This field provides a three-digit
response from the web server to the client’s browser, indicating the status
of the request.
Transfer Volume(Bytes) field
Indicates the size of the file(web page, graphics, etc.) in bytes, sent by the
web server to the client’s browser.
This field is useful for helping to monitor the network traffic, the load
carried by the network throughout the 24 hour cycle.
Web Usage Data – Web server log files
Web Logs come in various formats, which vary depending on the
configuration of the web server.
 Common Log Format(CLF): Supported by a variety of web servers.
It has
 Extended Common Log Format(ECLF): A variation of the common
log format.
 Microsoft IIS Log format: This format records more fields than the
other formats, so that more information can be uncovered.
Web Usage Data – Auxiliary Information
 Besides web logs, further auxiliary information may be available in
the form of user registration information, user demographic
information, and so on.
 These data usually reside on separate servers from the web log data
and will need to be merge with the web logs before preprocessing can
be done.
 Finally, to perform the preprocessing task, the analyst will need to
know the topology or structure of the website, the network of
hierarchies and relationships among the web pages, and so on.
Preprocessing For Web Usage Mining
Preprocessing is needed in order to:
 Clean up the data
 Rid the web log file of nonhuman access behavior(spiders, crawlers
and other automatic web bots)
 Identify each distinct user
 Identify the user session
 Perform path completion
Preprocessing For Web Usage Mining
 Clean up the data
The data cleaning and filtering portion of the preprocessing phase
consists of the following three steps:
1)
2)
3)
Variable extraction
Time stamp derivation
Page extension exploration and filtering
Preprocessing - Data cleaning and filtering
Data cleaning/filtering step 1: variable extraction
1) From the date/time field, extract the date variable
2) From the date/time field, extract the time variable
3) From the HTTP request field, extract the request method
4) From the HTTP request field, extract the page(URI)
5) From the HTTP request field, extract the protocol version
Data cleaning/filtering step 2: creating a Time Stamp
1) Find the number of days between the web log entry date and the
software’s base line date
2) Multiply this number of days by 86,400 which is number of seconds
within a day
3) Find the time in seconds since midnight that is represented by the
time in the web log entry
4) Add (2) and (3)
Preprocessing - Data cleaning and filtering
The figure shows the results from the variable extraction and time stamp
creation.
The baseline date for this example is January 1, 1995.
Preprocessing - Data cleaning and filtering
Page extension exploration and filtering
 Problem: The HTTP protocol requires a separate connection for every
file that is requested from the web server. Therefore, a user’s request
to view a particular page often results in several log entries since
graphics and scripts are downloaded in addition to the HTML file. In
most cases, only the log entry of the HTML file request is relevant and
should be kept for the user session file
 Solution: Elimination of items believed irrelevant can be reasonably
accomplished by checking the suffix of URL name. All log entries
with filename suffixes such as GIF,JPEG,JPG and map can be removed.
However, the list can be modified depending on the site being
analyzed
Preprocessing – De-Spidering The Web Log File
 Web search engines need the most current information available from
the WWW to provide this information to their customers.
 They dispatch spiders, crawlers and automatic web bots to crawl
around the Web performing exhaustive searches of Web sites.
 This behavior is not considered interesting from a web usage mining
standpoint.
 The most direct method of deleting these from the web logs is to
identify the spider’s name in the user agent field, when supplied. For
contact purposes, the bots often also include a URL or an e-mail
address.
 Example of crawlers: Google bot, MSN bot, Yahoo! Slurp, etc.
Preprocessing – User Identification
Internet essentially is stateless.
User Identification is one way of introducing a state into this stateless
system.
Another means of identifying users is the use of cookies.
- Cookies can be used to connect current web page accesses to previous
accesses. In addition to tracking user access, the m0st common uses
for cookies are:
 To avoid requiring returning registered users from signing in again each
time they access the site
 To personalize the user’s experience: for example with individualized
recommendations
 To maintain the user’s shopping cart for e-commerce sites.
Preprocessing – User Identification(con.)
 The remote host field, or IP address field, may be used to identify users.
However, the widespread use of proxy servers, corporate firewalls and
local caches makes the use of the IP address as a substitute for user
identification problematic.
 For example several users may be accessing the same site, using a proxy server,
which will provide the web server with the same IP address for each user.
Preprocessing – User Identification(con.)
 Since users generally do not provide their own identification, we should seek
alternative methods to identify them.
 Using some heuristics we can recognize users from one another:
 If the agent field differs for two web log entries, the requests are from two different
users.
 There are at least two users represented here.
Preprocessing – User Identification(con.)
Based on this you can assume the following paths through the web site taken by
each user:
 User1: A->B->E->K->I->O->E->L
 User2: A->C->G->M->H->N
Preprocessing – User Identification(con.)
However, if we apply the information available from the referrer field, and the
web site topology, we can uncover the highly likely result that “user 1” here is
actually two different users.
 User1: A->B->E->K->I->O->E->L
 User2: A->C->G->M->H->N
 User3: I->O
Preprocessing – User Identification(con.)
In general the following procedure could be used to identify users:
1) Sort the web log file by ID address and then by time stamp.
2) For each distinct ID address, identify each distinct agent.
3) For each user identified in step 2, apply path information collected from the
referrer field and the site topology to determine whether this behavior is
more likely the result of two or more users.
4) To identify each user, combine the user identification information from steps
1 to 3 with available cookie and registration information.
Preprocessing – Session Identification
Sessionizing or session identification is a process by which the aggregate
page requests made by a particular user over a long period of time are
partitioned into individual sessions.
The most straightforward approach is to assign a timeout after a certain length
of time has passed since the user’s last request.
A time out of 25.5 minutes was established based on experimental data, while
many web usage analysts and commercial applications set the timeout threshold
at 30 minutes.
Preprocessing – Session Identification(con.)
Considering the example above, we get the following 4 sessions:




session1 (user 1): A->B->E->K-> E->L
session2 (user 2): A->C->G->M->H->N
session3 (user 3): I->O
session4 (user 1): E->L
Preprocessing – Session Identification(con.)
Session identification procedure:
1) For each distinct user identified in the preceding section, assign a unique
session ID.
2) Define the timeout threshold t.
3) For each user, perform the following:
Find the time difference between every two consecutive web log entries.
If this difference exceeds the threshold t, assign a new session ID to the later entry.
4) Sort the entries by session ID
1)
2)
Preprocessing – Path Completion
 Not all page views seen by the user are recorded in the web server log. For
example many people use the Back button in their browsers. In such a case
the cached version of the page is used. This leads to some holes in the web
server’s record of the user’s path through the web site. This problem is caused
because of Local Caching.
 Knowledge of site topology must be applied to complete these paths, in a
process known as Path Completion.
 Once the missing pages have been identified, they are inserted into the
session file.
Preprocessing – Path Completion(con.)
 Considering again the session 2 identified in the preceding example:
 session2 (user 2): A->C->G->M->H->N





The path completion process leads us to the following sessions:
session1 (user 1): A->B->E->K
session2 (user 2): A->C->G->M->G->C->H->N
session3 (user 3): I->O
session4 (user 1): E->L
Preprocessing – Further Steps
Besides the specialized preprocessing steps for handling web log data described
so far, the web usage miner must still apply the usual data mining preprocessing
steps. Some of them include:





Data quality monitoring
Handling missing data
Identifying misclassifications
Identifying outliers using both graphical and numerical methods
Normalization and standardization
Modeling For Web Usage Mining
Association Rules
 Relates pages that are most often referenced together in a single





server session
Sets of pages that are accessed together with a support value
exceeding some specified threshold
These pages may not directly be connected by hyperlinks
Useful for Web designers to restructure their Web sites
These rules serve as a heuristic for prefetching documents in order
to reduce user-perceived latency when loading a page from a remote
site
Several algorithms have been developed. Some widely used ones are
the Apriori algorithm and FPG(Frequent Pattern Growth).
Association Rules(con.)
 X == > Y
 (support, confidence)
 60% of clients who accessed /products/, also accessed
/products/software/webminer.htm.
 30% of clients who accessed /special-offer.html, placed an online
order in /products/software/.
Mining Sequential Patterns
 Support for a pattern now depends on the ordering of the items, which
was not true for Association Rules.
 For example: a transaction consisting of URLs ABCD in that order
contains BC as a subsequence, but does not contain CB
 Useful for predicting future patterns in order to place advertisements for
a certain user group.
 Example: 60% of clients who placed an online order for WEBMINER,
placed another online order for software within 15 days
Clustering
 Group together a set of items having similar characteristics
 Clustering has been widely used in Web Usage Mining to
group together similar sessions.
 In the data domain, there are two main types of clustering for
discovery and analysis:
 Usage clusters


Establish groups of users exhibiting similar browsing patterns
Useful for inferring user demographics in order to perform market
segmentation and personalization.
 Page clusters
 Discover groups of pages that have related content
 Useful for search engines and Web assistance providers
E.g. :
clients who often access /products/software/webminer.html tend
to be from educational institutions.
Classification
 Mapping a data item into one of several predefined classes
 Develop a profile of users belonging to a particular class or category
 Requires feature extraction and selection that best describe the
properties of a given class or category
 Techniques

Decision tree classifiers, naïve Bayesian classifier, k-nearest neighbor
classifiers, support vector machines, etc.
 E.g.

30% of users who place online orders in /Product/Music are in the 19-25
age group and live on the West coast
Let’s try to group this set of
test scores into letter grades
Maximize the inter-cluster
distance and minimize the
intra-cluster distance
Clustering
– a one-dimensional example
Intra-cluster distance
3
4
2.13
3.33
Clustering:
Just specify number of groups.
Groups themselves are defined
by data
Classification:
50
55
60
65
8
70
75
6
80
85
90
95
6
Inter-cluster distance (gap used here)
Map data into predefined groups
Privacy Issues
 Web Usage Mining tools integrate different data sources (Web
logs, cookies data, as well as personal data) to accurately track
users behavior. This raises the issue of users privacy, a topic
that is currently highly relevant for the whole data mining
area.
 One of the main proposals to deal with privacy issues in the
Web area Is the Platform for Privacy Preferences or P3P.
 To solve the related privacy issues, researchers have also
tackled the problem as the possibility to develop effective user
models without accessing precise information available in
individual data records in order to not violating users privacy.
References:
o Data Mining The Web, Uncovering Patterns In Web Content,
o
o
o
o
o
o
Structure, And Usage- zdravko markov, daniel t.larose
http://www.sciencedirect.com/science/article/pii/S0169023X04001387
http://www.sciencedirect.com/science/article/pii/S1877050911000202
http://en.wikipedia.org/wiki/Web_mining
http://www.slideshare.net/Tommy96/powerpoint-presentation4036474/download
http://www.slideshare.net/Tommy96/webminingppt/download
http://www.slideshare.net/Tommy96/web-mining-tutorial/download