Download Web Miningx - Latest Seminar Topics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
WEB USAGE MINING
NEGATIVEASSOCIATION
s.vignesh
1hk07cs073
HKBKCE
Web Mining
 Web Mining is the use of the data mining


techniques to automatically discover and extract
information from web documents/services
Discovering useful information from the WorldWide Web and its usage patterns
My Definition: Using data mining techniques to
make the web more useful and more profitable
(for some) and to increase the efficiency of our
interaction with the web
Web Mining

Data Mining Techniques






Association rules
Sequential patterns
Classification
Clustering
Outlier discovery
Applications to the Web



E-commerce
Information retrieval (search)
Network management
Examples of Discovered
Patterns

Association rules


Classification


People with age less than 40 and salary > 40k trade on-line
Clustering


98% of AOL users also have E-trade accounts
Users A and B access similar URLs
Outlier Detection

User A spends more than twice the average amount of time
surfing on the Web
Web Mining
 The WWW is huge, widely distributed, global
information service centre for
 Information services: news, advertisements,
consumer information, financial management,
education, government, e-commerce, etc.
 Hyper-link information
 Access and usage information
 WWW provides rich sources of data for data mining
Why Mine the Web?

Enormous wealth of information on Web





Lots of data on user access patterns


Financial information (e.g. stock quotes)
Book/CD/Video stores (e.g. Amazon)
Restaurant information (e.g. Zagats)
Car prices (e.g. Carpoint)
Web logs contain sequence of URLs accessed by users
Possible to mine interesting nuggets of information


People who ski also travel frequently to Europe
Tech stocks have corrections in the summer and rally from
November until February
Why is Web Mining
Different?

The Web is a huge collection of documents except
for



The Web is very dynamic


Hyper-link information
Access and usage information
New pages are constantly being generated
Challenge: Develop new Web mining algorithms and
adapt traditional data mining algorithms to


Exploit hyper-links and access patterns
Be incremental
Web Mining Applications

E-commerce (Infrastructure)





Information retrieval (Search) on the Web




Generate user profiles
Targetted advertizing
Fraud
Similar image retrieval
Automated generation of topic hierarchies
Web knowledge bases
Extraction of schema for XML documents
Network Management


Performance management
Fault management
User Profiling

Important for improving customization



Generate user profiles based on their access patterns



Provide users with pages, advertisements of interest
Example profiles: on-line trader, on-line shopper
Cluster users based on frequently accessed URLs
Use classifier to generate a profile for each cluster
Engage technologies


Tracks web traffic to create anonymous user profiles of Web
surfers
Has profiles for more than 35 million anonymous users
Internet Advertizing

Ads are a major source of revenue for Web
portals (e.g., Yahoo, Lycos) and E-commerce
sites

Plenty of startups doing internet advertizing


Doubleclick, AdForce, Flycast, AdKnowledge
Internet advertizing is probably the “hottest”
web mining application today
Internet Advertizing

Scheme 1:



Manually associate a set of ads with each user profile
For each user, display an ad from the set based on profile
Scheme 2:



Automate association between ads and users
Use ad click information to cluster users (each user is
associated with a set of ads that he/she clicked on)
For each cluster, find ads that occur most frequently in the
cluster and these become the ads for the set of users in the
cluster
Internet Advertizing




Use collaborative filtering (e.g. Likeminds, Firefly)
Each user Ui has a rating for a subset of ads (based
on click information, time spent, items bought etc.)
Rij - rating of user Ui for ad Aj
Problem: Compute user Ui’s rating for an unrated ad
Aj
?
A1
A2
A3
Internet Advertizing

Key Idea: User Ui’s rating for ad Aj is set to Rkj,
where Uk is the user whose rating of ads is most
similar to Ui’s

User Ui’s rating for an ad Aj that has not been
previously displayed to Ui is computed as follows:




Consider a user Uk who has rated ad Aj
Compute Dik, the distance between Ui and Uk’s ratings on
common ads
Ui’s rating for ad Aj = Rkj (Uk is user with smallest Dik)
Display to Ui ad Aj with highest computed rating
Fraud

With the growing popularity of E-commerce, systems
to detect and prevent fraud on the Web become
important

Maintain a signature for each user based on buying
patterns on the Web (e.g., amount spent, categories
of items bought)

If buying pattern changes significantly, then signal
fraud

HNC software uses domain knowledge and neural
networks for credit card fraud detection
Retrieval of Similar Images

Given:


A set of images
Find:
All images similar to a given image
 All pairs of similar images


Sample applications:
Medical diagnosis
 Weather predication
 Web search engine for images
 E-commerce

Retrieval of Similar Images


QBIC, Virage, Photobook
Compute feature signature for each image


QBIC uses color histograms
WBIIS, WALRUS use wavelets

Use spatial index to retrieve database image whose
signature is closest to the query’s signature

WALRUS decomposes an image into regions
A single signature is stored for each region
Two images are considered to be similar if they have
enough similar region pairs


Images retrieved by
WALRUS
Query image
Problems with Web Search Today

Today’s search engines are plagued by
problems:
the abundance problem (99% of info of no
interest to 99% of people)
 limited coverage of the Web (internet
sources hidden behind search interfaces)
Largest crawlers cover < 18% of all web
pages
 limited query interface based on keywordoriented search
 limited customization to individual users

Problems with Web Search Today

Today’s search engines are plagued by
problems:

Web is highly dynamic
 Lot
of pages added, removed, and updated
every day

Very high dimensionality
Improve Search By Adding
Structure to the Web

Use Web directories (or topic hierarchies)

Provide a hierarchical classification of documents (e.g., Yahoo!)
Yahoo home page
Recreation
Travel

Sports
Business
Companies
Science
Finance
News
Jobs
Searches performed in the context of a topic restricts the search to only
a subset of web pages related to the topic
Automatic Creation
of Web Directories

In the Clever project, hyper-links between Web pages
are taken into account when categorizing them




Use a bayesian classifier
Exploit knowledge of the classes of immediate neighbors of
document to be classified
Show that simply taking text from neighbors and using
standard document classifiers to classify page does not work
Inktomi’s Directory Engine uses “Concept Induction”
to automatically categorize millions of documents
Network Management

Objective: To deliver content to users quickly and
reliably


Router
Server
Traffic management
Fault management
Service Provider Network
Why is Traffic Management
Important?

While annual bandwidth demand is increasing ten-fold
on average, annual bandwidth supply is rising only by
a factor of three

Result is frequent congestion at servers and on
network links



during a major event (e.g., princess diana’s death), an
overwhelming number of user requests can result in millions
of redundant copies of data flowing back and forth across the
world
Olympic sites during the games
NASA sites close to launch and landing of shuttles
Traffic
Management


Key Ideas

Dynamically replicate/cache content at multiple sites within
the network and closer to the user

Multiple paths between any pair of sites

Route user requests to server closest to the user or least
loaded server

Use path with least congested network links
Akamai, Inktomi
Traffic Management
Congested
link
Congested
server
Request
Router
Server
Service Provider Network
Traffic Management

Need to mine network and Web traffic to determine

What content to replicate?
Which servers should store replicas?
Which server to route a user request?

What path to use to route packets?



Network Design issues




Where to place servers?
Where to place routers?
Which routers should be connected by links?
One can use association rules, sequential pattern mining
algorithms to cache/prefetch replicas at server
Fault Management

Fault management involves

Quickly identifying failed/congested servers and links in network

Re-routing user requests and packets to avoid congested/down servers and
links

Need to analyze alarm and traffic data to carry out root cause analysis of
faults

Bayesian classifiers can be used to predict the root cause given a set of
alarms
Web Mining Issues

Size



Grows at about 1 million pages a day
Google indexes 9 billion documents
Number of web sites
 Netcraft survey says 72 million sites
(http://news.netcraft.com/archives/web_server_survey.html)

Diverse types of data





Images
Text
Audio/video
XML
HTML
Number of Active Sites
Total Sites Across All Domains August 1995 - October 2007
Systems Issues

Web data sets can be very large


Cannot mine on a single server!


Tens to hundreds of terabytes
Need large farms of servers
How to organize hardware/software to
mine multi-terabye data sets
 Without
breaking the bank!
Different Data Formats
Structured Data
 Unstructured Data
 OLE DB offers some solutions!

Web Data
Web pages
 Intra-page structures
 Inter-page structures
 Usage data
 Supplemental data

Profiles
 Registration information
 Cookies

Web Usage Mining
Pages contain information
 Links are ‘roads’
 How do people navigate the Internet


 Web Usage Mining (clickstream
analysis)
Information on navigation paths
available in log files
 Logs can be mined from a client or a
server perspective

Website Usage Analysis


Why analyze Website usage?
Knowledge about how visitors use Website could





Provide guidelines to web site reorganization; Help prevent
disorientation
Help designers place important information where the visitors look
for it
Pre-fetching and caching web pages
Provide adaptive Website (Personalization)
Questions which could be answered




What are the differences in usage and access patterns among users?
What user behaviors change over time?
How usage patterns change with quality of service (slow/fast)?
What is the distribution of network traffic over time?
Website Usage Analysis
Website Usage Analysis
Website Usage Analysis
Analog – Web Log File Analyser
Gives basic statistics such as
• number of hits
• average hits per time period
• what are the popular pages in your site
• who is visiting your site
• what keywords are users searching for to get to you
• what is being downloaded
http://www.analog.cx/
Web Usage Mining Process
Web Usage Mining Process
Web Usage Mining Process
Web Mining Outline
Goal: Examine the use of data mining on
the World Wide Web
 Web Content Mining
 Web Structure Mining
 Web Usage Mining
Web Mining Taxonomy
Modified from [zai01]
Web Content Mining




Examine the contents of web pages as well as result
of web searching
Can be thought of as extending the work performed
by basic search engines
Search engines have crawlers to search the web and
gather information, indexing techniques to store the
information, and query processing support to provide
information to the users
Web Content Mining is: the process of extracting
knowledge from web contents
Semi-structured Data
 Content
is, in general, semistructured
 Example:
Title
 Author
 Publication_Date
 Length
 Category
 Abstract
 Content

Structuring Textual Data



Many methods designed to analyze structured data
If we can represent documents by a set of attributes
we will be able to use existing data mining methods
How to represent a document?


Vector based representation
(referred to as “bag of words” as it is invariant to
permutations)
Use statistics to add a numerical dimension to
unstructured text
Document Representation


A document representation aims to capture what the
document is about
One possible approach:


Each entry describes a document
Attribute describe whether or not a term appears in the
document
Document Representation
Another approach:
• Each entry describes a document
• Attributes represent the frequency in which a
term appears in the document
Document Representation
• Stop Word removal: Many words are not informative and thus
irrelevant for document representation
the, and, a, an, is, of, that, …
• Stemming: reducing words to their root form (Reduce
dimensionality)
A document may contain several occurrences of words
like fish, fishes, fisher, and fishers. But would not be retrieved by
a query with the keyword “fishing”
Different words share the same word stem and should be
represented with its stem, instead of the actual word “Fish”