Download Web Mining Report.pdf

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
WEB MINING
PRESENTED BY:
VIKASH KUMAR.
Web Mining
 Web Mining is the use of the data mining


techniques to automatically discover and extract
information from web documents/services
Discovering useful information from the WorldWide Web and its usage patterns
My Definition: Using data mining techniques to
make the web more useful and more profitable
(for some) and to increase the efficiency of our
interaction with the web
Web Mining

Data Mining Techniques
 Association rules
 Sequential patterns
 Classification
 Clustering
Classification of Web Mining
Techniques
Web Content Mining
 Web-Structure Mining
 Web-Usage Mining

Web-Structure Mining

Generate structural summary about the Web
site and Web page
Depending upon the hyperlink, ‘Categorizing the Web
pages and the related Information @ inter domain level
Discovering the Web Page Structure.
Discovering the nature of the hierarchy of hyperlinks
in the website and its structure.
Web-Usage Mining

What is Usage Mining?
Discovering user ‘navigation patterns’ from web data.
Prediction of user behavior while the user interacts
with the web.
Helps to Improve large Collection of resources.
Web Usage Mining Process
Web Usage Mining
Search Engines
 Personalization
 Website Design

Website Usage Analysis
Website Usage Analysis


Why analyze Website usage?
Knowledge about how visitors use Website could





Provide guidelines to web site reorganization; Help prevent
disorientation
Help designers place important information where the visitors look
for it
Pre-fetching and caching web pages
Provide adaptive Website (Personalization)
Questions which could be answered




What are the differences in usage and access patterns among users?
What user behaviors change over time?
How usage patterns change with quality of service (slow/fast)?
What is the distribution of network traffic over time?
Web Content Mining

‘Process of information’ or resource
discovery from content of millions of sources
across the World Wide Web


E.g. Web data contents: text, Image, audio, video,
metadata and hyperlinks
Goes beyond key word extraction, or some
simple statistics of words and phrases in
documents.
Examples of Discovered
Patterns

Association rules
 98% of AOL users also have E-trade accounts

Classification
 People with age less than 40 and salary > 40k
trade on-line

Clustering
 Users A and B access similar URLs

Outlier Detection
 User A spends more than twice the average
amount of time surfing on the Web
Web Mining
 The WWW is huge, widely distributed, global
information service centre for
 Information services: news, advertisements,
consumer information, financial management,
education, government, e-commerce, etc.
 Hyper-link information
 Access and usage information
 WWW provides rich sources of data for data mining
Why Mine the Web?

Enormous wealth of information on Web





Lots of data on user access patterns


Financial information (e.g. stock quotes)
Book/CD/Video stores (e.g. Amazon)
Restaurant information (e.g. Zagats)
Car prices (e.g. Carpoint)
Web logs contain sequence of URLs accessed by users
Possible to mine interesting nuggets of information


People who ski also travel frequently to Europe
Tech stocks have corrections in the summer and rally from
November until February
User Profiling

Important for improving customization



Generate user profiles based on their access patterns



Provide users with pages, advertisements of interest
Example profiles: on-line trader, on-line shopper
Cluster users based on frequently accessed URLs
Use classifier to generate a profile for each cluster
Engage technologies


Tracks web traffic to create anonymous user profiles of Web
surfers
Has profiles for more than 35 million anonymous users
Problems with Web Search Today

Today’s search engines are plagued by
problems:
the abundance problem (99% of info of no
interest to 99% of people)
 limited coverage of the Web (internet
sources hidden behind search interfaces)
Largest crawlers cover < 18% of all web
pages
 limited query interface based on keywordoriented search
 limited customization to individual users

Problems with Web Search
Today(cont.)

Today’s search engines are plagued by
problems:

Web is highly dynamic
 Lot
of pages added, removed, and updated
every day

Very high dimensionality
Web Mining Issues

Size



Grows at about 1 million pages a day
Google indexes 9 billion documents
Number of web sites
 Netcraft survey says 72 million sites
(http://news.netcraft.com/archives/web_server_survey.html)

Diverse types of data





Images
Text
Audio/video
XML
HTML
Web Mining Applications

E-commerce (Infrastructure)





Information retrieval (Search) on the Web




Generate user profiles
Targetted advertizing
Fraud
Similar image retrieval
Automated generation of topic hierarchies
Web knowledge bases
Extraction of schema for XML documents
Network Management


Performance management
Fault management
Retrieval of Similar Images

Given:


A set of images
Find:
All images similar to a given image
 All pairs of similar images


Sample applications:
Medical diagnosis
 Weather predication
 Web search engine for images
 E-commerce

Fraud

With the growing popularity of E-commerce, systems
to detect and prevent fraud on the Web become
important

Maintain a signature for each user based on buying
patterns on the Web (e.g., amount spent, categories
of items bought)

If buying pattern changes significantly, then signal
fraud

HNC software uses domain knowledge and neural
networks for credit card fraud detection
Conclusion

Major limitations of Web mining research:
Lack of suitable test collections that can
be reused by researchers.
 Difficult to collect Web usage data across
different Web sites.


Future research directions:





Multimedia data mining: a picture is worth a
thousand words.
Multilingual knowledge extraction: Web page
translations
Wireless Web: WML and HDML.
The Hidden Web: forms, dynamically generated
Web pages.
Semantic Web
References
Mining the Web: Discovering
Knowledge from Hypertext Data by
Soumen Chakrabarti (MorganKaufmann Publishers )
 Web Mining :Accomplishments &
Future Directions by Jaideep
Srivastava
 The World Wide Web: Quagmire or
goldmine by Oren Entzioni
 http://www.galeas.de/webmining.html

THANK YOU