Download web content mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Web Mining
by:
Katharotiya Manthan
Overview






Web Mining
Semantic Web
Ontologies
Semantic Web Mining
Future Work
References
Problems With Web
Interaction




Finding Relevant Information
Creating New Knowledge using Existing
Resources
Personlization of Information
Learning about Consumers or Individual
Users
Web Mining



The term created by Orem Etzioni
(1996)
Application of Data mining techniques
Web Mining into Subtasks




Resource finding
Information Selection and pre-processing
Generalization
Analysis
Different Types

Web Usage Mining

Web Content Mining

Web Structure Mining
Data Mining vs. Web Mining

Traditional data mining



data is structured and relational
well-defined tables, columns, rows, keys,
and constraints.
Web data



Semi-structured and unstructured
readily available data
rich in features and patterns
Web Structure Mining

Generate structural summary about the
Web site and Web page


Extraction of patterns from the hyperlinks
Mining of the structure of the document
Web Usage Mining

Discovering user ‘navigation patterns’
from web data.


Prediction of user behavior while the user interacts
with the web.
Helps to Improve large Collection of resources.
Usage Mining Techniques

Data Preparation




Data Collection
Data Selection
Data Cleaning
Data Mining


Navigation Patterns
Sequential Patterns
Data Mining Techniques

Navigation Patterns




Example:
70% of users who accessed /company/product2 did so by
starting at /company and proceeding through
/company/new, /company/products and company/product1
80% of users who accessed the site started from
/company/products
65% of users left the site after four or less page references
Cont…

Sequential Patterns


In Google search, within past week 30% of
users who visited /company/product/ had
‘camera’ as text.
60% of users who placed an online order
in /company/product1 also placed an order
in /company/product4 within 15 days
Web Content Mining

‘Process of information’ or resource
discovery from content of millions of
sources across the World Wide Web


E.g. Web data contents: text, Image, audio, video,
metadata and hyperlinks
Goes beyond key word extraction, or
some simple statistics of words and
phrases in documents.
Semantic Web

The Semantic Web is an evolving
development of the World Wide Web in
which the meaning (semantics) of
information and services on the web is
defined, making it possible for the web
to "understand" and satisfy the
requests of people and machines to use
the web content.
XML, RDF and Web Data




Structured and Unstructured Data
W3c Standards for RDF
Semantic Web: Different Kinds of
databases
Tight Coupling and Loose Coupling
RDF - Resource Description
Framework

Data Model consists of three object
types:



Resources
Properties
Statements
Example


Ora Lassila is the creator of the resource
http://www.w3.org/Home/Lassila
This sentence has the following parts:



Subject(Resource)
http://www.w3.org/Home/Lassila
Predicate (Property) Creator
Object (literal) "Ora Lassila"
Cont…
Cont…
Ontologies

Ontologies are developed to provide
machine-processable semantics of
information sources that can be
communicated between different agents
(software and humans).
Developing an Ontology




Defining classes in the ontology,
Arranging the classes in a taxonomic
(subclass–superclass) hierarchy
Defining slots and describing allowed
values for these slots,
Filling in the values for slots for
instances.
Cont…
Semantic Web Mining


Closing the gap between Semantic Web
and Web Mining.
Use of ontologies
Mining the Semantic in Web
Evaluation Of Semantic Web
Mining


Web Mining Vs. Semantic Web Mining
A Note On E-Commerce
Research initiatives


Vivísimo proposes a clustering approach
for web document organization
Haveliwala also propose a methodology
for evaluating strategies for similarity
search on the Web.

Jaccard coefficient
Future Work

Demonstrating the utility of web mining
can be done by making exploratory
changes to web sites, e.g., adding
links from hot parts of web site to cold
parts and then extracting, visualizing
and interpreting changes in access
patterns.
Conti…


There is often a tension in the design
of algorithms between
accommodating a wide range of data,
or customizing the algorithm to
capitalize on known constraints or
regularities.
Also web content mining can be
introduced to implementations of this
architecture.
References






http://en.wikipedia.org/wiki/Web_mining
http://www.engr.sjsu.edu/meirinaki/papers/NEMIS.p
df
http://www.w3.org
http://www.cs.washington.edu/research/projects/We
bWare1/www/softbots/papers/agents97.pdf
http://infomesh.net/2001/swintro/
http://www.ksl.stanford.edu/people/dlm/etai/etaiabstract.html