Download 5.1 UNIT -5 material

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

URL redirection wikipedia , lookup

Transcript
Web Mining
World wide web is an huge, global information repository for
•news, advertisements, consumer information, financial
management, education, government, e-commerce
•Hyper-link information
•Access and usage information
•Already containing billions of pages
•Millions of pages getting added every day
WWW provides rich sources for data mining
Challenges
•Too huge for effective data warehousing and data mining
•Too complex and heterogeneous: no standards and
structure.
It holds unstructured data, there is no structure or schema
to the Web
There are no standard retrieval methods
5/5/2017
Data Mining -By Dr. S. C. Shirwaikar
1
Web Data can be classified into 5 classes
•Content of Actual Web Pages – text data graphics data
•Intrapage structure – HTML or XML code for the page
•Interpage Structure – Linkage structure or navigation
structure between Web Pages
•Usage data- how web pages are accessed by visitors –
dynamic data
•User profiles – Demographic or registration information
obtained about users- The information stored in cookies
Depending on type of data that is mined web mining can
be divided into several classes
5/5/2017
Data Mining -By Dr. S. C. Shirwaikar
2
Web
Mining
Web Content
Mining
Web page
Content Mining
5/5/2017
Web Structure
Mining
Search result
Mining
General Access
pattern
tracking
Data Mining -By Dr. S. C. Shirwaikar
Web Usage
Mining
Customized
usage
tracking
3
Web Page content mining is the traditional searching of
web pages via content
Search result mining is build on the top of search
engines and uses their result as the data to be mined
Web structure mining uses organization of pages on the
web
Web usage mining uses the logs of web access
It can be general or targeted at specific usage or users
5/5/2017
Data Mining -By Dr. S. C. Shirwaikar
4
Web Content mining
Traditional search engines are key word based.
Improvements can be by using techniques as concept
hierarchies and synonyms, user profiles and analyzing the
links between pages.
Queries may use expressions of keywords
E.g., car and repair shop, tea or coffee, DBMS but not
Oracle
Queries and retrieval should consider synonyms, e.g.,
repair and maintenance
Major difficulties of the model
Synonymy: A keyword T does not appear anywhere in the
document, even though the document is closely related to
T, e.g., data mining V/s Knowledge discovery
Polysemy: The same keyword may mean different things
in different contexts, e.g., mining
5/5/2017
Data Mining -By Dr. S. C. Shirwaikar
5
Basic Content mining is text mining
Degree of relevance based on the nearness of the keywords,
relative frequency of the keywords, etc.
Basic techniques
Stop list
Set of words that are deemed “irrelevant”, even though
they may appear frequently
E.g., a, the, of, for, with, etc.
Stop lists may vary when document set varies
Word stem
Several words are small syntactic variants of each other
since they share a common word stem
E.g., drug, drugs, drugged
A term frequency table
Each entry frequent_table(i, j) = # of occurrences of the
word ti in document di .Usually, the ratio instead of the
absolute number of occurrences is used
5/5/2017
Data Mining -By Dr. S. C. Shirwaikar
6
Intelligent search agents(software systems) use other
techniques besides keyword search
Concept Hierarchies – It defines a sequence of mappings
from a set of low level concepts to higher level, more
general concepts
India
Maharashtra
Mumbai
Punjab
Karnataka
Pune
Synonyms Bombay can be used instead of Mumbai
XML schema- web documents are not structured being
in HTML. The web is slowly adapting XML(extensible
markup language) which will structure documents and
mining can be efficient
5/5/2017
Data Mining -By Dr. S. C. Shirwaikar
7
Crawlers
Crawlers are used to facilitate the creation of indices used
by search engines.
A robot or spider or crawler is a program that traverses the
hypertext structure of the web. The page or set of pages
that the crawler starts with are referred to as the seed URL.
All visited links are saved in a queue, and these are then
searched for information.
The collected information is saved in indices for the users of
associated search engines
Crawlers are activated periodically.
Crawlers usually replace entire index , while incremental
crawlers selectively search the web and update index
incrementally
5/5/2017
Data Mining -By Dr. S. C. Shirwaikar
8
Focused Crawlers visit pages related to the topics of interest.
If a page is not relevant the links are not followed.
The focused crawler architecture consists of three primary
components
•A hypertext classifier that associates a relevance score for
each document w.r.t. crawl topic. Additionally a resource rating
estimates whether it will be beneficial to follow the links
moving out of the page
•A distiller determines Hub pages A Hub is a Web page that
provides collection of links to many relevant pages . Hub
pages facilitate search though they themselves do not contain
much information
•The crawler performs crawling via a priority based structure
governed by classifier and distiller
5/5/2017
Data Mining -By Dr. S. C. Shirwaikar
9
Harvest rate- It is a high precision rate performance objective
for the focused crawler
A manually created classification tree is used as a seed for
focused crawling
Harvest System
Harvest system is based on the use of caching, indexing and
crawling.
It uses gatherers and brokers.
A gatherer obtains information for indexing from an Internet
service provider, while a broker provides the index and query
interface.
Indices in Harvest are topic specific
5/5/2017
Data Mining -By Dr. S. C. Shirwaikar
10
Harvest gatherers use Essence system to assist in collecting
data
Essence classifies documents by creating a semantic index
Semantic indexing generates different types of information for
different types of files
It then creates indices on this information based on keywords
Essence uses file extensions to help classify file types
Virtual Web view
The problem of handling large amount of data on the Web
can be solved by creating a multiple layered database (MLDB)
Each layer of this database is more generalized than the layer
beneath it.
5/5/2017
Data Mining -By Dr. S. C. Shirwaikar
11
MLDB
• Layer0: the Web itself
• Layer1: the Web page descriptor layer
– Contains descriptive information for pages on the Web
– An abstraction of Layer0: substantially smaller but still rich
enough to preserve most of the interesting, general
information
– Organized into dozens of semistructured classes
• document, person, organization, ads, directory, sales,
software, game, stocks, library_catalog,
geographic_data, scientific_data, etc.
• Layer2 and up: various Web directory services
constructed on top of Layer1
– provide multidimensional, application-specific services
– WordNet Semantic Network is used-groups of synonyms in
English linked by lexical and semantic relationship
5/5/2017
Data Mining -By Dr. S. C. Shirwaikar
12
• Benefits:
– Multi-dimensional Web info summary analysis
– Approximate and intelligent query answering
– Web high-level query answering (WebSQL, WebML)
Covers, covered by, like , close to
– Web content and structure mining
– Observing the dynamics/evolution of the Web
• Is it realistic to construct such a meta-Web?
– Benefits even if it is partially constructed
– Benefits may justify the cost of tool development,
standardization and partial restructuring
5/5/2017
Data Mining -By Dr. S. C. Shirwaikar
13
Web Structure Mining
Web structure can be used for finding authoritative Web
pages-retrieving pages that are not only relevant, but also
of high quality, or authoritative on the topic
Hyperlinks can infer the notion of authority
The Web consists not only of pages, but also of
hyperlinks pointing from one page to another
These hyperlinks contain an enormous amount of latent
human annotation
A hyperlink pointing to another Web page, this can be
considered as the author's endorsement of the other
page
PageRank is used to measure the importance of a page
and is used for prioritizing pages . It is calculated based on
number of pages that point to it
5/5/2017
Data Mining -By Dr. S. C. Shirwaikar
14
It is bases on number of backlinks to a page. Weights are
used to provide more importance to backlinks coming from
important pages
Given a page p, Bp the set of pages pointing to p and Fp
denoting the set of links out of p
PR(p) = c ∑
PR(q)/ |Fq|
q ε Bq
Where c is a value between 0 and 1 used for normalization
A problem called rank sink exists when A points to B, B
points to A which is solved by adding an additional term
Problems with the Web linkage structure
•Not every hyperlink represents an endorsement
•Other purposes are for navigation or for paid
advertisements
•One authority will seldom have its Web page point
to its rival authorities in the same field
•Authoritative pages are seldom particularly descriptive
5/5/2017
Data Mining -By Dr. S. C. Shirwaikar
15
Hub It is a set of Web pages that provides collections of
links to authorities
HITS ( Hyperlink Induced Topic Search)
•Explore interactions between hubs and authoritative pages
•Use an index-based search engine to form the root set
Many of these pages are presumably relevant to the
search topic
Some of them should contain links to most of the
prominent authorities
•Expand the root set into a base set
•Include all of the pages that the root-set pages link to,
and all of the pages that link to a page in the root set, up
to a designated size cutoff
•Apply weight-propagation
An iterative process that determines numerical estimates
of hub and authority weights
5/5/2017
Data Mining -By Dr. S. C. Shirwaikar
16
•Output a short list of the pages with large hub
weights, and the pages with large authority weights for
the given search topic
Systems based on the HITS algorithm
•Clever, Google: achieve better quality search results
than those generated by term-index engines such as
AltaVista and those created by human ontologists
such as Yahoo!
Difficulties from ignoring textual contexts
•Drifting: when hubs contain multiple topics
•Topic hijacking: when many pages from a single Web
site point to the same single popular site
5/5/2017
Data Mining -By Dr. S. C. Shirwaikar
17
Web Usage Mining
Mining Web log records to discover user access patterns
of Web pages
Web logs provide rich information about Web dynamics
Typical Web log entry includes the URL requested, the
IP address from which the request originated, and a
timestamp
Logs can be examined from either a client or server
perspective
Log is evaluated for the information about the site where
service resides- can be used to improve design of sites
Log is evaluated for Client’s sequence of clicks
(clickstream data)- can be used to improve performance by
prefetching and caching of pages
5/5/2017
Data Mining -By Dr. S. C. Shirwaikar
18
Applications
•Target potential customers for electronic commerce
•Enhance the quality and delivery of Internet information
services to the end user
frequent access behaviour can be used for caching
improve the quality of web site
•Improve Web server system performance
•Identify potential prime advertisement locations
improve dales and advertisement
•Develop user profile aiding in personalization
5/5/2017
Data Mining -By Dr. S. C. Shirwaikar
19
Web Usage Mining activities
• Construct multidimensional view on the Weblog database
– Perform multidimensional OLAP analysis to find the top
N users, top N accessed Web pages, most frequently
accessed time periods, etc.
• Perform data mining on Weblog records
– Find association patterns, sequential patterns, and
trends of Web accessing
– May need additional information,e.g., user browsing
sequences of the Web pages in the Web server buffer
• Conduct studies to
– Analyze system performance, improve system design
by Web caching, Web page prefetching, and Web page
swapping
5/5/2017
Data Mining -By Dr. S. C. Shirwaikar
20