Download Chapter 10 Link Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Chapter 10
Link Analysis
Data Mining Techniques So Far…
• Chapter 5 – Statistics
• Chapter 6 – Decision Trees
• Chapter 7 – Neural Networks
• Chapter 8 – Nearest Neighbor Approaches: MemoryBased Reasoning and Collaborative Filtering
• Chapter 9 – Market Basket Analysis and Association
Rules
2
Introduction
• Airline Route Maps are useful
• Hyperlinks were revolutionary
– Apple’s HyperCard (Bill Atkinson)
• Claim that there are no more than 6 degrees of
separation between any two people on the
planet
• Link Analysis is the data mining technique that
addresses relationships and connections
• Link Analysis is based on Graph Theory
3
Introduction
• As you would expect, Link Analysis has its
limitations as a DM technique also
• However, quite effective in these and
similar situations
– Identifying authoritative sources of information
on the WWW by analyzing page links
– Understanding physician referral patterns
– Analyzing telephone call patterns
4
Basic Graph Theory
• Graphs are an
abstraction used to
represent relationships
• Graphs consist of
– Nodes (vertices) which are
the things in the graph that
have relationships
– Edges are pairs of nodes
connected by a relationship
• Visualization is a key
characteristic of a graph
5
Basic Graph Theory
• A path is an ordered
sequence of nodes
connected by edges
– Flight Segments (legs)
such as LA – Denver –
Boston
• A weighted graph is one
in which the edges have
weights associated with
them
– Example: Weights support
the association between
two products being
purchased together
6
Graph Theory Classic Problems
1. Finding a path in the
graph that visits every
edge exactly one time
(Seven Bridges – edges
are bridges and nodes
are land)
2. Finding the shortest path
that visits the nodes in
the graph exactly one
time (Traveling
Salesman)
– Completely connected
graph with n nodes has n!
(n factorial) unique paths
that contain all nodes (5! =
5 * 4 * 3 * 2 * 1 = 120)
7
Directed vs Undirected Graphs
• Undirected graphs – edges
between nodes go in both
directions (A to B; B to A)
• Directed graphs – edges between
nodes only go in one direction (A to
B is different than B to A)
– Ex: WWW
8
Google – Directed Graph Example
• Web pages = nodes
• Hyperlinks = edges
• Spiders & Web
crawlers updating
• Kleinberg’s Algorithm
– Hub – a page that
links to many
authorities
– Authority – a page that
is linked to by many
hubs
9
Google – example continued
• Authority versus mere
popularity
– Rank by number of unrelated
sites linking to a site yields
popularity
– Rank by number of subjectrelated hubs that point to
them yields authority
– Helps to overcome the
situation that often arises in
popularity where the real
authority (eg Home Page) is
ranked lower because of lack
of popularity of links to it
10
Examples of Link Analysis
• Recent Int’l Data Mining Conference
– http://www.siam.org/meetings/sdm04/
• Chapter10-Example1.pdf
• Chapter10-Example2.pdf
• Chapter10-Example3.pdf
• Megaputer (PolyAnalyst vendor) page:
– http://www.megaputer.com/products/pa/algorithms/la.php3
11
End of Chapter 10
12