Download Introduction to Spatial Data Mining Spatial Data mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Introduction to Spatial Data Mining
Instructor: Wei Ding
CS 697 Spatial Data Mining
S
Spring
2009
1
Spatial Data mining:
Find interesting,
g potentially useful,
non-trivial patterns in spatial data.
Patterns:
P
tt
:
• classification
• association analysis
anal sis
• cluster analysis
S ti l attribute
Spatial
tt ib t – geographically
hi ll referenced
f
d
• Neighborhood and extent
• Location
L ti (longitude,
(l
it d latitude,
l tit d elevation,
l ti
etc.)
t )
2
Illustrating Classification Task
Tid
Attrib1
1
Yes
Large
Attrib2
125K
Attrib3
No
Class
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
g
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Learn
Model
10
Tid
Attrib1
11
No
Small
Attrib2
55K
Attrib3
?
Class
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
N
No
L
Large
67K
?
Apply
Model
10
3
Classification
y Given a collection of records (training set)
y Each record contains a set of attributes, one of the attributes is
tthee cclass.
ass.
y Find a model for class attribute as a function of the values of
other attributes.
y Goal: previously unseen records should be assigned a class as
accurately as possible.
y A test set is used to determine the accuracy of the model.
Usually,
y the ggiven data set is divided into trainingg and test sets,
with training sets used to build the model and test set used to
validate it.
4
Classification: Example of a Decision
Tree
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Splitting Attributes
Refund
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
NO
> 80K
YES
10
Model: Decision Tree
Training Data
5
Decision Tree Classification Task
Tid
Attrib1
1
Yes
Large
Attrib2
125K
Attrib3
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Class
Learn
Model
10
10
6
Tid
Attrib1
11
No
Small
Attrib2
55K
Attrib3
?
Class
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply
Model
Married
Decision
Tree
Apply Model to Test Data
Test Data
Start from the root of tree.
Refund
Yes
No
NO
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
MarSt
Married
Single, Divorced
TaxInc
NO
< 80K
> 80K
YES
NO
7
Apply Model to Test Test
Data
Data
Refund
Yes
10
No
NO
MarSt
Married
Single, Divorced
TaxInc
< 80K
NO
8
NO
> 80K
YES
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
Apply Model to Test Test
Data
Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Married
Single, Divorced
TaxInc
NO
< 80K
> 80K
YES
NO
9
Apply Model to Test Test
Data
Data
Refund
Yes
10
No
NO
MarSt
Married
Single, Divorced
TaxInc
< 80K
NO
10
NO
> 80K
YES
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Married
Single, Divorced
TaxInc
NO
< 80K
> 80K
YES
NO
11
Apply Model to Test Test
Data
Data
Refund
Yes
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Married
Single, Divorced
TaxInc
< 80K
NO
12
Refund Marital
Status
NO
> 80K
YES
Assign Cheat to “No”
Association Rule Mining
y Given a set of transactions,
transactions find rules that will predict the
occurrence of an item based on the occurrences of other
te s in the
t e transaction
t a sact o
items
Market-Basket transactions
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread Milk,
Bread,
Milk Diaper,
Diaper Beer
Bread, Milk, Diaper, Coke
Example of Association Rules
{Diaper} → {Beer},
{Milk, Bread} → {Eggs,Coke},
{Beer, Bread} → {Milk},
Implication means co-occurrence,
co occurrence not
causality!
13
Interpret an Association Rule
y {Diaper} → {Beer}
y The rule suggests that a strong relationship exists
bbetween
t
the
th sale
l off di
diapers andd bbeer bbecause many
customers who buy diapers also buy beer.
y Retailers can use this type of rules to help them identify
new opportunities for cross-selling their products to the
customers.
14
Spatial Association Rules
y Earth science data
y Association patterns may reveal interesting connections among
tthee ocean,
ocea , land,
a , aand at
atmospheric
osp e c pprocesses
ocesses
y Criminal data
y Association patterns may reveal criminal behavior between
ppeople
p and their environment
15
Evaluation of Association Patterns
y Association analysis algorithms have the potential to generate
a large number of patterns
y Thousands or even millions of patterns from real commercial
databases
y Problem: How to identify the most interesting ones?
y One person’s trash might be another person’s treasure
y Objective interestingness measure: statistical arguments
y Subjective interestingness measure: subjective arguments
y {Butter}Æ{Bread},
{Butter}Æ{Bread} not interesting because it is expected
y {Diaper}Æ{Beer}, interesting because it is quite unexpected
16
Example of Cluster Analysis: Spatial Cluster
The 1854
Asiatic
Cholera in
London
A cluster whose
centroid is a water
pump
17
Cluster Analysis
y Cluster analysis divides data into groups (clusters) that are
meaningful, useful, or both.
y Clustering for understanding (be meaningful): the clusters
should capture the natural structure of the data.
y Clusteringg for utilityy (be
( useful):
) cluster analysis
y provides
p
an
abstraction from individual data objects to the clusters in which
those data objects reside.
18
What is Cluster Analysis
y Finding groups of objects such that the objects in a group will
be similar (or related) to one another and different from (or
uunrelated
e ate to) the
t e objects
o jects in ot
other
e ggroups
oups
Intra-cluster
di t
distances
are
minimized
Inter-cluster
distances are
maximized
i i d
19
Distance
y Not necessarily be the Euclidean Distance.
Distance
y Euclidean distance is the “ordinary” distance between two points
tthat
at one
o e would
ou measure
easu e with
t a ruler.
ue.
y Different measures of distance with respect to different
views.
views
20
Notion of Clusters Can be Ambiguous
How many clusters?
Six Clusters
Two Clusters
Four Clusters
Different measures of distance!
21
Applications of Cluster Analysis
y Understanding
y Group
G
related
l t d ddocuments
t ffor
browsing, group genes and proteins
that have similar functionality, or
group stocks with similar price
fluctuations
y Summarization
y Reduce the size of large data sets
Discovered Clusters
1
2
3
4
Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
Sun-DOWN
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
Computer-Assoc-DOWN,Circuit-City-DOWN,
p
y
Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN
Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,
MBNA-Corp-DOWN,Morgan-Stanley-DOWN
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Clustering precipitation
in Australia
Industry Group
Technology1-DOWN
Technology2 DOWN
Technology2-DOWN
Financial-DOWN
Oil-UP
Summary
y Spatial Data Mining is a technology that blends traditional
data analysis methods with sophisticated algorithms for
pprocessing
ocess g large
a ge vo
volumes
u es of
o spat
spatial
a data.
ata.
y It has opened up exciting opportunities for exploring and
analyzing new types of data and for analyzing old types of
data in new ways.
23