Download Discovery of Climate Indices using Clustering

Document related concepts

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Spatial Data Mining
CS 697
Assignment 1
February 16, 2010
Pradnya Khutafale, Peter Lucas,
and Chris Maio
Advisor: Dr. Wei Ding
Computer Science Department
UMass Boston
1
Discovery of
Climate Indices
using Clustering
Principal Investigators
Vipin Kumar (University of Minnesota)
Michael Steinbach (University of Minnesota)
Collaborators
Steven Klooster (Cal. State Univ, Monterey Bay)
Christopher Potter (NASA Ames Research Center)
Pang-Ning Tan (Michigan State University)
2
Researchers
Department of Computer Science
and Engineering
Michael Steinbach
Pang-Ning Tan
Vipin Kumar



Leading educators in the field of
spatial data mining
Investigating the use of data mining
techniques to find interesting spatiotemporal patterns from Earth Science
Regarded as leaders in the field of
climate indices identification and
data mining research
Discovery of Climate Indices using Clustering
3
Researchers
NASA & Ames Research Center
team members:
Chris Potter
Steven Klooster
Working on cutting edge
computer science methods
and technologies to be
utilized for finding solutions
to complex environmental
problems.
Discovery of Climate Indices using Clustering
4
Presentation Outline

Background: (Chris)


Climate Change
Earth System Linkages

Earth Science Data and Climate Indices (Chris)

Existing Eigenvalue Techniques and Limits (Pete)

New Clustering Based Methodology (Pete)

Results and Comparisons (Pradnya)

Conclusions and Future Research (Pradnya and Pete)
Discovery of Climate Indices using Clustering
5
Presentation Outline

Background:


Climate Change
Earth System Linkages

Earth Science Data and Climate Indices

Existing Eigenvalue Techniques and Limitations

New Clustering Based Methodology

Results and Comparisons

Conclusions and Future Research
Discovery of Climate Indices using Clustering
6
Background
Climate Change
Extinctions of plants and animals
Sea-level Rise
Rise in global temperatures
IPCC Predictions
Discovery of Climate Indices using Clustering
7
Background
Climate Change Impacts



Climate Change leads to
significant changes of rainfall
and soil moisture (drought and
flood)
Agricultural activities (crop
growth cycle) and world food
supplies are affected greatly by
climatic factors (desertification)
Climate change increases the
frequency, intensity, and
distribution of natural hazards,
such as hurricanes and other
storms
Discovery of Climate Indices using Clustering
8
Background
Earth System Linkages



Ocean, atmosphere, and
land processes are highly
coupled
Climate phenomena in one
location can affect the
climate at a far away
location this is known as
climate teleconnections
Understanding climate
“teleconnections” key to
knowing and predicting
ecosystem response to
climate change
Discovery of Climate Indices using Clustering
9
Presentation Outline

Background:


Climate Change
Earth System Linkages

Earth Science Data and Climate Indices

Existing Eigenvalue Techniques and Limitations

New Clustering Based Methodology

Results and Comparisons

Conclusions and Future Research
Discovery of Climate Indices using Clustering
10
Earth Science Data
Time Series Data


Sea Surface
Temperature (SST)
Sea Level Pressure (SLP)
11
Earth Science Data
Data Acquisition
There are thousands of floats, buoys, and other remote sensing devises throughout the
oceans collecting enormous amount of oceanographic data periodically transmitted to
shore via satellite (Naval Research Laboratory).
Discovery of Climate Indices using Clustering
12
Earth Science Data
Preprocessing Required

Spatial and temporal nature
of data poses a number of
challenges

Noisy

Cycles of varying lengths
and regularity

Strong seasonal component

Displays long term trends

Displays temporal and
spatial Autocorrelation
Discovery of Climate Indices using Clustering
13
Climate Indices




Climate Indices = Data time
series that summarize physical
behavior of different regions of
ocean and atmosphere
Distill climate variability at
regional or global scale into a
single and manageable time
series
Usually based on sea level
pressure and sea surface
temperature
Past methods of indication
painstakingly slow and tedious
Discovery of Climate Indices using Clustering
14
Climate Indices
Climate Index: Nino 1+2
Discovery of Climate Indices using Clustering
15
Discovery of Climate Indices using Clustering
16
Climate Indices
El Nino Correlations
SST of El Nino correlated indices
17
Climate Indices
Detection of Climate Indices



Earth Scientists have devoted a significant
amount of time discovering climate indices
Traditional approaches include direct
observation of climate phenomena (El
Nino)
Use of linear algebra techniques including
eigenvalue analysis
Discovery of Climate Indices using Clustering
18
Climate Indices
Eigenvalue Analysis


Driven by massive amount of
data obtained from satellites
and remote sensing devises
Provides a way to quickly and
automatically detect patterns
in large amounts of data
Jason-2 IR satellite image
Discovery of Climate Indices using Clustering
19
Climate Indices
Eigenvalue Analysis

Eigenvalue techniques include:



Principle Components Analysis (PCA)
Single Value Decomposition (SVD)
Limitations of Eigenvalue Analysis


Weaker signals may be masked by stronger signals
All Discovered signals must be orthogonal to each
other making it difficult to attach a physical
interpretation to them
Discovery of Climate Indices using Clustering
20
Climate Indices
Alternative Clustering Methodology




Utilization of data mining
techniques and enormous
amount of remote sensing
data to find climate indices
Analysis yields clusters
that represent ocean
regions with relatively
homogeneous behavior
Centroids of these areas
summarize behavior
particular region
Finding “meaningful”
clusters will enable Earth
Scientists to better predict
changes in climate system
Discovery of Climate Indices using Clustering
21
Climate Indices
Benefits of Clustering

Discovered signals do not need to be orthogonal
or statistically independent of one another

Signals are more easily interpreted

Weaker signals are more readily detected

It provides an efficient way to determine the
influence of large set of points (all ocean point) on
another large set of points (all land points)
Discovery of Climate Indices using Clustering
22
Climate Indices
Results of Clustering Methodology



Candidate Indices highly
correlated to known
indices representing
rediscovery of well
known indices and
validation of methods
Variants to well-known
indices which may be
better predictors of land
behavior for some
regions of land
Cluster centroids that
have medium or low
correlation with known
indices may represent
new Earth science
phenomena
Discovery of Climate Indices using Clustering
23
Presentation Outline

Background:




Climate Change
Earth System Linkages
Earth Science Data and Climate Indices
Existing Eigenvalue Techniques and
Limitations

New Clustering Based Methodology

Results and Comparisons

Conclusions and Future Research
Discovery of Climate Indices using Clustering
24
Eigenvalue Techniques
Finding Spatial or Temporal
Patterns using SVD Analysis
SVD: Singular Value Decomposition



Earth Scientists typically used SVD
analysis to identify climate indices
Goal : To find a new set of attributes that
better describe variability in the data,
through dimensionality reduction
Its operation can be thought of as
revealing the internal structure of the
data in a way which best explains the
variance in the data
Karl Pearson, Statistician
1857 – 1936
Discovery of Climate Indices using Clustering
25
Eigenvalue Techniques
Overview of SVD Analysis

These techniques applied to a
data set in the form of a data
matrix (m by n)

m rows (objects)

n columns (attributes)

Data Matrix: a variation of
record data in that it consists
of all numeric attributes
Example of a data matrix
Discovery of Climate Indices using Clustering
26
Eigenvalue Techniques
Overview of SVD Analysis



Assume the data objects in a
matrix all have the same
fixed set of attributes
Each data object can be
thought of as a point, or
Vector in multidimensional
space
Each spatial dimension
represents a distinct
attribute describing the
object
Discovery of Climate Indices using Clustering
27
Simple Example of SVD Analysis


Just using web, it’s hard to find intuitive explanation of SVD
Again, SVD is a way to expose underlying details of matrix
Simple Example using Golf : 3 golfers play 9 holes, par every hole
How to predict score for a player on a given hole?





Assume two vectors, Player Ability and Hole Difficulty
Predicted score = Player Ability * Hole Difficulty
Hole difficulty is Left Singular Vector
Player Ability is Right Singular Vector

Discovery of Climate Indices using
Clustering
28
Eigenvalue Techniques
Finding Spatial or Temporal
Patterns using SVD Analysis



Given a data matrix, whose rows consist of time series from
various points on the globe, the objective is to discover the
strong temporal or spatial patterns in the data
SVD decomposes a matrix into two sets of patterns, which,
that correspond to a set of spatial patterns (left singular
vectors) and a set of temporal patterns (right singular
vectors).
We can plot the temporal patterns regular line plot and the
spatial patterns on a spatial grid and visualize these
patterns.
Discovery of Climate Indices using Clustering
29
Eigenvalue Techniques
Example : Plotting SST
(Sea Surface Temp)
Strongest spatial pattern of SST
Temporal pattern of SST (blue)
plotted against the NINO4 index
(green)
Discovery of Climate Indices using Clustering
30
Eigenvalue Techniques
Limitations of SVD Analysis



Only useful for finding a few of the strongest
signals
Smaller patterns in data may be obscured
Signals must be orthogonal to each other
(statistically independent)

May not identify all patterns in data

Efficiency can be a concern
Discovery of Climate Indices using Clustering
31
Presentation Outline

Background:


Climate Change
Earth System Linkages

Earth Science Data and Climate Indices

Existing Eigenvalue Techniques and Limitations

New Clustering Based Methodology

Results and Comparisons

Conclusions and Future Research
Discovery of Climate Indices using Clustering
32
Clustering Methods
Clustering Based Methodology for
the Discovery of Climate Indices

Two key steps for finding climate indices
1.
2.
Find candidate indices using clustering
Evaluate these candidate indices for Earth
Science significance
Clustering Method used for this study:
SNN Clustering Algorithm Method
“Searching Nearest Neighbors”
Discovery of Climate Indices using Clustering
33
Clustering Methods
Finding Candidate Indices
Using Clustering
SNN Clustering Algorithm


First finds the nearest neighbors of
each data point
Next, redefines the similarity between
pairs in terms of how many nearest
neighbors the two points share

Using this definition of similarity the
algorithm identifies core points

These Core Points are used to build
clusters

SNN algorithms have time complexity
O(n*log(n))
Graph of functions n(log n) and n
Discovery of Climate Indices using Clustering
34
Clustering Methods
Evaluation of Candidate
Indices




Indices must be evaluated in terms of Earth Science
significance
(meaning the strength of the association between the
behavior of a candidate index and land climate)
Goal is to find a numerical measure of the strength and
association between the behavior of an index and land
climate
To evaluate influence of climate indices on land, the
researchers use Area-Weighted Correlation
Definition : The weighted average of the correlation of
the candidate index with all land points, where weight
is based on the area of the land grid point
Discovery of Climate Indices using Clustering
35
Clustering Methods
Calculating Area-weighted Correlation



Step 1 : Compute the correlation of the time series of the
candidate index with the same time series associated with
each land point
Step 2 : Compute the weighted average of the correlations,
where the weight associated with each land point is its area
The resulting area-weighted correlation
can be at most 1, min is 0
General Formula for W.A.
Wc = weight of each value M
Mc = some value to average
General Correlation Index. 1 being strongest
Discovery of Climate Indices using Clustering
36
Clustering Methods
Comparison of Area-Weighted
Correlations



Development of Baseline
to compare the values of
area weighted
correlations of candidate
indices
Histogram of area
weighted correlation of
1000 random time series
No time series has a WAC
>.1 This will be the
baseline, and indicates
whether a good candidate
index
Discovery of Climate Indices using Clustering
37
Clustering Methods
Validation of Comparison
Baseline



Below shown are weighted area correlations of 11 known
indices
Note that 10/11 indices have a weighted area correlation of
>.1
If candidate index shows weighted area correlation >.1,
investigate
Graph of Weighted Area
Correlation of
Well know Climate Indices
Discovery of Climate Indices using Clustering
38
Presentation Outline

Background:


Climate Change
Earth System Linkages

Earth Science Data and Climate Indices

Existing Eigenvalue Techniques and Limitations

New Clustering Based Methodology

Results and Comparisons

Conclusions and Future Research
Discovery of Climate Indices using Clustering
39
Results
SST Based Candidate Indices



Used SST data over time period from 1958
and 1998 and applied SNN clustering
Obtained 107 clusters
Cluster centroids were used to categorize
clusters into G0,G1,G2 and G3 groups
depending on their correlation to known
indices
Discovery of Climate Indices using Clustering
40
Results
107 Sea Surface Temperature
(SST) Clusters


Find Correlation
with known
index like SOI,
NINO1+2 etc
Find Area
Weighted
correlation with
land
Discovery of Climate Indices using Clustering
41
Results
SST Cluster Correlation
Correlation between known indices with SST cluster centroids
and SVD Components
Discovery of Climate Indices using Clustering
42
Results
G0: Clusters with correlation to known
indices >= 0.8

Very highly correlated
NINO 3.4

1+2
RediscoveredNINO
well-known
indices

Serve to validate the approach
NINO 4
NINO 3
Discovery of Climate Indices using Clustering
43
Results
G0: SST Cluster Correlation
Correlation between known indices with SST cluster centroids
and SVD Components
Discovery of Climate Indices using Clustering
44
Results
G1: Clusters with correlation to known
indices from 0.4 to 0.8
Discovery of Climate Indices using Clustering
45
Results
G1: Cluster 29 vs. El Nino Indices
Cluster 29
Discovery of Climate Indices using Clustering
46
Results
G2: Clusters with correlation to known
indices from 0.25 to 0.4



Less correlated
May represent new earth science
phenomena
May be new index
Discovery of Climate Indices using Clustering
47
Results
Cluster 62 vs. El Nino Indices
Cluster 62
Discovery of Climate Indices using Clustering
48
Results
G3: Clusters with correlation to known
indices <= 0.25



Less correlated
May represent new earth science
phenomena or weaker version of
known phenomena
New index
Discovery of Climate Indices using Clustering
49
Results
SLPbased Candidate Indices
SLP data over time period from
1958 to 1998
 Correlation measured as difference
of all pairs of cluster centriods
 Negative correlation are interesting
candidates
 25 Clusters found

25 Sea Level Pressure Based Clusters
Discovery of Climate Indices using Clustering
50
Results
SLP Clusters Pairwise
Correlation
Note :Only negative correlation values shown
Discovery of Climate Indices using Clustering
51
Comparisons
Comparison with SVD based
Indices
Correlation of Cluster Centroids with
land temperature
Correlation of first 30 SVD components
with land temperature
Discovery of Climate Indices using Clustering
52
Comparisons
SST Clusters : Performance
Comparison
Correlation for known indices with SST cluster centroids and SVD
components
Discovery of Climate Indices using Clustering
53
Comparisons
SLP Clusters : Performance
Comparison
Discovery of Climate Indices using Clustering
54
Comparisons
SLP clusters Performance
Comparison
Area-weighted correlation for known indices with SLP cluster centroids
and SVD components
Discovery of Climate Indices using Clustering
55
Conclusions






Demonstrated that clustering is a viable
alternative to eigenvalue based approach for
the discovery of climate indices
Can replicate many well-known climate
indices
Have also discovered variants of known
indices that may be “better” for some regions
Some indices may represent new Earth
Science phenomena
No need for discovered indices to be
orthogonal
No need to pre-select the area to analyze
Discovery of Climate Indices using Clustering
56
Future Work




Investigation of candidate indices by Earth
Scientists
Investigate whether there are climate
indices that cannot be represented by
clusters
Noise elimination and other preprocessing
improvements
Aggregation
Discovery of Climate Indices using Clustering
57
QUESTIONS ???
58