Download V. Kumar

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Discovery of Patterns in the
Global Climate System using Data Mining
Vipin Kumar
Army High Performance Computing Research Center
Department of Computer Science
University of Minnesota
http://www.cs.umn.edu/~kumar
Research sponsored by AHPCRC/ARL, DOE, NASA, and NSF
© Vipin Kumar
August 20, 2003
‹#›
What is Data Mining?

Many Definitions
– Non-trivial extraction of implicit, previously
unknown and potentially useful information
from data
– Exploration & analysis, by automatic
or semi-automatic means,
of large quantities of data
in order to discover
meaningful
patterns
© Vipin Kumar
August 20, 2003
‹#›
What is (not) Data Mining?

What is not Data
Mining?

– Certain names are more
prevalent in certain US
locations (O’Brien,
O’Rourke, … in Boston
area)
– Look up phone
number in phone
directory
– Query a Web search
engine for information
about “Amazon”
© Vipin Kumar
What is Data Mining?
– Group together similar
documents returned by
search engine according
to their context (Amazon
rainforest, Amazon.com,
etc.)
August 20, 2003
‹#›
Why Mine Data? Commercial Viewpoint

Lots of data is being collected
and warehoused
– Web data
Yahoo!
collects 10GB/hour
– purchases at department/
grocery stores



Walmart records  20 million
transactions per day
– Bank/Credit Card transactions
Computers have become cheaper and more powerful
Competitive Pressure is Strong
– Provide better, customized services for an edge (e.g.
in Customer Relationship Management)
© Vipin Kumar
August 20, 2003
‹#›
Why Mine Data? Scientific Viewpoint

Data collected and stored at
enormous speeds (GB/hour)
– remote sensors on a satellite

NASA EOSDIS archives over
1-petabytes of Earth Science data per year
– telescopes scanning the skies

Sky survey data
– gene expression data
– scientific simulations



terabytes of data generated in a few hours
Traditional techniques infeasible for raw data
Data mining may help scientists
– in automated analysis of massive data sets
– in hypothesis formation
Mining Large Data Sets - Motivation
4,000,000
3,500,000
3,000,000
The Data Gap
2,500,000
2,000,000
1,500,000
Total new disk (TB) since 1995
1,000,000
Number of
analysts
500,000
0
1995
1996
1997
1998
1999
Ref: R. Grossman, C. Kamath, V. Kumar, Data Mining for Scientific and Engineering Applications

There is often information “hidden” in the data that is not readily evident

Human analysts may take too long to discover useful information

Much of the data is never analyzed at all
© Vipin Kumar
August 20, 2003
‹#›
Origins of Data Mining
Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems
 Traditional techniques
may be unsuitable due to
Statistics/
Machine Learning/
– Enormity of data
AI
Pattern
Recognition
– High dimensionality
of data
Data Mining
– Heterogeneous,
distributed nature
Database
of data
systems

© Vipin Kumar
August 20, 2003
‹#›
Role of Parallel & Distributed Computing


High Performance Computing (HPC) is often critical for
scalability to large data sets
– Many algorithms use more than O(n)
computation time
– Sequential computers
Statistics/
Machine Learning/
have limited memory, thus
AI
Pattern
requiring multiple, expensive
Recognition
I/O passes over data
Data
Distributed computing is needed
because data is distributed
Mining
– due to privacy reasons
High
Database
Performance
systems
– physically dispersed over
Computing
many different geographic
locations
© Vipin Kumar
August 20, 2003
‹#›
Data Mining Tasks...
Data
10
Milk
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
11
No
Married
60K
No
12
Yes
Divorced 220K
No
13
No
Single
85K
Yes
14
No
Married
75K
No
15
No
Single
90K
Yes
60K
Predictive Modeling

Find a model for class attribute as a function of
the values of other attributes
Model for predicting tax evasion
Married
Yes
Tid Refund
Marital
Status
Taxable
Evade
Income
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
NO
Income100K
Yes
Yes
Yes
Learn
Classifier
No
NO
Income
 80K
NO
No
YES
10
© Vipin Kumar
August 20, 2003
‹#›
Predictive Modeling: Applications

Targeted Marketing

Customer Attrition/Churn

Classifying Galaxies
Early
Class:
• Stages of Formation
Intermediate
Attributes:
• Image features,
• Characteristics of light
waves received, etc.
Late
Sky Survey Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB
Courtsey: http://aps.umn.edu
Clustering

Given a set of data points, find groupings such that
– Data points in one cluster are more similar to
one another
– Data points in separate clusters are less similar
to one another
© Vipin Kumar
August 20, 2003
‹#›
Clustering: Applications

Market Segmentation

Gene expression clustering

Document Clustering
Category
Total
Articles
Correctly
Placed
555
364
Foreign
341
260
National
273
36
Metro
943
746
Sports
738
573
Entertainment
354
278
Financial
© Vipin Kumar
August 20, 2003
‹#›
Association Rule Discovery


Given a set of records, find dependency rules
which will predict occurrence of an item based
on occurrences of other items in the record
TID
Items
1
2
3
4
5
Bread, Coke, Milk
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
Rules Discovered:
{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}
Applications
– Marketing and Sales Promotion
– Supermarket shelf management
– Inventory Management
© Vipin Kumar
August 20, 2003
‹#›
Deviation/Anomaly Detection
Detect significant deviations from normal behavior
 Applications:
– Credit Card Fraud Detection

– Network Intrusion
Detection
Typical network traffic at University level may reach over 100 million connections per day
© Vipin Kumar
August 20, 2003
‹#›
Discovery of Patterns in the Earth
Science Data
NASA ESE questions:
NPP
.
Pressure

How is the global Earth system changing?

What are the primary forcings?

How does Earth system respond to
natural & human-induced changes?

What are the consequences of changes in
the Earth system?

How well can we predict future changes?
.
Longitude
Global snapshots of values for a
number of variables on land
surfaces or water

Data sources:
Pressure
.
Precipitation
Precipitation
SST
SST
Latitude
grid cell

NPP
Time
zone

weather observation stations

earth orbiting satellites (since 1981)

modeled-based data
Climate Indices:
Connecting the Ocean/Atmosphere and the Land

A climate index is a time
series of sea surface
temperature or sea level
pressure
Correlation Between ANOM 1+2 and Land Temp (>0.2)
90
0.8
Climate indices capture
teleconnections

The simultaneous variation in
climate and related processes
over widely separated points on
the Earth
El Nino
Events
0.6
60
0.4
30
0.2
latitude

0
0
-0.2
-30
-0.4
-60
-0.6
-0.8
-90
-180 -150 -120 -90
-60
-30
0
30
60
90
120 150 180
longitude
Nino 1+2 Index
© Vipin Kumar
August 20, 2003
‹#›
Discovery of Climate Indices Using Clustering
SST Clusters With Relatively High Correlation to Land Temperature
90
A novel clustering technique was developed to
identify regions of uniform behavior in spatiotemporal data. The use of clustering for discovering climate
indices is driven by the intuition that a climate phenomenon is
expected to involve a significant region of the ocean or
atmosphere where the behavior is relatively uniform over the
entire area.
60
30
0
78
75
67
94
A cluster-based approach for discovering climate indices
provides better physical interpretation than those based on the
SVD/EOF paradigm, and provide candidate indices with better
predictive power than known indices for some land areas.
-30
-60
-90
-180 -150 -120
-90
-60
-30
0
30
60
90
120
150
longitude
Cluster 29 versus El Nino Indices
90
0.6
60
0.4
30
latitude
latitude
29
0.2
0
0
-0.2
-30
-0.4
-60
-0.6
-90
-180
-150
-120
-90
-60
-30
0
30
longitude
© Vipin Kumar
60
90
120
150
180
180
Some SST clusters reproduce well-known climate indices. In
particular, we were able to replicate the four El Nino SSTbased indices: cluster 94 corresponds to NINO 1+2, 67 to
NINO 3, 78 to NINO 3.4, and 75 to NINO 4. The correlations
of these clusters to their corresponding indices are higher than
0.9.
Some SST clusters, e.g., cluster 29, are significantly different
than known indices, but provide better correlation with land
climate variables than known indices for many parts of the
globe. The bottom figure shows the difference in correlation
to land temperature between cluster 29 and the El Nino
indices. Areas in yellow indicate where cluster 29 has higher
correlation.
August 20, 2003
‹#›
Mining the Climate Data: Clustering
# grid points: 67K Land, 40K Ocean
Current data size range: 20 – 400 MB
Monthly data over a range of 17 to 50 years
Niño
Region
Range
Longitude
Range
Latitude
1+2 (94)
90°W-80°W
10°S-0°
3 (67) 150°W-90°W
5°S-5°N
3.4 (78) 170°W-120°W
5°S-5°N
4 (75) 160°E-150°W
5°S-5°N
El Nino Regions Defined
by Earth Scientists
Cluster
94
67
78
75
Nino Index Correlation
NINO 1+2
0.9225
NINO 3
0.9462
NINO 3.4
0.9196
NINO 4
0.9165
Clusters of SST that have high impact on
land temperature
© Vipin Kumar
August 20, 2003
‹#›
SST Cluster Moderately Correlated to
Known Indices
Ref: Steinbach et al 2002/2003
(KDD 2003)
Cluster 62
Cluster 62 - SOI ANOM12 ANOM3 ANOM4 ANOM34 (mincorr = 0.20)
90
90
70
70
50
50
30
30
10
10
-10
-10
-30
-30
-50
-50
-70
-70
-90
-180
-90
-180
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-140
-100
-60
-20
20
60
100
140
180
-140
-100
-60
-20
20
60
100
140
180
Correlation of Known Indices with SST
Cluster Centroids and SVD Components
Climate
Indices
Cluster Centroids
SVD Components
Best-shifted
Correlation
Best
Centroid
Best SVD
Correlation
Best
Component
SOI
-0.7006
75 (G0)
-0.5427
3
NAO
-0.2973
19
(G2)
0.1774
8
AO
-0.2383
29
(G1)
0.2301
8
PDO
0.5172
20
(G1)
-0.4684
7
QBO
-0.2675
20
(G1)
0.3187
11
CTI
0.9147
67
(G0)
0.6316
3
WP
0.2590
78
(G0)
0.1904
3
NINO1+2
0.9225
94 (GO)
-0.5419
1
NINO3
0.9462
67
(G0)
-0.6449
1
NINO3.4
0.9196
78
(G0)
-0.6844
1
NINO4
0.9165
75
(G0)
-0.6894
1
SLP Clusters
NAO
AO
SOI
SOI
DMI
© Vipin Kumar
August 20, 2003
‹#›
Pair of SLP Clusters that Correspond to SOI
Cluster centroid 20 – 13 versus SOI
Centroids of SLP clusters 13 and 20
3
3
Centroid 20
Centroid 13
Centroid 13 - 20
SOI
2
2
1
1
0
0
-1
-1
-2
-2
-3
87
88
89
90
91
92
93
94
95
96
97
98
99
-3
87
88
89
90
91
92
93
94
95
96
97
98
Correlation = 0.75
© Vipin Kumar
August 20, 2003
‹#›
99
Finding New Patterns: Indian Monsoon Dipole Mode Index

Recently a new index, the Indian
Ocean Dipole Mode index (DMI),
has been discovered.

DMI is defined as the difference
in SST anomaly between the
region 5S-5N, 55E-75E and the
region 0-10S, 85E-95E.

DMI and is an indicator of a weak
monsoon over the Indian
subcontinent and heavy rainfall
over East Africa.

We can reproduce this index as a
difference of pressure indices of
clusters 16 and 22.
© Vipin Kumar
Plot of cluster 16 – cluster 22 versus
the Indian Ocean Dipole Mode index.
(Indices smoothed using 12 month
moving average.)
August 20, 2003
‹#›
Mining the Climate Data: Associations
Ref: Tan et al 2001
FPAR-Hi ==> NPP-Hi
(sup=5.9%, conf=55.7%)
Grassland/Shrubland areas
Association rule is interesting because it appears mainly in regions with
grassland/shrubland vegetation type
© Vipin Kumar
August 20, 2003
‹#›
Detection of Ecosystem Disturbances
Detection of sudden changes in greenness over
extensive areas from these large global satellite
data sets required development of automated
techniques that take into account the timing,
location, and magnitude of such changes.
An algorithm was designed to identify any
significant and sustained declines in FPAR during
an 18 year time period. This algorithm transforms a
non-stationary time series to a sequence of
disturbance events. Techniques were also
developed to discover associations between
ecosystem disturbance regimes and historical
climate anomalies.
Release: 03-51AR
These algorithms and techniques have allowed
Earth Science researchers to gain a deeper insight
into the interplay among natural disasters, human
activities and the rise of carbon dioxide in Earth's
atmosphere during two recent decades.
NASA DATA MINING REVEALS A NEW HISTORY OF NATURAL DISASTERS
NASA is using satellite data to paint a detailed global picture of the interplay among natural disasters,
human activities and the rise of carbon dioxide in the Earth's atmosphere during the past 20 years.
http://amesnews.arc.nasa.gov/releases/2003/03_51AR.html
© Vipin Kumar
August 20, 2003
‹#›
Understanding Global Teleconnections of Climate to Regional
Model Estimates of Amazon Ecosystem Carbon Fluxes
Average NPP at 55.0 W, 15.0 S vs. Average AO
3
NPP
AO
30
2
1
0
-1
0
latitude
-2
-3
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98
Discovered, using correlation analysis, a strong connection
between the rainfall patterns generated by the South
American monsoon system and terrestrial greenness over a
large section of the southern Amazon region.
-30
-60
-90
© Vipin Kumar
-60
longitude
This is the first direct evidence of large-scale effects of the
Atlantic Ocean rainfall systems on yearly greenness changes
-30 in the Amazon region, and the finding has important
implications for the impacts of "slash and burn"
deforestation on this crucial ecosystem of the world.
August 20, 2003
‹#›
High Resolution EOS Data

EOS satellites provide high resolution
measurements
– Finer spatial grids

8 km  8 km grid produces 10,848,672 data points

1 km  1 km grid produces 694,315,008 data points
– More frequent measurements
– Multiple instruments


Earth Observing System
(e.g., Terra and Aqua satellites)
Generates terabytes of day per day
High resolution data allows us to
answer more detailed questions:
–
Detecting patterns such as trajectories, fronts, and
movements of regions with uniform properties
–
Finding relationships between leaf area index (LAI)
and topography of a river drainage basin
–
Finding relationships between fire frequency and
elevation as well as topographic position
http://www.crh.noaa.gov/lmk/soo/docu/basicwx.htm
Discovery of Changes from the Global Carbon Cycle and Climate System Using
Data Mining: Journal Publications

Potter, C., Tan, P., Steinbach, M., Klooster, S., Kumar, V., Myneni, R., Genovese,
V., 2003. Major disturbance events in terrestrial ecosystems detected using
global satellite data sets. Global Change Biology, July, 2003.

Potter, C., Klooster, S. A., Myneni, R., Genovese, V., Tan, P., Kumar,V. 2003.
Continental scale comparisons of terrestrial carbon sinks estimated from satellite
data and ecosystem modeling 1982-98. Global and Planetary Change (in press)

Potter, C., Klooster, S. A., Steinbach, M., Tan, P., Kumar, V., Shekhar, S., Nemani,
R., Myneni, R., 2003. Global teleconnections of climate to terrestrial carbon flux.
Geophys J. Res.- Atmospheres (in press).

Potter, C., Klooster, S., Steinbach, M., Tan, P., Kumar, V., Myneni, R., Genovese,
V., 2003. Variability in Terrestrial Carbon Sinks Over Two Decades: Part 1 – North
America. Geophysical Research Letters (in press)

Potter, C. Klooster, S., Steinbach, M., Tan, P., Kumar, V., Shekhar, S. and C.
Carvalho, 2002. Understanding Global Teleconnections of Climate to Regional
Model Estimates of Amazon Ecosystem Carbon Fluxes. Global Change Biology (in
press)

Potter, C., Zhang, P., Shekhar, S., Kumar, V., Klooster, S., and Genovese, V., 2002.
Understanding the Controls of Historical River Discharge Data on Largest River
Basins. (in preparation)
© Vipin Kumar
August 20, 2003
‹#›
Discovery of Changes from the Global Carbon Cycle and Climate System Using
Data Mining: Conference/Workshop Publications

Steinbach, M., Tan, P. Kumar, V., Potter, C. and Klooster, S., 2003. Discovery of
Climate Indices Using Clustering, KDD 2003, Washington, D.C., August 24-27,
2003.

Zhang, P., Huang, Y., Shekhar, S., and Kumar, V., 2003. Exploiting Spatial
Autocorrelation to Efficiently Process Correlation-Based Similarity Queries , Proc.
of the 8th Intl. Symp. on Spatial and Temporal Databases (SSTD '03)

Zhang, P., Huang, Y., Shekhar, S., and Kumar, V., 2003. Correlation Analysis of
Spatial Time Series Datasets: A Filter-And-Refine Approach, Proc. of the Seventh
Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD '03)

Ertoz, L., Steinbach, M., and Kumar, V., 2003. Finding Clusters of Different Sizes,
Shapes, and Densities in Noisy, High Dimensional Data, Proc. of Third SIAM
International Conference on Data Mining.

Tan, P., Steinbach, M., Kumar, V., Potter, C., Klooster, S., and Torregrosa, A., 2001.
Finding Spatio-Temporal Patterns in Earth Science Data, KDD 2001 Workshop on
Temporal Data Mining, San Francisco

Kumar, V., Steinbach, M., Tan, P., Klooster, S., Potter, C., and Torregrosa, A., 2001.
Mining Scientific Data: Discovery of Patterns in the Global Climate System, Proc.
of the 2001 Joint Statistical Meeting, Atlanta
© Vipin Kumar
August 20, 2003
‹#›