Download Scientific Data Mining and Analysis Chandrika Kamath Center for Applied Scientific Computing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia, lookup

Nonlinear dimensionality reduction wikipedia, lookup

Transcript
Scientific Data Mining and Analysis
Chandrika Kamath
Center for Applied Scientific Computing
Lawrence Livermore National Laboratory
March 17, 2004
This work was performed under the auspices of the U.S. Department of Energy by University of
California Lawrence Livermore National Laboratory under contract no. W-7405-Eng-48.
http://www.llnl.gov/casc/sapphire
Data mining terminology
z
z
z
z
Data mining: the semi-automatic discovery of patterns,
associations, anomalies, and statistically significant
structures in data
Pattern recognition: the discovery and characterization
of patterns
Pattern: an ordering with an underlying structure
Feature: extractable measurement or attribute
Pattern: radio galaxy with a bent-double
morphology
Features: number of “blobs”
maximum intensity in a blob
spatial relationship between
blobs (distances and angles)
FIRST: sundog.stsci.edu
CK 2
Large-scale data mining - from a Terabyte
to a Megabyte
Raw
Data
Target
Data
Preprocessed
Data
Data Preprocessing
Data Fusion
Sampling
Multi-resolution
analysis
Transformed
Data
Patterns
Knowledge
Pattern Recognition Interpreting Results
DimensionDe-noising
reduction
Object identification
Featureextraction
Normalization
Classification
Clustering
Regression
An iterative and interactive process
Visualization
Validation
CK 3
First, we need to handle massive, multiresolution data from different sensors
z
z
z
Different data formats and output types
Size of the data
– sampling
– multi-resolution techniques
Data from different sensors at different resolutions at
different times
– data registration
X-ray
Infrared
Optical
Radio
Example: Images of the Crab Nebula (chandra.harvard.edu)
CK 4
Next, we need to find the objects in image
and mesh data
z
z
Data can be noisy, with missing values
– denoising can smooth the data
Image processing techniques to identify the object
– extensive variation across objects and images
– algorithms have several parameters
– must be robust across images
– extensible to 3D meshes and unstructured grids
FIRST
Example: the result of image segmentation
CK 5
Once an object has been identified, we
need to extract features to represent it
z
z
z
Scale, rotation, and translation invariant
Robust: insensitive to small changes in data
Appropriate to the problem
– involve domain scientists
– extract many and then reduce
N
N
Example: Position angle is not a robust feature
CK 6
We may need to reduce the number of
features through feature selection
Features
f1 f 2 K f n
Data items
z
z
z
z
z
Features
' '
'
1 2
p
f f Kf
p<n
May need to normalize features
Number of features to keep
Involve domain scientists
Need to interpret the results
Techniques may or may not depend on the pattern
recognition algorithm used
CK 7
There are several challenges in applying
pattern recognition to scientific data
z
Classification: given a training set, build
a “model” that can assign a label to an
unclassified object
– training sets are often small
– unbalanced
– contain subjective errors
f2
A
+
+
+
__
_
_
_+
++
+
+
f1
B
z
Clustering: group the data items based
on their similarity
– number of groups
– quality of clustering
– effects of outliers
f2
x
x
x
x
x
xx
x
x
xx
xx
f1
CK 8
Other issues make scientific data mining
challenging
z
z
z
z
z
z
z
Visualization and validation of results
– use to create a larger training set;
subject to labeling “drift”
– info-viz: one-off solution; problems
with large datasets
Parallelism: size of the data shrinks as it
is processed
Real time responses: streaming data
Spatio-temporal data
“Interesting ” vs. “labeled” data
Lack of ground truth
Mining a growing data set
Bent-double
Non-bent double
CK 9