Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Scientific Data Mining and Analysis Chandrika Kamath Center for Applied Scientific Computing Lawrence Livermore National Laboratory March 17, 2004 This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract no. W-7405-Eng-48. http://www.llnl.gov/casc/sapphire Data mining terminology z z z z Data mining: the semi-automatic discovery of patterns, associations, anomalies, and statistically significant structures in data Pattern recognition: the discovery and characterization of patterns Pattern: an ordering with an underlying structure Feature: extractable measurement or attribute Pattern: radio galaxy with a bent-double morphology Features: number of “blobs” maximum intensity in a blob spatial relationship between blobs (distances and angles) FIRST: sundog.stsci.edu CK 2 Large-scale data mining - from a Terabyte to a Megabyte Raw Data Target Data Preprocessed Data Data Preprocessing Data Fusion Sampling Multi-resolution analysis Transformed Data Patterns Knowledge Pattern Recognition Interpreting Results DimensionDe-noising reduction Object identification Featureextraction Normalization Classification Clustering Regression An iterative and interactive process Visualization Validation CK 3 First, we need to handle massive, multiresolution data from different sensors z z z Different data formats and output types Size of the data – sampling – multi-resolution techniques Data from different sensors at different resolutions at different times – data registration X-ray Infrared Optical Radio Example: Images of the Crab Nebula (chandra.harvard.edu) CK 4 Next, we need to find the objects in image and mesh data z z Data can be noisy, with missing values – denoising can smooth the data Image processing techniques to identify the object – extensive variation across objects and images – algorithms have several parameters – must be robust across images – extensible to 3D meshes and unstructured grids FIRST Example: the result of image segmentation CK 5 Once an object has been identified, we need to extract features to represent it z z z Scale, rotation, and translation invariant Robust: insensitive to small changes in data Appropriate to the problem – involve domain scientists – extract many and then reduce N N Example: Position angle is not a robust feature CK 6 We may need to reduce the number of features through feature selection Features f1 f 2 K f n Data items z z z z z Features ' ' ' 1 2 p f f Kf p<n May need to normalize features Number of features to keep Involve domain scientists Need to interpret the results Techniques may or may not depend on the pattern recognition algorithm used CK 7 There are several challenges in applying pattern recognition to scientific data z Classification: given a training set, build a “model” that can assign a label to an unclassified object – training sets are often small – unbalanced – contain subjective errors f2 A + + + __ _ _ _+ ++ + + f1 B z Clustering: group the data items based on their similarity – number of groups – quality of clustering – effects of outliers f2 x x x x x xx x x xx xx f1 CK 8 Other issues make scientific data mining challenging z z z z z z z Visualization and validation of results – use to create a larger training set; subject to labeling “drift” – info-viz: one-off solution; problems with large datasets Parallelism: size of the data shrinks as it is processed Real time responses: streaming data Spatio-temporal data “Interesting ” vs. “labeled” data Lack of ground truth Mining a growing data set Bent-double Non-bent double CK 9