Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Data Mining in the
Multidimensional Parameter
Space
Yanxia Zhang
National Astronomical Observatories,CAS
Nov.27 2003
1
Outline
Why and What
DM Technology
Future Directions
2
Why
Necessity Is the Mother of Invention
Data avalanche
VO
DM&KDD
IRAS 25m 2MASS 2m DSS Optical IRAS 100mWENSS 92cmNVSS 20cm GB 6cm ROSAT ~keV
3
What
DM (KDD):
Extraction of interesting ( non-trivial, implicit,
previously unknown and potentially useful)
information from data in large databases
Alternative names :
Data mining: a misnomer?
Knowledge discovery in databases (KDD: SIGKDD),
knowledge extraction, data archeology, data dredging,
information harvesting, business intelligence, etc.
4
Taxonomy of DM
In large scientific databases, DM in two flavors:
– Event-based
mining
– Relationship-based mining
5
Event-Based Mining for Science
Event-based mining is based upon events or trends in data.
Known events / known algorithms - use existing physical models
(descriptive models) to locate known phenomena of interest either spatially or
temporally within a large database.
Known events / unknown algorithms - use pattern recognition and
clustering properties of data to discover new observational (in our case,
astrophysical) relationships among known phenomena.
Unknown events / known algorithms - use expected physical
relationships (predictive models) among observational parameters of
astrophysical phenomena to predict the presence of previously unseen events
within a large complex database.
Unknown events / unknown algorithms - use thresholds or trends to
identify transient or otherwise unique ("one-of-a-kind") events and therefore to
discover new phenomena.
6
Relationship-Based Mining for Science
Relationship-based mining is based on associations.
Spatial associations -- identify events (astronomical objects) at the same
location in the sky.
Temporal associations -- identify events occurring during the same or
related periods of time.
Coincidence associations -- use clustering techniques to identify
events that are co-located within a multi-dimensional parameter space.
7
Science Requirements for DM
Cross-Identification - refers to the classical problem of associating
the source list in one database to the source list in another.
Cross-Correlation - refers to the search for correlations,
tendencies, and trends between physical parameters in multi-dimensional
data, usually across databases.
Nearest-Neighbor Identification - refers to the general
application of clustering algorithms in multi-dimensional parameter
space, usually within a database.
Systematic Data Exploration - refers to the application of the
broad range of event-based and relationship-based queries to a database
in the hope of making a serendipitous discovery of new objects or a new
class of objects.
8
Outline
Why and What
DM Technology
Future Directions
9
DM: Confluence of Multiple Disciplines
Database system,
Data warehouse,
OLAP
ML&AI
Information
science
statistics
DM
Visualization
Other
disciplines
10
DM: A KDD Process
Data mining: the core of
Pattern Evaluation
knowledge discovery
process.
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
11
DM Functionality
Concept description: Characterization and Comparison:
Generalize, summarize, and possibly contrast data characteristics,
e.g., stars vs. galaxies.
Association:
From association, correlation, to causality.
finding rules like “stars point sources”.
Classification and Prediction:
Classify data based on the values in a classifying attribute, e.g.,
classify objects based on spectra, or classify galaxies and stars
based on images.
Predict some unknown or missing attribute values based on other
information.
12
DM Functionality (Cont.)
Clustering:
Group data to form new classes, e.g., cluster spectra
data to find distribution patterns.
Time-series analysis:
Trend and deviation analysis: Find and characterize
evolution trend, sequential patterns, similar
sequences, and deviation data, e.g., variable stars.
Similarity-based pattern-directed analysis: Find and
characterize user-specified patterns in large databases.
Cyclicity/periodicity analysis: Find segment-wise or
total cycles or periodic behaviours in time-related data.
Other pattern-directed or statistical analysis:
13
DM: On What Kind of Data?
Relational databases
Data warehouses
Transactional databases
Advanced DB systems and information repositories
Object-oriented and object-relational databases
Spatial databases
Time-series data and temporal data
Text databases and multimedia databases
Heterogeneous and legacy databases
WWW
14
Challenges in DM
Mining methodology issues
Mining different kinds of knowledge in databases.
Interactive mining of knowledge at multiple levels of
abstraction.
Incorporation of background knowledge
DM query languages and ad-hoc DM.
Expression and visualization of DM results.
Handling noise and incomplete data
Pattern evaluation: the interestingness problem.
Performance issues:
Efficiency and scalability of DM algorithms.
Parallel, distributed and incremental mining methods.
15
Challenges in DM (Cont.)
Issues related to the variety of data types:
Handling relational and complex types of data
Mining information from heterogeneous databases and
global information systems.
Issues related to applications and social impacts:
Application of discovered knowledge.
– Domain-specific data mining tools
– Intelligent query answering
– Process control and decision making.
Integration of the discovered knowledge with existing
knowledge: A knowledge fusion problem.
Protection of data security and integrity.
16
Mining Association Rules
Assocation rule mining:
Finding
associations or correlations among a set of items
or objects in transaction databases, relational databases,
and data warehouses.
Applications:
Basket
data analysis, cross-marketing, catalog design,
loss-leader analysis, clustering, etc.
Examples.
form: LHS RHS [support, confidence].
buys(x, diapers) buys(x, beers) [0.5%, 60%]
major(x, CS) ^ takes(x, DB) grade(x, A) [1%, 75%]
Rule
17
Methods for Mining Associations
The Apriori principle: Any subset of a frequent itemset must
be frequent.
( Agrawal & Srikant’94, Mannila, Klementen, et al’94)
Partition Technique:(Savasere, Omiecinski, Navathe’95)
Sampling techique (Toivonen’96)
Multi-level or generalized association (Agrawal & Srikant’95,
Han & Fu’95)
Quantitative association rule mining (Srikant & Agrawal’96,
Lent et al.’97, Miller’97).
Constraint-based or query-based association (Ng, et al’98,
Tsur et al’98)
From association to correlation (Brin et al’97)
18
Classification
Data categorization based on a set of training objects.
Applications: stars,galaxies,AGN classification etc.
Example: classify AGN and provide the symptoms
which describe each class or subclass.
The classification task: Based on the features present in
the class_labeled training data, develop a description or
model for each class. It is used for
classification of future test data,
better understanding of each class, and
prediction of certain properties and behaviors.
19
Three Schemes in Classification
Knowledge to be mined:
Summarization (characterization), comparison,
association, classification, clustering, trend, deviation
and pattern analysis, etc.
Mining knowledge at different abstraction levels:
primitive level, high level, multiple-level, etc.
Databases to be mined:
Relational, transactional, object-oriented, objectrelational, active, spatial, time-series, text, multi-media,
heterogeneous, legacy, etc.
Techniques adopted:
Database-oriented, data warehouse (OLAP), machine
learning, statistics, visualization, neural network, etc.
20
Major Classification Methods
Decision tree-based classification:
Training set vs test set or cross-validation
Overfitting problem and tree pruning
Boosting techniques.
Bayesian classification:
Naïve Bayesian classification
Bayesian belief networks
Boosting techniques (e.g., AdaBoosting).
Neural network approach:
Multi-layer networks and back-propagation.
Genetic algorithms:
Genetic operators and fitness function selection.
21
Predictive Modeling in Databases
Predictive modeling: Predict data values or construct
generalized linear models based on the database data.
One can only predict value ranges or category distributions.
Method outline:
Minimal generalization
Attribute relevance analysis
Generalized linear model construction
Prediction.
Determine the major factors which influence the prediction.
Data relevance analysis: uncertainty measurement,
entropy analysis, expert judgement, etc.
Multi-level prediction: drill-down and roll-up analysis.
22
Data Clustering Analysis
Clustering:
Partitioning a set of data (or objects) into a set of classes,
called clusters, such that members of each class sharing
some interesting common properties.
High
quality clusters:
the intra-class similarity is high.
the inter-class similarity is low.
Measuring data clustering quality
Distance functions
23
Three Categories of Clustering Techniques
Partitioning-based:
Basically enumerate various partitions and then
score them by some criterion.
K-means, K-medoids, etc.
Hierarchy-based:
Create a hierarchical decomposition of the set of
data (or objects) using some criterion.
Model-based:
A model is hypothesized for each of the clusters
Find the best fit of that model to each other.
E.g., Bayesian classification (AutoClass), Cobweb.
24
Database Clustering Methods
CLARANS (Ng & Han’94):
An extension to k-medoid algorithm based on
randomized search.
BIRCH (Zhang et al’96):
CF tree (a balanced tree structure).
DBSCAN (EKXS96):
connects regions of sufficiently high desity into clusters.
STING (WYM97):
A hierarchical cell structure that store statistical
information.
CLIQUE (Agrawal et al’98):
Cluster high dimensional data.
25
Time-Series DM
Trend and deviation analysis
Find trend (data evolution regularity) and deviations.
Regression analysis, visualization techniques.
Subsequence analysis: similarity search
Subsequence matching: normalization + matching
Template specification: shape and macro specification.
Sequential pattern analysis
Sequential association rules
Periodicity analysis
full periods vs. partial periods, cyclic association rules.
26
Similarity Search in DM
Faloutsos et al. (1994) :
Extract features from each window
Fourier Transform & R*-tree structure.
Agrawal et al. (1995) :
Amplitude scaling, offset translation
Distance is determined from the sequence envelopes
Agrawal et al. (1995) :
SDL pattern language to encode queries about “shapes”
Jagadish et al. (1997) :
domain-independent framework
“find all objects that are similar to some objects in class A
and are not similar to any object in class B”
27
Periodic Pattern Search in TimeRelated Data Sets
Full cycle analysis:
Fourier transformation, other statistical analysis methods
Fragment-wise cyclic behavior analysis:
Example. Jack reads NY Times at every 9:00am.
Given (natural) periods vs. arbitray periods.
A data cube and OLAP-based technique: (Han, Gong and
Yin’98)
Cyclic association rules:
Associations which form cycles.
Cyclic Association Rules (B. Özden, S. Ramawamy, A.
Silberschatz, 1998)
28
Conclusions
Data warehouse: An industry trend
DW stores a huge amount of subject-oriented, cleansed, integrated,
consolidated, time-related data.
OLAP provides an interactive data analysis environment
OLAM: Integration of mining with OLAP
Take advantages of data warehouse infrastructure.
From batch mining to interactive, multi-dimensional mining
Many interesting research and implementation issues
Database mining and warehouse mining are both important
directions to pursue.
29
Outline
Why and What
DM Technology
Future Directions
30
Future Work on DM Research
Integration with data warehouse, OLAP, and relational
technology
Scalability: efficient algorithms, parallel/distributed and
incremental mining
Ad-hoc mining query language and its optimization
Multiple, integrated DM functions and methods
Mining on new kinds of data: time-series data, text,
multimedia, spatial and Web
Visual DM and knowledge visualization
Application exploration
Interactive, exploratory DM environment
31
Integration with Data Warehouse and
OLAP Technology
Data warehouse: A strong industry trend
huge amount of subject-oriented, cleansed, integrated, consolidated,
time-related data are stored in data warehouses
OLAP provides an interactive data analysis environment
Integrate mining with OLAP leads to multiple dimensional
DM:
On-Line Analytical Mining (OLAM :-).
32
Efficiency and Scalability in DM
Efficient algorithms in every DM function
Class description: Summarization and comparison
Classification and prediction
Clustering
Time-series and trend analysis
Real-time, fast response in exploratory DM
Progressive, multiple precision data analysis
Parallel and distributed DM algorithms
Incremental DM methods
33
Why Parallel and Distributed DM?
Massive amounts of data sets:
From mega-bytes to giga- and tera-bytes.
Costly data mining algorithms:
Association, classification, clustering, prediction.
Data in applications are geographically
distributed.
Parallel and networked computers are widely
available.
34
DM Query Optimization
Ad-hoc DM query language: DM SQL.
DM query optimization:
How to carve a DM view?
How to push user-specified rule constraints?
How to integrate interestingness measures in mining?
Interactivity of DM: New challenge.
35
Multiple, Integrated DM Functions and
Methods
Multiple mining functions
Concept description: characterization and
discrimination
Classification and prediction
Clustering
Association and correlation analysis
Multiple mining methods
Statistical approaches
Machine learning approaches
Neural network approach
Other approaches: mathematical models, etc.
36
Mining Complex Types of Data
Text data mining:
Library database, e-mails, book stores, Web pages.
Spatial data mining:
geographic information systems, engineering databases, medical
image database.
Multimedia data mining:
image and video/audio databases.
Web mining:
unstructured and semi-structured data
Web access pattern analysis: easily doable.
37
Visual DM
The power of visual comprehension:
A picture = a thousand words
Pattern recognition and exploratory mining
Visual mining techniques:
Data visualization
Integration with other mining methods
Visual representation of knowledge
Charts, graphs, trees, curves, cubes
Multi-dimensional representation: color, shape, texture,
gray-level, etc.
38
Exploration of DM Applications
Need more success stories:
Insurance and market analysis, NBA strategy analysis.
Most current DM systems are lack of a “thick
semantic layer”
like the early relational database systems without
application software.
Customized data mining systems:
Market analysis DM systems
Insurance and customer analysis systems
39
Towards an Integrated, Exploratory DM
Environment
Exploratory DM:
Interactive, user-centered, exploratory mining process
High performance and fast response
Integrated multiple DM functions and methods:
Try different approaches to see which one is better
Try different functions to see which patterns are more
interesting
Automated mining and interactive mining: not too far
apart!
40
Towards VO-based DM
Success of
DM
Success of
VO
Hope the day comes earlier
41
Thank you !!!
42