Download Data Mining in the Multidimensional Parameter Space

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Data Mining in the
Multidimensional Parameter
Space
Yanxia Zhang
National Astronomical Observatories,CAS
Nov.27 2003
1
Outline
 Why and What
DM Technology
 Future Directions
2
Why
Necessity Is the Mother of Invention
Data avalanche
VO
DM&KDD
IRAS 25m 2MASS 2m DSS Optical IRAS 100mWENSS 92cmNVSS 20cm GB 6cm ROSAT ~keV
3
What
 DM (KDD):
 Extraction of interesting ( non-trivial, implicit,
previously unknown and potentially useful)
information from data in large databases
 Alternative names :
 Data mining: a misnomer?
 Knowledge discovery in databases (KDD: SIGKDD),
knowledge extraction, data archeology, data dredging,
information harvesting, business intelligence, etc.
4
Taxonomy of DM
In large scientific databases, DM in two flavors:
– Event-based
mining
– Relationship-based mining
5
Event-Based Mining for Science
Event-based mining is based upon events or trends in data.
 Known events / known algorithms - use existing physical models
(descriptive models) to locate known phenomena of interest either spatially or
temporally within a large database.
 Known events / unknown algorithms - use pattern recognition and
clustering properties of data to discover new observational (in our case,
astrophysical) relationships among known phenomena.
 Unknown events / known algorithms - use expected physical
relationships (predictive models) among observational parameters of
astrophysical phenomena to predict the presence of previously unseen events
within a large complex database.
 Unknown events / unknown algorithms - use thresholds or trends to
identify transient or otherwise unique ("one-of-a-kind") events and therefore to
discover new phenomena.
6
Relationship-Based Mining for Science
Relationship-based mining is based on associations.
 Spatial associations -- identify events (astronomical objects) at the same
location in the sky.
 Temporal associations -- identify events occurring during the same or
related periods of time.
 Coincidence associations -- use clustering techniques to identify
events that are co-located within a multi-dimensional parameter space.
7
Science Requirements for DM
 Cross-Identification - refers to the classical problem of associating
the source list in one database to the source list in another.
 Cross-Correlation - refers to the search for correlations,
tendencies, and trends between physical parameters in multi-dimensional
data, usually across databases.
 Nearest-Neighbor Identification - refers to the general
application of clustering algorithms in multi-dimensional parameter
space, usually within a database.
 Systematic Data Exploration - refers to the application of the
broad range of event-based and relationship-based queries to a database
in the hope of making a serendipitous discovery of new objects or a new
class of objects.
8
Outline
 Why and What
 DM Technology
 Future Directions
9
DM: Confluence of Multiple Disciplines
Database system,
Data warehouse,
OLAP
ML&AI
Information
science
statistics
DM
Visualization
Other
disciplines
10
DM: A KDD Process
 Data mining: the core of
Pattern Evaluation
knowledge discovery
process.
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
11
DM Functionality
 Concept description: Characterization and Comparison:
 Generalize, summarize, and possibly contrast data characteristics,
e.g., stars vs. galaxies.
 Association:
 From association, correlation, to causality.
 finding rules like “stars  point sources”.
 Classification and Prediction:
 Classify data based on the values in a classifying attribute, e.g.,
classify objects based on spectra, or classify galaxies and stars
based on images.
 Predict some unknown or missing attribute values based on other
information.
12
DM Functionality (Cont.)
 Clustering:
 Group data to form new classes, e.g., cluster spectra
data to find distribution patterns.
 Time-series analysis:
 Trend and deviation analysis: Find and characterize
evolution trend, sequential patterns, similar
sequences, and deviation data, e.g., variable stars.
 Similarity-based pattern-directed analysis: Find and
characterize user-specified patterns in large databases.
 Cyclicity/periodicity analysis: Find segment-wise or
total cycles or periodic behaviours in time-related data.
 Other pattern-directed or statistical analysis:
13
DM: On What Kind of Data?
 Relational databases
 Data warehouses
 Transactional databases
 Advanced DB systems and information repositories
 Object-oriented and object-relational databases
 Spatial databases
 Time-series data and temporal data
 Text databases and multimedia databases
 Heterogeneous and legacy databases
 WWW
14
Challenges in DM
 Mining methodology issues
 Mining different kinds of knowledge in databases.
 Interactive mining of knowledge at multiple levels of
abstraction.
 Incorporation of background knowledge
 DM query languages and ad-hoc DM.
 Expression and visualization of DM results.
 Handling noise and incomplete data
 Pattern evaluation: the interestingness problem.
 Performance issues:
 Efficiency and scalability of DM algorithms.
 Parallel, distributed and incremental mining methods.
15
Challenges in DM (Cont.)
 Issues related to the variety of data types:
 Handling relational and complex types of data
 Mining information from heterogeneous databases and
global information systems.
 Issues related to applications and social impacts:
 Application of discovered knowledge.
– Domain-specific data mining tools
– Intelligent query answering
– Process control and decision making.
 Integration of the discovered knowledge with existing
knowledge: A knowledge fusion problem.
 Protection of data security and integrity.
16
Mining Association Rules

Assocation rule mining:
 Finding
associations or correlations among a set of items
or objects in transaction databases, relational databases,
and data warehouses.

Applications:
 Basket
data analysis, cross-marketing, catalog design,
loss-leader analysis, clustering, etc.

Examples.
form: LHS RHS [support, confidence].
 buys(x, diapers)  buys(x, beers) [0.5%, 60%]
 major(x, CS) ^ takes(x, DB) grade(x, A) [1%, 75%]
 Rule
17
Methods for Mining Associations







The Apriori principle: Any subset of a frequent itemset must
be frequent.
 ( Agrawal & Srikant’94, Mannila, Klementen, et al’94)
Partition Technique:(Savasere, Omiecinski, Navathe’95)
Sampling techique (Toivonen’96)
Multi-level or generalized association (Agrawal & Srikant’95,
Han & Fu’95)
Quantitative association rule mining (Srikant & Agrawal’96,
Lent et al.’97, Miller’97).
Constraint-based or query-based association (Ng, et al’98,
Tsur et al’98)
From association to correlation (Brin et al’97)
18
Classification
 Data categorization based on a set of training objects.
 Applications: stars,galaxies,AGN classification etc.
 Example: classify AGN and provide the symptoms
which describe each class or subclass.
 The classification task: Based on the features present in
the class_labeled training data, develop a description or
model for each class. It is used for
 classification of future test data,
 better understanding of each class, and
 prediction of certain properties and behaviors.
19
Three Schemes in Classification
 Knowledge to be mined:
 Summarization (characterization), comparison,
association, classification, clustering, trend, deviation
and pattern analysis, etc.
 Mining knowledge at different abstraction levels:
primitive level, high level, multiple-level, etc.
 Databases to be mined:
 Relational, transactional, object-oriented, objectrelational, active, spatial, time-series, text, multi-media,
heterogeneous, legacy, etc.
 Techniques adopted:
 Database-oriented, data warehouse (OLAP), machine
learning, statistics, visualization, neural network, etc.
20
Major Classification Methods
 Decision tree-based classification:
 Training set vs test set or cross-validation
 Overfitting problem and tree pruning
 Boosting techniques.
 Bayesian classification:
 Naïve Bayesian classification
 Bayesian belief networks
 Boosting techniques (e.g., AdaBoosting).
 Neural network approach:
 Multi-layer networks and back-propagation.
 Genetic algorithms:
 Genetic operators and fitness function selection.
21
Predictive Modeling in Databases
 Predictive modeling: Predict data values or construct




generalized linear models based on the database data.
One can only predict value ranges or category distributions.
Method outline:
 Minimal generalization
 Attribute relevance analysis
 Generalized linear model construction
 Prediction.
Determine the major factors which influence the prediction.
 Data relevance analysis: uncertainty measurement,
entropy analysis, expert judgement, etc.
Multi-level prediction: drill-down and roll-up analysis.
22
Data Clustering Analysis
 Clustering:
 Partitioning a set of data (or objects) into a set of classes,
called clusters, such that members of each class sharing
some interesting common properties.
 High
quality clusters:
 the intra-class similarity is high.
 the inter-class similarity is low.
 Measuring data clustering quality
 Distance functions
23
Three Categories of Clustering Techniques
 Partitioning-based:
Basically enumerate various partitions and then
score them by some criterion.
K-means, K-medoids, etc.
 Hierarchy-based:
Create a hierarchical decomposition of the set of
data (or objects) using some criterion.
 Model-based:
A model is hypothesized for each of the clusters
Find the best fit of that model to each other.
E.g., Bayesian classification (AutoClass), Cobweb.
24
Database Clustering Methods
 CLARANS (Ng & Han’94):
 An extension to k-medoid algorithm based on
randomized search.
 BIRCH (Zhang et al’96):
 CF tree (a balanced tree structure).
 DBSCAN (EKXS96):
 connects regions of sufficiently high desity into clusters.
 STING (WYM97):
 A hierarchical cell structure that store statistical
information.
 CLIQUE (Agrawal et al’98):
 Cluster high dimensional data.
25
Time-Series DM
 Trend and deviation analysis
 Find trend (data evolution regularity) and deviations.
 Regression analysis, visualization techniques.
 Subsequence analysis: similarity search
 Subsequence matching: normalization + matching
 Template specification: shape and macro specification.
 Sequential pattern analysis
 Sequential association rules

Periodicity analysis
 full periods vs. partial periods, cyclic association rules.
26
Similarity Search in DM
 Faloutsos et al. (1994) :
 Extract features from each window
 Fourier Transform & R*-tree structure.
 Agrawal et al. (1995) :
 Amplitude scaling, offset translation
 Distance is determined from the sequence envelopes
 Agrawal et al. (1995) :
 SDL pattern language to encode queries about “shapes”
 Jagadish et al. (1997) :
 domain-independent framework
 “find all objects that are similar to some objects in class A
and are not similar to any object in class B”
27
Periodic Pattern Search in TimeRelated Data Sets
 Full cycle analysis:
 Fourier transformation, other statistical analysis methods
 Fragment-wise cyclic behavior analysis:
 Example. Jack reads NY Times at every 9:00am.
 Given (natural) periods vs. arbitray periods.
 A data cube and OLAP-based technique: (Han, Gong and
Yin’98)
 Cyclic association rules:
 Associations which form cycles.
 Cyclic Association Rules (B. Özden, S. Ramawamy, A.
Silberschatz, 1998)
28
Conclusions
 Data warehouse: An industry trend
 DW stores a huge amount of subject-oriented, cleansed, integrated,
consolidated, time-related data.
 OLAP provides an interactive data analysis environment
 OLAM: Integration of mining with OLAP
 Take advantages of data warehouse infrastructure.
 From batch mining to interactive, multi-dimensional mining
 Many interesting research and implementation issues
 Database mining and warehouse mining are both important
directions to pursue.
29
Outline
Why and What
DM Technology
Future Directions
30
Future Work on DM Research
 Integration with data warehouse, OLAP, and relational
technology
 Scalability: efficient algorithms, parallel/distributed and
incremental mining
 Ad-hoc mining query language and its optimization
 Multiple, integrated DM functions and methods
 Mining on new kinds of data: time-series data, text,
multimedia, spatial and Web
 Visual DM and knowledge visualization
 Application exploration
 Interactive, exploratory DM environment
31
Integration with Data Warehouse and
OLAP Technology
 Data warehouse: A strong industry trend
 huge amount of subject-oriented, cleansed, integrated, consolidated,
time-related data are stored in data warehouses
 OLAP provides an interactive data analysis environment
 Integrate mining with OLAP leads to multiple dimensional
DM:
 On-Line Analytical Mining (OLAM :-).
32
Efficiency and Scalability in DM
 Efficient algorithms in every DM function
 Class description: Summarization and comparison
 Classification and prediction
 Clustering
 Time-series and trend analysis
 Real-time, fast response in exploratory DM
 Progressive, multiple precision data analysis
 Parallel and distributed DM algorithms
 Incremental DM methods
33
Why Parallel and Distributed DM?

Massive amounts of data sets:
 From mega-bytes to giga- and tera-bytes.
 Costly data mining algorithms:
 Association, classification, clustering, prediction.
 Data in applications are geographically
distributed.
 Parallel and networked computers are widely
available.
34
DM Query Optimization
 Ad-hoc DM query language: DM SQL.
 DM query optimization:
 How to carve a DM view?
 How to push user-specified rule constraints?
 How to integrate interestingness measures in mining?
 Interactivity of DM: New challenge.
35
Multiple, Integrated DM Functions and
Methods
 Multiple mining functions
 Concept description: characterization and
discrimination
 Classification and prediction
 Clustering
 Association and correlation analysis
 Multiple mining methods
 Statistical approaches
 Machine learning approaches
 Neural network approach
 Other approaches: mathematical models, etc.
36
Mining Complex Types of Data
 Text data mining:
 Library database, e-mails, book stores, Web pages.
 Spatial data mining:
 geographic information systems, engineering databases, medical
image database.
 Multimedia data mining:
 image and video/audio databases.
 Web mining:
 unstructured and semi-structured data
 Web access pattern analysis: easily doable.
37
Visual DM
 The power of visual comprehension:
 A picture = a thousand words
 Pattern recognition and exploratory mining
 Visual mining techniques:
 Data visualization
 Integration with other mining methods
 Visual representation of knowledge
 Charts, graphs, trees, curves, cubes
 Multi-dimensional representation: color, shape, texture,
gray-level, etc.
38
Exploration of DM Applications
 Need more success stories:
 Insurance and market analysis, NBA strategy analysis.
 Most current DM systems are lack of a “thick
semantic layer”
 like the early relational database systems without
application software.
 Customized data mining systems:
 Market analysis DM systems
 Insurance and customer analysis systems
39
Towards an Integrated, Exploratory DM
Environment
 Exploratory DM:
 Interactive, user-centered, exploratory mining process
 High performance and fast response
 Integrated multiple DM functions and methods:
 Try different approaches to see which one is better
 Try different functions to see which patterns are more
interesting
 Automated mining and interactive mining: not too far
apart!
40
Towards VO-based DM
Success of
DM
Success of
VO
Hope the day comes earlier
41
Thank you !!!
42