Download data avalanche - China-VO

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Principal component analysis wikipedia , lookup

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Multinomial logistic regression wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Chinese Virtual Observatory
Zhang Yanxia
China-VO Group
2006.11.30 in Guilin
Outline
•
•
•
•
•
•
Why
What
How
Example
challenge
summary
China-VO 2006, Guilin
11/29-12/03
2
Astronomy facing
“data avalanche”
Necessity Is the Mother of Invention
DM&KDD
China-VO 2006, Guilin
11/29-12/03
IRAS 25m 2MASS 2m DSS Optical IRAS 100mWENSS 92cmNVSS 20cm GB 6cm ROSAT ~keV
3
Issues in Astronomy
Ofer Lahav, 2006, astro-ph/0610703
Summary on the 4th meeting on “Statistical Challenge
in Modern Astronomy” held at Penn State University in June 2006
• Compression (e.g. Galaxy images and spectra)
• Classification (e.g. Stars, galaxies, or Gamma Ray Bursts)
• Reconstruction (e.g. of blurred galaxy images, mass distribution
from weak gravitational lensing)
• Feature extraction (e.g. signatures feature of stars, galaxies and
quasars)
• Parameter estimation (e.g. Star parameter measurement,
Photometric redshift prediction, orbital parameters of extrasolar planets, or cosmological parameters )
• Model selection (e.g. are there 0,1,2,……planets around stars, or
is there a cosmological model with none-zero neutrino mass
more favorable)
China-VO 2006, Guilin
11/29-12/03
4
Science Requirements for DM
(Borne K D, 2001, Proc. Of the MPA/ESO/MPE Workshop,671)
 Cross-Identification - refers to the classical problem of associating
the source list in one database to the source list in another.
 Cross-Correlation - refers to the search for correlations,
tendencies, and trends between physical parameters in multi-dimensional
data, usually across databases.
 Nearest-Neighbor Identification - refers to the general
application of clustering algorithms in multi-dimensional parameter
space, usually within a database.
 Systematic Data Exploration - refers to the application of the
broad range of event-based and relationship-based queries to a database
in the hope of making a serendipitous discovery of new objects or a new
class of objects.
China-VO 2006, Guilin
11/29-12/03
5
KDD: Opportunity and Challenges
Competitive
Pressure
Data Rich
Knowledge Poor
(the resource)
KDD
Data Mining
Technology
Mature
Enabling Technology
(Interactive MIS, OLAP,
parallel computing, Web, etc.)
China-VO 2006, Guilin
11/29-12/03
6
KDD: A Definition
KDD is the automatic extraction of non-obvious,
hidden knowledge from large volumes of data.
106-1012 bytes:
never see the whole
data set or put it in the
memory of computers
Data mining
algorithms?
China-VO 2006, Guilin
What knowledge?
How to represent
and use it?
11/29-12/03
7
Benefits of Knowledge Discovery
Value
Disseminate
DSS
Generate
Volume
MIS
EDP
Rapid Response
EDP: Electronic Data Processing
MIS: Management Information Systems
DSS: Decision Support Systems
China-VO 2006, Guilin
11/29-12/03
8
DM: A KDD Process
– Data mining: the core
of knowledge discovery
process.
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Data Cleaning
Data Integration
Databases
Selection
Work at each process of DM
60
50
40
30
20
10
0
DM object
Evalution
Data preparation
China-VO 2006, Guilin
Data processing
11/29-12/03
Analysis and
10
Primary Tasks of Data Mining
finding the description
of several predefined
classes and classify
a data item into one
of them.
Classification
?
maps a data item
to a real-valued
prediction variable.
identifying a finite
set of categories or
clusters to describe
the data.
finding a model
which describes
significant dependencies
between variables.
Dependency
Modeling
Regression
discovering the
most significant
changes in the data
Deviation and
change detection
Clustering
finding a
compact description
for a subset of data
Summarization
China-VO 2006, Guilin
11/29-12/03
11
Feature selection
•
•
•
•
Filter method
Wrapper method
Embedded method
Feature weighted method
China-VO 2006, Guilin
11/29-12/03
12
Feature extraction
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
PCA
Factor analysis (Principal FA/Maximum Likelihood FA)
Projection pursuit
ICA
Non-linear PCA/ICA
Random projection
Principal curves
MDS
LLE
ISOMAP
Topological continuous map
Neural network
Vector quantization
Kernel PCA/ICA
LDA (linear discriminant analysis )
QDA (quadratic discriminant analysis)
FDA (Fisher discriminant analysis)
GDA (Generalized discriminant analysis)
KDDA (kernel direct discriminant analysis)
China-VO 2006, Guilin
11/29-12/03
13
Classification Methods
• Based on statistical theory: SVMs, ML, LDA,FDA,QDA,KNN
• Based on NN: LVQ, RBF, PNN, KSOM,BBN,SLP,MLP
• Based on Decision Tree: REPTree, RandomTree, CART,C5.0,
J48, DecisionStump, RandomForest, NBtree,AC2,Cal5,
ADTree,KDTree
• Based on Decision Rule: Decision Table,CN2,ITrule, AQ
• Based on bayesian theory: Naive Bayes classifier, NBTree
• Based on meta learning: adaboost, boosting, bagging
• Based on evolution theory: genetic algorithm
• Based on fuzzy theory: fuzzy set, rough set
• Ensembles of classifiers
Data
Mining algorithm
China-VO 2006, Guilin
patterns
11/29-12/03
14
Regression Methods
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
(penalized) logistic regression
Bayesian regression analysis
Additive regression
Locally weighted regression
Voted perceptron network
Projection pursuit regression
Recursive partitioning regression
Alternating condition expectation
Stepwise regression
Recursive least square
Fourier transform regression
Ruled-based regression
Principal component regression
Instance-based regression
Multivariate adaptive regression splines
Regression trees (CART, RETIS, M5,random forest, KDtree)
Simple windowed regression
SVM
NN
China-VO 2006, Guilin
11/29-12/03
15
Method to estimate errors
• Train-test
• Cross-validation
• Bootstrap
• Leave-one-out
China-VO 2006, Guilin
11/29-12/03
16
Evaluation of methods
• Accuracy
• Speed
• Comprehensibility
• Time to learn
• Generalization
China-VO 2006, Guilin
11/29-12/03
17
Model Selection for Classifiction
• Accuracy
• G-mean
• F-measure
• ROC (Receive Operating Characteristic Curve)
China-VO 2006, Guilin
11/29-12/03
18
Model Selection for Regression
• AIC(Akaike information criterion)
• BIC (Bayesian information criterion)
• SRM (Structure Risk Minimization)
China-VO 2006, Guilin
11/29-12/03
19
Example 1
Lim Jien-sien et al. Machine Learning, 40, 203-229(2000)
33 algorithms on 16 different samples
22 decision trees
CART, S-Plus tree, C4.5,FACT,QUEST,IND,OC1,LMDT,CAL5,T1
9 statistical methods
LDA,QDA,NN,LOG,FDA,PDA,MDA,POL
2 neural networks
LVQ,RBF
China-VO 2006, Guilin
11/29-12/03
20
Example 1
Lim Jien-sien et al. Machine Learning, 40, 203-229(2000)
China-VO 2006, Guilin
11/29-12/03
21
Example 2
China-VO 2006, Guilin
11/29-12/03
22
Example 3
Zhao,Y, Zhang,Y., 2006, submitted to cospar
China-VO 2006, Guilin
11/29-12/03
23
Example 3
Zhang,Y,Zhao,Y, 2006, submitted to CHJAA
For NB, ADTree MLP, the corresponding whole accuracy
amounts to 97.5%, 98.5% and 98.1%, respectively.
China-VO 2006, Guilin
11/29-12/03
24
Example 4
Zhang,Y, Luo, A, Zhao,Y, 2006, submitted to Cospar
By best-forward search, j-h, b-v,j+ 2.5lgFpeak are optimal features
selected from the 10 features.
Decision Table is applied. 10-fold cross-validation for training and
test.
98.03%
China-VO 2006, Guilin
11/29-12/03
25
Example 5
Li,Y.,Zhang,Y.,Zhao,Y.,2006,submitted to Chinese Science
k-Nearest neighbor classifier
China-VO 2006, Guilin
11/29-12/03
26
Example 6
Zhang,Y., Zhao, Y., 2006,ADASS
XV,351,173
China-VO 2006, Guilin
11/29-12/03
27
Challenges and Influential Aspects
Massive data sets,
high dimensionality
(efficiency, scalability)
Different sources of data
(distributed, heterogeneous
databases, noise and missing,
irrelevant data, etc.)
Changing
data and
knowledge
Handling of different
types of data with
different degree of
supervision
Interactive,
Visualization
Knowledge
Discovery
Understandability of patterns, various
kinds of requests and results (decision
lists, inference networks, concept
hierarchies, etc.)
China-VO 2006, Guilin
11/29-12/03
28
Summary
•
•
•
•
•
Linear or non-linear
Gassian or non-gassian
Continous or discrete
Missing or not
Comparision of the number of attributes
with that of records
• Choose the appropriate method or
ensemble algorithms according to the
task and data characteristics
China-VO 2006, Guilin
11/29-12/03
29
Prospect
With the wing of DM,
find more, better or best knowledge!
Thank you for your attention!
China-VO 2006, Guilin
11/29-12/03
30
Thank you !!!