Download Data mining in astronomy

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
10.1117/2.1200811.1283
Data mining in astronomy
Yanxia Zhang and Yongheng Zhao
Effective information-management strategies could accelerate the pace
of discovery.
Like other data-rich disciplines such as physics, biology, geology, and oceanography, astronomy is facing a data avalanche
due to advances in telescope and detector technology, the exponential increase in computing capabilities, improvements in
data-collection methods, and successful applications of theoretical simulations. As the era approaches in which data covers the
full range of wavelengths from radio to gamma-rays, the expected data volumes will add up to terabytes, soon to be followed by petabytes.
Proper management and processing of massive data sets requires efficient federation of database technologies. However,
mining knowledge from huge data volumes is the ultimate goal
and development of data-mining techniques is therefore critical.
Knowledge discovery in databases (KDD) is the process of extracting useful knowledge from data. Data mining, the application of specific algorithms to discover rare or previously unknown types of object or phenomenon, is a particular step in the
process.1 KDD is inherently interactive and iterative, as shown
in Figure 1. Common KDD functions are classification, cluster
analysis, and regression.
In classification one develops a description or model for each
class of data labeled with discrete integers (as opposed to cluster analysis, which is sometimes called ‘unsupervised classification’). Classification is used for the organization of future test
data, better understanding of each data class, and predictions
of certain properties and behaviors. It is based on spectra or
images and, for example, may be used to describe galaxies by
morphology.
Cluster (or clustering) analysis is a multivariate procedure
based on placing objects into more or less homogeneous groups
such that the relationship between groups is revealed. It lacks
an underlying body of statistical theory and is heuristic in nature, requiring decisions to be made by individual users (which
can strongly affect results). Cluster analysis is used to classify
groups or objects more objectively than subjectively and can help
astronomers find unusual objects within a flood of data. Exam-
Figure 1. Knowledge discovery in databases.
ples include discoveries of high-redshift quasars, type-2 quasars
(highly luminous active galactic nuclei whose centers are obscured by gas and dust), and brown dwarfs.
In regression analysis the input data labels are real and continuous. Therefore, if an algorithm can handle data with both
real and integer targets, it can be used for classification and
regression. Discoveries in astronomy from regression include
the Hertzsprung-Russell diagram and Hubble’s law relating a
galaxy’s recessional velocity to its distance. Problems in astronomy that can be solved by regression include photometric
or spectral redshift measurements of galaxies and quasars and
physical parameter estimations of stars.
KDD techniques
KDD is a new and growing field which can address many
of the problems facing modern astronomy. Many knowledgediscovery methods are in use and under development, some
generic while others remain domain specific. Six common, essential elements qualify a data-mining approach as a KDD technique. All KDD methods2, 3 share the same principles of efficiency, accuracy, comprehensibility, automation, and generalization, taking the shortest time possible to learn.
Data-mining algorithms are a core part of KDD. They can
be supervised, semisupervised, or unsupervised. Supervised
learning uses training data to infer a model which is then
Continued on next page
10.1117/2.1200811.1283 Page 2/3
applied to test data. Unsupervised learning relies exclusively on
test data. In other words, supervised-learning input data uses labels, while unsupervised learning does not. The semisupervised
approach uses a combination of labeled and unlabeled data to
train a classifier. A large amount of unlabeled data can often be
supplemented with a small amount of labeled data to construct
a useful classifier.
Generally, supervised-learning algorithms produce a better
success rate than unsupervised approaches with respect to the
value of the resulting knowledge. For example, reduction of high
dimensionality relies on feature selection and extraction, which
removes irrelevant or redundant variables. Feature-selection
methods include the filter, wrapper, and embedded methods.4
Learning algorithms are complex and generally considered
the hardest part of any KDD technique that can be realized using
different approaches.5, 6 Classification and regression are normally performed by supervised-learning techniques. Many algorithms, such as k-nearest neighbor, support-vector machines,
neural networks, naı̈ve Bayes, decision trees, decision rules,
metalearning, genetic algorithms, fuzzy sets, rough sets, and ensembles of classifiers have been applied to solve classification
problems. Frequently used regression methods include locally
weighted, kernel, and projection-pursuit regression, k-nearest
neighbors, and neural networks.
Cluster analysis is usually realized by unsupervised-learning
techniques. It groups objects of similar kinds into categories
and sorts different objects into groups by their degree of association. It uses a number of different algorithms, such as Kmeans, K-medoids, AutoClass, self-organizing maps, principalcomponent analysis, and expectation maximization.
Outlier detection aims to detect objects behaving in an unexpected way or which have abnormal properties. It can find rare,
unknown, or bad data. The techniques used are commonly divided into six methods, i.e., distribution, depth, distance, clustering, density, and deviation based.7
The future of KDD
Automation of KDD would offer many advantages. Numerous projects are currently underway to achieve this goal, such
as the International Virtual Observatory Alliance (IVOA),8 as
well as the GRIST9 and astrostastics10 programs. In addition, we
previously proposed an architecture for multiwavelength data
mining.11 In this system, users with no database knowledge may
create their own databases and federate multiwavelength data
using automated database-creation and cross-match tools. The
use of such data-mining tools will enable scientists to work with
large data samples. For example, a recently designed automatic
system for photometric-redshift estimation will become an
essential tool to automatically determine the physical parameters of galaxies, quasars, and stars.12
Progression in this field requires international collaboration
of experts from various disciplines, including computer scientists, database and data-mining specialists, statisticians, and astronomers. Only then will the astronomical community (and
other data-rich sciences) share in the intellectual prosperity afforded by optimal investigation of the available data. We believe
that our work on the Large Sky Area Multi-Object Fiber Spectroscopic Telescope (LAMOST) project at the National Astronomical Observatories of the Chinese Academy of Sciences, will be
a successful example of how to integrate data acquisition and
knowledge retrieval.
This article is funded by the National Natural Science Foundation of
China under grant No. 10778724 and by Chinese National 863 project
No. 2006AA01A120.
Author Information
Yanxia Zhang and Yongheng Zhao
National Astronomical Observatories
Chinese Aacademy of Sciences
Beijing, China
Yanxia Zhang is an associate professor. She specializes in the
study of multiwavelength astronomy and in data-mining algorithms.
Yongheng Zhao is project manager of the LAMOST project. A
professor since 1996, he specializes in the study of high-energy
astrophysics, and data mining and analysis in astronomy.
References
1. U. M. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, Knowledge discovery and data
mining: towards a unifying framework Proc. Int’l Conf. Knowl. Disc. Data Mining 2,
pp. 82–88, Portland, 1996.
2. U. M. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, From data mining to knowledge discovery: an overview, in U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R.
Uthurusamy (eds.), Advances in Knowledge Discovery and Data Mining, pp. 1–
34, AAAI Press/The MIT Press, Menlo Park, CA, 1996.
3. W. J. Frawley, G. Piatetsky-Shapiro, and C. Matheus, Knowledge discovery in
databases: an overview, in G. Piatetsky-Shapiro and W. J. Frawley (eds.), Knowledge
Discovery in Databases, pp. 1–30, AAAI Press/MIT Press, Cambridge, MA, 1991.
4. H. Zheng and Y. Zhang, Feature selection for high-dimensional data in astronomy,
Adv. Space Res. 41, pp. 1960–1964, 2008.
5. Y. Zhang, Y. Zhao, and C. Cui, Data mining and knowledge discovery in database of
astronomy, Prog. Astron. 20 (4), pp. 312–323, 2002.
6. Y. Zhang, H. Zheng, and Y. Zhao, Knowledge discovery in astronomical data, Proc.
SPIE 7019, p. 701938, 2008. doi:10.1117/12.788417
7. Y. Zhang, A. Luo, and Y. Zhao, Outllier detection in astronomical data, astronomical
data analysis II, Proc. SPIE 5493, pp. 521–529, 2004. doi:10.1117/12.550998
Continued on next page
10.1117/2.1200811.1283 Page 3/3
8. http://www.ivoa.net/
9. http://grist.caltech.edu/
10. http://astrostatistics.psu.edu/
11. Y. Zhang, Y. Zhao, and H. Zheng, System architectural design of multiwavelength
data mining, Proc. SPIE 7017, p. 70171M, 2008. doi:10.1117/12.788398
12. D. Wang, Y. Zhang, and Y. Zhao, An automatic system for photometric redshift estimation based on sky survey data, Proc. SPIE 7019, p. 701937, 2008.
doi:10.1117/12.788429
c 2008 SPIE