Download Cluster Analysis of Massive Datasets in Astronomy 1 Introduction

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Astronomical spectroscopy wikipedia , lookup

Weak gravitational lensing wikipedia , lookup

Flatness problem wikipedia , lookup

Transcript
Cluster Analysis of Massive
Datasets in Astronomy
Woncheol Jang∗
March 7, 2006
Abstract
Clusters of galaxies are a useful proxy to trace the mass distribution of the universe. By measuring the mass of clusters of galaxies
at different scales, one can follow the evolution of the mass distribution (Martı́nez and Saar, 2002). It can be shown that finding galaxies
clustering is equivalent to finding density contour clusters (Hartigan,
1975): connected components of the level set S c ≡ {f > c} where f
is a probability density function. Cuevas et al. (2000, 2001) proposed
a nonparametric method for density contour clusters. They attempt
to find density contour clusters by the minimal spanning tree. While
their algorithm is conceptually simple, it requires intensive computations for large datasets. We propose a more efficient clustering method
based on their algorithm with the Fast Fourier Transform (FFT). The
method is applied to a study of galaxy clustering on large astronomical
sky survey data.
Key Words: Density contour cluster; level set; clustering; Fast
Fourier Transform.
1
Introduction
In the social and physical sciences, clustering often plays an important role
in analyzing data. For example, clusters of galaxies are a useful proxy to
∗
Institute of Statistics and Decision Sciences, Duke University, Durham, NC 27708
USA
1
trace the mass distribution of the universe. By measuring the mass of clusters of galaxies at different scales, one can follow the evolution of the mass
distribution (Martı́nez and Saar, 2002).
In most cases, the objectives of clustering are to find the locations and the
number of clusters. Although these two problems are separate, it is tempting
to solve both of them simultaneously. The usual tools for clustering are
similarities or distances between objects.
One popular approach to clustering is the model-based clustering (Fraley
and Raftery, 2002; McLachlan and Peel, 2000). It is based on the assumption
that the data are generated according to a mixture distribution with G components and each component is a member of parametric distributions such as
normal. The parameters and number of clusters are estimated from the data
and each observation can be assigned to a cluster with a probability of originating from the cluster. While the model-based clustering provides a method
to estimate the number of cluster and membership of each observation, the
results often are sensitive to the assumption of the parametric family: See
Stuetzle (2003). He also pointed out that the model-based clustering is only
suitable for ellipsoidal-shaped clusters.
An alternative method is the nonparametric approach that is based on the
assertion that a cluster is associated with a mode carrying high probability
over neighborhoods. The goal of this approach is to find the modes and
assign each observation to a cluster. Hartigan (1975) captured this concept
by introducing density contour clusters; clusters are connected components
of the level set Sc ≡ {f > c}. Indeed, it can be shown that finding galaxies
clustering is equivalent to finding density contour clusters.
In this paper, we present a fast clustering algorithm for Hartigan’s density
contour clusters based on Cuevas et al. (2000, 2001) which we will refer to as
the CFF algorithm. They suggested to use unions of balls centered at data
points to estimate the connected components of the level set and provided
an algorithm to extract the connected components of the estimated level
set. While the CFF algorithm is conceptually simple, it requires massive
amounts of computations for large datasets. We propose a more efficient
clustering method based on this algorithm. Instead of using data points, we
use grid points as centers of the balls. As a result, the Fast Fourier Transform
(FFT) can be employed to reduce the cost of original computations for large
datasets.
The rest of this paper is organized as follows. In the following section, we
give a review of the current status on level set inferences. We then introduce
2
the original CFF algorithm and our algorithm, a modified version of the CFF
algorithm. The method is applied to a study of galaxy clustering on large
astronomical datasets in section 4. Finally, in section 5, we discuss possible
extensions of our method.
2
Level Set Estimation
Suppose that Y1 , . . . , Yn are independent observations from an unknown density f on Rd . From Hartigan’s point of view, density contour clustering is
equivalent to estimating the level set Sc = {f > c}. Here c is a constant and
often it is suggested by the situation under study.
A naive estimator for the level set is the plug-in estimator Sbc ≡ {fb > c};
See Cuevas and Fraiman (1997) and references given therein. Here fb is a
kernel density estimator:
n
y1 − Yi1
yd − Yid
1 X
b
,
K
,...,
f (y) =
nhd i=1
h
h
where y = (y1 , . . . , yd ), h = hn is the sequence of the bandwidths satisfying
hn → 0, nhdn → ∞ and K is a multivariate kernel function satisfying:
Z
Z
Z
K(y)dy = 1,
yK(y)dy = 0, and
yy T K(y)dy = µ2 (K) · I.
R
Here µ2 = yj2 K(y)dy is independent of j (Wand and Jones, 1995).
Cuevas and Fraiman (1997) studied asymptotic properties of the plug-in
estimator of type {fb > cn } with cn → 0 for support estimation and showed
the consistency and convergence rates.
From a different point of view, the level set can be employed to develop
a statistical quality control and outlier detection tool. If a observation is
outside of the level set, one can classify that the process is out of control.
It is interesting to know the asymptotic behavior of the probability of no
classification. Báillo et al. (2001) obtained the convergence rates of this type
of the probability.
It may not be easy to construct the plug-in estimator in practice because
of the complicated geometrical structure of the estimator (Cuevas et al.,
2000). A simpler alternative is to use a finite union of balls:
Sec1 =
kn
[
B(Y(i) , n ),
i=1
3
where Y(i) are those observations (in the original sample Y1 , . . . , Yn ) belonging
to Sbc , B(Y(i) , n ) is a closed ball centered at Y(i) with radius n and kn is the
number of Y(i) . Note that kn is random.
Sec1 can be viewed as a histogram version of the plug-in estimator because
the plug-in estimator is indeed a finite union of Y(i) + hSc (K) if Sc (K) =
{K > c} is bounded.
The properties of this type of estimators were studied originally by Devroye and Wise (1980) with applications to statistical quality control.
Two set metrics have been used in the literature for studies of asymptotic
inferences: the distance in measure dµ (S, T ) (similar to L1 metric in density
estimation) and the Hausdorff metric dH (similar to the supremum norm in
density estimation):
dµ ≡ µ(T ∆S),
dH (T, S) ≡ inf{ > 0 : T ⊂ S , T ⊂ S},
where ∆ is the symmetric difference, µ is the Lebesgue measure and S is a
union of all open balls with a radius around points of S.
Devroye and Wise (1980) first proved dµ consistency of Sec1 under some
conditions on n which are similar to those imposed on the bandwidth in
the kernel smoothing and Korostelev and Tsybakov (1993) obtained that
dµ convergence rates of Sec1 when the level set is a domain with a piecewise
Lipschitz boundary.
The consistency and convergence rates with regard to µH can be found
in Cuevas and Rodriguez-Casal (2004) and Cuevas and Fraiman (1997).
3
3.1
Clustering Algorithm
CFF Algorithm
To find clusters, one must estimate the level set and extract the connected
components of the estimated level set. In machine learning literature, finding the connected components of the estimated level set can be considered
as a constraint satisfaction problem (Russell and Norvig, 2002). While the
best-first greed search is often used to find a solution for this type of search
problems, finding the optimal solution may not be feasible. A possible alternative is to use the heuristic search, but it may require lots of memory
usage. For really big search spaces, this method could run out of memory
(personal communication with Andrew Moore) since it requires visiting every
4
data point. In contrast to algorithms for the constraint satisfaction problem,
the CFF algorithm only need to visit those observations belonging to the
level set.
The key idea of the CFF algorithm is first to find a subset of data belonging to the level set and then find clusters by agglomerating these data
points. In short, the CFF algorithm consists of two key steps.
Step 1 Among the original data, find the observations Y(i) which belong to the
estimated level set Sbc .
Step 2 Identify connected components of the estimated level set Sbc by unions
of open balls centered at Y(i) with radius n . This means that every
given pair of Y(i) will be joined with a path consisting of a finite number
of edges with length smaller than 2n .
Cuevas et al. (2000) also suggested an alternative version of their procedure for small kn . This could happen if the sample size is small or c is
relatively large. The key idea of the alternative procedure is to replace Y(i)
with smoothed bootstrap observations Z(i) belonging to the estimated level
set, drawn from fb.
Since the CFF algorithm is a nonparametric approach, it outperforms the
other clustering algorithms such as mixture models and hierarchical single
clustering for noisy background cases such as astronomical sky survey data
in Figure 2 (Wong and Moore, 2002).
3.2
CFF Algorithm with FFT
The CFF algorithm is conceptually simple, but it requires massive computations for large datasets. Even for the first step, we need to compute the
density estimates at every observation. Especially in high dimension, the
task could be daunting even with today’s high computing power. There have
been substantial developments in machine learning community for this type
of problems: density estimation in high dimension with large datasets. For
example, Moore (1999) provided an algorithm to fit the EM-based mixture
model with KD-tree and Gray and Moore (2003) presented an algorithm for
fast kernel density estimation for high dimensional massive datasets. While
they approximated density estimates very quickly by cutting off the search
early without computing exact densities, there are some remaining issues in
the second step.
5
The second step is equivalent to finding Minimum Spanning Tree and
Wong and Moore (2002) proposed an alternative implementation based on
the GeoMS2 algorithm (Narasimhan et al., 2000). Though Wong and Moore
showed the improvement of the CFF algorithm, their algorithm did not address the choice of n , an extra smoothing parameter which the CFF algorithm requires as input. While Cuevas et al. (2000) suggested a few empirical
rules for possible choices of n , those rules still require intensive computations
and may not be useful in practice for large datasets.
To save computing cost and provide a convenient choice of n , we propose
a modified version of the CFF algorithm. The key idea is to replace data
points with grid points t1 , . . . , tm . In other words, we estimate Sbc with
Sec2 ≡
km
[
B(t(i) , m )
i=1
where t(i) ’s are equally spaced grid points which belong to Sbc , km is the total
number of the grid points belonging to Sbc and m is the grid size.
For the choice of the number of grid points, Wand (1994) provided some
guideline for the multivariate case. For two dimensional case, table 1 in
Wand (1994) showed that 322 grid points with linear binning achieved almost
the same accuracy as 10,000 data points did. Hall and Wand (1996) also
addressed the minimum grid size to achieve a certain degree of the asymptotic
efficiency. Our empirical rule is to choose the nearest integer to n1/d k −1 as
the number of grid for each dimension where k is any number between two
and three so the total number of grid points can be (n1/d k −1 )d = nk −d .
Having used the size of the grid as the radius of the balls, one can apply
the Fast Fourier Transform (FFT) to compute density estimates at grid point
to speed up the computations. While the FFT only requires O(m log m) to
compute all density estimates at every grid points, other method usually
require O(n log n). Note that we choose m < n.
To implement our algorithm, we use the following steps as described in
Cuevas et al. (2000).
Let T be the number of the connected components and set the initial
value of T as 0.
Step 1 Compute fb at every grid point using the FFT and find a subset of grid
points {t(i) : t(i) ∈ Sbc }.
6
Step 2 Choose a grid point from the set and name it t(1) . Compute the distance
r1 = kt(1) − t(2) k where t(2) is the the nearest grid point to t(1) .
Step 3a If r1 > 2m , the ball B(t(1) , m ) is a connected component of Sbc . Put
T = T + 1 and repeat step 1 with any grid point in Sbc except t(1) .
Step 3b If r1 ≤ 2m , compute r2 = min{kt(3) − t(1) k, kt(3) − t(2) k} where t(3) is
the grid point closest to the set {t(1) , t(2) }.
Step 4a If r2 > 2m , put T = T + 1 and repeat step 1 with any grid point in Sbc
except t(1) and t(2) .
Step 4b If r2 ≤ 2m , compute, by recurrence,
rK = min{kt(K+1) − t(i) k, i = 1, . . . , K},
where t(K+1) is the grid point closest to the set {t(1) , . . . , t(K) }.
Repeat until we find a distance rK > 2m . Then put T = T + 1 and
return to step 1.
Step 5 Repeat Step 2 - 4 until every grid point is considered, then the total
number of clusters, the connected components of Sbc is T .
In contrast to the original CFF algorithm, our algorithm is computationally efficient for massive datasets;
1. In step 1, our algorithm only requires O(m log m) while the original
algorithm needs at least O(n log n) operations. Note that m < n;
2. In step 2, usual Euclidean Minimum Spanning Trees methods require
O(kn log kn ) while our agglomerating step only needs O(km log km ) where
km < kn . Usually km is usually smaller than kn since m < n.
Another advantage of our method is that one can use existing popular
R/Splus library such as KernSmooth to compute density estimates at grid
points. Although it is possible to compute density estimates at data points
with O(n log n) operations, none of current R/Splus library provides such an
efficient computation.
The idea of using a fixed grid has been also used by Chaudhuri et al.
(1999) in the context of set estimation. They define a set estimator which
7
they called s-shape with applications to digital image and showed the consistency of their estimator when the data are generated from a continuous
distribution.
For small or moderate sample size (m > n), one may not gain computational efficiency by using our algorithm, but still achieve the same effect in
the alternative bootstrap approach proposed by Cuevas et al. (2000).
4
Case Study
Considering a relatively short history of statistics, the interaction between
statistics and astronomy has a long history than one may imagine. Important
statistical concepts and methods such as least squares are indeed developed
by astronomers (Babu and Djorgovski , 2004). However, from the mid of 19th
century the relationship weakened since astronomers more focused on astrophysics while statisticians turned to applications in agriculture and biological
sciences.
However for last two decades, the advent of new massive sky survey
started to bring back the connection between two fields. There have been a
series of the Statistical Challenges in Modern Astronomy conferences hosted
at Penn. State University to address a vast range of statistical issues in
astronomy with modern statistical methodology.
One of main statistical issues in astronomy is clustering astronomical sky
survey data. In this section, we introduce the scientific background of this
subject and apply our method to large astronomical sky survey data.
4.1
Scientific Background
Traditionally astronomical sky survey has been done by a small group of
cosmologists spending sleepless nights with a number of handy telescopes.
However today’s high technology such as digital imaging cameras is opening a new era to cosmologists. Upon the arrival of huge digitalized data,
some fundamental questions in cosmology have been revisited. For example;
(1)How did the universe begin? (2) How old is the universe? (3) Is universe
still expanding? (4) What is the eventual fate of the universe?
Whereas the Big Bang model has been very successful to answer the first
question, the rest of questions are involved in the mass distribution of the
8
universe. The key assumption in modern cosmology is that the evolution of
the mass distribution of the universe is sensitive to cosmological parameters.
Indeed, the mass distribution of the universe can be described as a surprisingly simple model by Press and Schechter (1974)’s seminal work. Their
model provides an analytic form of a sort of cumulative density function of
the mass of clusters of galaxies at different scales via cosmological parameters. Since most matter cannot be observed except clusters of galaxies, they
are a useful tool to follw the mass distribution of the universe. Furthermore,
due to the finite velocity of light, the further an object is, the further in the
past we observe it. Therefore by measuring the mass of clusters of galaxies
at different scales (times), one can learn the history of the universe.
To estimate the cosmological parameters, one can consider a goodnessof-fit type of test statistics to calculate confidence intervals for cosmological
parameters by matching Press-Schechter model on the mass of clusters of
galaxies (Jang, 2003). It is beyond of the scope of this paper and will remain
as future work. In short, galaxy clustering is a crucial step to estimate those
cosmological parameters and those estimates would lead to answers for the
rest of questions.
A summary of typical astronomical sky survey data analysis steps is given
in Figure 1.
4.2
Statistical Models
Let X1 , X2 , . . . , Xn be the positions of galaxies where Xi = (X1i , X2i , Zi ).
The first two components, right ascension (RA) and declination (DEC) are
the longitude and latitude with respect to the Earth. The third component,
redshift, is related to distance and defined as follows.
λo − λ e
λe
where λo is the observed wavelength of a spectral line and λe is the wavelength of the same spectral line measured in a laboratory. Using the similar
argument in Doppler shifts for sound waves, we can estimates the distance
from the redshift.
We assume thatR Xi is a realization of a Poisson process with the intensity
measure Λt (C) = C λ(x)dx, the mean number of galaxies inside C at time
t. Here λ(x) is the intensity function.
z≡
9
Galaxy Sky Survey Data
?
Galaxy Clustering
?
Fit Cosmological Model on the Mass of the Clusters
?
Estimating Large-scale Structure
Figure 1: Typical Analysis Procedure of Astrophysics Sky Survey Data
Cosmologists assume the mean number of galaxies inside a region C is
directly proportional to the total mass inside the region. Hence the intensity
measure Λt (C), the mean number of galaxies inside C at time t, is
Z
Λt (C) ∝
ρt (x)dx.
C
Here ρt (x) be the mass density function at time t, i.e.,
Z
ρt (x)dx ≡ total mass in a region A.
A
The mass density is often expressed in terms of the density parameter Ω ,
8πG
defined by Ω = 3H
2 c2 ρ, where G is the Newton’s constant of gravitation, H
is the Hubble constant and c is the speed of light. The density parameter is
directly related to the spatial curvature; space is negatively curved (“open”)
for Ω < 1 , flat for Ω = 1, and positively curved (“closed”) for Ω > 1.
Observations to date have provided evidence in favor of an approximately
flat universe, Ω = 1. We may decompose the density parameter into a sum
of contributions from different sources of energy; the density parameter for
10
matter ΩM and for the cosmological constant ΩΛ . It is believed that ΩM is
close to 0.3 and ΩΛ is close to 0.7.
It is believed that in early the universe, quantum fluctuation were frozen
in by sudden exponential inflation, thus big normalized mass density functions became virialized objects, clusters. To become a virialized object, a
mass density function must satisfy the following geometric condition,
( )
C = x ρ(x|z) > δc ,
where δc is a complicated nonlinear function of redshift z and ΩM (Reichart
et al., 1999).
Given the fact that ρ is a kind of probability density function, it is clear
from the condition (1) that galaxy clustering is equivalent to estimating the
level set.
4.3
Data: Mock 2dF catalogue
With the development of modern instruments, the astronomical sky survey
data collection procedure is much different than it used to be. For example,
the Sloan Digital Sky Survey (SDSS) uses a 2.5meter-telescope at Apache
Point, New Mexico with a wide-field CCD imaging camera sky per hour in
five broad photometric bands covering wavelength range accessible to CCDs
from ground.
Though there are several real astronomical sky surveys available to the
public including the SDSS, we use a simulation data, the Mock 2dF catalogue,
for our data analysis. Unlike real data, the cosmological parameters are
given in simulation data and can be used later to measure how accurate our
estimates are.
The Mock 2dF catalogue has been built to develop faster algorithms to
deal with the very large numbers of galaxies involved and the development
of new statistics (Cole et al., 1998). All Mock 2dF catalogues mimic the
2dF catalogue which was constructed using the 2dF instrument built by the
Anglo-Australian Observatory. The 2dF catalogue measured redshifts for
250,000 galaxy selected from the APM survey, a projected catalogue using
the Automatic Plate Measuring (APM) machine.
Figure 2 shows a two-dimensional projection from the Mock 2dF catalogue. Here each observation presents a galaxy. While the majority of
11
Figure 2: Mock 2dF catalogue
galaxies belong to one of clusters, the rest can be considered as noises. We
use a subset of the Mock 2dF catalogue with the density parameter for matter, ΩM =0.3 and the cosmological constant, ΩΛ =0.7 for data analysis. It
contains 202,882 galaxies and each galaxies has 4 attributes : RA, DEC, redshift and apparent magnitude. Here apparent magnitude is the brightness
of the object and can be used to calculate the mass of the object since the
mass follows light.
4.4
Results
Our main goal is to find the mass distribution of clusters as a function of
time or redshift z. We use the following steps.
1. Given z, estimate ρ with a nonparametric density estimator.
12
2. Assign each galaxy to a cluster with our clustering algorithm.
3. Add up the absolute magnitudes of galaxies in each cluster and use the
sums as estimates of the mass of the clusters.
4. Repeat the steps 1-3 over different z and compute the mass distribution
of clusters as a function of z.
For the first step, the data were divided into 10 slices by equally spaced
redshift and then, a bivariate kernel density estimator were fitted to estimate
the joint distribution of RA and DEC given redshift. Figure 3 (a) shows a
slice of the 2dF data with 0.10 < z < 0.125 which has 33,157 galaxies and
Figure 3 (b) presents a contour plot by the density estimates.
To keep the original scale of the data, a spherically symmetric kernel was
used, which means the bandwidth matrix is a constant times the identity
matrix. To choose the optimal smoothing bandwidth, we used the crossvalidation selector based on the results in Jang (2006).
To implement our algorithm, the first step is to compute density estimates
at each grid point with the FFT. We used R library KernSmooth developed
by Matt Wand to compute density estimates. Figure 3 (c) shows the grids
point which belongs to the estimated level set {fb > δc }.
For the second step, we wrote c program interfacing with R. We used
412 by 71 grid points for this particular example. The number of grid points
(29,252) and this grid size is used for all datasets.
. The main reason of this choice is that cosmologists want to keep the
original physical ratio of RA and DEC and we want to use at least more than
50 grid points for each dimension.
The clustering results are given in Figure 3 (d). Here each color represents
a different cluster and 1,945 clusters were found out of 33,157 galaxies. In
astronomical sky survey, cosmologists are more interested in the number and
size of clusters so they can compute the mass distribution of the clusters over
the time. Given a range of redshift, cosmologists agree to our findings and
fitting the cosmological model for the mass distribution of clusters is still on
going research project.
5
Discussion
The explosion of data in scientific problems provides a better opportunity
where nonparametric methods can be applied for solving the problems. Our
13
algorithm shows the improvement of the original CFF algorithm in terms of
computation cost with the FFT. We also address the issue of choosing the
extra smoothing parameter n and provides a practical rule for the choice of
the grid size.
Constructing confidence sets for clusters can be used to address uncertainty of the clustering results. While there is a substantial literature on
making confidence statements about a curve f in the context of nonparametric regression and nonparametric density estimation, most of them produce
pointwise confidence bands for f . Therefore, it is not easy to construct confidence statements about features of f such as density contour clusters from
the band. Jang et al. (2004) provides a method to construct uniform confidence sets for densities and density contour clusters. However, to implement
this method in practice is still challenging.
It may be interesting to combine our method with other methods developed in machine learning community such as Gray and Moore (2003).
References
Babu, G. B. Djorgovski, S. G.(2004). Some Statistical and Computational Challenges, and Opportunities in Astronomy. Statistical Science 18
322–332.
Báillo, A., Cuesta-Albertos, J. A. and Cuevas, A. (2001). Convergence rates in nonparametric estimation of level sets. Statistics and
Probability Letters 53 27–35.
Chaudhuri, A. R., Basu, A., Bhandari, S. and Chaudhuri, B. (1999).
An efficient approach to consistent set estimation. Sankhyā, Series B 61
496–513.
Cole, S., Hatton, S., Weinberg, D. H. and Frenk, C. S. (1998).
Mock 2df and sdss galaxy redshift surveys. Monthly Notices of the Royal
Astronomical Society 300 945–966.
Cuevas, A., Febrero, M. and Fraiman, R. (2000). Estimating the
number of clusters. The Canadian Journal of Statistics 28 367–382.
Cuevas, A., Febrero, M. and Fraiman, R. (2001). Cluster analysis: a
14
further approach based on density estimation. Computational Statistics &
Data Analysis 36 441–459.
Cuevas, A. and Fraiman, R. (1997). A plugin approach to support estimation. Annals of Statistics 25 2300–2312.
Cuevas, A. and Rodriguez-Casal, A. (2004). On boundary estimation.
Advances in Applied Probability 36 340–354.
Devroye, L. and Wise, G. (1980). Detection of abnormal behavior via
nonparametric estimation of the support. SIAM Journal on Applied Mathematics 38 480–488.
Fraley, C. and Raftery, A. (2002). Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical
Association 97 611–631.
Gray, A. and Moore, A. (2003). Rapid evaluation of multiple density
models. In Artificial Intelligence and Statistics.
Hall, P. and Wand, M. (1996). On the accuracy of binned kernel density
estimators. Journal of Multivariate Analysis 56 165–184.
Hartigan, J. (1975). Clustering Algorithm. Wiley, New York.
Jang, W. (2003). Nonparametric Density Estimation and Galaxy Clustering. In Statistical Challenges in Astronomy 443-445. Springer, New York.
Jang, W. (2006). Nonparametric density estimation and clustering in astronomical sky surveys. Computational Statistics & Data Analysis 50
760–774.
Jang, W., Genovese, C. and Wasserman, L. (2004). Nonparametric
confidence sets for densities. Tech. Rep. 795, Department of Statistics,
Carnegie Mellon University.
Korostelev, A. and Tsybakov, A. (1993). Minimax Theory of Image
Reconstruction. Springer. New York.
Martı́nez, V. and Saar, E. (2002). Statistics of the Galaxy Distribution.
Chapman and Hall, London.
15
McLachlan, G. and Peel, D. (2000). Finite Mixture Model. Wiley, New
York.
Moore, A. (1999). Very fast em-based mixture model clustering using
multiresolution kd-trees. In Advances in Neural Information Processing
Systems 543–549.
Narasimhan, G., Zhu, J. and Zachariasen, M. (2000). Experiments
with computing geometric minimum spanning trees. In Proceedings of
ALENEX’00, 183–196. Lecture Notes in Computer Science, SpringerVerlag, New York
Press, W. H. and Schechter, P. (1974). Formation of galaxies and
clusters of galaxies by self-similar gravitational condensation. Astrophysical
Journal 187 425–438.
Reichart, D., Nichol, R., Castander, F., Burker, D., A.K.Romer,
Holden, B., Collins, C. and Ulmer, M. (1999). A deficit highredshift, high-luminosity x-ray clusters: evidence for a high value of ωm ?
Astrophysical Journal 518 521–532.
Russell, S. J. and Norvig, P. (2002). Artificial Intelligence: A Modern
Approach . Prentice Hall, Upper Saddle River, 2nd edn.
Stuetzle, W. (2003). Estimating the cluster tree of a density by analyzing
the minimal spanning tree of a sample. Journal of Classification 20 25–47.
Wand, M. (1994). Fast computation of multivariate kernel estimators. Journal of Computational and Graphical Statistics 3 433–445.
Wand, M. and Jones, M. (1995). Kernel Smoothing. Chapman and Hall,
London.
Wong, W.-K. and Moore, A. (2002). Efficient algorithms for nonparametric clustering with clutter. In Computing Science and Statistics
34 541–553.
16
−30
−35
160
180
200
220
−30
−25
17
RA
(a) Mock 2dF catalogue with 0.1 < z < 0.125
−35
DEC
−25
Example : Mock 2dF catalogue
160
180
200
(b) contour plot by kernel density estimation
220
Figure 3: Subset of Mock 2dF catalogue
18