Download SimulationDatabases

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cosmic distance ladder wikipedia , lookup

Weak gravitational lensing wikipedia , lookup

Redshift wikipedia , lookup

Gravitational lens wikipedia , lookup

Astronomical spectroscopy wikipedia , lookup

Transcript
Analysing Cosmological
Simulations in the Virtual
Observatory:
Designing and Mining the Millennium
Simulation Database
Gerard Lemson
German Astrophysical Virtual Observatory
(GAVO)
ARI, Heidelberg
MPE, Garching bei München
1
Acknowledgments


Alex Szalay
Virgo consortium, in particular:
 Volker Springel, Simon White, Gabriella DeLucia, Jeremy Blaizot
(MPA, Munich, Germany),
 Carlos Frenk, Richard Bower, John Helly (ICC, Durham, UK)

Similar efforts/sites to Millennium Database
 Durham (mirror of Millennium DB)
 Horizon/GalICS (Lyon)
 ITVO (Trieste)

2
GAVO is funded by the German Federal Ministry for Education and Research
(BMBF)
Garching, June 26, 2008
Summary
 VO aims to provide access to remote data for use/analysis by 3rd
parties.
 Data analysis requires
 advanced methods for analysis
 data
 Data sets are often very large, often far away (makes them even
larger!)
 To analyse remote datasets, one needs to be able to bring the
analysis to the data.
 “Standard” approach using flat files and C/IDL/etc code sub-optimal
 To analyse very large datasets we also need advanced methods of
data organisation and data access
 Structured approach supported by relational database system allows
one to concentrate on science, iso worry about I/O optimisation etc
 And the questions can become pretty complex !
3
Garching, June 26, 2008
Case study: The Millennium Simulation
Springel V. et al. 2005 Nature 435, 629
4
Garching, June 26, 2008
Millennium Simulation
 Virgo consortium







Gadget 3
10 billion particles, dark matter only
500 Mpc periodic box
Concordance model (as of 2004) initial conditions
64 snapshots
350000 CPU hours
O(30Tb) raw + post-processed data
 Post-processing data products complex and large
 Challenge to analyse, even locally!
 SimDAP-like approach required for remote access.
5
Garching, June 26, 2008
Intermezzo: Data Access is Hitting a Wall
(courtesy Alex Szalay)
FTP and GREP are not adequate






You can GREP/FTP 1 MB in a second
You can GREP/FTP 1 GB in a minute
You can GREP/FTP 1 TB in 2 days
You can GREP/FTP 1 PB in 3 years
SFTP much slower
and 1PB ~2,000 disks
 At some point you need
indices to limit search
parallel data search and analysis
 This is where databases can help
6
Garching, June 26, 2008
Analysis and Databases
(courtesy Alex Szalay)
 Much statistical analysis deals with








Creating uniform samples -- data filtering
Assembling relevant subsets
Estimating completeness
Censoring bad data
Counting and building histograms
Generating Monte-Carlo subsets
Likelihood calculations
Hypothesis testing
 Traditionally these are performed on files
 Most of these tasks are much better done inside a
database
7
Garching, June 26, 2008
Advantages of relational databases
 Encapsulation of data in terms of logical structure, no
need to know about internals of data storage
 Standard query language for finding information
 Advanced query optimizers (indexes, clustering)
 Transparent internal parallelization
 Authenticated remote access for multiple users at same
time




8
Forces one to think carefully about data structure
Speeds up path from science question to answer
Facilitates communication (query code is cleaner)
Facilitates adaptation to IVOA standards (ADQL)
Garching, June 26, 2008
Millennium Simulation Phenomenology
 Density field on 2563 mesh
 CIC
 Gaussian smoothed: 1.25,2.5,5,10 Mpc/h
 Friends-of-Friends (FOF) groups
 SUBFIND Subhalos
 Galaxies from 2 semi-analytical models (SAMs)
 MPA (L-Galaxies, DeLucia & Blaizot, 2006; Bertone et al 2007)
 Durham (GalForm, Bower et al, 2006)
 Subhalo and galaxy formation histories: merger trees
 Mock catalogues on light-cone
 Pencil beams (Kitzbichler & White, 2006)
 All-sky (depth of SDSS spectral sample)
(Blaizot et al, 2005)
 In preparation: Spectra for light cone galaxies
9
Garching, June 26, 2008
10
Garching, June 26, 2008
Millennium Simulation Phenomenology
 Density field on 2563 mesh
 CIC
 Gaussian smoothed: 1.25,2.5,5,10 Mpc/h
 Friends-of-Friends (FOF) groups
 SUBFIND Subhalos
 Galaxies from 2 semi-analytical models (SAMs)
 MPA (L-Galaxies, DeLucia & Blaizot, 2006; Bertone et al 2007)
 Durham (GalForm, Bower et al, 2006)
 Subhalo and galaxy formation histories: merger trees
 Mock catalogues on light-cone
 Pencil beams (Kitzbichler & White, 2006)
 All-sky (depth of SDSS spectral sample)
(Blaizot et al, 2005)
 In preparation: Spectra for light cone galaxies
11
Garching, June 26, 2008
FOF groups, (sub)halos and galaxies
12
Garching, June 26, 2008
Millennium Simulation Phenomenology
 Density field on 2563 mesh
 CIC
 Gaussian smoothed: 1.25,2.5,5,10 Mpc/h
 Friends-of-Friends (FOF) groups
 SUBFIND Subhalos
 Galaxies from 2 semi-analytical models (SAMs)
 MPA (L-Galaxies, DeLucia & Blaizot, 2006; Bertone et al 2007)
 Durham (GalForm, Bower et al, 2006)
 Subhalo and galaxy formation histories: merger trees
 Mock catalogues on light-cone
 Pencil beams (Kitzbichler & White, 2006)
 All-sky (depth of SDSS spectral sample)
(Blaizot et al, 2005)
 In preparation: Spectra for light cone galaxies
13
Garching, June 26, 2008
Time evolution: merger
trees
14
Garching, June 26, 2008
Millennium Simulation Phenomenology
 Density field on 2563 mesh
 CIC
 Gaussian smoothed: 1.25,2.5,5,10 Mpc/h
 Friends-of-Friends (FOF) groups
 SUBFIND Subhalos
 Galaxies from 2 semi-analytical models (SAMs)
 MPA (L-Galaxies, DeLucia & Blaizot, 2006; Bertone et al 2007)
 Durham (GalForm, Bower et al, 2006)
 Subhalo and galaxy formation histories: merger trees
 Mock catalogues on light-cone
 Pencil beams (Kitzbichler & White, 2006)
 All-sky (depth of SDSS spectral sample)
(Blaizot et al, 2005)
 In preparation: Spectra for light cone galaxies
15
Garching, June 26, 2008
Mock catalogues
16
Garching, June 26, 2008
Millennium Simulation Phenomenology
 Density field on 2563 mesh
 CIC
 Gaussian smoothed: 1.25,2.5,5,10 Mpc/h
 Friends-of-Friends (FOF) groups
 SUBFIND Subhalos
 Galaxies from 2 semi-analytical models (SAMs)
 MPA (L-Galaxies, DeLucia & Blaizot, 2006; Bertone et al 2007)
 Durham (GalForm, Bower et al, 2006)
 Subhalo and galaxy formation histories: merger trees
 Mock catalogues on light-cone
 Pencil beams (Kitzbichler & White, 2006)
 All-sky (depth of SDSS spectral sample)
(Blaizot et al, 2005)
 In preparation: Spectra for light cone galaxies
17
Garching, June 26, 2008
Synthetic spectra (not yet available)
18
Garching, June 26, 2008
Hierarchy of Data Products
FOF Group
Parent FOF group
Subhalo
Located in
Light Cone
Galaxy
Density Field
Mesh Cell
SUBFIND result
Located in
original
Spectrum
SAM Galaxy
Merger Tree
Parent halo
Subhalo
MergerTree
Tree relationships
19
Garching, June 26, 2008
Designing the Database


Need a model for data, including relations between
different objects
Model needs to support science: “20 questions”
(following Gray & Szalay)
1.
2.
3.
4.
5.
6.
7.
8.
20
Return the galaxies residing in halos of mass between 10^13 and 10^14
solar masses.
Return the galaxy content at z=3 of the progenitors of a halo identified at
z=0
Return the complete halo merger tree for a halo identified at z=0
Find all the z=3 progenitors of z=0 red ellipticals (i.e. B-V>0.8 B/T > 0.5)
Find the descendents at z=1 of all LBG's (i.e. galaxies with SFR>10
Msun/yr) at z=3
Find all the z=2 galaxies which were within 1Mpc of a LBG (i.e.
SFR>10Msun/yr) at some previous redshift.
Find the multiplicity function of halos depending on their environment
(overdensity of density field smoothed on certain scale)
Find the dependency of halo properties on environment
Garching, June 26, 2008
Data model features
 Each object its table
 properties are columns
 each a unique identifier
 Relations implemented through foreign keys,




pointers to unique identifier column
FOF to mesh cell it lies in
Sub-halo to its FOF group
galaxy to its sub-halo etc
 Special design needed for
 Hierarchical relations: merger trees
 Spatial relations: multi-dimensional indexes required
 Support for random sample selection
21
Garching, June 26, 2008
Formation histories:
Subhalo and Galaxy merger trees
 Tree structure
 halos have single descendant
 halos have main progenitor
 Hierarchical structures usually handled using
recursive code
 inefficient for data access
 not (well) supported in RDBs
 Tree indexes
 depth first ordering of nodes defines identifier
 pointer to last progenitor in subtree
22
Garching, June 26, 2008
23
Garching, June 26, 2008
Merger trees :
select prog.*
from galaxies des
,
galaxies prog
where des.galaxyId = 0
and prog.galaxyId
between des.galaxyId
and des.lastProgenitorId
Branching points :
select descendantId
from galaxies des
where descendantId != -1
group by descendantId
having count(*) > 1
24
Garching, June 26, 2008
Spatial queries, random samples
 Spatial queries require multi-dimensional
indexes.
 (x,y,z) does not work: need discretisation
 index on (ix,iy,iz) with ix=floor(x/10) etc
 More sophisticated: space filling curves
 bit-interleaving/oct-tree/Z-Index
 Peano-Hilbert curve
 Need custom functions for range queries
(Implemented in T-SQL)
 Random sampling using a RANDOM column
 RANDOM from [0,1000000]
25
Garching, June 26, 2008
The Millennium Database web site
 SQLServer 2005 database
 Web application (Java in Apache Tomcat web server)




portal: http://www.mpa-garching.mpg.de/millennium/
public DB access: http://www.g-vo.org/Millennium
private access: http://www.g-vo.org/MyMillennium
MyDB
 Access methods
 browser with plotting capabilities through VOPlot applet
 wget + IDL, R
 TOPCAT (3.1)
26
Garching, June 26, 2008
27
Garching, June 26, 2008
Usage statistics





28
Up since August 2006 (astro-ph/0608019)
~225 registered users
> 5 million queries
> 40 billion rows
~130 papers, ~50% not related to Virgo consortium
(see http://www.mpa-garching.mpg.de/millennium )
Garching, June 26, 2008
Some science questions and their
implementation as SQL
If time permits, in any case 1-1 demo possible.
29
Find light cone galaxies in a slice in
redshift, RA and Dec
select
from
where
and
30
ra,dec,redshift_obs
kitzbichler2006a_obs
redshift_obs between 1 and 1.1
dec between -.05 and .0
Garching, June 26, 2008
Color-magnitude for random sample of
galaxies
select
,
,
from
where
and
and
31
mag_bdust
mag_bdust - mag_vdust as color
type
delucia2006a
snapnum=63
random between 0 and 100
mag_b < 0
Garching, June 26, 2008
32
Garching, June 26, 2008
Get merger tree for identified galaxy
select p.snapnum
,
p.x,p.y,p.z
,
p.stellarmass
,
p.mag_b-p.mag_v as color
from delucia2006a d
,
delucia2006a p
where d.galaxyid=0
and p.galaxyid between d.galaxyid
and d.lastprogenitorid
33
Garching, June 26, 2008
34
Garching, June 26, 2008
35
Garching, June 26, 2008
Histogram of density field at redshifts
0,1,2,3; Gaussian smoothing 5 Mpc/h
select
,
,
from
where
group
order
36
snapnum
.01*floor(f.g5/.01) as g5
count(*) as num
mfield f
f.snapnum in (63,41,32,27)
by snapnum,.01*floor(f.g5/.01)
by 1,2
Garching, June 26, 2008
37
Garching, June 26, 2008
FOF multiplicity function
at redshifts 0,1,2,3,
select snapnum
,
.1*floor(log10(np)/.1) as lognp
,
count(*) as num
from fof
where snapnum in (63,41,32,27)
group by snapnum
,
.1*floor(log10(np)/.1)
order by 1,2
38
Garching, June 26, 2008
39
Garching, June 26, 2008
FOF mass multiplicity function,
conditioned on density in environment
select .1*floor(log10(fof.np)/.1)
as lognp
,
count(*) as num
from mfield f
,
fof
where fof.snapnum=f.snapnum
and fof.phkey = f.phkey
and f.snapnum=63
and f.g5 between 1 and 1.1
group by .1*floor(log10(fof.np)/.1)
order by 1
40
Garching, June 26, 2008
41
Garching, June 26, 2008
Thank you !
42