Download PS1

Document related concepts

Microsoft Jet Database Engine wikipedia , lookup

Database wikipedia , lookup

Open Database Connectivity wikipedia , lookup

SQL wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Versant Object Database wikipedia , lookup

Database model wikipedia , lookup

Transcript
PS1 PSPS
Object Data Manager Design
PSPS Critical Design Review
November 5-6, 2007
IfA
Outline










slide 2
ODM Overview
Critical Requirements Driving Design
Work Completed
Detailed Design
Spatial Querying [AS]
ODM Prototype [MN]
Hardware/Scalability [JV]
How Design Meets Requirements
WBS and Schedule
Issues/Risks
[AS] = Alex, [MN] = Maria, [JV] = Jan
ODM Overview
The Object Data Manager will:
 Provide a scalable data archive for the PanSTARRS data products
 Provide query access to the data for Pan-STARRS
users
 Provide detailed usage tracking and logging
slide 3
ODM Driving Requirements
 Total size 100 TB,
• 1.5 x 1011 P2 detections
• 8.3x1010 P2 cumulative-sky (stack) detections
• 5.5x109 celestial objects
 Nominal daily rate (divide by 3.5x365)
• P2 detections: 120 Million/day
• Stack detections: 65 Million/day
• Objects: 4.3 Million/day
 Cross-Match requirement: 120 Million / 12 hrs ~ 2800 / s
 DB size requirement:
• 25 TB / yr
• ~100 TB by of PS1 (3.5 yrs)
slide 4
Work completed so far
 Built a prototype
 Scoped and built prototype hardware
 Generated simulated data
• 300M SDSS DR5 objects, 1.5B Galactic plane
objects
 Initial Load done – Created 15 TB DB of simulated data
• Largest astronomical DB in existence today
 Partitioned the data correctly using Zones algorithm
 Able to run simple queries on distributed DB
 Demonstrated critical steps of incremental loading
 It is fast enough
• Cross-match > 60k detections/sec
• Required rate is ~3k/sec
slide 5
Detailed Design




Reuse SDSS software as much as possible
Data Transformation Layer (DX) – Interface to IPP
Data Loading Pipeline (DLP)
Data Storage (DS)
• Schema and Test Queries
• Database Management System
• Scalable Data Architecture
• Hardware
 Query Manager (QM: CasJobs for prototype)
slide 6
High-Level Organization
Data
DataTransformation
TransformationLayer
Layer(DX)
(DX)
objZoneIndx
orphans
Detections_l1
objZoneIndx
Linked servers
Load
Support1
LoadAdmin
Load
Supportn
PartitionMap
Data Loading Pipeline (DLP)
[LnkToObj_p1]
Linked servers
[Objects_pm]
P1
Pm
[Detections_p1]
Meta
Detections
Data Storage (DS)
Database
Full table
[partitioned table]
Output table
Partitioned View
slide 7
[LnkToObj_pm]
[Detections_pm]
PS1
PartionsMap
Objects
LnkToObj
Legend
Detections_ln
LnkToObj_ln
LnkToObj_l1
[Objects_p1]
orphans
Meta
Query
QueryManager
Manager(QM)
(QM)
Web
WebBased
BasedInterface
Interface(WBI)
(WBI)
Meta
Detailed Design




Reuse SDSS software as much as possible
Data Transformation Layer (DX) – Interface to IPP
Data Loading Pipeline (DLP)
Data Storage (DS)
• Schema and Test Queries
• Database Management System
• Scalable Data Architecture
• Hardware
 Query Manager (QM: CasJobs for prototype)
slide 8
Data Transformation Layer (DX)
 Based on SDSS sqlFits2CSV package
• LINUX/C++ application
• FITS reader driven off header files
 Convert IPP FITS files to
• ASCII CSV format for ingest (initially)
• SQL Server native binary later (3x faster)
 Follow the batch and ingest verification procedure
described in ICD
• 4-step batch verification
• Notification and handling of broken publication cycle
 Deposit CSV or Binary input files in directory structure
• Create “ready” file in each batch directory
 Stage input data on LINUX side as it comes in from IPP
slide 9
DX Subtasks
DX
slide 10
Initialization
Job
Batch
Ingest
Batch
Verification
Batch
Conversion
FITS schema
FITS reader
CSV Converter
CSV Writer
Interface with IPP
Naming convention
Uncompress batch
Read batch
Verify Batch
Verify Manifest
Verify FITS Integrity
Verify FITS Content
Verify FITS Data
Handle Broken Cycle
CSV Converter
Binary Converter
“batch_ready”
Interface with DLP
DX-DLP Interface
 Directory structure on staging FS (LINUX):
• Separate directory for each JobID_BatchID
• Contains a “batch_ready” manifest file
– Name, #rows and destination table of each file
• Contains one file per destination table in ODM
– Objects, Detections, other tables
 Creation of “batch_ready” file is signal to loader to
ingest the batch
 Batch size and frequency of ingest cycle TBD
slide 11
Detailed Design




Reuse SDSS software as much as possible
Data Transformation Layer (DX) – Interface to IPP
Data Loading Pipeline (DLP)
Data Storage (DS)
• Schema and Test Queries
• Database Management System
• Scalable Data Architecture
• Hardware
 Query Manager (QM: CasJobs for prototype)
slide 12
Data Loading Pipeline (DLP)
 sqlLoader – SDSS data loading pipeline
• Pseudo-automated workflow system
• Loads, validates and publishes data
– From CSV to SQL tables
• Maintains a log of every step of loading
• Managed from Load Monitor Web interface
 Has been used to load every SDSS data release
• EDR, DR1-6, ~ 15 TB of data altogether
• Most of it (since DR2) loaded incrementally
• Kept many data errors from getting into database
– Duplicate ObjIDs (symptom of other problems)
– Data corruption (CSV format invaluable in
catching this)
slide 13
sqlLoader Design
 Existing functionality
• Shown for SDSS version
• Workflow, distributed loading, Load Monitor
 New functionality
• Schema changes
• Workflow changes
• Incremental loading
– Cross-match and partitioning
slide 14
sqlLoader Workflow
 Distributed design
Export
LOAD
achieved with
EXP Check CSV
linked servers and
CHK Build Task DBs
SQL Server Agent
BLD Build SQL Schema
Validate
 LOAD stage can
SQL
Backup
be done in parallel
VAL
by loading into
BCK Detach
temporary task
DTC
databases
 PUBLISH stage
PUBLISH
Publish
writes from task
FINISH
Cleanup
PUB
DBs to final DB
CLN
FIN
 FINISH stage
creates indices
 Loading pipeline is a system of VB and SQL
and auxiliary
scripts, stored procedures and functions
(derived) tables
slide 15
Load Monitor Tasks Page
slide 16
Load Monitor Active Tasks
slide 17
Load Monitor Statistics Page
slide 18
Load Monitor – New Task(s)
slide 19
Data Validation
 Tests for data integrity
and consistency
 Scrubs data and finds
problems in upstream
pipelines
 Most of the validation
can be performed
within the individual
task DB (in parallel)
slide 20
Test Uniqueness
Of Primary Keys
Test the unique
Key in each table
Test
Foreign Keys
Test for consistency
of keys that link tables
Test
Cardinalities
Test consistency of
numbers of various
quantities
Test
HTM IDs
Test the Hierarchical
Triamgular Mesh IDs
used for spatial
indexing
Test Link Table
Consistency
Ensure that links are
consistent
Distributed Loading
Samba-mounted CSV/Binary Files
Load
Monitor
Master
Master
Schema
Slave
LoadSupport
View of Task DB
Master
Schema
LoadAdmin
Task
Data
Slave
LoadSupport
View of Task DB
Master
Task
Data
Schema
Publish
Finish
Publish
Schema
slide 21
Publish
Data
LoadSupport
View of Task DB
Master
Task
Data
Schema
Schema Changes
 Schema in task and publish DBs is driven off a list of
schema DDL files to execute (xschema.txt)
 Requires replacing DDL files in schema/sql directory
and updating xschema.txt with their names
 PS1 schema DDL files have already been built
 Index definitions have also been created
 Metadata tables will be automatically generated using
metadata scripts already in the loader
slide 22
Workflow Changes
 Cross-Match
and Partition
steps will be
added to the
workflow
 Cross-match
will match
detections to
objects
 Partition will
horizontally
partition data,
move it to
slice servers,
and build
DPVs on main
slide 23
Export
LOAD
Check
CSVs
Create
Task DBs
Build SQL
Schema
Validate
XMatch
PUBLISH
Partition
Matching Detections with Objects
 Algorithm described fully in prototype section
 Stored procedures to cross-match detections will be part
of the LOAD stage in loader pipeline
 Vertical partition of Objects table kept on load server for
matching with detections
 Zones cross-match algorithm used to do 1” and 2”
matches
 Detections with no matches saved in Orphans table
slide 24
XMatch and Partition Data Flow
Detections
Loadsupport
Load
Detections
ObjZoneIndx
Detections_In
XMatch
Orphans
LinkToObj_In
Pm
Update
Objects
Detections_m
LinkToObj_m
Merge
Partitions
Detections_chunk
LinkToObj_chunk
Pull
Chunk
PS1
Objects_m
slide 25
Pull
Partition
Objects_m
LinkToObj_m
Switch
Partition
Objects
LinkToObj
Detailed Design




Reuse SDSS software as much as possible
Data Transformation Layer (DX) – Interface to IPP
Data Loading Pipeline (DLP)
Data Storage (DS)
• Schema and Test Queries
• Database Management System
• Scalable Data Architecture
• Hardware
 Query Manager (QM: CasJobs for prototype)
slide 26
Data Storage – Schema
slide 27
PS1 Table Sizes Spreadsheet
Stars
Galaxies
Total Objects
5.00E+09
5.00E+08
5.50E+09
1.51E+11
36750000000
2.3E-07
P2 Detections per year
4.30E+10
tablename
AltModels
CameraConfig
FileGroupMap
IndexMap
Objects
ObjZoneIndx
PartitionMap
PhotoCal
PhotozRecipes
SkyCells
Surveys
DropP2ToObj
DropStackToObj
P2AltFits
P2FrameMeta
P2ImageMeta
P2PsfFits
P2ToObj
P2ToStack
StackDeltaAltFits
StackHiSigDeltas
StackLowSigDelta
StackMeta
StackModelFits
StackPsfFits
StackToObj
StationaryTransient
0.3
columns bytes/row total rows
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
5
4
7
88
7
3
10
2
2
2
4
4
13
18
64
34
3
2
13
32
2
49
131
44
4
2
m
3.00
0.3*DR1
1547
287
4335
2301
420
63
4111
151
267
10
267
39
39
71
343
2870
183
31
15
71
167
5000
1551
535
215
39
23
10
30
100
100
5.50E+09
5.50E+09
100
1000
10
50000
30
4.00E+06
4.00E+06
1.51E+10
1.05E+06
6.72E+07
1.51E+11
1.51E+11
1.51E+11
3.68E+09
3.68E+10
1.65E+06
700000
7.50E+09
8.25E+10
8.25E+10
5.00E+08
sum
indices
total
total size (TB)
1.547E-08
8.61E-09
4.335E-07
2.301E-07
2.31
0.3465
4.111E-07
0.000000151
2.67E-09
0.0000005
8.01E-09
0.000156
0.000156
1.06855
0.00036015
0.192864
27.5415
4.6655
2.2575
0.260925
6.13725
0.00825
0.0010857
4.0125
17.7375
3.2175
0.0115
69.76959861
13.95391972
83.72351833
Prototype
1.547E-08
8.61E-09
4.335E-07
2.301E-07
0.693
0.10395
4.111E-07
0.000000151
2.67E-09
0.0000005
8.01E-09
1.33714E-05
1.33714E-05
0.09159
0.00003087
0.0165312
2.3607
0.3999
0.1935
0.022365
0.52605
0.000707143
0.00032571
0.343928571
1.520357143
0.275785714
0.000985714
6.549735569
1.309947114
7.859682683
0 means the table size is essentially the same for all data releases
1 means the table size will grow
0.29
DR1
1.547E-08
8.61E-09
4.335E-07
2.301E-07
2.31
0.3465
4.111E-07
0.000000151
2.67E-09
0.0000005
8.01E-09
4.45714E-05
4.45714E-05
0.3053
0.0001029
0.055104
7.869
1.333
0.645
0.07455
1.7535
0.002357143
0.0010857
1.146428571
5.067857143
0.919285714
0.003285714
21.83244779
4.366489558
26.19893735
0.57
DR2
1.547E-08
8.61E-09
4.335E-07
2.301E-07
2.31
0.3465
4.111E-07
0.000000151
2.67E-09
0.0000005
8.01E-09
8.91429E-05
8.91429E-05
0.6106
0.0002058
0.110208
15.738
2.666
1.29
0.1491
3.507
0.004714286
0.0010857
2.292857143
10.13571429
1.838571429
0.006571429
41.00730812
8.201461624
49.20876974
0.86
DR3
1.547E-08
8.61E-09
4.335E-07
2.301E-07
2.31
0.3465
4.111E-07
0.000000151
2.67E-09
0.0000005
8.01E-09
0.000133714
0.000133714
0.9159
0.0003087
0.165312
23.607
3.999
1.935
0.22365
5.2605
0.007071429
0.0010857
3.439285714
15.20357143
2.757857143
0.009857143
60.18216845
12.03643369
72.21860214
1.00
DR4
1.547E-08
8.61E-09
4.335E-07
2.301E-07
2.31
0.3465
4.111E-07
0.000000151
2.67E-09
0.0000005
8.01E-09
0.000156
0.000156
1.06855
0.00036015
0.192864
27.5415
4.6655
2.2575
0.260925
6.13725
0.00825
0.0010857
4.0125
17.7375
3.2175
0.0115
69.76959861
13.95391972
83.72351833
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
0
1
0
0
0
0
0
0
0
1
1
1
1
1
1
0.33
0
1
1
1
1
1
0.33
0.33
0.33
1
1
0.33
0.33
0.33
0.33
0.33
0.33
0.33
0.33
0.33
0.33
0.33
Primary filegroup
0 means full table
1 means the table is
Fraction of the ta
Note: These estimates are for the whole PS1, assuming 3.5 years. 7 bytes added to each row for overhead as suggested by Alex
slide 28
PS1 Table Sizes - All Servers
Table
Year 1
Year 2
Year 3
Year 3.5
Objects
4.63
4.63
4.61
4.59
StackPsfFits
5.08
10.16
15.20
17.76
StackToObj
1.84
3.68
5.56
6.46
StackModelFits
1.16
2.32
3.40
3.96
P2PsfFits
7.88
15.76
23.60
27.60
P2ToObj
2.65
5.31
8.00
9.35
Other Tables
3.41
6.94
10.52
12.67
Indexes +20%
5.33
9.76
14.18
16.48
31.98
58.56
85.07
98.87
Total
Sizes are in TB
slide 29
Data Storage – Test Queries
 Drawn from several sources
• Initial set of SDSS 20 queries
• SDSS SkyServer Sample Queries
• Queries from PS scientists (Monet, Howell, Kaiser,
Heasley)
 Two objectives
• Find potential holes/issues in schema
• Serve as test queries
– Test DBMS iintegrity
– Test DBMS performance
 Loaded into CasJobs (Query Manager) as sample
queries for prototype
slide 30
Data Storage – DBMS
 Microsoft SQL Server 2005
• Relational DBMS with excellent query optimizer
 Plus
• Spherical/HTM (C# library + SQL glue)
– Spatial index (Hierarchical Triangular Mesh)
• Zones (SQL library)
– Alternate spatial decomposition with dec zones
• Many stored procedures and functions
– From coordinate conversions to neighbor search
functions
• Self-extracting documentation (metadata) and
diagnostics
slide 31
Documentation and Diagnostics
slide 32
Data Storage – Scalable Architecture
 Monolithic database design (a la SDSS) will not do it
 SQL Server does not have cluster implementation
• Do it by hand
 Partitions vs Slices
• Partitions are file-groups on the same server
– Parallelize disk accesses on the same machine
• Slices are data partitions on separate servers
• We use both!
 Additional slices can be added for scale-out
 For PS1, use SQL Server Distributed Partition Views
(DPVs)
slide 33
Distributed Partitioned Views
 Difference between DPVs and
file-group partitioning
• FG on same database
• DPVs on separate DBs
• FGs are for scale-up
• DPVs are for scale-out
 Main server has a view of a
partitioned table that includes
remote partitions (we call them
slices to distinguish them from
FG partitions)
 Accomplished with SQL
Server’s linked server
technology
 NOT truly parallel, though
slide 34
Scalable Data Architecture
 Shared-nothing architecture
 Detections split across cluster
 Objects
Head
replicated on
Objects
Head and
Objects_S1
Slice DBs
Objects_S2
 DPVs of
Objects_S3
Detections
tables on the
Detections DPV
Headnode DB
Detections_S1
 Queries on
Detections_S2
Objects stay
Detections_S3
on head node
 Queries on detections use
only local data on slices
slide 35
S1
Detections_S1
Objects_S1
S2
Detections_S2
Objects_S2
S3
Detections_S3
Objects_S3
Hardware - Prototype
Storage:
S3
PS04
10A = 10 x [13 x 750 GB]
3B = 3 x [12 x 500 GB]
2A
Function:
LX = Linux
L = Load server
S/Head = DB server
M = MyDB server
W = Web server
LX
PS01
Function
Total space
RAID config
Disk/rack config
slide 36
4
2B
Staging
10 TB
RAID5
4
S2
PS03
4
2A
L2/M
PS05
L1
PS13
4
A
8
A
Loading
9 TB
RAID10
14D/3.5W
S1
PS12
Server Naming
Convention:
PS0x = 4-core
PS1x = 8-core
8
2A
Head
PS11
8
2A
DB
39 TB
RAID10
12D/4W
W
PS02
MyDB
4
B
Web
0 TB
RAID10
Hardware – PS1
 Ping-pong configuration to
maintain high availability and
query performance
 2 copies of each slice and of
main (head) node database on
fast hardware (hot spares)
 3rd spare copy on slow
hardware (can be just disk)
 Updates/ingest on offline copy
then switch copies when ingest
and replication finished
Queries
Queries
Queries
 Synchronize second copy while
first copy is online
Queries
 Both copies live when no ingest
 3x basic config. for PS1
slide 37
Live
(Copy 1)
Ingest
Offline
(Copy 2)
Live
(Copy 1)
Offline
(Copy 2)
Live
(Copy 2)
Offline
(Copy 1)
Live
(Copy 1)
Queries
Spare
(Copy 3)
Replicate
Spare
(Copy 3)
Replicate
Live
(Copy 2)
Spare
(Copy 3)
Spare
(Copy 3)
Detailed Design




Reuse SDSS software as much as possible
Data Transformation Layer (DX) – Interface to IPP
Data Loading Pipeline (DLP)
Data Storage (DS)
• Schema and Test Queries
• Database Management System
• Scalable Data Architecture
• Hardware
 Query Manager (QM: CasJobs for prototype)
slide 38
Query Manager
 Based on SDSS CasJobs
 Configure to work with distributed database, DPVs
 Direct links (contexts) to slices can be added later if
necessary
 Segregates quick queries from long ones
 Saves query results server-side in MyDB
 Gives users a powerful query workbench
 Can be scaled out to meet any query load
 PS1 Sample Queries available to users
 PS1 Prototype QM demo
slide 39
ODM Prototype Components
 Data Loading Pipeline
 Data Storage
 CasJobs
• Query Manager (QM)
• Web Based Interface (WBI)
 Testing
slide 40
Spatial Queries (Alex)
slide 41
Spatial Searches in the ODM
slide 42
Common Spatial Questions
Points in region queries
1. Find all objects in this region
2. Find all “good” objects (not in masked areas)
3. Is this point in any of the regions
Region in region
4. Find regions near this region and their area
5. Find all objects with error boxes intersecting region
6. What is the common part of these regions
Various statistical operations
7. Find the object counts over a given region list
8. Cross-match these two catalogs in the region
slide 43
Sky Coordinates of Points
 Many different coordinate systems
• Equatorial, Galactic, Ecliptic, Supergalactic
 Longitude-latitude constraints
 Searches often in mix of different coordinate systems
• gb>40 and dec between 10 and 20
• Problem: coordinate singularities, transformations
 How can one describe constraints in a easy,
uniform fashion?
 How can one perform fast database queries in
an easy fashion?
• Fast:Indexes
• Easy: simple query expressions
slide 44
Describing Regions
Spacetime metadata for the VO (Arnold Rots)
 Includes definitions of
•
•
•
Constraint: single small or great circle
Convex: intersection of constraints
Region: union of convexes
 Support both angles and Cartesian descriptions
 Constructors for
•
CIRCLE, RECTANGLE, POLYGON, CONVEX HULL
 Boolean algebra (INTERSECTION, UNION, DIFF)
 Proper language to describe the abstract regions
 Similar to GIS, but much better suited for astronomy
slide 45
Things Can Get Complex
A
B
A
B
Green area: A  (B- ε) should find B if it contains an A and not masked
Yellow area: A  (B±ε) is an edge case may find B if it contains an A.
slide 46
We Do Spatial 3 Ways
 Hierarchical Triangular Mesh
(extension to SQL)
• Uses table valued functions
• Acts as a new “spatial access method”
 Zones: fits SQL well
• Surprisingly simple & good
 3D Constraints: a novel idea
• Algebra on regions, can be
implemented in pure SQL
slide 47
PS1 Footprint
 Using the projection cell definitions as centers for
tessellation (T. Budavari)
slide 48
CrossMatch: Zone Approach
 Divide space into declination zones
 Objects ordered by zoneid, ra
(on the sphere need wrap-around margin.)
 Point search
look in neighboring zones within
~ (ra ± Δ) bounding box





All inside the relational engine
Avoids “impedance mismatch”
Can “batch” comparisons
Automatically parallel
Details in Maria’s thesis
r
ra-zoneMax
x
zoneMax
ra ± Δ
slide 49
Indexing Using Quadtrees
 Cover the sky with hierarchical pixels
 COBE – start with a cube
 Hierarchical Triangular Mesh (HTM) uses trixels
• Samet, Fekete
 Start with an octahedron, and
split each triangle into 4 children,
2,0
20
down to 20 levels deep
2,1
23
 Smallest triangles are 0.3”
2,2
21
2,3
22
 Each trixel has a unique htmID
slide 50
2,3,0
222
223
2,3,1
2,3,2
220
2,3,3
221
Space-Filling Curve
[0.12,0.13)
[0.120,0.121) [0.121,0.122) [0.122,0.123) [0.123,0.130)
122
1,2,1
[0.120,0.121)
[0.122,0.130)
132
Triangles correspond to ranges
All points inside the triangle are inside the range.
100
slide 51
121
120
131
102
133
112
103
130
113
101
110
111
SQL HTM Extension
 Every object has a 20-deep htmID (44bits)
 Clustered index on htmID
 Table-valued functions for spatial joins
• Given a region definition, routine returns up
to 10 ranges of covering triangles
• Spatial query is mapped to ~10 range queries
 Current implementation rewritten in C#
 Excellent performance, little calling overhead
 Three layers
• General geometry library
• HTM kernel
• IO (parsing + SQL interface)
slide 52
Writing Spatial SQL
-- region description is contained by @area
DECLARE @cover TABLE (htmStart bigint,htmEnd bigint)
INSERT @cover SELECT * from dbo.fHtmCover(@area)
-DECLARE @region TABLE ( convexId bigint,x float, y float, z float)
INSERT @region SELECT dbo.fGetHalfSpaces(@area)
-SELECT
o.ra, o.dec, 1 as flag, o.objid
FROM (SELECT objID as objid, cx,cy,cz,ra,[dec]
FROM Objects q JOIN @cover AS c
ON q.htmID between c.HtmIdStart and c.HtmIdEnd
) AS o
WHERE NOT EXISTS (
SELECT p.convexId
FROM @region AS p
WHERE (o.cx*p.x + o.cy*p.y + o.cz*p.z < p.c)
GROUP BY p.convexId
)
slide 53
Status










slide 54
All three libraries extensively tested
Zones used for Maria’s thesis, plus various papers
New HTM code in production use since July on SDSS
Same code also used by STScI HLA, Galex
Systematic regression tests developed
Footprints computed for all major surveys
Complex mask computations done on SDSS
Loading: zones used for bulk crossmatch
Ad hoc queries: use HTM-based search functions
Excellent performance
Prototype (Maria)
slide 55
PS1 PSPS
Object Data Manager Design
PSPS Critical Design Review
November 5-6, 2007
IfA
slide 56
Detail Design




slide 57
General Concepts
Distributed Database architecture
Ingest Workflow
Prototype
Zones






slide 58
Partition and bin the data into declination zones
• ZoneID = floor ((dec + 90.0) / zoneHeight)
Few tricks required to handle spherical geometry
Place the data close on disk
• Cluster Index on ZoneID and RA
Fully implemented in SQL
Efficient
• Nearby searches
• Cross-Match (especially)
Declination (Dec)
Zones (spatial partitioning and indexing algorithm)
Fundamental role in addressing the critical requirements
• Data volume management
• Association Speed
• Spatial capabilities
Right Ascension (RA)
Zoned Table
ObjID
ZoneID*
RA
Dec
1
0
0.0
-90.0
2
20250
180.0
0.0
3
20250
181.0
0.0
4
40500
360.0
+90.0
CX
CY
CZ
ZoneID = floor ((dec + 90.0) / zoneHeight)
* ZoneHeight = 8 arcsec in this example
slide 59
…
SQL CrossNeighbors
SELECT *
FROM prObj1 z1
JOIN zoneZone ZZ
ON ZZ.zoneID1 = z1.zoneID
JOIN prObj2 z2
ON ZZ.ZoneID2 = z2.zoneID
WHERE
z2.ra BETWEEN z1.ra-ZZ.alpha AND z2.ra+ZZ.alpha
AND
z2.dec BETWEEN z1.dec-@r AND z1.dec+@r
AND
(z1.cx*z2.cx+z1.cy*z2.cy+z1.cz*z2.cz) > cos(radians(@r))
slide 60
Good CPU Usage
slide 61
Partitions
 SQL Server 2005 introduces technology to handle
tables which are partitioned across different disk
volumes and managed by a single server.
 Partitioning makes management and access of large
tables and indexes more efficient
• Enables parallel I/O
• Reduces the amount of data that needs to be
accessed
• Related tables can be aligned and collocated in
the same place speeding up JOINS
slide 62
Partitions
 2 key elements
•
Partitioning function
– Specifies how the table or index is partitioned
•
Partitioning schemas
– Using a partitioning function, the schema specifies the
placement of the partitions on file groups
 Data can be managed very efficiently using Partition
Switching
•
•
•
Add a table as a partition to an existing table
Switch a partition from one partitioned table to another
Reassign a partition to form a single table
 Main requirement
•
slide 63
The table must be constrained on the partitioning column
Partitions
 For the PS1 design,
• Partitions mean File Group Partitions
• Tables are partitioned into ranges of ObjectID,
which correspond to declination ranges.
• ObjectID boundaries are selected so that each
partition has a similar number of objects.
slide 64
Distributed Partitioned Views
 Tables participating in the Distributed Partitioned View
(DVP) reside on different databases which reside in
different databases which reside on different instances
or different (linked) servers
slide 65
Concept: Slices
 In the PS1 design, the bigger tables will be partitioned across
servers
 To avoid confusion with the File Group Partitioning, we call
them “Slices”
 Data is glued together using Distributed Partitioned Views
 The ODM will manage slices. Using slices improves system
scalability.
 For PS1 design, tables are sliced into ranges of ObjectID, which
correspond to broad declination ranges. Each slice is subdivided
into partitions that correspond to narrower declination ranges.
 ObjectID boundaries are selected so that each slice has a
similar number of objects.
slide 66
Detail Design Outline




slide 67
General Concepts
Distributed Database architecture
Ingest Workflow
Prototype
PS1 Distributed DB system
objZoneIndx
detections
objZoneIndx
detections
orphans_l1
Orphans_ln
Linked servers
Detections_l1
Load
Support1
LnkToObj_l1
Detections_ln
Load
Supportn
LoadAdmin
LnkToObj_ln
PartitionsMap
Linked servers
[Objects_p1]
[LnkToObj_p1]
[Detections_p1]
[Objects_pm]
P1
Meta
Pm
PartitionsMap
Detections
PS1
[LnkToObj_pm]
[Detections_pm]
Meta
Objects
LnkToObj
Meta
PS1 database
Query Manager (QM)
Legend
Database
Web Based Interface (WBI)
slide 68
Full table
Output table
Partitioned View
[partitioned table]
Design Decisions: ObjID
 Objects have their positional information encoded in
their objID
• fGetPanObjID (ra, dec, zoneH)
• ZoneID is the most significant part of the ID
 It gives scalability, performance, and spatial functionality
 Object tables are range partitioned according to their
object ID
slide 69
ObjectID Clusters Data Spatially
Dec = –16.71611583
ZH = 0.008333
ZID = (Dec+90) / ZH = 08794.0661
ObjectID = 087941012871550661
RA = 101.287155
ObjectID is unique when objects are separated by >0.0043 arcsec
slide 70
Design Decisions: DetectID
 Detections have their positional information encoded in
the detection identifier
• fGetDetectID (dec, observationID, runningID,
zoneH)
• Primary key (objID, detectionID), to align detections
with objects within partitions
• Provides efficient access to all detections associated
to one object
• Provides efficient access to all detections of nearby
objects
slide 71
DetectionID Clusters Data in Zones
Dec = –16.71611583
ZH = 0.008333
ZID = (Dec+90) / ZH = 08794.0661
DetectID = 0879410500001234567
ObservationID = 1050000
Running ID = 1234567
slide 72
ODM Capacity
5.3.1.3 The PS1 ODM shall be able to ingest into the
ODM a total of
• 1.51011 P2 detections
• 8.31010 cumulative sky (stack) detections
• 5.5109 celestial objects
together with their linkages.
slide 73
PS1 Table Sizes - Monolithic
Table
Year 1
Year 2
Year 3
Year 3.5
Objects
2.31
2.31
2.31
2.31
StackPsfFits
5.07
10.16
15.20
17.74
StackToObj
0.92
1.84
2.76
3.22
StackModelFits
1.15
2.29
3.44
4.01
P2PsfFits
7.87
15.74
23.61
27.54
P2ToObj
1.33
2.67
4.00
4.67
Other Tables
3.19
6.03
8.87
10.29
Indexes +20%
4.37
8.21
12.04
13.96
26.21
49.24
72.23
83.74
Total
Sizes are in TB
slide 74
What goes into the main Server
Linked servers
P1
Pm
PartitionsMap
PS1
Objects
LnkToObj
Meta
PS1 database
Objects
PartitionsMap
LnkToObj
Meta
Legend
Database
Full table [partitioned table]
Output table
Distributed Partitioned View
slide 75
What goes into slices
Linked servers
[Objects_p1]
[LnkToObj_p1]
[Detections_p1]
[Objects_pm]
P1
Pm
[LnkToObj_pm]
[Detections_pm]
PartitionsMap
PartitionsMap
PartitionsMap
Meta
PS1
Meta
Objects
LnkToObj
Meta
PS1 database
[Objects_p1]
PartitionsMap
[LnkToObj_p1]
Legend
[Detections_p1]
Database
Meta
Full table [partitioned table]
Output table
Distributed Partitioned View
slide 76
What goes into slices
Linked servers
[Objects_p1]
[LnkToObj_p1]
[Detections_p1]
[Objects_pm]
P1
Pm
[LnkToObj_pm]
[Detections_pm]
PartitionsMap
PartitionsMap
PartitionsMap
Meta
PS1
Meta
Objects
LnkToObj
Meta
PS1 database
[Objects_p1]
PartitionsMap
[LnkToObj_p1]
Legend
[Detections_p1]
Database
Meta
Full table [partitioned table]
Output table
Distributed Partitioned View
slide 77
Duplication of Objects & LnkToObj
 Objects are distributed across slices
 Objects, P2ToObj, and StackToObj are duplicated in the
slices to parallelize “inserts” & “updates”
 Detections belong into their object’s slice
 Orphans belong to the slice where their position would
allocate them
• Orphans near slices’ boundaries will need special
treatment
 Objects keep their original object identifier
• Even though positional refinement might change
their zoneID and therefore the most significant part
of their identifier
slide 78
Glue = Distributed Views
Linked servers
[Objects_p1]
[LnkToObj_p1]
[Detections_p1]
[Objects_pm]
P1
Pm
[Detections_pm]
PartitionsMap
PartitionsMap
Meta
[LnkToObj_pm]
PartitionsMap
Detections
PS1
Meta
Objects
LnkToObj
Meta
PS1 database
Detections
Legend
Database
Full table [partitioned table]
Output table
Distributed Partitioned View
slide 79
Partitioning in Main Server
 Main server is partitioned (objects) and collocated (lnkToObj) by objid
 Slices are partitioned (objects) and collocated (lnkToObj) by objid
Linked servers
P1
Pm
PS1
PS1 database
Query Manager (QM)
Web Based Interface (WBI)
slide 80
PS1 Table Sizes - Main Server
Table
Year 1
Year 2
Year 3
Year 3.5
2.31
2.31
2.31
2.31
StackPsfFits




StackToObj
0.92
1.84
2.76
3.22
StackModelFits




P2PsfFits




P2ToObj
1.33
2.67
4.00
4.67
Other Tables
0.41
0.46
0.52
0.55
Indexes +20%
0.99
1.46
1.92
2.15
Total
5.96
8.74
11.51
12.90
Objects
Sizes are in TB
slide 81
PS1 Table Sizes - Each Slice
m=4
m=8
m=10
m=12
Year 1
Year 2
Year 3
Year 3.5
Objects
0.58
0.29
0.23
0.19
StackPsfFits
1.27
1.27
1.52
1.48
StackToObj
0.23
0.23
0.28
0.27
StackModelFits
0.29
0.29
0.34
0.33
P2PsfFits
1.97
1.97
2.36
2.30
P2ToObj
0.33
0.33
0.40
0.39
Other Tables
0.75
0.81
1.00
1.01
Indexes +20%
1.08
1.04
1.23
1.19
Total
6.50
6.23
7.36
7.16
Table
Sizes are in TB
slide 82
PS1 Table Sizes - All Servers
Table
Year 1
Year 2
Year 3
Year 3.5
Objects
4.63
4.63
4.61
4.59
StackPsfFits
5.08
10.16
15.20
17.76
StackToObj
1.84
3.68
5.56
6.46
StackModelFits
1.16
2.32
3.40
3.96
P2PsfFits
7.88
15.76
23.60
27.60
P2ToObj
2.65
5.31
8.00
9.35
Other Tables
3.41
6.94
10.52
12.67
Indexes +20%
5.33
9.76
14.18
16.48
31.98
58.56
85.07
98.87
Total
Sizes are in TB
slide 83
Detail Design Outline




slide 84
General Concepts
Distributed Database architecture
Ingest Workflow
Prototype
PS1 Distributed DB system
objZoneIndx
detections
objZoneIndx
detections
orphans_l1
Orphans_ln
Linked servers
Detections_l1
Load
Support1
LnkToObj_l1
Detections_ln
Load
Supportn
LoadAdmin
LnkToObj_ln
PartitionsMap
Linked servers
[Objects_p1]
[LnkToObj_p1]
[Detections_p1]
[Objects_pm]
P1
PartitionsMap
Meta
Pm
Detections
PS1
[LnkToObj_pm]
[Detections_pm]
PartitionsMap
PartitionsMap
Objects
Meta
LnkToObj
Meta
PS1 database
Query Manager (QM)
Legend
Database
Web Based Interface (WBI)
slide 85
Full table
Output table
Partitioned View
[partitioned table]
“Insert” & “Update”



slide 86
SQL Insert and Update are expensive operations
due to logging and re-indexing
In the PS1 design, Insert and Update have been refactored into sequences of:
Merge + Constrain + Switch Partition
Frequency
• f1: daily
• f2: at least monthly
• f3: TBD (likely to be every 6 months)
Ingest Workflow
DZone
X(1”)
ObjectsZ
DXO_1a
NoMatch
CSV
X(2”)
Detect
Resolve
DXO_2a
P2PsfFits
Orphans
slide 87
P2ToObj
Ingest @ frequency = f1
11
12
13
1
2
Stack*_1
Objects
ObjectsZ
Objects_1
P2ToObj
P2ToObj_1
P2PsfFits
P2ToPsfFits_1
StackToObj
Orphans
Orphans_1
Orphans_1
Metadata+
LOADER
SLICE_1
slide 88
P2ToObj
MAIN
3
Updates @ frequency = f2
11
12
13
1
2
Stack*_1
Objects
Objects
LOADER
slide 89
Objects_1
P2ToObj_1
P2ToObj
P2ToPsfFits_1
StackToObj
Orphans_1
Metadata+
SLICE_1
MAIN
3
Updates @ frequency = f2
11
12
13
1
2
Stack*_1
Objects
Objects
LOADER
slide 90
Objects_1
P2ToObj_1
P2ToObj
P2ToPsfFits_1
StackToObj
Orphans_1
Metadata+
SLICE_1
MAIN
3
Snapshots @ frequency = f3
1
2
3
Objects
Objects
P2ToObj
Snapshot
StackToObj
Metadata+
MAIN
slide 91
Batch Update of a Partition
select into
1
1
2
1
…
2
3
merged
slide 92
select into … where
select into … where
select into … where
B1 + PK index
B2 + PK index
B3 + PK index
switch
switch
switch
B1
A1
A2
A3
Scaling-out
 Apply Ping-Pong strategy to satisfy query performance during ingest
2 x ( 1 main + m slices)
[Objects_p1]
Linked servers
[LnkToObj_p1]
[Detections_p1]
[Objects_p2]
[LnkToObj_p2]
P1
P2
[Objects_pm]
P2
P3
Pm-1
Pm
[Detections_p2]
Detections
PartitionsMap
Meta
Objects
Pm
P1
Detections
PartitionsMap
PS1
PS1
LnkToObj
Meta
Meta
[Detections_pm]
[Objects_p1]
[LnkToObj_p1]
[Detections_p1]
Objects
LnkToObj
[LnkToObj_pm]
Meta
PS1 database
Query Manager (QM)
Legend
Database
Duplicate
Full table
[partitioned table]
Partitioned View Duplicate P view
slide 93
Scaling-out
 More robustness, fault-tolerance, and reabilability calls for
3 x ( 1 main + m slices)
[Objects_p1]
Linked servers
[LnkToObj_p1]
[Detections_p1]
[Objects_p2]
[LnkToObj_p2]
P1
P2
[Objects_pm]
P2
P3
Pm-1
Pm
[Detections_p2]
Detections
PartitionsMap
Meta
Objects
Pm
P1
Detections
PartitionsMap
PS1
PS1
LnkToObj
Meta
Meta
[Detections_pm]
[Objects_p1]
[LnkToObj_p1]
[Detections_p1]
Objects
LnkToObj
[LnkToObj_pm]
Meta
PS1 database
Query Manager (QM)
Legend
Database
Duplicate
Full table
[partitioned table]
Partitioned View Duplicate P view
slide 94
Adding New slices
SQL Server range partitioning capabilities make it easy






slide 95
Recalculate partitioning limits
Transfer data to new slices
Remove data from slices
Define an d Apply new partitioning schema
Add new partitions to main server
Apply new partitioning schema to main server
Adding New Slices
slide 96
Detail Design Outline




slide 97
General Concepts
Distributed Database architecture
Ingest Workflow
Prototype
ODM Ingest Performance
5.3.1.6 The PS1 ODM shall be able to ingest the data
from the IPP at two times the nominal daily arrival rate*
* The nominal daily data rate from the IPP is defined as the total
data volume to be ingested annually by the ODM divided by
365.
 Nominal daily data rate:
• 1.51011 / 3.5 / 365 = 1.2108 P2 detections / day
• 8.31010 / 3.5 / 365 = 6.5107 stack detections / day
slide 99
Number of Objects
miniProto
myProto
Prototype
SDSS* Stars
5.7 x 104
1.3 x 107
1.1 x 108
SDSS* Galaxies
9.1 x 104
1.1 x 107
1.7 x 108
Galactic Plane
1.5 x 106
3 x 106
1.0 x 109
TOTAL
1.6 x 106
2.6 x 107
1.3 x 109
PS1
5.5 x 109
* “SDSS” includes a mirror of 11.3 <  < 30 objects to  < 0
Total GB of csv loaded data:
300 GB
CSV Bulk insert load:
8 MB/s
Binary Bulk insert:
18-20 MB/s
Creation
Started:
October 15th 2007
Finished:
October 29th 2007 (??)
Includes
• 10 epochs of P2PsfFits detections
• 1 epoch of Stack detections
slide 100
Prototype in Context
slide 102
Survey
Objects
Detections
SDSS DR6
3.8  108
2MASS
4.7  108
USNO-B
1.0  109
Prototype
1.3  109
1.4  1010
PS1 (end of survey)
5.5  109
2.3  1011
Size of Prototype Database
Table
Main
Slice1
0.43
Slice2
0.43
Slice3
Loader
0.43
1.30
Total
Objects
1.30
3.89
StackPsfFits
6.49




6.49
StackToObj
6.49




6.49
StackModelFits
0.87




0.87
P2PsfFits

4.02
3.90
3.35
0.37
11.64
P2ToObj

4.02
3.90
3.35
0.12
11.39
Total
15.15
8.47
8.23
7.13
1.79
40.77
Extra Tables
0.87
4.89
4.77
4.22
6.86
21.61
Grand Total
16.02
13.36
13.00
11.35
8.65
62.38
Table sizes are in billions of rows
slide 103
Size of Prototype Database
Table
Main
Slice1
Slice2
Slice3
Loader
Total
Objects
547.6
165.4
165.3
165.3
137.1
StackPsfFits
841.5




841.6
StackToObj
300.9




300.9
StackModelFits
476.7




476.7
P2PsfFits

879.9
853.0
733.5
74.7
2541.1
P2ToObj

125.7
121.9
104.8
3.8
356.2
2166.7
1171.0
1140.2
1003.6
215.6
5697.1
207.9
987.1
960.2
840.7
957.3
3953.2
Allocated / Free
1878.0
1223.0
1300.0
1121.0
666.0
6188.0
Grand Total
4252.6
3381.1
3400.4
2965.3
1838.9
15838.3
Total
Extra Tables
Table sizes are in GB
slide 104
1180.6
9.6 TB of data in a distributed database
Well-Balanced Partitions
Server
slide 105
Partition
Rows
Fraction
Dec Range
Main
1
432,590,598
33.34%
32.59
Slice 1
1
144,199,105
11.11%
14.29
Slice 1
2
144,229,343
11.11%
9.39
Slice 1
3
144,162,150
11.12%
8.91
Main
2
432,456,511
33.33%
23.44
Slice 2
1
144,261,098
11.12%
8.46
Slice 2
2
144,073,972
11.10%
7.21
Slice 2
3
144,121,441
11.11%
7.77
Main
3
432,496,648
33.33%
81.98
Slice 3
1
144,270,093
11.12%
11.15
Slice 3
2
144,090,071
11.10%
14.72
Slice 3
3
144,136,484
11.11%
56.10
Ingest and Association Times
Task
Create Detections Zone Table
39.62
X(0.2") 121M X 1.3B
65.25
Build #noMatches Table
1.50
X(1") 12k X 1.3B
0.65
Build #allMatches Table (121M)
6.58
Build Orphans Table
0.17
Create P2PsfFits Table
11.63
Create P2ToObj Table
14.00
Total of Measured Times
slide 106
Measured
Minutes
140.40
Ingest and Association Times
Task
slide 107
Estimated
Minutes
Compute DetectionID, HTMID
30
Remove NULLS
15
Index P2PsfFits on ObjID
15
Slices Pulling Data from Loader
5
Resolve 1 Detection - N Objects
10
Total of Estimated Times
75
Educated Guess
Wild Guess
Total Time to I/A daily Data
Task
Time
Time
(hours) (hours)
0.32


0.98
Total of Measured Times
2.34
2.34
Total of Estimated Times
1.25
1.25
Total Time to I/A Daily Data
3.91
4.57
Ingest 121M Detections (binary)
Ingest 121M Detections (CSV)
Requirement: Less than 12 hours (more than 2800 detections / s)
Detection Processing Rate: 8600 to 7400 detections / s
Margin on Requirement: 3.1 to 2.6
Using multiple loaders would improve performance
slide 108
Insert Time @ slices
Estimated
Minutes
Task
Import P2PsfFits (binary out/in)
20.45
Import P2PsfFits (binary out/in)
2.68
Import Orphans
0.00
Merge P2PsfFits
58 Educated Guess
193
Add constraint P2PsfFits
Merge P2ToObj
13
Add constraint P2ToObj
54
Total of Measured Times
6 h with 8 partitions/slice
(~1.3 x 109 detections/partition)
slide 109
362
Detections Per Partition
slide 110
Years
Total
Detections
Slices
0.0
0.00
4
8
32
0.00
1.0
4.29  1010
4
8
32
1.34  109
1.0
4.29  1010
8
8
64
6.7  108
2.0
8.57  1010
8
8
64
1.34  109
2.0
8.57  1010
10
8
80
1.07  109
3.0
1.29  1011
10
8
80
1.61  109
3.0
1.29  1011
12
8
96
1.34  109
3.5
1.50  1011
12
8
96
1.56  109
Partition
Total
Detections
per Slice Partitions per Slice
Total Time for Insert @ slice
Task
Time
(hours)
Total of Measured Times
0.25
Total of Estimated Times
5.3
Total Time for daily insert
6
Daily insert may operate in parallel with daily ingest and association.
Requirement: Less than 12 hours
Margin on Requirement: 2.0
Using more slices will improve insert performance.
slide 111
Summary





slide 112
Ingest + Association < 4 h using 1 loader (@f1= daily)
• Scales with the number of servers
• Current margin on requirement 3.1
• Room for improvement
Detection Insert @ slices (@f1= daily)
• 6 h with 8 partitions/slice
• It may happen in parallel with loading
Detections Lnks Insert @ main (@f2 < monthly)
• Unknown
• 6 h available
Objects insert & update @ slices (@f2 < monthly)
• Unknown
• 6 hours available
Objects update @ main server (@f2 < monthly)
• Unknown
• 12 h available. Transfer can be pipelined as soon as objects
have been processed
Risks
 Estimates of Insert & Update at slices could be
underestimated
• Need more empirical evaluation of exercising
parallel I/O
 Estimates and lay out of disk storage could be
underestimated
• Merges and Indexes require 2x the data size
slide 113
Hardware/Scalability (Jan)
slide 114
PS1 Prototype Systems Design
Jan Vandenberg, JHU
Early PS1 Prototype
slide 115
Engineering Systems to Support the
Database Design
 Sequential read performance is our life-blood. Virtually
all science queries will be I/O-bound.
 ~70 TB raw data: 5.9 hours for full scan on IBM’s fastest
3.3 GB/s Champagne-budget SAN
• Need 20 GB/s IO engine just to scan the full data in
less than an hour. Can’t touch this on a monolith.
 Data mining a challenge even with good index coverage
• ~14 TB worth of indexes: 4-odd times bigger than
SDSS DR6.
 Hopeless if we rely on any bulk network transfers: must
do work where the data is
 Loading/Ingest more cpu-bound, though we still need
solid write performance
slide 116
Choosing I/O Systems
 So killer sequential I/O performance is a key systems
design goal. Which gear to use?
• FC/SAN?
• Vanilla SATA?
• SAS?
slide 117
Fibre Channel, SAN





slide 118
Expensive but not-so-fast physical links (4 Gbit, 10 Gbit)
Expensive switch
Potentially very flexible
Industrial strength manageability
Little control over RAID controller bottlenecks
Straight SATA
 Fast
 Pretty cheap
 Not so industrialstrength
slide 119
SAS
 Fast: 12 Gbit/s FD building
blocks
 Nice and mature, stable
 SCSI’s not just for swanky
drives anymore: takes SATA
drives!
 So we have a way to use
SATA without all the “beige”.
 Pricey? $4400 for full
15x750GB system
($296/drive == close to
Newegg media cost)
slide 120
SAS Performance, Gory Details
 SAS v. SATA differences
Native SAS V. SATA Performance
500
450
400
20%
350
MB/s
300
250
200
150
100
50
0
1
2
3
4
Disks
slide 121
5
6
7
Per-Controller Performance
 One controller can’t quite accommodate the throughput of an entire
storage enclosure.
Controller Limits
1400
1200
6 Gbit Limit
1000
One Controller
Ideal
MB/s
800
600
400
200
0
1
2
3
4
5
6
7
Disks
slide 122
8
9
10
11
12
13
Resulting PS1 Prototype I/O
Topology
 1100 MB/s single-threaded
sequential reads per server
Aggregate Design I/O Performance
1600
1400
6 Gbit Limit
Dual Controllers
1200
One Controller
Ideal
MB/s
1000
800
600
400
200
0
1
2
3
4
5
6
7
8
9
10
Disks
slide 123
11
12
13
14
15
16
17
18
RAID-5 v. RAID-10?
 Primer, anyone?
 RAID-5 perhaps feasible with contemporary
controllers…
 …but not a ton of redundancy
 But after we add enough disks to meet performance
goals, we have enough storage to run RAID-10 anyway!
slide 124
RAID-10 Performance
 0.5*RAID-0 for single-threaded reads
 RAID-0 perf for 2-user/2-thread workloads
 0.5*RAID-0 writes
slide 125
PS1 Prototype Servers
Prototype Loader
Prototype DB
H1
Linux
staging
L1
L2
S1
slide 126
S2
S3
PS1 Prototype Servers
PS1 Prototype
slide 127
PS1 Prototype Servers
slide 128
Projected PS1 Systems Design
Loader
….
Linux
staging
LN
L1
R1
R2
H1
R3
H1
H2
….
S1
S2
H1
H2
….
S8
S1
S2
H2
….
S8
S1
S2
S8
G1
H1
H2
….
S1
slide 129
S2
S8
Backup/Recovery/Replication
Strategies
 No formal backup
• …except maybe for mydb’s, f(cost*policy)
 3-way replication
• Replication != backup
– Little or no history (though we might have some point-intime capabilities via metadata
– Replicas can be a bit too cozy: must notice badness before
replication propagates it
• Replicas provide redundancy and load balancing…
• Fully online: zero time to recover
• Replicas needed for happy production performance plus
ingest, anyway
 Off-site geoplex
• Provides continuity if we lose HI (local or trans-Pacific network
outage, facilities outage)
• Could help balance trans-Pacific bandwidth needs (service
continental traffic locally)
slide 130
Why No Traditional Backups?
 Money no object… do traditional backups too!!!
 Synergy, economy of scale with other collaboration
needs (IPP?)… do traditional backups too!!!
 Not super pricey…
 …but not very useful relative to a replica for our
purposes
• Time to recover
slide 131
Failure Scenarios (Easy Ones)
 Zero downtime, little effort:
• Disks (common)
– Simple* hotswap
– Automatic rebuild from hotspare or replacement
drive
• Power supplies (not uncommon)
– Simple* hotswap
• Fans (pretty common)
– Simple* hotswap
* Assuming sufficiently non-beige gear
slide 132
Failure Scenarios (Mostly
Harmless Ones)
 Some downtime and replica cutover:
• System board (rare)
• Memory (rare and usually proactively detected and
handled via scheduled maintenance)
• Disk controller (rare, potentially minimal downtime
via cold-spare controller)
• CPU (not utterly uncommon, can be tough and time
consuming to diagnose correctly)
slide 133
Failure Scenarios (Slightly Spooky
Ones)
 Database mangling by human or pipeline error
• Gotta catch this before replication propagates it everywhere
• Need lots of sanity checks before replicating
• (and so off-the-shelf near-realtime replication tools don’t help
us)
• Need to run replication backwards from older, healthy replicas.
Probably less automated than healthy replication.
 Catastrophic loss of datacenter
• Okay, we have the geoplex
– …but we’re dangling by a single copy ‘till recovery is
complete
– …and this may be a while.
– …but are we still in trouble? Depending on colo scenarios,
did we also lose the IPP and flatfile archive?
slide 134
Failure Scenarios (Nasty Ones)
 Unrecoverable badness fully replicated before detection
 Catastrophic loss of datacenter without geoplex
 Can we ever catch back up with the data rate if we need
to start over and rebuild with an ingest campaign? Don’t
bet on it!
slide 135
Operating Systems, DBMS?
 Sql2005 EE x64
• Why?
• Why not DB2, Oracle RAC, PostgreSQL, MySQL,
<insert your favorite>?
 (Win2003 EE x64)
 Why EE? Because it’s there. <indexed DPVs?>
 Scientific Linux 4.x/5.x, or local favorite
 Platform rant from JVV available over beers
slide 136
Systems/Database Management





slide 137
Active Directory infrastructure
Windows patching tools, practices
Linux patching tools, practices
Monitoring
Staffing requirements
Facilities/Infrastructure
Projections for PS1
 Power/cooling
• Prototype is 9.2 kW (2.6 Tons AC)
• PS1: something like 43 kW, 12.1 Tons
 Rack space
• Prototype is 69 RU, <2 42U racks (includes 14U of
rackmount UPS at JHU)
• PS1: about 310 RU (9-ish racks)
 Networking: ~40 Gbit Ethernet ports
 …plus sundry infrastructure, ideally already in place
(domain controllers, monitoring systems, etc.)
slide 138
Operational Handoff to UofH
 Gulp.
slide 139
How Design Meets Requirements
 Cross-matching detections with objects
• Zone cross-match part of loading pipeline
• Already exceeded requirement with prototype
 Query performance
• Ping-pong configuration for query during ingest
• Spatial indexing and distributed queries
• Query manager can be scaled out as necessary
 Scalability
• Shared-nothing architecture
• Scale out as needed
• Beyond PS1 we will need truly parallel query plans
slide 140
WBS/Development Tasks
2 PM
Refine Prototype/Schema
3 PM
Staging/Transformation
1 PM
Initial Load
3 PM
Load/Resolve Detections
3 PM
1 PM
Create Snapshot
2 PM
Replication Module
2 PM
Query Processing
2 PM
Hardware
2 PM
Redistribute Data
4 PM
Documentation
4 PM
slide 141
Resolve/Synchronize Objects
Testing
4 PM
• Workflow Systems
• Logging
• Data Scrubbing
• SSIS (?) + C#
2 PM
• QM/Logging
Total Effort:
Delivery:
35 PM
9/2008
Personnel Available









slide 142
2 new hires (SW Engineers) 100%
Maria 80%
Ani 20%
Jan 10%
Alainna 15%
Nolan Li 25%
Sam Carliles 25%
George Fekete 5%
Laszlo Dobos 50% (for 6 months)
Issues/Risks
 Versioning
• Do we need to preserve snapshots of monthly
versions?
• How will users reproduce queries on subsequent
versions?
• Is it ok that a new version of the sky replaces the
previous one every month?
 Backup/recovery
• Will we need 3 local copies rather than 2 for safety
• Is restoring from offsite copy feasible?
 Handoff to IfA beyond scope of WBS shown
• This will involve several PMs
slide 143
Mahalo!
Query Manager
MyDB table
that query
results go
into
Context that
query is
executed in
Load one of
the sample
queries into
query
buffer
slide 145
Check query
syntax
Name that
this query
job is given
Get graphical
query plan
Query
buffer
Run query in
quick (1
minute) mode
Submit query
to long (8hour) queue
Query Manager
Stored procedure
arguments
SQL code for
stored procedure
slide 146
Query Manager
MyDB context is the default, but
other contexts can be selected
User can browse DB Views,
Tables, Functions and Procedures
The space used and
total space available
Multiple tables can be selected
and dropped at once
Table list can be sorted by
name, size, type.
slide 147
Query Manager
The query that
created this table
slide 148
Query Manager
Context to run search
on
Search
radius
Table to
hold results
slide 149