Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon Valley http://research.microsoft.com/~gray/talks 1 First, an aside: 2 other projects • TerraServer – joint with USGS • Giga Byte File Transfers – joint with Caltech and CERN 2 TerraServer • • • • Seamless mosaic of US ~20 TB of imagery 30 M web hits/day A scalability laboratory TerraServer Bricks – A High Availability Cluster Alternative (2004) TerraServer Cluster and SAN Experience (2004) TerraService.NET: An Introduction to Web Services (2002) Microsoft TerraServer: A Spatial Data Warehouse (1999) The Microsoft TerraServerTM (1998) KVM / IP 3 Giga Byte Per Second File Mover • CERN to Pasadena – Windows TCP/IP, NTFS – Quantifying performance – Working on better algorithms – Opteron – Disk-to-Disk at 550MBps now (~2 TB/Hour). • GOAL: 1GBps disk-to-disk. CERN-Caltech Trasfer Speeds GBps Land Speed Record PCI -X limit limit MBps Gigabyte Bandwidth Enables Global Co-Laboratories Sequential Disk IO Tests for Newisys->Newisys 1000 900 tcp 800 700 600 500 400 300 200 100 0 Mar-04 File Transfer MBps 1 Stream tcp MBps May-04 Jun-04 4 Aug-04 Sep-04 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon Valley http://research.microsoft.com/~gray/talks 5 The Evolution of Science • Observational Science – Scientist gathers data by direct observation – Scientist analyzes data • Analytical Science – Scientist builds analytical model – Makes predictions. • Computational Science – Simulate analytical model – Validate model and makes predictions • Data Exploration Science Data captured by instruments Or data generated by simulator – Processed by software – Placed in a database / files – Scientist analyzes database / files 6 Information Avalanche • In science, industry, government,…. – better observational instruments and – and, better simulations producing a data avalanche Image courtesy C. Meneveau & A. Szalay @ JHU • Examples – BaBar: Grows 1TB/day 2/3 simulation Information 1/3 observational Information – CERN: LHC will generate 1GB/s .~10 PB/y – VLBA (NRAO) generates 1GB/s today – Pixar: 100 TB/Movie BaBar, Stanford P&E Gene Sequencer From http://www.genome.uci.edu/ • New emphasis on informatics: – Capturing, Organizing, Summarizing, Analyzing, Visualizing 7 Space Telescope The Big Picture Experiments & Instruments Other Archives Literature questions facts facts ? answers Simulations The Big Problems • • • • • • Data ingest Managing a petabyte Common schema How to organize it? How to reorganize it How to coexist with others • Query and Vis tools • Support/training • Performance – Execute queries in a minute – Batch query scheduling 8 FTP - GREP • Download (FTP and GREP) are not adequate – – – – You can GREP 1 MB in a second You can GREP 1 GB in a minute You can GREP 1 TB in 2 days You can GREP 1 PB in 3 years. • Oh!, and 1PB ~3,000 disks • At some point we need indices to limit search parallel data search and analysis • This is where databases can help • Next generation technique: Data Exploration – Bring the analysis to the data! 9 The Speed Problem • Many users want to search the whole DB ad hoc queries, often combinatorial • Want ~ 1 minute response • Brute force (parallel search): – 1 disk = 50MBps => ~1M disks/PB ~ 300M$/PB • Indices (limit search, do column store) – 1,000x less equipment: 1M$/PB • Pre-compute answer – No one knows how do it for all questions. 10 Next-Generation Data Analysis • Looking for – Needles in haystacks – the Higgs particle – Haystacks: Dark matter, Dark energy • Needles are easier than haystacks • Global statistics have poor scaling – Correlation functions are N2, likelihood techniques N3 • As data and computers grow at same rate, we can only keep up with N logN • A way out? – Relax notion of optimal (data is fuzzy, answers are approximate) – Don’t assume infinite computational resources or memory • Combination of statistics & computer science 11 Analysis and Databases • Much statistical analysis deals with – – – – – – – – – Creating uniform samples – data filtering Assembling relevant subsets Estimating completeness censoring bad data Counting and building histograms Generating Monte-Carlo subsets Likelihood calculations Hypothesis testing • Traditionally these are performed on files • Most of these tasks are much better done inside a database • Move Mohamed to the mountain, not the mountain to 12 Mohamed. Organization & Algorithms • Use of clever data structures (trees, cubes): – – – – Up-front creation cost, but only N logN access cost Large speedup during the analysis Tree-codes for correlations (A. Moore et al 2001) Data Cubes for OLAP (all vendors) • Fast, approximate heuristic algorithms – No need to be more accurate than cosmic variance – Fast CMB analysis by Szapudi et al (2001) • N logN instead of N3 => 1 day instead of 10 million years • Take cost of computation into account – Controlled level of accuracy – Best result in a given time, given our computing resources 13 World Wide Telescope Virtual Observatory http://www.ivoa.net/ • Premise: Most data is (or could be online) • The Internet is the world’s best telescope: – – – – It has data on every part of the sky In every measured spectral band: optical, x-ray, radio.. As deep as the best instruments (2 years ago). It is up when you are up. The “seeing” is always great (no working at night, no clouds no moons no..). – It’s a smart telescope: links objects and data to literature on them. 14 Why Astronomy? • Community has lots of data • Data is real and well documented – High-dimensional (with confidence intervals) – Spatial, temporal • Diverse and distributed – Many different instruments from many different places and many different times • Community wants to share/cross compare – Can freely share data and algorithms. – “DataMining, Not Data MINE!!” Mark Ellisman, UCSD • They are well organized • Community is small and homogeneous • No commercial or privacy concerns – All the problems are technical or social. 15 The WWT Components • Data Sources – Literature – Archives • Unified Definitions – Units, – Semantics/Concepts/Metrics, Representations, – Provenance • Object model • Classes and methods • Portals 16 Data Sources • Literature online and cross indexed – Simbad, ADS, NED, http://simbad.u-strasbg.fr/Simbad, http://adswww.harvard.edu/, http://nedwww.ipac.caltech.edu/ • Many curated archives online – FIRST, DPOSS, 2MASS, USNO, IRAS, SDSS, VizeR,… – Typically files with English meta-data and some programs • Groups, Researchers, Amateurs Publish – Datasets online in various formats – Data publications are ephemeral (may disappear) – Many have unknown provenance • Documentation varies; some good and some none. 17 Unified Definitions • Universal Content Definitions http://vizier.u-strasbg.fr/doc/UCD.htx – Collated all table heads from all the literature – 100,000 terms reduced to ~1,500 – Rough consensus that this is the right thing. – Refinement in progress as people use UCDs • Defines – Units: • gram, radian, second, janski... – Semantic Concepts / Metrics • Std error, Chi2 fit, magnitude, flux @ passband, velocity, 18 Provenance • Most data will be derived. • To do science, need to trace derived data back to source. • So programs and inputs must be registered. • Must be able to re-run them. • Example: Space Telescope Calibrated Data – Run on demand – Can specify software version (to get old answers) • Scientific Data Provenance and Curation are largely unsolved problems (some ideas but no science). 19 Object Model Your • General acceptance of XML program • Recent acceptance of XML Schema (XSD over DTD) Web Server • Wait-and-See about SOAP/WSDL/… – “ Web Services are just Corba with angle brackets.” – FTP is good enough for me. • Personal opinion: – Web Services are much more than “Corba + <>” – Huge focus on interop – Huge focus on integrated tools Your program Data • But the community says “Show me!” In your address – Many technologists convinced, space but not yet the astronomers Web Service 20 Classes and Methods Your program • First Class: VO table http://www.us-vo.org/VOTable/ – Represents an answer set in XML Web Service Data In your address space • Defined by an XML Schema (XSD) • Metadata (in terms of UCDs) • Data representation (numbers and text) – First method • Cone Search: Get objects in this cone http://voservices.org/cone/ 21 Other Classes Your program • Space-Time class – http://hea-www.harvard.edu/~arots/nvometa/STCdoc.pdf • Image Class (returns pixels) – SdssCutout – Simple Image Access Protocol Web Service Data In your address space http://bill.cacr.caltech.edu/cfdocs/usvo-pubs/files/ACF8DE.pdf – HyperAtlas http://bill.cacr.caltech.edu/usvo-pubs/files/hyperatlas.pdf • Spectral – Simple Spectral Access Protocol – 500K spectra available at http://voservices.net/wave • Query Services – ADQL and SkyNode http://skyservice.pha.jhu.edu/develop/vo/adql/ – And http://SkyQuery.Net • Registry: – see below 22 The Registry • UDDI seemed inappropriate – Complex – Irrelevant questions – Relevant questions missing • Evolved Dublin Core – Represent Datasets, Services, Portals – Needs to be machine readable – Federation (DNS model) – Push & Pull: register then harvest • http://www.ivoa.net/twiki/bin/view/IVOA/IvoaResReg 23 Demo • SkyServer: – navigator showing cutout web service – List: showing many calls and variant use. • SkyQuery: – Show integration of various archives. – Explain spatial join xMatch operator. 24 SkyServer.SDSS.org • A modern Astronomy archive – Raw Pixel data lives in file servers – Catalog data (derived objects) lives in Database – Online query to any and all • Also used for education – 150 hours of online Astronomy – Implicitly teaches data analysis • Interesting things – – – – – – Spatial data search Client query interface via Java Applet Query interface via Emacs Popular Cloned by other surveys (a template design) Web services are core of it. 25 SkyQuery.Net A Prototype WWT • Started with SDSS data and schema • Imported12 other datasets into that spine schema. (a day per dataset plus load time) • Unified them with a portal • Implicit spatial join among the datasets. • All built on Web Services – Pure XML – Pure SOAP – Used .NET toolkit 26 Federation: SkyQuery.Net • Combine 4 archives initially • Added 9 more • Send query to portal, portal joins data from archives. • Problem: want to do multi-step data analysis (not just single query). • Solution: Allow personal databases on portal • Problem: some queries are monsters • Solution: “batch schedule” on portal server, Deposits answer in personal database. 27 SkyQuery Structure • Portal is • Each SkyNode publishes – Plans Query (2 phase) – Schema Web Service – Integrates answers – Database Web Service – Is a web service Image Cutout SDSS INT SkyQuery Portal FIRST 2MASS 28 MyDB http://skyserver.sdss.org/cas • Portal allows federation of data but… • Intermediate results may be large. • Intermediate results feed into next analysis step. • Sending them back-and-forth to client is costly and sometimes infeasible. • Solution: create a working DB for client at Portal: MyDB 29 MyDB http://skyserver.sdss.org/cas • Anyone can create a personal DB at SkyServer portal. – It is about 100 MB – It is private • • • • • Simple queries done immediately Complex queries done by batch scheduler All queries can create/read/write MyDB tables Very popular with “serious” users. MyDB will be sharable with by a group. 30 Open SkyQuery • SkyQuery being adopted by AstroGrid as reference implementation for OGSA-DAI (Open Grid Services Architecture, Data Access and Integration). • SkyNode basic archive object http://www.ivoa.net/twiki/bin/view/IVOA/SkyNode • SkyQuery Language (VoQL) is evolving. http://www.ivoa.net/twiki/bin/view/IVOA/IvoaVOQL 31 The WWT Components Outline What we learned • Data Sources • Astro is a community of 10,000 • Homogenous & Cooperative • If you can’t do it for Astro, do not bother with 3M bio-info. • Agreement – Literature – Archives • Unified Definitions – Units, – Semantics/Concepts/Metrics, Representations, – Provenance • • • • – Takes time – Takes endless meetings • Big problems are non-technical Object model – Legacy is a big problem. Classes and methods • Plumbing and tools are there Portals But… WWT is a poster child for – What is the object model? the Data Grid. – What do you want to save? – How document provenance? 32 References (all are MSR TRs) Where the Rubber Meets the Sky: Bridging the Gap between Databases and Science When Database Systems Meet the Grid There Goes the Neighborhood: Relational Algebra for Spatial Data Search Extending the SDSS Batch Query System to the National Virtual Observatory Grid The World-Wide Telescope, an Archetype for Online Science Data Mining the SDSS SkyServer Database The SDSS SkyServer – Public Access to the Sloan Digital Sky Server Data Web Services for the Virtual Observatory Online Scientific Data Curation, Publication, and Archiving Petabyte Scale Data Mining: Dream or Reality? The World-Wide Telescope, an Archetype for Online Science Designing and Mining Multi-Terabyte Astronomy Archives: The Sloan Digital Sky Survey 33