Download Information at Your Fingertips

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia, lookup

Transcript
Information At Your
Fingertips
Web Services
Jim Gray & Tom Barclay
Microsoft Research
Alex Szalay
Johns Hopkins University
1
Communications Excitement!!
Point-to-Point
Immediate
Time
Shifted
Broadcast
conversation
money
lecture
concert
mail
book
newspaper
Net
Work
+ DB
Data
Base
Its ALL going electronic
Immediate is being stored for analysis (so ALL database)
Analysis & Automatic Processing are being added
2
Slide borrowed from Craig Mundie
Information Excitement!
• All information will be online
(somewhere)
text, speech, sound, vision, graphics, spatial, time…
• You might record everything
– read: 10MB/day, 400 GB/lifetime (5 disks today)
– hear: 400MB/day, 16 TB/lifetime (2 disks/year today)
– see: 1MB/s, 40GB/day, 1.6 PB/lifetime (150 disks/year maybe
someday)
• Information at Your Fingertips
–Make it easy to capture & present
–Make it easy to store & organize & access
–Make it easy to analyze & summarize
3
How much information is there?
Yotta
• Soon everything can be
recorded and indexed
Everything
!
• Most bytes will never be
Recorded
seen by humans.
All Books
• Data summarization,
MultiMedia
trend detection,
All LoC books
anomaly detection
(words)
are key technologies
See Mike Lesk:
How much information is there:
http://www.lesk.com/mlesk/ksg97/ksg.html
See Lyman & Varian:
How much information
http://www.sims.berkeley.edu/research/projects/how-much-info/
24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli
.Movi
e
A Photo
A Book
Zetta
Exa
Peta
Tera
Giga
Mega
4
Kilo
How do we get information today.
• Human searches web
(with an index)
• Human browses pages
5
How do we get information tomorrow?
• Agents gather and
digest it for us.
• Q: How?
• A Microsoft: Dot Net
– Discovery:
UDDI,
WSDL
– Explore: SOAP
Digital Dashboard
My Agents
SOAP
WSDL
Web Services
6
How do you publish information?
• Get the data.
• Conceptualize the data schema
• Provide methods that return data subsets.
f, g, x,
y…
– Challenge: how much processing on your server?
• Publish the schema and methods.
• We are exploring these issues.
7
TerraServer Example
• What is TerraServer?
– 3TB Internet Map DB available since June 1998
– USGS photo and topo maps of the US
– Integrated with Home Advisor
– Shows off SQL Server availability & scalability
– Designed for basic computer systems and low
speed communications
• What is TerraService?
– A .NET web service
– Makes TerraServer data available to other apps
8
Application Goals
•
•
•
•
•
•
•
•
3 TB
BIG — 1 TB of data including catalog, temporary space, etc.
PUBLIC — available on the world wide web
INTERESTING — to a wide audience
ACCESSIBLE — using standard browsers (IE, Netscape)
REAL — a LOB application (users can buy imagery)
FREE — cannot require NDA or money to a user to access
FAST — usable on low-speed (56kbps) and high speeds(T-1+)
EASY — we do not want a large group to develop, deploy, or
maintain the application
•
•
Available – Always, 24x7x52 99.99% of the time
Programmable -- .NET applications can integrate
TerraServer data into their apps
10
Demo
http://terraserver.microsoft.com
Show
photo
topo
gazetteer
demographics
11
Hardware
8 Compaq DL360 “Photon” Web Servers
One SQL database per rack
Each rack contains 4.5 tb
261 total drives / 13.7 TB total
Fiber SAN
Switches
Meta Data
Stored on 101 GB
“Fast, Small Disks”
(18 x 18.2 GB)
Imagery Data
Stored on 4 339 GB
“Slow, Big Disks”
(15 x 73.8 GB)
To Add 90 72.8 GB
Disks in Feb 2001
to create 18 TB SAN
O O
E E
J J
SQL\Inst1
P Q
K
L
F
G
SQL\Inst2
R S
M N
H
I
12
4 Compaq ProLiant 8500 Db Servers
TerraServer Experience
•
Successful Web Site
– Met all 8 goals – interesting, big, real, public,
fast, easy, accessible, and free
– High Availability – Windows Data Center &
Compaq SAN Technology
– Top 1000 Web Site – continues to be popular
•
New Feature Requests
– Programmable access to meta-data
– User selectable image sizes, i.e. “a map server”
– Permission to use TerraServer data within
server applications
13
What is a Web Service?
Open
Internet
Protocols

Web
Service
A programmable application component
accessible via standard Web protocols
Provide a Directory of Services on the
Internet

You can ask a site for a description of the
Web Services it offers

Web Services are defined in terms of the
formats and ordering of messages

Web Service consumers can send and
receive messages using XML

All these capabilities are built using open
Internet protocols
UDDI
Universal Description, Design, and Integration
SOAP
Discovery
SOAP
Contract Language
SOAP
XML & HTTP
14
.NET TerraService Architecture
Standard
Browsers
Map UI
Web Forms
Existing
DB Server
Map Server
Http Handler
Smart
Clients
Windows
Forms
.NET
Framework
705 m Rows
SQL 2000
TerraServer
Web Service
1.0 TB Db
SQL 2000
1.0 TB Db
ADO.NET
OLEDB
SQL 2000
1.0 TB Db
15
TerraServer Web Services
Terra-Tile-Service
•
•
•
•
Query Gazetteer
Retrieve imagery
meta-data
Retrieve imagery
Simple Projection
conversions
Clients can present
TerraServer imagery
in new ways.
Landmark-Service
•
Geo-coded places,
e.g. Schools, Golf
Courses, Hospitals,
etc.
•
Place Polygons
e.g. Zip Codes, Cities,
etc.
allows “overlay”
information for
Terra-Tile-Service
applications
16
Web Service Methods
•
Place Search
–
–
–
–
•
GetPlaceFacts
GetPlaceList
GetPlaceListInRect
CountPlacesInRect
Projection
–
–
–
–
–
•
Tile
–
–
–
–
–
–
GetAreaFromPt
GetAreaFromRect
GetAreaFromTileId
GetTileMetaFromLonLatPt
GetTileMetaFromTileId
GetTile (Image)
ConvertLonLatPtToUtmPt
Landmark
ConvertUtmPtToLonLatPt•
ConvertLonLatTo NearestPlace– GetLandmarkTypes
– CountOfLandmarkPointsByRect
GetTheme
– GetLandmarkPointsByRect
GetLatLonMetrics
– CountOfLandmarkShapesByRect
– GetLandmarkShapesByRect
http://terraservice.net
17
Soil Viewer Uses TerraService
18
Custom End Product
WebInterpretation
XML
Soil
Soil Report
Data Viewer
Map
19
What Tom Showed You
• Converted a Web Server
–HTML get post
–Server returns pictures
to people
• to a Web Service
–SOAP service
–returns XML self-describing data
–Application integrates data
(Agriculture and Geo data)
20
Rosetta Stone
• Distributed computing
+ basic services
• Yellow Pages
• Dot Net
• UDDI – Universal
description, discovery,
and integration
• Schema, XLANG
• ?
• RPC – remote procedure • SOAP – simple object
call, CORBA, DCOM, RMI access protocol
• IDL – interface definition • WSDL – web services
definition language
language
• XML- eXtended Markup
• XDR - eXternal Data
Language
Representation
21
Sky Server
– Like TerraServer pictures of the sky.
– But also LOTS of data on each object
So a
data mining web service
• Luminosity (multi-spectra), morphology, spectrum
• So, it is a data mining application
• Cross-correlation is challenging because
–Multi-resolution
–Data is dirty/fuzzy (error bars, cosmic rays, airplanes…)
•50 K Spectro Objects
–Time varying
•~ 100 attributes + 30 lines
+
22
•15M Photo Objects ~ 400 attributes
Astronomy Data
•
•
•
•
•
In the “old days” astronomers took photos.
Starting in the 1960’s they began to digitize.
New instruments are digital (100s of GB/nite)
Detectors are following Moore’s law.
Data avalanche: double every year
1000
100
Courtesy
of
Alex
Szalay
10
1
0.1
1970
1975
1980
1985
1990
1995
2000
CCDs
Glass
Total area of 3m+
telescopes in the world in
m2, total number of CCD
pixels in megapixel, as a
function of time. Growth
over 25 years is a factor
of 30 in glass, 3000 in
23
pixels.
Astronomy Data
• Astronomers have a few Petabytes now.
– 1 pixel (byte) / sq arc second ~ 4TB
– Multi-spectral, temporal, … → 1PB
• They mine it looking for
new (kinds of) objects or
more of interesting ones(quasars),
density variations in 400-D space
correlations in 400D space
•
•
•
•
•
Data doubles every year.
Data is public after a year.
So, 50% of the data is public.
Some have private access to 5% more data.
So: 50% vs 55% access for everyone
24
Astronomy Data
• But…..
• How do I get at that 50% of the data?
• Astronomers have culture of publishing.
– FITS files and many tools.
http://fits.gsfc.nasa.gov/fits_home.html
– Encouraged by NASA.
• Publishing data “details” is difficult.
Astronomers want to do it but it is VERY hard.
(What programs where used? what were the processing steps? How were errors treated?…)
25
Virtual Observatory
http://www.astro.caltech.edu/nvoconf/
http://www.voforum.org/
• Premise: Most data is (or could be online)
• So, the Internet is the world’s best telescope:
– It has data on every part of the sky
– In every measured spectral band: optical, x-ray, radio..
– As deep as the best instruments (1 year ago).
– It is up when you are up.
The “seeing” is always great
(no working at night, no clouds no moons no..).
– It’s a smart telescope:
links objects and data to literature on them. 26
Virtual Observatory
The Age of Mega-Surveys
• Large number of new surveys
•
•
MACHO
– multi-TB in size, 100 million objects or more
2MASS
– individual archives planned, or under way
DENIS
– Data publication an integral part of the survey SDSS
PRIME
– Software bill a major cost in the survey
DPOSS
GSC-II
Multi-wavelength view of the sky
COBE
– more than 13 wavelength coverage in 5 years
MAP
NVSS
Impressive early discoveries
FIRST
– finding exotic objects by unusual colors
GALEX
• L,T dwarfs, high-z quasars
ROSAT
– finding objects by time variability
OGLE ...
• gravitational micro-lensing
27
Slide courtesy of Alex Szalay, modified by jim
Virtual Observatory
Federating the Archives
• The next generation mega-surveys are different
–
–
–
–
top-down design
large sky coverage
sound statistical plans
well controlled/documented data processing
• Each survey has a publication plan
• Data mining will lead to stunning new discoveries
• Federating these archives
 Virtual Observatory
Slide courtesy of Alex Szalay
28
The Multiwavelength Crab Nebula
Crab star
1053 AD
Nova first sighted
1054 A.D. by
Chinese Astronomers
Now: Crab Nebula
X-ray,
optical,
infrared, and
radio
29
Slide courtesy of Robert Brunner @ CalTech.
Exploring Parameter Space
Given an arbitrary
parameter space:
•
•
•
•
•
•
Data Clusters
Points between Data Clusters
Isolated Data Clusters
Isolated Data Groups
Holes in Data Clusters
Isolated Points
Nichol et al. 2001
30
Slide courtesy of Robert Brunner @ CalTech.
Virtual Observatory and Education
• In the beginning science was empirical.
• Then theoretical branches evolved.
• Now, we have a computational branches.
– The computational branch has been simulation
– It is becoming data analysis/visualization
• The Virtual Observatory can be used to
– Teach astronomy:
make it interactive,
demonstrate ideas and phenomena
– Teach computational science skills
and the process of scientific discovery
31
Sloan Digital Sky Survey
http://sdss.org/
• A group of astronomers has been building a telescope
(with 90M$ from Sloan Foundation, NSF, and a dozen universities).
for the last 12 years!
• Now data is arriving:
– 250GB/nite (20 nights per year).
– 100 M stars, 100 M galaxies, 1 M spectra.
• Public data at http://sdss.org/
– 5% of the survey, 600 sq degrees, 15 M objects 60GB.
– This data includes most of the known high z quasars.
– It has a lot of science left in it but… that is just the start.
32
Demo of Sky Server
Alex built SkyServer (based on TerraServer design).
http://skyserver.sdss.org/
Demo:
famous places
navigator
data
shopping cart
spectrum
SQL?
?
33
Virtual Observatory Challenges
• Size : multi-Petabyte
40,000 square degrees is 2 Trillion pixels
– One band (at 1 sq arcsec)
– Multi-wavelength
– Time dimension
4 Terabytes
10-100 Terabytes
>> 10 Petabytes
– Need auto parallelism tools
• Unsolved Meta-Data problem
– Hard to publish data & programs
– Hard to find/understand data & programs
• Current tools inadequate
– new analysis & visualization tools
• Transition to the new astronomy
– Sociological issues
34
3-steps to Virtual Observatory
• Get SDSS and Palomar online
– Alex Szalay, Jan Vandenberg, Ani Thakar….
– Roy Williams, Robert Brunner, Julian Bunn
• Do queries and crossID matches
with CalTech and SDSS to expose
– Schema, Units,…
– Dataset problems
– the typical use scenarios.
• Implement WebServices at CalTech and SDSS
35
The Challenges
• How to federate the Archives to make a VO?
• The hope:
XML is the answer.
• The reality:
XML is syntax and tools:
FITS on XML will be good but…..
Explaining the data will still be very difficult.
• Define Astronomy Objects and Methods.
– Based on UDDI, WSDL, SOAP.
– Each archive is a service
• http://TerraService.net/ shows the idea.
– Working with Caltech (Brunner, Williams, Djorgovski, Bunn)
– But, how does data mining work?
36
SkyServer as a WebService
WSDL+SOAP
just add details 
Archive ss = new VOService(SkyServer);
Attributes A[] = ss.GetObjects(ra,dec,radius)
…
?? What are the objects (attributes…)?
?? What are the methods (GetObjects()...)?
?? What query language? SQL, Xquery…?
37
Summary
• All information at your fingertips.
• How do we publish information
so that our agents can digest it?
• Example: TerraServer -> TerraService
• The Virtual Observatory Concept
– The Internet is worlds best telescope
• For astronomy
• For teaching astronomy and
• For teaching computational science
38
39