Download T. Tyson - Computing & Communications

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia, lookup

Transcript
Data driven discovery:
opportunities and challenges
Tony Tyson
UC Davis
Next Generation Research & the University of California
Planning
 Watch the tail of the distribution
 Disruptive technologies drive user behavior
 Rare modalities will become commonplace
Future of computing at scale
Future of computing at scale
The “Big” in Big Data
• What you do with it
• More challenging than volume or storage
• The big opportunity and challenge
Full end-to-end simulations
6
LSST Wide-Fast-Deep survey
A survey of 37 billion objects
in space and time
Each sky patch will be visited over 800 times:
30 trillion measurements
15 terabytes per night, for ten years
Complex high-dimensional data
“Genome project” approach to astronomy
 Avoid cost of building a new facility running a new
experiment every time we ask a new science question
 One exhaustive survey of the optical universe
 A 3.2 Giga pixel image every 18 sec for 10 years
 Calibrated trusted data: over 500PB
500PB image collection + 15PB catalog
 Many simulated universes
 Multiple 100-1000PB databases
 Exascale data enables many “experiments”
9
Automated discovery
Data exploration
This is required also for
automated Data Quality Assessment
10
Comparing data with theory:
Cosmological Simulations
 Hard to analyze the data afterwards ->
need Database
 Compare to real data
 Next generation of simulations with
Petabytes of output are under way
(Exascale-Sky)
The Science of Big Data
 Data growing exponentially, in all science
 Changes the nature of science
from hypothesis-driven to data-driven discovery
 Cuts across all sciences
 Convergence of physical and life sciences
through Big Data (statistics and computing)
 A new scientific revolution
The Crunch
 Science community starving for storage and IO
• Put data-intensive computations as close to data
as possible
 Current architectures cannot scale much further
• Need to get off the curve leading to power wall
 A new, Fourth Paradigm of science is emerging
• Many common patterns across all scientific
disciplines
5 Year Trend
 Sociology:
• Data collection in ever larger collaborations
• Analysis decoupled, off archived data by smaller
groups
• Multi-PB data sets
 Some form of a scalable Cloud solution inevitable
• Who will operate it, what business model, what scale?
• Science needs different tradeoffs than eCommerce
 Scientific data will never be fully co-located
• Geographic origin tied to experimental facilities
• Streaming algorithms, data pipes for distributed
workflows
Research Fiber
1 Gb – 10 Gb
INTERNET
(CENIC)
VLAN1
LAN
LAN
100 Mb - 1 Gb
BUILDING
10 - 100 Gb
100 Mb - 1 Gb
BUILDING
BORDER
10 Gb
10 Gb
10 Gb
AREA
AREA
10 Gb
10 Gb
CORE
VLAN1
Infrastructure
• Big bandwidth between data centers
• Exascale computations at centers
• Adequate bandwidth to users for
vizualization (streaming HD)
• Plus one more important ingredient..
What’s needed?
(not drawn to scale)
Miners
Scientists
Science Data
& Questions
Data Mining
Algorithms
Plumbers
Database
To store data
Execute
Queries
Question &
Answer
Visualization
Tools
Jim Gray
17
We need to train a cadre of
scientists who are deep both
in CS/Statistics and their
domain science
Example: Data Science Initiative @ UC Davis
18
Big Data @ UC Davis
Domain
Sciences Discover
Develop
Training
Distribute
Infrastructure
Analytics