Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data driven discovery: opportunities and challenges Tony Tyson UC Davis Next Generation Research & the University of California Planning  Watch the tail of the distribution  Disruptive technologies drive user behavior  Rare modalities will become commonplace Future of computing at scale Future of computing at scale The “Big” in Big Data • What you do with it • More challenging than volume or storage • The big opportunity and challenge Full end-to-end simulations 6 LSST Wide-Fast-Deep survey A survey of 37 billion objects in space and time Each sky patch will be visited over 800 times: 30 trillion measurements 15 terabytes per night, for ten years Complex high-dimensional data “Genome project” approach to astronomy  Avoid cost of building a new facility running a new experiment every time we ask a new science question  One exhaustive survey of the optical universe  A 3.2 Giga pixel image every 18 sec for 10 years  Calibrated trusted data: over 500PB 500PB image collection + 15PB catalog  Many simulated universes  Multiple 100-1000PB databases  Exascale data enables many “experiments” 9 Automated discovery Data exploration This is required also for automated Data Quality Assessment 10 Comparing data with theory: Cosmological Simulations  Hard to analyze the data afterwards -> need Database  Compare to real data  Next generation of simulations with Petabytes of output are under way (Exascale-Sky) The Science of Big Data  Data growing exponentially, in all science  Changes the nature of science from hypothesis-driven to data-driven discovery  Cuts across all sciences  Convergence of physical and life sciences through Big Data (statistics and computing)  A new scientific revolution The Crunch  Science community starving for storage and IO • Put data-intensive computations as close to data as possible  Current architectures cannot scale much further • Need to get off the curve leading to power wall  A new, Fourth Paradigm of science is emerging • Many common patterns across all scientific disciplines 5 Year Trend  Sociology: • Data collection in ever larger collaborations • Analysis decoupled, off archived data by smaller groups • Multi-PB data sets  Some form of a scalable Cloud solution inevitable • Who will operate it, what business model, what scale? • Science needs different tradeoffs than eCommerce  Scientific data will never be fully co-located • Geographic origin tied to experimental facilities • Streaming algorithms, data pipes for distributed workflows Research Fiber 1 Gb – 10 Gb INTERNET (CENIC) VLAN1 LAN LAN 100 Mb - 1 Gb BUILDING 10 - 100 Gb 100 Mb - 1 Gb BUILDING BORDER 10 Gb 10 Gb 10 Gb AREA AREA 10 Gb 10 Gb CORE VLAN1 Infrastructure • Big bandwidth between data centers • Exascale computations at centers • Adequate bandwidth to users for vizualization (streaming HD) • Plus one more important ingredient.. What’s needed? (not drawn to scale) Miners Scientists Science Data & Questions Data Mining Algorithms Plumbers Database To store data Execute Queries Question & Answer Visualization Tools Jim Gray 17 We need to train a cadre of scientists who are deep both in CS/Statistics and their domain science Example: Data Science Initiative @ UC Davis 18 Big Data @ UC Davis Domain Sciences Discover Develop Training Distribute Infrastructure Analytics