Download T. Tyson - Computing & Communications

Data driven discovery: opportunities and challenges Tony Tyson UC Davis Next Generation Research & the University of California Planning  Watch the tail of the distribution  Disruptive technologies drive user behavior  Rare modalities will become commonplace Future of computing at scale Future of computing at scale The “Big” in Big Data • What you do with it • More challenging than volume or storage • The big opportunity and challenge Full end-to-end simulations 6 LSST Wide-Fast-Deep survey A survey of 37 billion objects in space and time Each sky patch will be visited over 800 times: 30 trillion measurements 15 terabytes per night, for ten years Complex high-dimensional data “Genome project” approach to astronomy  Avoid cost of building a new facility running a new experiment every time we ask a new science question  One exhaustive survey of the optical universe  A 3.2 Giga pixel image every 18 sec for 10 years  Calibrated trusted data: over 500PB 500PB image collection + 15PB catalog  Many simulated universes  Multiple 100-1000PB databases  Exascale data enables many “experiments” 9 Automated discovery Data exploration This is required also for automated Data Quality Assessment 10 Comparing data with theory: Cosmological Simulations  Hard to analyze the data afterwards -> need Database  Compare to real data  Next generation of simulations with Petabytes of output are under way (Exascale-Sky) The Science of Big Data  Data growing exponentially, in all science  Changes the nature of science from hypothesis-driven to data-driven discovery  Cuts across all sciences  Convergence of physical and life sciences through Big Data (statistics and computing)  A new scientific revolution The Crunch  Science community starving for storage and IO • Put data-intensive computations as close to data as possible  Current architectures cannot scale much further • Need to get off the curve leading to power wall  A new, Fourth Paradigm of science is emerging • Many common patterns across all scientific disciplines 5 Year Trend  Sociology: • Data collection in ever larger collaborations • Analysis decoupled, off archived data by smaller groups • Multi-PB data sets  Some form of a scalable Cloud solution inevitable • Who will operate it, what business model, what scale? • Science needs different tradeoffs than eCommerce  Scientific data will never be fully co-located • Geographic origin tied to experimental facilities • Streaming algorithms, data pipes for distributed workflows Research Fiber 1 Gb – 10 Gb INTERNET (CENIC) VLAN1 LAN LAN 100 Mb - 1 Gb BUILDING 10 - 100 Gb 100 Mb - 1 Gb BUILDING BORDER 10 Gb 10 Gb 10 Gb AREA AREA 10 Gb 10 Gb CORE VLAN1 Infrastructure • Big bandwidth between data centers • Exascale computations at centers • Adequate bandwidth to users for vizualization (streaming HD) • Plus one more important ingredient.. What’s needed? (not drawn to scale) Miners Scientists Science Data & Questions Data Mining Algorithms Plumbers Database To store data Execute Queries Question & Answer Visualization Tools Jim Gray 17 We need to train a cadre of scientists who are deep both in CS/Statistics and their domain science Example: Data Science Initiative @ UC Davis 18 Big Data @ UC Davis Domain Sciences Discover Develop Training Distribute Infrastructure Analytics

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download T. Tyson - Computing & Communications