Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
U.S. Department of Energy Office of Science U.S. Department of Energy’s Office of Science New Opportunities for Data and Information Management: Finding the Dots, Connecting the Dots, Understanding the Dots 2006 AAAS Annual Meeting February 19, 2006 St. Louis, MO Raymond L. Orbach Director, Office of Science U.S. Department of Energy DOE Office of Science Office of Science Supports basic research that underpins DOE missions Constructs and operates large scientific facilities for the U.S. scientific community Seven Program Offices February 19, 2006 Accelerators, synchrotron light sources, neutron sources Advanced Scientific Computing Research (ASCR) Basic Energy Sciences (BES) Biological and Environmental Research (BER) Fusion Energy Sciences (FES) High Energy Physics (HEP) Nuclear Physics (NP) Workforce Development (WD) 2 U.S. Department of Energy The FY 2007 President’s Request for science funding is a 14.1% increase and sets the Office of Science on a path to doubling by 2016 Office of Science Office of Science Budget Doubling from FY 2006 to FY 2016 7 6 Budget Authority As Spent Dollars in Billions An historic opportunity for our country – a renaissance for U.S. science and continued global competitiveness. SC budget doubles to $7.2B in FY 2016 from $3.6B in FY 2006 5 4 FY 1995 level plus inflation 3 2 1 2009 2010 2011 2012 2013 2014 2015 2016 2003 2004 2005 2006 2007 2008 1997 1998 1999 2000 2001 2002 1995 1996 0 Fiscal Year February 19, 2006 3 U.S. Department of Energy Data Storage Funding Office of Science Data Storage Funding Including R&D (ASCR+HEP+NP) FY 2006 FY 2007 $ 34M $ 37.6M Current experiment and simulation data storage capacity for the Office of Science is about 100 petabytes and is expected to more than double by FY 2009 February 19, 2006 4 U.S. Department of Energy Data Sources Three Pillars of Scientific Discovery: Experiment, Theory, and Simulation Office of Science Two different kinds of very large data sets: February 19, 2006 Experimental data High energy physics, environment and climate observation data, biological mass-spectrometry Data needs to be retained for long term Simulation data Astrophysics, climate, fusion, catalysis, QCD From computationally expensive large simulations Post processing of data using quantum Monte Carlo, analytics and graphical analysis, perturbation theory, and molecular dynamics 5 U.S. Department of Energy PetaCache Project HEP Data Analysis: Beyond Data Mining Office of Science BaBar Data Challenge: • 2 petabytes stored, 10-100 terabytes intense access/inquiry • 1–15 kilobytes (small) data objects • Hundreds of users, thousands of batch jobs PetaCache project (SLAC: David Leith and Richard Mount) Revolutionize access to huge datasets: • First innovative solid-state disk as intermediate storage for HPC data searches • 100 times smaller latency than disk • At least 500 times faster throughput than disk • Builds Feature Database structures to accelerate the retrieval of data Expected Impact BaBar: From analyst’s idea to seeing the result – nine months becomes one day. February 19, 2006 6 U.S. Department of Energy Connecting the Dots in Science ORNL: Nagiza Samatova Office of Science Finding the Dots Sheer Volume of Data Climate Now: 20-40 Terabytes/year 5 years: 5-10 Petabytes/year Fusion Now: 100 Megabytes/15 min 5 years: 1000 Megabytes/2 min February 19, 2006 Understanding the Dots Advanced Mathematics and Algorithms Huge dimensional space Combinatorial challenge Complicated by noisy data Requires high-performance computers Providing Predictive Understanding Produce hydrogen-based energy Stabilize carbon dioxide Clean and dispose toxic waste 7 U.S. Department of Energy Connecting the Dots in Combustion, Fusion, and Structural Biology Office of Science Finding the DOTS - Large-scale simulations in support of combustion grand challenges are generating terabytes of data per simulation. Of particular interest in these simulations are transient events such as ignition, extinction, and re-ignition, which are not well understood. Similar problems also exist in high-resolution, ultra-high speed images of edge turbulence in the National Spherical Torus Experiment at PPPL. In structural biology, the interaction between two proteins forming a molecular machine can be described as the set of contacting amino acid residues. The set of features is very large, and is generated by the combinations of different chemical identities, orientation patterns, and spatial arrangement of the residues. Connecting the DOTS – In combustion, it is unclear what features in the simulation data and their nonlinear dynamic effects could be used to characterize such events. Simulations need to be carried out to explore different possibilities. In fusion, extracting features that could characterize the plasma blobs is relevant to the analysis of Poincaré sections for the particle orbits. For the two interacting proteins, the number of the distinctly different variants of subunits forming the molecular machine is millions or billions, even after applying sophisticated filtering algorithms. The correlations between the subunits establishes the connection between the dots. Understanding the DOTS – • A complete understanding the correlations and chemical reactions inherent in the turbulent flow during combustion is still beyond our reach. • In fusion, each particle orbit in a Poincaré section is generated when a particle intersects a plane perpendicular to the magnetic axis. Identifying and classifying the orbits is of significant importance in understanding and stabilizing the plasma. • Multiple connectable groups of amino acids can be constructed for the interacting proteins, with probabilities giving the likelihood for each variant. Finding the "optimal" solution is important. For example, high scoring interfaces may represent a dynamic picture of the protein machine workings, or additional "ports" suitable for yet-not-discovered protein subunits and other co-factors. February 19, 2006 8 U.S. Department of Energy Office of Science Decadal Data Challenge Office of Science Mathematical and Computational Challenges and Needs “Curse of Dimensionality” - Interpretation of high dimensional data Challenges: Going beyond classical Bayesian theory of probabilistic quantification to address long range and non-linear correlations between features in noisy data Mathematical description of complex geometric shapes in their spatial and temporal dimensions Enumeration and optimization of multivariate functions on complex graphs that describe relationships between identified features Low rank approximations and generalized separation of variables to reduce the dimension with out destroying information New harmonic and discrete mathematics and new algorithms for fast extraction of correlations and patterns February 19, 2006 9 U.S. Department of Energy Office of Science Response to the Data Challenge Office of Science The Office of Science will initiate a long-term research program to address the “Curse of Dimensionality.” Some of the elements of the research program are: Bayesian Theory – New research to develop efficient ways for dealing with both local and long-range correlations between features, including Bayesian estimators to correctly estimate the simultaneous appearance of “striking” features at precisely defined locations, and mechanisms to incorporate partial analytical models to supplement missing statistics. Mathematical description of complex geometric shapes – New research on the stochastic theory of shapes to classify geometric shapes in terms of stochastic models, which are essential for the rigorous comparisons needed for pattern discovery. We intend to develop high performance scalable algorithms for querying, searching, tracking, and reconstruction of high dimensional shapes from incomplete information. Enumeration and optimization of multivariate functions on complex graphs – New research to develop efficient methodologies for the hierarchical enumeration of composite objects, including analytical methods for dynamically constraining the search space. We intend to develop optimization methods to deal with novel spaces formed by graphs of identified features (dots) and their relationships (connections). Such spaces typically have hundreds of variables and dimensions. Additionally, we intend to develop computational libraries to efficiently handle an enormous number of possible variants through construction of subgraph indexing schemes and efficient lookup methods. February 19, 2006 10