Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
eScience and Grid Tools and techniques for the next generation scientist Professor Brian Vinter Head of the Copenhagen eScience Center eScience «The next 10 to 20 years will see computational science firmly embedded in the fabric of science – the most profound development in the scientific method in over three centuries.» US Department of Energy 2003. Mega-Science The next scientific period will be dominated by Mega-Science projects • 104 researchers on a single project • Extreme data production • Highly integrated collaboration between different groups of scientists Examples • CERN LHC • ALMA • Mars project Data Production 1997: Total data worldwide app 12 exabytes (incl. documents, film, TV, pictures, …)1 1999: 2-3 exabytes data produced2 2002: App. 5 exabytes data produced2 1 1 1 1 Exabyte = 1000 Petabytes Petabyte = 1000 Terabytes Terabyte = 1000 Gigabytes Gigabyte = 1000 Megabytes Global data availablity doubles every 4-5 years. 1) http://www.lesk.com/mlesk/ksg97/ksg.html 2) http://www.sims.berkeley.edu/research/projects/how-much-info-2003/ eScience Components Modeling and simulation eScience Components Modeling and simulation Data acquisition and handling eScience Components Modeling and simulation Data acquisition and handling Visualization eScience Components Modeling and simulation Data acquisition and handling Visualization HPC and Grid Why is it getting more difficult? 54 molecules 442 molecules 1372 molecyles Process Biopolymers Proteins Ribosomes 10-12 Biomimetic Compound Single peptide O2 10 1000 104 105 10-9 10-6 10-3 1 This seminar 10-15 2 Protein folding Time 1 Proton transfer System H Size Photoionization System sizes and time scales 106 number of atoms 103 seconds Nano-modeling Extremely CPU- and Data-intensive algorithms Complex structurecalculations Multiple days of execution even on a supercomputer Runs of both PCs and Supercomputers eScience and Bio/Med We expect very good results form eScience in biology and medicine The foremost advantages will come from introducing a mathematical causal understanding of biological systems • Bio-informatics are already doing this An emerging field: Systems Biology • Systems Medicine is also starting internationally Calculations in treatment Computational methods are already important in medical planning • Radiation planning • Bypass flow modeling • Robotic surgery • … Personalized medicine Every human is unique Also at the genetic level In our genome, which is written with the alphabet ACGT, we have a number of micro mutations – called single nucleotide polymorphisms, SNP These SNPs are often without consequence but • Some make us sick • Some are indicators of a faulty gene • Others influence our reception of a drug The last complication makes is very hard to make drugs for the general population We want to move from commodity medicine to custom tailored drugs An example app 60% of today's medicines are metabolized by cytochrome P450 enzymes • Some have highly efficient P450 while others have very slow and inefficient P450 • Knowledge of a patients P450 level will allow us to dose medicine to the individual much more efficiently This is already in early use And this is eScience how? Developing a drug is not a linear process The human genome is written with billions og letters • Any person has millions of SNP mutations • Finding the SNP that has an effect is a highly complex computational task eScience and geology Geology and hydrology too has been using computational methods for a long time There are very interesting aspects in combining different methods • i.e. include biological systems in the models • Inverse mapping of seismic data It turns out that we use the same techniques in medicine • And soon in industry Grid Minimum intrusion Grid Minimum intrusion Grid User GRID User User Resource GRID GRID Resource Resource Resource Processing plants Like the power grid the computing Grid has many types of power producers • High yield power plants (fossil fuel, nuclear,…) • Supercomputers and large farms • Low yield producers (windmills, etc) • Individual PCs and games-consoles • Very low yield producers (solar panels, etc.) • Web-browers One Click Interactive Applications VGrids Best thing since sliced bread VGrids are Virtual Organizations in MiG They are a dead easy way to create collaborations • • • • Share files Share resources Private entry page Public Web-page Portals VO’s can generate their own private entry pages including application portals Files in VGrids A user must keep her personal homedirectory independent of which VGrid she works in But VGrids have a common directory where only members of the VGrid are allowed • These are represented as directories in the users home-directory VGrid owners can create sub-VGrids Examples eScience on Grid GeneRecon GeneRecon seeks to identify genetic factors behind heretical deceases The overall idea is to compare two genomes • • One where the decease is observed One where the decease is not observed App 1000 individuals in each set GeneRecon is developed at the Bioinformatics Research Center, Århus University GeneRecon The Algorithm is a Markov-chain Monte Carlo method A test run consists of app. 30.000 individual tests • • One test runs form 1 to 10 days on a PC In total no less than 82 CPU years MiG hosted the execution on Grid and got the execution down below a month Statistics Total time 1315 jobs were submitted to Grid at the same time 0 jobs were lost First result • 678 101 2:04:44 2.08 Last result Min • Avg 28 days, 5:42:54 Max 0.01 46 392 Execution time 55 505 Queue time Groundwater modeling on Funen 18.0 Calibration of the Assens model: 1 model evaluation = 30 min 920 model evaluations = 19 days Aggregeret objektiv funktion 17.0 16.0 15.0 14.0 13.0 12.0 11.0 0 200 400 600 Antal model evalueringer 800 1000 Days to hours 5.0 AUTOCAL (1 PC) Objective function 4.5 AUTOCAL OfficeGRID AUTOCAL OfficeGRID (10 PCs) 4.0 3.5 3.0 Master 2.5 2.0 1.5 0 20 40 60 80 100 Time [h] Client Client Client Client Drug Design Molecular docking is a time consuming calculation process which this project does through two steps First step is a coarse calculation that can eliminate molecules that won’t dock • This process can run on PCs and PS3’s – a lot of work is being done towards efficient utilization of the CELL CPU for molecular docking The molecules that survive the first step are then modeled more precisely at quantum level on classic supercomputers and clusters SeGrid Still a proposal The idea is to share sensitive data through Grid and use the Grid technology to manage access control and automatic anonymization More information www.eScience.dk Portal for KUs eScience activities www.migrid.org Portal for the Minimum intrusion Grid www.rcuk.ac.uk/escience/ The very ambitious UK eScience program