Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Discovery Net Discovery Net Yike Guo, John Darlington (Dept. of Computing), John Hassard (Depts. of Physics and Bioengineering) Bob Spence (Dept. of Electrical Engineering) Tony Cass (Department of Biochemistry), Sevket Durucan (T. H. Huxley School of Environment) Imperial College London AIM To design, develop and implement an infrastructure to support real time processing, interaction, integration, visualisation and mining of massive amounts of time critical data generated by high throughput devices. The Consortium Industry Connection : 4 Spin-off companies + related companies (AstraZeneca, Pfizer, GSK, Cisco, IBM, HP, Fujitsu, Gene Logic, Applera, Evotec, International Power, Hydro Quebec, BP, British Energy, ….) Industrial Contribution Hardware : sensors (photodiode arrays, hybrid photodiodes, PMTs), systems (optics, mechanical systems, DSPs, FPGAs) Software (analysis packages, algorithms, data warehousing and mining systems) Intellectual Property: access to IP portfolio suite at no cost Data: raw and processed data from biotechnology, pharmacogenomic, remote sensing (GUSTO installations, satellite data from geo-hazard programmes) and renewable energy data (from our own remote tidal power systems) High Throughput Sensing Distributed Reference DBs Characteristics Distributed Users Different Devices but same computational characteristics •Data intensive & • Data dispersive •large scale, •heterogeneous •distributed data Collaborative applications Distributed warehousing Distributed Devices Discovery issues: Distributed Knowledge Discovery, Management Incremental, Interactive Discovery & Collaborative Discovery Information issues: annotations semantics, reference, integrated view of data •Real-time data manipulation Data issues: different measurements for same object: Data registration, normalisation, Need to • calibrate • integrate • analyse calibration & quality control GRID issues: wide area, high volume, scalability (data, users), collaboration DNet Architecture High Throughput Sensing (HTS) Applications Large-scale Dynamic Realtime Decision support Large-scale Dynamic System Knowledge Discovery Grid-based Data Mining, Collaborative Visualisation Information Structuring Information Integration & Composition, Semantics & Domain-based Ontologies, Sharing Distributed Data Engineering Data Registration, Data Normalisation, Data Quality Utilising Grid Infrastructure for HT Computing Grid Basic Infrastructure Globus/Cordon/SRB Based on Globus & ORB Infrastructure High Throughput Computing Services Based on Kensington Discovery Platform Grid-based Knowledge Discovery Testbed Applications Throughput (GB/s) HTS Applications Large-scale Dynamic Real- time Decision support Size (petabytes) Large-scale Dynamic System Knowledge Discovery Node Number operations Renewable energy Applications Tidal Energy Connections to other renewable initiatives (solar, biomass, fuel cells), & to CHP and baseload stations 1-10 1-10 >20000 Structuring Mining Optimisation RT decisions Remote Sensing Applications Air Sensing, GUSTO Geological, geohazard analysis 1-100 Bio Chip Applications 10-100 >50000 Image Registration Visualisation Predictive Modelling RT decisions Protein-folding chips: SNP chips, Diff. Gene chips using LFII Protein-based fluorescent micro arrays 1-1000 10-1000 >10000 Data Quality Visualisation Structuring Clustering Distributed Dynamic Knowledge Management Large-scale urban air sensing applications Each GUSTO air pollution system produces 1kbit per second, or 1010 bits per year. We expect to increase the number (from the present 2 systems) to over 20,000 over next 3 years, to reach a total of 0.6 petabytes of data within the 3-year ramp-up. NO simulant 6.7.2001 GUSTO GUSTO You are here The useful information comes from time-resolved correlations among remote stations, and with other environmental data sets. Electrical grid Renewables characterised by •large number of small units, •often in remote areas •wireless connectivity •fluctuating,unpredictable loading As total exceeds 12% grid control becomes very difficult without RT e-grid. There is large potential in embedded generation renewable sources – they will dominate in new build (nuclear., hydro and carbon) power stations. Decentralised power •active management, is the new paradigm. •RT monitoring, . •RT control, •minute to minute security, •pan network optimisation. •This requires very high bandwidth •RT remote station data acquisition, •warehousing and analysis. The IC Advantage The IC infrastructure: microgird for the testbed Enddevices devices End Cat 5 floor wiring Floor Floorswitches switches Central Computing Facilities Building Riser Fibre Over than 12000 end devices 10 Mb/s – 1Gb/s to end devices 1 Gb/s between floors Building router switches Building Router Switches workstation cluster Core to Building Fibre wireless Core Fibre SMP Core router Router Switches Core switches storage Access to disparate offcampus sites: IC hospitals, Wye College etc. 10 Gb/s to backbone 10 Gb/s between backbone router matrix and wireless capability Proposed Firewall Proposed firewall London LondonMAN MAN/ JANET JANET ICPC Resource 150 Gflops Processing >100 GB Memory 5 TB of disk storage £3m SRIF funding Network upgrade +20 TB of disk storage 2x1Gb/s to LMAN II (10Gb/s scheduled 2004) +25 TB of tape storage 3 Clusters (> 1 Tera Flops) Particle Physics and Astronomy Research Council (PPARC) ASTROGRID (http://www.astrogrid.ac.uk/) a ~£5M project aimed at building a datagrid for UK astronomy, which will form the UK contribution to a global Virtual Observatory Particle Physics and Astronomy Research Council (PPARC) GridPP (http://www.gridpp.ac.uk/) to develop the Grid technologies required to meet the LHC computing challenge collaboration with international grid developments in Europe and the US EPSRC Testbeds (1) MyGrid Personalised extensible environments for data-intensive in silico experiments in biology Distributed Aircraft Maintenance Environment RealityGrid closely couple high performance computing, high throughput experiment and visualization EPSRC Testbeds (2) GEODISE : Grid Enabled Optimisation and DesIgn Search for Engineering CombiChem : Combinatorial Chemistry Structure-Property Mapping Discovery Net : High Throughput Sensing