* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Courtesy Affymetrix Inc. - Oracle Software Downloads
Survey
Document related concepts
Transcript
Session : 40382 Life Sciences: Data Revolution Building Gene Expression Databases Mahendra Navarange Microarray Centre MRC Clinical Sciences Centre and Imperial College, UK Agenda What is Life Science? MiMiR : database for gene expression data Data acquisition process and data characteristics System requirements Design issues Code snippets What is Life Sciences ? Includes Biology BioTechnology Chemistry Pharmaceuticals Agriculture / Plant Science Environmental Sciences ???? Objective Understand the molecular and evolutionary basis of living organisms Focus Areas Genomics Human Genome Project Draft published in 2000 Finished version on 14 April 2003 Sequencing data doubles every year Transcriptomics Study of transcription (gene expression) Proteomics Study of translation (protein synthesis) Courtesy F. Hoffmann-La Roche Ltd. Data…Data…Data Sanger Centre 5TB Celera ~ 100TB+ (2001) 700 600 500 TB 400 300 200 100 0 1999 2000 2001 2002 2003 2004 2005 2006 2007 Data Revolution in Life Sciences Impact of technology High throughput platforms (HTP) – Robotics – Miniaturisation Data driven science Datawarehousing technologies Data mining and visualisation software Life Sciences Information Technology Databases Genomics Sanger NCBI TIGR KEGG Transcriptomics ArrayExpress Proteomics Protein Databank (PDB) SWISSPROT Entrez Using Life Sciences Data identify causes of genetic diseases discover new drug compounds personalised medicine develop new diagnostics Drug Discovery Pipeline Target Identification Target Validation HTP Screening Hits Leads Clinical Leads Trials FDA Life Sciences : The Future “…..biology is changing from a purely laboratory-based science to an information based science.” Eric Lander, Director, Whitehead Institute MIT Agenda What is Life Sciences ? MiMiR: database for gene expression data Data acquisition process and data characteristics System requirements Design issues Code snippets Transcriptomics Comparing gene expression across databases Collaborate to share expertise Benefits Diagnostics Screen target drug compounds Identify toxic side effects Screen patients for clinical trials Literature Experiment design Further Analysis GO Workflow HTP Local DB NCBI Data Preliminary Analysis Collaboration HTP Microarray Platform : Hardware Courtesy Affymetrix Inc., Dell Inc Microarray Data Acquisition Courtesy Fisher Scientific Courtesy Affymetrix Inc. Microarray Data High density microarray ~ 500,000 spots of ~18 µm size >20,000 genes Typical file size 45MB No. of files produced in typical experiment 10-20. Courtesy Affymetrix Inc. Life Sciences Data Explosion Data Characteristics Image data generated by HTP platforms, annotation by researchers Large volume and size Varied data types Datawarehousing challenges Non-summarisable High dimensionality Limited knowledge of underlying biological processes No standard industry data models or best practices Agenda What is Life Sciences ? MiMiR: database for gene expression data Data acquisition process and data characteristics System requirements Design issues Code snippets System Requirements Seamless data integration Handle wide range of datatypes Processor intensive and I/O intensive Exponential growth in data storage Open architecture, collaboration System Requirements Rapid changes – new databases, technologies and instruments Competitive pressures, quick response, low access times Plug and play capability Security MIcroarray Data MIning Resource MiMiR – Microarray Datawarehouse ~250GB. Expected to double in next few months ~2500 images, over 1500 BioAssays 52 tables, largest table 15GB Infrastructure Oracle 9i Release 1 on Windows 2000 Dell PowerEdge Quad Processor, 2 GB memory, 400 GB hard disk 1 TB NAS capacity Requirements vs. Solutions Integrate different types of data sources Use of XML for data exchange Use of Oracle UltraSearch Efficient data retrieval Stringent response time standards on procedures Indexed Organised Tables, Partitioning Security Firewall Single Sign-On servers (in progress) Rapid change management BC4J framework, Jdeveloper Extreme programming, prototyping MiMiR System Architecture Ext Ref Images Annotation Blast MiMiR MAGE-ML Spot Info JDeveloper XSQL ArrayExpress 9iAS Admin Application Server XSU XDK Private BC4J JSP JClient Oracle Products Used Oracle 9i Database Server/Client (Release1) Partitioning Join indexing Oracle 9i JDeveloper (9.0.2) Oracle 9i Application Server (BC4J) Oracle XML features Oracle PL/SQL packages for XML Oracle XSQL publishing framework XDK (DOMParser and SAXParser) XSU Oracle Data Mining (Future) Oracle Collaboration Suite (Future) Why Oracle ? Readily scalable Manage wide variety of data types Integrated development tools Support XML and Java High performance middleware Secure collaboration Agenda What is Life Sciences ? MiMir : database for gene expression data Data acquisition and profiling System requirements Design issues Code snippets Oracle and XML :Design Issues Storage Storing XML in tables Storing XML in CLOBs Hybrid Generation XDK for Java, PL/SQL XSU Transformation XSL Stylesheet Views Processing XDK DOMParser XDK SAXParser Searching XPATH Oracle Text Publishing XSQL publishing framework XSL Oracle and XML : XSQL Example <?xml version="1.0" encoding='windows-1252'?> <!-| Uncomment the following processing instruction and replace | the stylesheet name to transform output of your XSQL Page using XSLT <?xml-stylesheet type="text/xsl" href="YourStylesheet.xsl" ?> --> <?xml-stylesheet type="text/xsl" href="mimirArray.xsl"?> <xsql:query connection="micro" xmlns:xsql="urn:oraclexsql"> select * from array </xsql:query> Oracle and XML: Design Issues Agenda What is Life Sciences ? MiMir : database for gene expression data Data profiling System requirements Design issues Code snippets An Example Creating XML from 500,000 records in the database Solution 1 Using XSU Java API to get XMLDOM. 1) conn=createConnection.createConnection(); 2) String query = "SELECT * FROM IMAGE_QUANTITATION i "+ "WHERE QUANT_FILENAME = 'PMB2002011001Aaa'"; 3) OracleXMLQuery q1 = new OracleXMLQuery(conn,query); 4) q1.keepCursorState(true); 5) XMLDocument xmlDoc=(XMLDocument)q1.getXMLDOM(); 6) XMLDocument.print(out); Solution 2 Using XSU Java API to get XMLString. 1) conn=createConnection.createConnection(); 2) String query = "SELECT * FROM IMAGE_QUANTITATION i "+ "WHERE QUANT_FILENAME = 'PMB2002011001Aaa'"; 3) OracleXMLQuery q1 = new OracleXMLQuery(conn,query); 4) q1.keepCursorState(true); 5) # XMLDocument xmlDoc=(XMLDocument)q1.getXMLDOM(); 6) # XMLDocument.print(out); 7) System.out.println(q1.getXMLString()); Solution 3 Using dbms_xmlquery package to get XML output from SQL Select dbms_xmlquery.getXML(‘select * from IMAGE_QUANTITATION where quant_filename=‘’PMB2002011001Aaa’’’) from dual <?xml version = '1.0'?> <ROWSET> <ROW num="1"> <IMAGE_ID>PMB2002011003Aaa</IMAGE_ID> <CHIP_TYPE>MG-U74Av2</CHIP_TYPE> <ELE_SET_NAME>AFFX-MurIL2_at</ELE_SET_NAME> <POSITIVE>2</POSITIVE> <NEGATIVE>5</NEGATIVE> <PAIRS>20</PAIRS> <PAIRS_USED>20</PAIRS_USED> <PAIRS_IN_AVG>19</PAIRS_IN_AVG> Summary Life sciences is generating enormous amount of data using HTP The data is non-summarisable, distributed and has varied data types Data integration and secure collaboration is key to success MiMiR Acknowledgements Dr. Helen Causton Prof. Tim Aitman Dr. Laurence Game Vihar Wadekar Helen Figueira Helen Banks Nicola Cooley MGED Data Society (www.mged.org) Session : 40382 Life Sciences: Data Revolution Building Gene Expression Databases What Next : Opportunities for collaboration for development of Knowledge Management Systems for Drug Discovery Contact: [email protected] http://microarray.csc.mrc.ac.uk