Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Managing Gene Annotation Information the search is over … one problem solved … another begins observations from a foot soldier in the bio-information (r)evolution Bill Farmerie -- ICBR Genomics Group Interdisciplinary Center for Biotechnology Research Established at the University of Florida in 1987 by the Florida Legislature centralized organization of biomedical core facilities supporting biotechnology-based research How did information management become my problem? 1998 GSAC Miami Beach Why should I care about this problem? Because my paycheck depends on it. Avoid fatal failure in the funding loop. PI has $ for large genebased project Other PI’s think this looks like a good idea PI applies for new funding Core Lab generates data Downstream data management & analysis PI writes papers, gives talks From Sequence to Function The genomic sequence identifies the 'parts' the next trick is understanding gene function Post genomic era = functional genomics Critical concept: genes of similar sequence may have similar functions Inferring function for a new gene begins with searching for it’s nearest neighbor (or homolog) of known function BLAST Most common starting point for gene identification Similarity search of sequence repository (GenBank) Output Calculated scores (bit score and e-value) Text string (definition line), ID Reference Tag Sequence alignment Advantages Disadvantages Fast algorithm, very good at finding close homologs Not good at finding distant relatives Cluster and Grid-enabled versions available HMMER HMMER developed by Sean Eddy Uses Hidden Markov Models Searches unknown protein query sequence against a database of protein family models Advantages Superior to BLAST for discovering more distant homology relations Disadvantages Statistical models constructed from alignment of conserved protein regions (Pfam) More computationally intensive than BLAST GRID enabled OK! Great! Sequencing done. Homology searches complete. But how will I deliver this information to scientists spread all over campus, and their worldwide collaborators? Search for summarizing information that restores sanity CTGGGTTCTGTTCGGGATCCCAGT CACAGGGACAATGGCGCATTCATA TGTCACTTCCTTTACCTGCCTGGA GAGGTGTGGCCACAGACTCTGGTG GCTGCGAACGGGGACTCTGACCCA GTCGACTTTATCGCCTTGACGAAG AACCAGATTGACGTTGTCGGAGTC GGAACTCACCTGGTCACCTGTACG ACTCAGCCGTCGCTGGGTTGCGTT CTGACACGCGGCTCCTCGTGTGGA GCCGAAACCCCGACAAAAGCGAAG GAGAGAGTGAGTATGAGCAGGCGG BlastQuest A small idea with a big mission BlastQuest Requirements Accessible to research groups at remote locations Privacy constrained sharing of results among the scientists Selective browsing of BLAST homology search results Selective data filtering on statistical criteria e-value or bit score Selective data grouping on criteria such as GI number, or a defined number of top-scoring results Ad hoc search capability on user determined criteria: text terms boolean logic From a computational point of view BlastQuest is embarrassingly simple. However it solved our problem for information storage, selective retrieval, and distribution. Overview of BlastQuest Architecture Web Browser Client Side GUI Web Server BLAST XML document Tier 3 Client Interf ace Module XML Loader ACE Loader Assembly ACE f ile Tier 2 SQL Constructor JDBC Tier 1 MySQL DBMS Welcome to BlastQuest Choose among client projects Results Selection Grouped Results Ad Hoc Text Searching Internal BLAST Searches Viewing a Gene Ontology Tree Viewing a Gene Ontology Tree Viewing a Gene Ontology Tree KEGG Classification Kyoto Encyclopedia of Genes and Genomes “Wiring diagrams of life” KEGG Protein Networks Metabolic pathways Regulatory pathways Molecular complexes Network-network relations Network-environment relations Common to both Unique to non-Unigene Unique to Unigene Bacterial Genome Annotation Workbench Another simple idea driven by necessity Start Project Summary Contig Browser Contig summary Physical map linked to annotation Simple problems. Simple solutions. Why are these simple ideas important? Human Genome Project HGP drove innovation in biotechnology 2 major technological benefits stimulated methods development of high throughput on computational tools for data mining and visualization of biological information reliance The HGP and the cost of DNA sequencing “finished” quality DNA sequence a DNA base call is considered finished if the probability of base call error is less than 1 in 10,000 also known as phred > 40 contiguous DNA sequence of phred > 40 usually achieved by multifold sequencing of the same region; typically 710X coverage 1985: $10 per finished base 2001: $1 per 10 finished bases Genbank August 22, 2005 QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Public Collections of DNA and RNA Sequence Reach 100 Gigabases Trends in the cost efficiency of § DNA sequencing § Shendure, J., Mitra, R., Varma, C., and Church, G.M. (2004)”Advanced sequencing technologies: Methods and Goals” Nature Genetics 5:335 454 Life Sciences Corporation The first commercial, massively parallel, DNA sequencing technology 454 Technology Cyclic-array sequencing on in vitro amplified DNA molecules individual molecules must be amplified to give a detectable sequencing signal Instead of biological cloning, we amplify individual DNA fragments on solid state beads using PCR Instead of terminator-based sequencing, pyrosequencing used to determine nucleotide order “sequencing by synthesis” 454 Process Overview The bottom line … efficiency of DNA sequencing increased 100X cost per finished base declined 10- to 30-fold … so what happens next? The “democratization” of large-scale genomic biology Many projects are now possible that were once fiscally inviable We must deal with basic local data management and information issues or lose this opportunity If you thought bioinformatics was important before By terminator-based sequencing we @ UF produce 60-70 Mbp per year By synthesis-based sequencing we produce 6070 Mbp per day