Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Evolution by natural selection DNA->RNA->Protein StructureFunction What is bioinformatics? Why is there bioinformatics? Databases, the reagents What does bioinformatics do? Bioinformatics is about understanding how life works. It is an hypothesis driven science. 5 Bioinformatics is about integrating biological themes together with the help of computer tools and biological databases, and gaining new knowledge from this. 6 Acquisition, curation, and analysis of biological data Hypothesis u Lots of new sequences being added u u Automated sequencers Genome Projects EST sequencing u Microarray studies u Proteomics Metagenomics WGS u u Patterns in datasets that can only be analyzed using computers Genome information DNA sequence Gene expression Protein expression Protein Structure Genome mapping Metabolic networks Regulatory networks Trait mapping Gene function analysis Scientific literature "Biology is mere stamp-collecting” 1951 (Sanger & Tupper) - 30 AAs of ß-chain bovine insulin 1965 (Holley) - nucleotide sequence of a yeast alanine tRNA 1970s (various)- various protein sequencing methods 1972 (Dayhoff) - "Atlas of protein sequence and structure" 1977 (Sanger, Maxam & Gilbert) DNA sequencing 1980s (Brenner and various others) automated sequencing 1980s community databases 1987-92 genome sequencing projects 1992 (Venter) Expressed Sequence Tags and patents 1998 C.elegans complete genome 2001 Human genome Present widespread use of automated sequencers 10 1400 1000 800 600 400 200 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002 2001 2000 0 1999 No. of databases 1200 Source: Nucleic Acids Research: Database issue 11 20 42 26 27 106 134 72 67 189 129 126 113 119 270 DNA Sequence Databases (9%) RNA Sequence Databases (5%) Protein Sequence Databases (13%) Structure Databases (9%) Genomics Databases (19%) Metabolic and Signaling Pathways (8%) Human and other Vertebrate Genomes (8%) Human Genes & Diseases (9%) Microarray Data & Gene Expression (5%) Proteomics Resources (1%) Other Molecular Biology Databases (3%) Organelle Databases (2%) Plant Databases (7%) Immunological Databases (2%) 1. 2. 3. GenBank at National Centre for Biotechnology Information (NCBI) of the National Institute of Health (NIH) in Bethseda, USA INTERNATIONAL NUCLEOTIDE SEQUENCE European Nucleotide Archive at the European Bioinformatics Institute (EBI) in Hinxton, England DATABASE COLLABORATION (INSDC) DNA Database of Japan (DDBJ) at the National Institute of Genetics in Mishima, Japan Ranks Higher taxa Genus Species Lower taxa Total Archaea 108 127 699 199 1133 Bacteria 1144 2136 11591 11445 26316 17812 55141 223387 19199 315535 1281 3860 23888 1769 30794 12985 35329 102285 8925 159524 2141 13580 89422 7214 112357 519 358 8092 68135 77104 19608 57770 249723 99013 426110 Eukaryota Fungi Metazoa Viridiplantae Viruses All taxa Source: http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi Entries 19,301,988 9,296,587 2,187,038 2,199,144 3,927,943 3,222,429 1,705,210 228,265 1,343,269 1,770,193 1,424,327 2,321,188 1,224,108 214,236 1,454,515 662,510 811,571 1,890,146 82,708 738,474 Bases 15,881,839,899 9,118,049,806 6,503,434,302 5,381,235,474 5,055,840,446 4,793,300,236 3,127,958,433 1,352,948,327 1,251,053,810 1,194,842,997 1,147,237,486 1,138,511,865 1,058,563,193 1,003,309,475 947,332,578 915,431,680 896,784,038 895,052,594 828,906,407 778,132,243 Species Common name Homo sapiens Mus musculus Rattus norvegicus Bos taurus Zea mays Sus scrofa Danio rerio Strongylocentrotus purpuratus Oryza sativa Japonica Group Nicotiana tabacum Xenopus (Silurana) tropicalis Arabidopsis thaliana Drosophila melanogaster Pan troglodytes Canis lupus familiaris Vitis vinifera Gallus gallus Glycine max Macaca mulatta Solanum lycopersicum Human Mouse Rat Cow Corn Pig Zebra fish Sea urchin Rice Tobacco Clawed frog Thale cress Fruit fly Chimpanzee Dog Grape Chicken Soybean Rhesus macaque Tomato Number of base pairs ___________________________________________________________ 1971 1977 1982 1992 1995 1996 1998 2000 2001 2003 First published DNA sequence PhiX174 Lambda Yeast Chromosome III Haemophilus influenza Saccharomyces C. elegans D. melanogaster H. sapines (draft) H. sapiens 12 5,375 48,502 316,613 1,830,138 12,068,000 97,000,000 120,000,000 2,600,000,000 2,850,000,000 17 Growth in complete genomes Cochrane G et al. Nucl. Acids Res. 2011;39:D15-D18 © The Author(s) 2010. Published by Oxford University Press. Bioinformatics as interdisciplinary science has to: 1 • Pick up, provide and apply the appropriate mathematical tools needed for tackling problems of systematic biology; 2 • provide a suitable knowledge basis to specify the application of the developed tools; 3 • develop appropriate algorithms and implement them as effective computer programs; 4 • provide the required technical solutions for handling large amounts of biological data Use of computational search and alignment techniques to compare new genome against known genes Use of mathematical modeling techniques to identify common patterns, features and high level functions Integrated approach that integrates both Purpose Software Sequence assembly Pair wise sequence comparison Sequence-profile comparison Multiple alignment Phylogenetic analysis Gene identification Arachne, GAP4, AMOS FASTA, BLAST Analysis of rep DNAs Protein sequence ‘Fingerprints’/motifs Microarray data analysis 2-D Gel analysis PSI-BLAST ClustalW PAUP, Phylip Genscan, GeneMarkHMM, GRAIL, Genei, Glimmer RepeatMasker, RepeatFinder, RECON Pfam, ProDom, COG PROSITE, PRINTS, BLOCKS GeneTraffic, GeneSpring, GCOS, Cluster, CaARRAY, BASE, Bioconductor SWISS-2DPAGE, Melanie, Flicker, PDQuest Database/Tool PlantsDB URL Use http://mips.gsf.de/projects/plants Similarities and dissimilarities, specific characteristics of individual plant genomes POGs/PlantRBP http://plantrbp.uoregon.edu/ For cross species comparison of genomes AgBase http://www.agbase.msstate.edu For functional analysis of genes TIGR Plant TA http://plantta.tigr.org To generate a comprehensive resource of database assembled and annotated gene transcripts PathoPlant® http://www.pathoplant.de Plant–pathogen interactions and signal transduction reactions PlantGDB http://www.plantgdb.org/ Resources for comparative genomics SGN http://sgn.cornell.edu Solanaceae genomics network Sputnik http://mips.gsf.de/proj/sputnik/ EST clustering and annotation system PopulusDB http://www.populus.db.umu.se/ Open resource for tree genomics HARVEST http://harvest.ucr.edu/ EST database viewing software CR-EST http://pgrc.ipk-gatersleben.de/cr- Crop EST database est/index.php VitisExpDB http://cropdisease.ars.usda.gov/v Grape gene expression database itis_at/main-page.htm What is similar to my sequence? Searching gets harder as the databases get bigger - and quality changes Tools: BLAST and FASTA = time saving heuristics (approximate methods) Statistics + informed judgment of the biologist 23 1. Sequence analysis › Pairwise (Global & Local) › Global: aligning sequence pairs in an end- to-end fashion › Local: aligning specified regions in a pair of sequences › Multiple sequence analysis (MSA) 24 Ab initio: The gene looks like the average of many genes › Genscan, GeneMark, GRAIL… Similarity: The gene looks like a specific known gene › Procrustes,… Hybrid: A combination of both › Genomescan (http://genes.mit.edu/genomescan/) GENERIC STEPS INVOLVED IN EST ANALYSIS Briefings in Bioinformatics 2006. VOL 8(1) 6-21 26 MINING FOR SSRs TOOLS 1. MISA 2. WEBSAT 3. Microsatellite Repeat Finder 4. Perfect Microsatellite Repeat Finder 5. Tandem Repeats Finder 6. Repeat Finder 7. Etandem 8. Msatcommander 27 TRENDS in Biotechnology 2005 Vol.23(1) 48-55 In silico SNP/indel identification Tools •AutoSNP •QualitySNP •HaploSNPer •MAVIANT •PolyBayes •SNiPpER Source: Genes, Genomes and Genomics- SPECIAL ISSUE: Tree and Forest Genetics ( 2010) 28 Tools Database – miRBase - http://www.mirbase.org/ MiRAlign : http://bioinfo.au.tsinghua.edu.cn/miralign/ miRanda : microRNA Target Detection – Miracle : http://miracle.igib.res.in/miracle/ RegRNA : http://regrna.mbc.nctu.edu.tw/ miRTar : http://mirtar.mbc.nctu.edu.tw/ miRU: Plant microRNA Potential Target Finder miRseek : http://220.227.138.213/mirnablast/mirnablast.php Source: Asia Pac. J. Mol. Biol. Biotechnol., Vol. 15 (3), 2007 29 Can we predict the function of protein molecules from their sequence? sequence > structure > function Prediction of some simple 3-D structures (a-helix, b-sheet, membrane spanning, etc.) 30 COMBINED BIOINFORMATICS AND CHEMOINFORMATICS WORKFLOW. 1. Sequence assembly 2. Identification of target proteins. 3. BLASTp search against PDB to find out homologous protein structures, to be used as templates (red) for protein homology modeling experiments. 4. Protein model structures (blue) can in turn be employed for docking. For docking experiments ligand structures have to be converted into their 3D form. 31 36–43 Genomics 89 (2007) Mapping Identifying the location of clones and markers on the chromosome by genetic linkage analysis and physical mapping Sequencing Assembling clone sequence reads into large (eventually complete) genome sequences Gene discovery Identifying coding regions in genomic DNA by database searching and other methods Function assignment Using database searches, pattern searches, protein family analysis and structure prediction to assign a function to each predicted gene Data mining Searching for relationships and correlations in the information Genome comparison Comparing different complete genomes to infer evolutionary history and genome rearrangements Development of automated sequencing techniques Joining the sequences of smaller fragments Prediction of promoters and protein coding regions Identifies the enzyme function of new genes by comparing with that of evolutionary close genomes Network of gene-groups connected through the reactions catalyzed by enzymes embedded in the gene-groups Global modeling of chemical reactions in the microbial cells To identify transcription factors for protein-DNA interactions there are four major approaches Micro-array analysis of gene expressions Statistical analysis of promoter regions of orthologous genes Global analysis of frequency patterns of dimers in the intergenic region Biochemical modeling at the atomic level We have only touched small parts of the elephant Trial and error (intelligently) is often your best tool Keep up with the main databases, and you’ll have a pretty good idea of what is happening and available