* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Powerpoint File
Survey
Document related concepts
Vectors in gene therapy wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Microevolution wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Gene expression profiling wikipedia , lookup
Point mutation wikipedia , lookup
Gene expression programming wikipedia , lookup
Designer baby wikipedia , lookup
Gene nomenclature wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Protein moonlighting wikipedia , lookup
Transcript
Genome analysis and annotation Part II Modeling a gene S.mansoniView PASA assemblies Evidence S. japonicum EST alignments Genewise alignments(predictions) nr Protein Alignments Caenorhabditis sp. Protein Alignments Brugia malayi Protein Alignments TIGR THE INSTITUTE FOR GENOMIC RESEARCH Attributes of individual annotated genes Sequence Database Hits Top: Protein matches Bottom: EST matches Not shown graphically: gene name, nucleotide and protein sequence, MW, pI, organellar targeting sequence, membrane spanning regions, other domains. Gene Predictions Annotated Gene Top: editing panel Bottom: final curation Splice site predictions: red: acceptor sites blue: donor sites Screenshot of a component within Neomorphic’s annotation station: www.neomorphic.com Assigning function to predicted gene products Assigning function to predicted gene products The primary tool for assigning function is homology to well characterized proteins E.coli H. influenzae E.coli H. influenzae H. influenzae H. influenzae M. genitalium M. genitalium …however transitive annotation can lead to errors that propagate. The modular nature of proteins can provide the basis for functional annotation • Proteins may share features that give clues to their structure and/or function • A domain is a region of a protein that can adopt a particular threedimensional structure. Together a group of proteins that share a domain is called a family. There are several databases of protein families such as Pfam (http://www.sanger.ac.uk/Software/Pfam/) • Motifs are short, conserved regions of proteins, typically consisting of a pattern of amino acids that characterizes a prrotein family (http://www.expasy.org/prosite/) EF-hand: D-[DNS]-{ILVFYW}-[DENSTG]-[DNQGHRK]-{GP}-[LIVMC]-[DENQSTAGC]-x(2)- [DE]-[LIVMFYW] 3) HMM domains can also be defined and used to group proteins into families Protein domain frequencies can yield insights into the biology of an organism Top 20 PFAM domains in A. fumigatus Counts in A. nidulans and A. oryzae Afu Ana Aoa PF00400 WD domain, G-beta repeat 532 598 541 PF00023 Ankyrin repeat 368 633 430 PF00083 major facilitator superfamily protein 166 281 219 PF00172 Fungal Zn(2)-Cys(6) binuclear cluster domain 146 179 211 PF00515 TPR Domain 139 142 152 PF00096 Zinc finger, C2H2 type 124 113 142 PF04082 Fungal specific transcription factor domain 110 163 159 PF00153 Mitochondrial carrier protein 106 114 100 PF00069 Protein kinase domain 105 105 101 PF00005 ABC transporter 93 129 86 PF00076 RNA recognition motif. (a.k.a. RRM, RBD, or RNP domain) 93 93 99 PF00106 oxidoreductase, short chain dehydrogenase/reductase family 92 135 129 PF00271 Helicase conserved C-terminal domain 73 80 69 PF00067 Cytochrome P450 63 134 102 PF00107 oxidoreductase, zinc-binding dehydrogenase family 61 107 80 PF00501 AMP-binding enzyme 61 77 83 PF00560 Leucine Rich Repeat 47 50 54 PF00550 Phosphopantetheine attachment site 46 54 60 PF00036 EF hand 10 52 50 Domain based Paralogous Families can be genrated Domain Content of Entire Proteome can be computed All the proteins from a genome HMM search against Pfam profiles Alignment search against homology-based domain alignments The search results are stored in the database in the form of domain-based alignments Organize the proteins into domain-based paralogous families •Related families share one or more domains with other families •Many putative novel domains are extensions of existing domains Hidden Markov Models (HMMs) Statistical representations of sequence patterns. Seed: Model: ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC A query sequence is scored by how likely is it that the HMM would produce it. Procedure for Preparing a HMM Seed Inspect and edit a pairwise aligned group of gene products: - Eliminate fragments - Correct the alignment - Remove sequence outside domain - Eliminate redundancy - BLAST, annotate and possibly expand the seed. Homology-Based Alignment: HMM Seed: Trusted Hits: TIGR THE INSTITUTE FOR GENOMIC RESEARCH What is Gene Ontology (GO)? The Gene Ontology is a set of dynamic controlled vocabularies used to describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner (www.geneontology.org) The Three Ontologies Molecular function, biological process and cellular component are considered attributes of gene products. Biological Process (a) A biological objective has more than one distinct step Molecular Function (b) what the gene product does Think ‘activity’ Cellular Component (c) location in the cell (or smaller unit) or part of a complex Assigning GO IDs Each GO ID is qualified with an evidence code. Evidence codes are: IMP – inferred from mutant phenotype IGI—inferred from genetic interaction IPI—inferred from physical interaction IDA—inferred from direct assay IEP—inferred from expression pattern ISS—inferred from structural similarity IEA—inferred from electronic annotation IC—inferred by curator TAS—traceable author statement NAS—non-traceable author statement ND—no biological data available NR—no longer used • • • • Experimental evidence Sequence similarity Calculated by algorithm Author statement The “with/to” field ISS, IPI, IGI require the accession of the similarity hit, the interacting entity Gene ontologies can help interpret large scale datasets K-means clustering using TIGR Multi-Experiment Viewer (TMEV) Cluster 4 Cluster 10