Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Sequence Annotation & Ontologies The main question: Sequence annotation => What does some sequence mean? >Sequence of unknown function AGGGAATAAAGGCTCAGGGACCGGCAGTTCTACTCTAGAGCCCACCAGCCTCTCAGAGCC TCCGGTGACTGGCCTGTGTCTCCCCCTGGATGGACATGTGGACGGCGCTGCTCATCCTGC AAGCCTTGTTGCTACCCTCCCTGGCTGATGGTGCCACCCCTGCCCTGCGCTTTGTAGCCG TGGGTGACTGGGGAGGGGTCCCCAATGCCCCATTCCACACGGCCCGGGAAATGGCCAATG CCAAGGAGATCGCTCGGACTGTGCAGATCCTGGGTGCAGACTTCATCCTGTCTCTAGGGG ACAATTTTTACTTCACTGGTGTGCAAGACATCAATGACAAGAGGTTCCAGGAGACCTTTG AGGACGTATTCTCTGACCGCTCCCTTCGCAAAGTGCCCTGGTACGTGCTAGCCGGAAACC ATGACCACCTTGGCAATGTCTCTGCCCAGATTGCATACTCTAAGATCTCCAAGCGCTGGA ACTTCCCCAGCCCTTTCTACCGCCTGCACTTCAAGATCCCACAGACCAATGTGTCTGTGG CCATTTTTATGCTGGACACAGTGACACTATGTGGCAACTCAGATGACTTCCTCAGCCAGC Classical genetics teaches us that protein function and structure are ultimately linked – 3D structure of protein defines its function and that structure is defined by protein sequence Similar sequence => Similar structure => Similar function Thus the sequence of a gene includes the information what it does... Sequence alignment Important part of bioinformatics Allows us to make basic assumptions that this sequence is similar to another... So they have have something in common Starting point for about everything in bioinformatics... This gives us potential name for the sequence… Sequence alignment Sequences producing significant alignments: (Bits) Value ref|NP_000688.2| arachidonate 12-lipoxygenase, 12S-type [Homo... 743 0.0 gb|AAA59523.1| 12-lipoxygenase [Homo sapiens] >gb|AAS00094.1|... 741 0.0 gb|AAA60056.1| 12-lipoxygenase [Homo sapiens] 741 0.0 gb|AAA51533.1| arachidonate 12-lipoxygenase [Homo sapiens] 732 0.0 ref|XP_003810174.1| PREDICTED: arachidonate 12-lipoxygenase, ... 731 0.0 ref|XP_004058489.1| PREDICTED: arachidonate 12-lipoxygenase, ... 726 0.0 ref|XP_001104192.1| PREDICTED: arachidonate 12-lipoxygenase, ... 714 0.0 ref|XP_003912258.1| PREDICTED: arachidonate 12-lipoxygenase, ... 704 0.0 ref|XP_002806856.1| PREDICTED: LOW QUALITY PROTEIN: arachidon... 686 0.0 ref|XP_001503048.3| PREDICTED: arachidonate 12-lipoxygenase, ... 680 0.0 gb|EHH24438.1| Arachidonate 12-lipoxygenase, 12S-type, partia... 674 0.0 gb|EHH57648.1| Arachidonate 12-lipoxygenase, 12S-type, partia... 674 0.0 ref|XP_003929174.1| PREDICTED: arachidonate 12-lipoxygenase, ... 673 0.0 ref|XP_003791235.1| PREDICTED: arachidonate 12-lipoxygenase, ... 671 0.0 ref|XP_003274578.1| PREDICTED: arachidonate 12-lipoxygenase, ... 666 0.0 Other information besides the name… What other information can be found from sequence or linked to it from various databases? 3-D structure of protein Genomic context, neighbours Is it expressed? In what circumstances it is expressed? What functional domains protein includes What sequence does not tell However, function is not always self-evident from gene sequence? Some sequence positions are more important than others We don´t know the interactions Positions that form active site of the protein are more critical Protein Protein Interactions create complexes, signalling pathways… Context Function varies between different species Function varies between different cell types Organisms / protein families with little research Example Blast results from the Meliteae Cinxia (The Glanville fritillary butterfly) Score Sequences producing significant alignments: tr|Q7QF43|Q7QF43_ANOGA AGAP000295-PA OS=Anopheles gambiae GN=AGA... tr|B4JK90|B4JK90_DROGR GH12613 OS=Drosophila grimshawi GN=GH1261... tr|Q29H35|Q29H35_DROPS GA13573 OS=Drosophila pseudoobscura pseud... tr|B4N1J8|B4N1J8_DROWI GK16321 OS=Drosophila willistoni GN=GK163... tr|B4PXZ6|B4PXZ6_DROYA GE17857 OS=Drosophila yakuba GN=GE17857 P... tr|Q9VZ71|Q9VZ71_DROME CG15211, isoform A OS=Drosophila melanoga... tr|B4R7M7|B4R7M7_DROSI GD16987 OS=Drosophila simulans GN=GD16987... tr|B4IDT9|B4IDT9_DROSE GM11433 OS=Drosophila sechellia GN=GM1143... tr|B3NVE7|B3NVE7_DROER GG18369 OS=Drosophila erecta GN=GG18369 P... tr|B3MQD3|B3MQD3_DROAN GF20425 OS=Drosophila ananassae GN=GF2042... tr|B4L7Y1|B4L7Y1_DROMO GI11027 OS=Drosophila mojavensis GN=GI110... tr|Q16VM8|Q16VM8_AEDAE Putative uncharacterized protein (Fragmen... tr|B0WJ18|B0WJ18_CULQU Putative uncharacterized protein OS=Culex... tr|C1C2H5|C1C2H5_9MAXI Plasmolipin OS=Caligus clemensi GN=PLLP P... tr|Q1DGM2|Q1DGM2_AEDAE Putative uncharacterized protein (Fragmen... tr|C3ZW39|C3ZW39_BRAFL Putative uncharacterized protein OS=Branc... tr|A3KQ86|A3KQ86_DANRE Novel protein similar to vertebrate trans... tr|C3YLB7|C3YLB7_BRAFL Putative uncharacterized protein OS=Branc... sp|Q8CJ61|CKLF4_MOUSE CKLF-like MARVEL transmembrane domain-cont... tr|A4IFB9|A4IFB9_BOVIN CMTM4 protein OS=Bos taurus GN=CMTM4 PE=2... sp|P47987|PLLP_RAT Plasmolipin OS=Rattus norvegicus GN=Pllp PE=1... sp|Q9DCU2|PLLP_MOUSE Plasmolipin OS=Mus musculus GN=Pllp PE=2 SV=1 tr|Q4SI17|Q4SI17_TETNG Chromosome 5 SCAF14581, whole genome shot... tr|B1H2E6|B1H2E6_XENTR LOC100145405 protein OS=Xenopus tropicali... tr|A7YYE5|A7YYE5_DANRE CKLF-like MARVEL transmembrane domain con... tr|C4Q9A0|C4Q9A0_SCHMA Marvel-containing potential lipid raft-as... tr|Q6DGM6|Q6DGM6_DANRE CKLF-like MARVEL transmembrane domain con... tr|C3ZW41|C3ZW41_BRAFL Putative uncharacterized protein (Fragmen... tr|C1BJ60|C1BJ60_OSMMO Plasmolipin OS=Osmerus mordax GN=PLLP PE=... sp|Q8IZR5|CKLF4_HUMAN CKLF-like MARVEL transmembrane domain-cont... … (bits) E-Value 155 140 136 135 133 133 132 132 132 132 132 131 129 104 100 74 73 72 67 67 66 66 65 65 65 65 64 64 63 62 1e-35 5e-31 8e-30 1e-29 5e-29 7e-29 9e-29 9e-29 9e-29 9e-29 1e-28 3e-28 1e-27 3e-20 8e-19 6e-11 1e-10 1e-10 4e-09 8e-09 1e-08 1e-08 2e-08 3e-08 3e-08 3e-08 5e-08 5e-08 1e-07 2e-07 Plasmolipins Chemokine like factor superfamily members Function found... You find out some function... Arachidonate 12-lipoxygenase 12S-type… Is this understandable?? Simple googling can help you (Swissprot and Wikipedia pages) But what if you want to know a more general function for gene (sequence) in question? Or you have a large number of genes and you would like to know what is common / specific for that set => Ontologies! Ontologies “An explicit formal specification of how to represent the objects, concepts, and other entities that are assumed to exist in some area of interest and the relationships that hold among them” Blaah, blaah: Ontologies represent categories that allow generalizations. Some categories are more detailed and some more broad. Stronger vs. weaker generalization. Genes are mapped to these categories. The more detailed the information is the more detailed the class is www.geneontology.org The GO is a hierarchical structure for categorizing gene products in terms of their association with: 1. biological processes 2. cellular components 3. molecular functions in a species-independent manner Structure of Gene Ontology Hierarchical structure of linked root of hierarchical structure nodes Smaller classes: child classes Precise, detail information Larger classes: parent classes Broad, unspecific information Smaller classes belong to larger classes Viral protein biosynthesis => Protein biosynthesis => Biosynthesis Starting node Structure of Gene Ontology Direct Acylic Graph (DAG) root of hierarchical structure -this means a tree-structure where child node can have links to many parent nodes. -other words: branches can split while going towards root node viral protein biosynthesis is linked to protein biosynthesis and to viral genome expression. Starting node Example of GO What GO contains 27.3.2012 36259 terms in total 22331 biological_process 2976 cellular_component 9323 molecular_function Tools http://amigo.geneontology.org/cgi-bin/amigo/go.cgi GO Evidence Codes (http://www.geneontology.org/GO.evidence.shtml) IEA TAS Etc... Advantages of GO Cross species comparison Already used by several databases Comprehensive Many terms per gene product Many-to many relationships possible Simplify querying Uses restricted vocabulary developed by curators and annotators Use of evidence code How reliable is the given information Advantages of GO Some GO terms for Arachidonate 12S-type: 12-lipoxygenase fatty acid oxidation cellular component movement negative regulation of apoptotic process positive regulation of cell adhesion …. This information cannot be seen from sequence name GO classes are still updated when the gene name is fixed Advantages of GO: Reliability of information GO annotations include Evidence Codes http://www.geneontology.org/GO.evidence.shtml These represent how the information was obtained Annotation can be backtracked to its source All information is not equally reliable http://www.geneontology.org/GO.evidence.shtml#comp-assigned IEA abbreviation represents un-curated annotations These usually refer to automated annotations generated with seq. Similarity User can select in many GO servers which Evidence codes are accepted to the analysis Advantages of GO with many genes Analysis of large set of sequences for common features Viewing gene names & descriptions can be confusing TAIR_locus AT5G10040 AT1G12805 AT4G10250 AT1G54050 AT4G33980 AT2G17850 AT4G33980 AT4G33070 AT4G33560 AT1G19530 AT3G07150 ATMG00080 AT5G12020 AT1G16030 AT1G77120 AT5G12030 AT5G54165 AT2G14247 AT3G20810 TAIR_description unknown protein; nucleotide binding HSP20-like chaperones superfamily protein HSP20-like chaperones superfamily protein BEST Arabidopsis thaliana protein match is: cold regulated gene 27 . Rhodanese/Cell cycle control phosphatase superfamily protein BEST Arabidopsis thaliana protein match is: cold regulated gene 27 Thiamine pyrophosphate dependent pyruvate decarboxylase family protein Wound-responsive family protein unknown protein unknown protein; ribosomal protein L16 17.6 kDa class II heat shock protein heat shock protein 70B alcohol dehydrogenase 1 heat shock protein 17.6A unknown protein;. Expressed protein 2-oxoglutarate (2OG) and Fe(II)-dependent oxygenase superfamily protein Advantages of GO with many genes GO allows summarization of frequent features from the list Summary Sequence annotation aims at finding informative similar sequences Collect information in sequence name and in Gene Ontology classes Gene Ontology represents supplementary information besides the gene name Gene Ontology allows also summarization of information from the larger set of genes