Download Sequence Annotation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Sequence Annotation &
Ontologies
The main question:
 Sequence annotation => What does some sequence
mean?
>Sequence of unknown function
AGGGAATAAAGGCTCAGGGACCGGCAGTTCTACTCTAGAGCCCACCAGCCTCTCAGAGCC
TCCGGTGACTGGCCTGTGTCTCCCCCTGGATGGACATGTGGACGGCGCTGCTCATCCTGC
AAGCCTTGTTGCTACCCTCCCTGGCTGATGGTGCCACCCCTGCCCTGCGCTTTGTAGCCG
TGGGTGACTGGGGAGGGGTCCCCAATGCCCCATTCCACACGGCCCGGGAAATGGCCAATG
CCAAGGAGATCGCTCGGACTGTGCAGATCCTGGGTGCAGACTTCATCCTGTCTCTAGGGG
ACAATTTTTACTTCACTGGTGTGCAAGACATCAATGACAAGAGGTTCCAGGAGACCTTTG
AGGACGTATTCTCTGACCGCTCCCTTCGCAAAGTGCCCTGGTACGTGCTAGCCGGAAACC
ATGACCACCTTGGCAATGTCTCTGCCCAGATTGCATACTCTAAGATCTCCAAGCGCTGGA
ACTTCCCCAGCCCTTTCTACCGCCTGCACTTCAAGATCCCACAGACCAATGTGTCTGTGG
CCATTTTTATGCTGGACACAGTGACACTATGTGGCAACTCAGATGACTTCCTCAGCCAGC
 Classical genetics teaches us that protein function
and structure are ultimately linked – 3D structure
of protein defines its function and that structure is
defined by protein sequence
 Similar sequence => Similar structure => Similar
function
 Thus the sequence of a gene includes the
information what it does...
Sequence alignment
 Important part of bioinformatics
 Allows us to make basic assumptions that this
sequence is similar to another... So they have have
something in common
 Starting point for about everything in
bioinformatics...
 This gives us potential name for the sequence…
Sequence alignment
Sequences producing significant alignments:
(Bits) Value
ref|NP_000688.2| arachidonate 12-lipoxygenase, 12S-type [Homo... 743 0.0
gb|AAA59523.1| 12-lipoxygenase [Homo sapiens] >gb|AAS00094.1|... 741 0.0
gb|AAA60056.1| 12-lipoxygenase [Homo sapiens]
741 0.0
gb|AAA51533.1| arachidonate 12-lipoxygenase [Homo sapiens]
732 0.0
ref|XP_003810174.1| PREDICTED: arachidonate 12-lipoxygenase, ... 731 0.0
ref|XP_004058489.1| PREDICTED: arachidonate 12-lipoxygenase, ... 726 0.0
ref|XP_001104192.1| PREDICTED: arachidonate 12-lipoxygenase, ... 714 0.0
ref|XP_003912258.1| PREDICTED: arachidonate 12-lipoxygenase, ... 704 0.0
ref|XP_002806856.1| PREDICTED: LOW QUALITY PROTEIN: arachidon... 686 0.0
ref|XP_001503048.3| PREDICTED: arachidonate 12-lipoxygenase, ... 680 0.0
gb|EHH24438.1| Arachidonate 12-lipoxygenase, 12S-type, partia... 674 0.0
gb|EHH57648.1| Arachidonate 12-lipoxygenase, 12S-type, partia... 674 0.0
ref|XP_003929174.1| PREDICTED: arachidonate 12-lipoxygenase, ... 673 0.0
ref|XP_003791235.1| PREDICTED: arachidonate 12-lipoxygenase, ... 671 0.0
ref|XP_003274578.1| PREDICTED: arachidonate 12-lipoxygenase, ... 666 0.0
Other information besides the name…
 What other information can be found from
sequence or linked to it from various databases?
3-D structure of protein
 Genomic context, neighbours
 Is it expressed?
 In what circumstances it is expressed?
 What functional domains protein includes

What sequence does not tell
 However, function is not always self-evident from
gene sequence?

Some sequence positions are more important than others


We don´t know the interactions


Positions that form active site of the protein are more critical
Protein Protein Interactions create complexes, signalling
pathways…
Context
Function varies between different species
 Function varies between different cell types


Organisms / protein families with little research
Example Blast results from the Meliteae Cinxia
(The Glanville fritillary butterfly)
Score
Sequences producing significant alignments:
tr|Q7QF43|Q7QF43_ANOGA AGAP000295-PA OS=Anopheles gambiae GN=AGA...
tr|B4JK90|B4JK90_DROGR GH12613 OS=Drosophila grimshawi GN=GH1261...
tr|Q29H35|Q29H35_DROPS GA13573 OS=Drosophila pseudoobscura pseud...
tr|B4N1J8|B4N1J8_DROWI GK16321 OS=Drosophila willistoni GN=GK163...
tr|B4PXZ6|B4PXZ6_DROYA GE17857 OS=Drosophila yakuba GN=GE17857 P...
tr|Q9VZ71|Q9VZ71_DROME CG15211, isoform A OS=Drosophila melanoga...
tr|B4R7M7|B4R7M7_DROSI GD16987 OS=Drosophila simulans GN=GD16987...
tr|B4IDT9|B4IDT9_DROSE GM11433 OS=Drosophila sechellia GN=GM1143...
tr|B3NVE7|B3NVE7_DROER GG18369 OS=Drosophila erecta GN=GG18369 P...
tr|B3MQD3|B3MQD3_DROAN GF20425 OS=Drosophila ananassae GN=GF2042...
tr|B4L7Y1|B4L7Y1_DROMO GI11027 OS=Drosophila mojavensis GN=GI110...
tr|Q16VM8|Q16VM8_AEDAE Putative uncharacterized protein (Fragmen...
tr|B0WJ18|B0WJ18_CULQU Putative uncharacterized protein OS=Culex...
tr|C1C2H5|C1C2H5_9MAXI Plasmolipin OS=Caligus clemensi GN=PLLP P...
tr|Q1DGM2|Q1DGM2_AEDAE Putative uncharacterized protein (Fragmen...
tr|C3ZW39|C3ZW39_BRAFL Putative uncharacterized protein OS=Branc...
tr|A3KQ86|A3KQ86_DANRE Novel protein similar to vertebrate trans...
tr|C3YLB7|C3YLB7_BRAFL Putative uncharacterized protein OS=Branc...
sp|Q8CJ61|CKLF4_MOUSE CKLF-like MARVEL transmembrane domain-cont...
tr|A4IFB9|A4IFB9_BOVIN CMTM4 protein OS=Bos taurus GN=CMTM4 PE=2...
sp|P47987|PLLP_RAT Plasmolipin OS=Rattus norvegicus GN=Pllp PE=1...
sp|Q9DCU2|PLLP_MOUSE Plasmolipin OS=Mus musculus GN=Pllp PE=2 SV=1
tr|Q4SI17|Q4SI17_TETNG Chromosome 5 SCAF14581, whole genome shot...
tr|B1H2E6|B1H2E6_XENTR LOC100145405 protein OS=Xenopus tropicali...
tr|A7YYE5|A7YYE5_DANRE CKLF-like MARVEL transmembrane domain con...
tr|C4Q9A0|C4Q9A0_SCHMA Marvel-containing potential lipid raft-as...
tr|Q6DGM6|Q6DGM6_DANRE CKLF-like MARVEL transmembrane domain con...
tr|C3ZW41|C3ZW41_BRAFL Putative uncharacterized protein (Fragmen...
tr|C1BJ60|C1BJ60_OSMMO Plasmolipin OS=Osmerus mordax GN=PLLP PE=...
sp|Q8IZR5|CKLF4_HUMAN CKLF-like MARVEL transmembrane domain-cont...
…
(bits) E-Value
155
140
136
135
133
133
132
132
132
132
132
131
129
104
100
74
73
72
67
67
66
66
65
65
65
65
64
64
63
62
1e-35
5e-31
8e-30
1e-29
5e-29
7e-29
9e-29
9e-29
9e-29
9e-29
1e-28
3e-28
1e-27
3e-20
8e-19
6e-11
1e-10
1e-10
4e-09
8e-09
1e-08
1e-08
2e-08
3e-08
3e-08
3e-08
5e-08
5e-08
1e-07
2e-07
Plasmolipins
Chemokine like factor
superfamily
members
Function found...
 You find out some function...
 Arachidonate 12-lipoxygenase 12S-type…
 Is this understandable??
 Simple googling can help you (Swissprot and
Wikipedia pages)
 But what if you want to know a more general
function for gene (sequence) in question?
 Or you have a large number of genes and you
would like to know what is common / specific for
that set
 => Ontologies!
Ontologies
 “An explicit formal specification of how to represent
the objects, concepts, and other entities that are
assumed to exist in some area of interest and the
relationships that hold among them”
 Blaah, blaah: Ontologies represent categories that
allow generalizations.
 Some categories are more detailed and some more
broad. Stronger vs. weaker generalization.
 Genes are mapped to these categories. The more
detailed the information is the more detailed the
class is
www.geneontology.org
 The GO is a hierarchical structure for categorizing




gene products in terms of their association with:
1. biological processes
2. cellular components
3. molecular functions
in a species-independent manner
Structure of Gene Ontology
 Hierarchical structure of linked
root of hierarchical
structure
nodes
 Smaller classes: child classes
 Precise,
detail information
 Larger classes: parent classes
 Broad,
unspecific information
 Smaller classes belong to larger
classes
 Viral
protein biosynthesis =>
 Protein biosynthesis =>
 Biosynthesis
Starting node
Structure of Gene Ontology
 Direct Acylic Graph (DAG)
root of hierarchical
structure
-this means a tree-structure where
child node can have links to many
parent nodes.
-other words: branches can split while
going towards root node
viral protein biosynthesis is linked to
protein biosynthesis and to viral
genome expression.
Starting node
Example of GO
What GO contains
 27.3.2012




36259 terms in total
22331 biological_process
2976 cellular_component
9323 molecular_function
 Tools

http://amigo.geneontology.org/cgi-bin/amigo/go.cgi
 GO Evidence Codes
(http://www.geneontology.org/GO.evidence.shtml)



IEA
TAS
Etc...
Advantages of GO
 Cross species comparison
 Already used by several databases
 Comprehensive
 Many terms per gene product
 Many-to many relationships possible
 Simplify querying
 Uses restricted vocabulary developed by curators and
annotators
 Use of evidence code
 How reliable is the given information
Advantages of GO
 Some GO terms for Arachidonate
12S-type:
12-lipoxygenase
fatty acid oxidation
 cellular component movement
 negative regulation of apoptotic process
 positive regulation of cell adhesion ….

 This information cannot be seen from sequence
name
 GO classes are still updated when the gene name is
fixed
Advantages of GO: Reliability of information
 GO annotations include Evidence Codes
 http://www.geneontology.org/GO.evidence.shtml
 These represent how the information was obtained
 Annotation can be backtracked to its source
 All information is not equally reliable
 http://www.geneontology.org/GO.evidence.shtml#comp-assigned
 IEA abbreviation represents un-curated annotations

These usually refer to automated annotations generated with seq.
Similarity
 User can select in many GO servers which Evidence codes
are accepted to the analysis
Advantages of GO with many genes
 Analysis of large set of sequences for common
features
 Viewing gene names & descriptions can be confusing
TAIR_locus
AT5G10040
AT1G12805
AT4G10250
AT1G54050
AT4G33980
AT2G17850
AT4G33980
AT4G33070
AT4G33560
AT1G19530
AT3G07150
ATMG00080
AT5G12020
AT1G16030
AT1G77120
AT5G12030
AT5G54165
AT2G14247
AT3G20810
TAIR_description
unknown protein;
nucleotide binding
HSP20-like chaperones superfamily protein
HSP20-like chaperones superfamily protein
BEST Arabidopsis thaliana protein match is: cold regulated gene 27 .
Rhodanese/Cell cycle control phosphatase superfamily protein
BEST Arabidopsis thaliana protein match is: cold regulated gene 27
Thiamine pyrophosphate dependent pyruvate decarboxylase family protein
Wound-responsive family protein
unknown protein
unknown protein;
ribosomal protein L16
17.6 kDa class II heat shock protein
heat shock protein 70B
alcohol dehydrogenase 1
heat shock protein 17.6A
unknown protein;.
Expressed protein
2-oxoglutarate (2OG) and Fe(II)-dependent oxygenase superfamily protein
Advantages of GO with many genes
 GO allows summarization of frequent features from
the list
Summary
 Sequence annotation aims at finding informative
similar sequences
 Collect information in sequence name and in Gene
Ontology classes
 Gene Ontology represents supplementary
information besides the gene name
 Gene Ontology allows also summarization of
information from the larger set of genes
Related documents