Download Protein_Informatics_Annotation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Promoter (genetics) wikipedia , lookup

Paracrine signalling wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Metalloprotein wikipedia , lookup

Community fingerprinting wikipedia , lookup

Gene wikipedia , lookup

Gene regulatory network wikipedia , lookup

Magnesium transporter wikipedia , lookup

Expression vector wikipedia , lookup

Western blot wikipedia , lookup

Gene expression wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Gene nomenclature wikipedia , lookup

Gene expression profiling wikipedia , lookup

Proteolysis wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Protein purification wikipedia , lookup

Point mutation wikipedia , lookup

Interactome wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Protein structure prediction wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Transcript
Advanced practical course in genome
bioinformatics
DAY 6: Functional annotation
Petri Törönen
Earlier version Patrik Koskinen
Three sections
• Background
• Methods
• Demonstration of tools
Outline of Background
• Goal
• How function can be defined
– Description
– Gene Ontology
• What can be used to predict function
• What I am omitting here
• Why these two matter
GOAL
• You have an unknown protein sequence
• It should be functionally annotated
– In wet lab (precise, slow)
– In silico (less precise, faster)
• When having thousands of sequences, wet lab is not an
option
– Collective annotation of sequences (hundreds of scientists)
– Combined use of few in-silico methods
Examples (Good, Bad)
tartrate-resistant acid phosphatase type 5 precursor
[Homo sapiens]
>NP_001602.1 tartrate-resistant acid phosphatase type 5
precursor [Homo sapiens]
hypothetical protein PANDA_021498, partial [Ailuropoda
melanoleuca] GenBank: EFB24196.1
>EFB24196.1 hypothetical protein PANDA_021498, partial
[Ailuropoda melanoleuca]
How function is defined
• How can we describe a function for a gene?
How function is defined
•
•
•
•
•
Functional description as human readable text
Linking gene to Key Words (Uniprot)
Linking gene to Gene Ontology classes
Linking gene to Enzyme categories
Linking gene to Signalling Pathways or Biochemical
Pathways (KEGG)
• Linking Domain to functional activity
Focus on description and on Gene Ontology
Human readable descriptions
• tartrate-resistant acid phosphatase type 5 precursor
[Homo sapiens]
Summary: This gene encodes an iron containing
glycoprotein which catalyzes the conversion of
orthophosphoric monoester to alcohol and
orthophosphate. It is the most basic of the acid
phosphatases and is the only form not inhibited by L(+)tartrate. [provided by RefSeq, Aug 2008].
Gene Ontology (GO)
• GO represents a popular standard currently in the gene
annotation
• GO represents categories that represent gene function
• Creates an union for genes in same process
• Easy summary for genes with similar function
• Easier to predict than text descriptions
Gene Ontology (GO)
• 3 sub-parts: Biological Process, Molecular Function,
Cellular Localization
– Molecular Function => chemical activity
– Biological Process => Biology, cellular process
– Cellular localization => Location of gene in cell
• Hierarchical structure
– Categories with very precise function
– Categories with less precise function
– Categories with very broad function
Gene Ontology
• tartrate-resistant acid phosphatase type 5 precursor
[Homo sapiens]
Advantages of GO
• Cross species comparison
– Already used by several databases
• Comprehensive
– GO covers all biological and chemical processes
– Many terms per gene product
• Simplify querying
– Uses restricted vocabulary developed by curators
and annotators
• Use of evidence code
– How reliable is the given information
Methods outline
•
•
•
•
Stupid method
What can be used to predict function
Current in-silico methods
Method Comparisons
Stupid method
• Run BLAST search
• Take the first sequence hit
Stupid Functional annotation
Traditional way to go: Nearest neighbour
= query sequence
Threshold in search space
(e.g. Blast e-val 1e-5)
Example Blast results from the Meliteae Cinxia
(The Glanville fritillary butterfly)
Score
Sequences producing significant alignments:
tr|Q7QF43|Q7QF43_ANOGA AGAP000295-PA OS=Anopheles gambiae GN=AGA...
tr|B4JK90|B4JK90_DROGR GH12613 OS=Drosophila grimshawi GN=GH1261...
tr|Q29H35|Q29H35_DROPS GA13573 OS=Drosophila pseudoobscura pseud...
tr|B4N1J8|B4N1J8_DROWI GK16321 OS=Drosophila willistoni GN=GK163...
tr|B4PXZ6|B4PXZ6_DROYA GE17857 OS=Drosophila yakuba GN=GE17857 P...
tr|Q9VZ71|Q9VZ71_DROME CG15211, isoform A OS=Drosophila melanoga...
tr|B4R7M7|B4R7M7_DROSI GD16987 OS=Drosophila simulans GN=GD16987...
tr|B4IDT9|B4IDT9_DROSE GM11433 OS=Drosophila sechellia GN=GM1143...
tr|B3NVE7|B3NVE7_DROER GG18369 OS=Drosophila erecta GN=GG18369 P...
tr|B3MQD3|B3MQD3_DROAN GF20425 OS=Drosophila ananassae GN=GF2042...
tr|B4L7Y1|B4L7Y1_DROMO GI11027 OS=Drosophila mojavensis GN=GI110...
tr|Q16VM8|Q16VM8_AEDAE Putative uncharacterized protein (Fragmen...
tr|B0WJ18|B0WJ18_CULQU Putative uncharacterized protein OS=Culex...
tr|C1C2H5|C1C2H5_9MAXI Plasmolipin OS=Caligus clemensi GN=PLLP P...
tr|Q1DGM2|Q1DGM2_AEDAE Putative uncharacterized protein (Fragmen...
tr|C3ZW39|C3ZW39_BRAFL Putative uncharacterized protein OS=Branc...
tr|A3KQ86|A3KQ86_DANRE Novel protein similar to vertebrate trans...
tr|C3YLB7|C3YLB7_BRAFL Putative uncharacterized protein OS=Branc...
sp|Q8CJ61|CKLF4_MOUSE CKLF-like MARVEL transmembrane domain-cont...
tr|A4IFB9|A4IFB9_BOVIN CMTM4 protein OS=Bos taurus GN=CMTM4 PE=2...
sp|P47987|PLLP_RAT Plasmolipin OS=Rattus norvegicus GN=Pllp PE=1...
sp|Q9DCU2|PLLP_MOUSE Plasmolipin OS=Mus musculus GN=Pllp PE=2 SV=1
tr|Q4SI17|Q4SI17_TETNG Chromosome 5 SCAF14581, whole genome shot...
tr|B1H2E6|B1H2E6_XENTR LOC100145405 protein OS=Xenopus tropicali...
tr|A7YYE5|A7YYE5_DANRE CKLF-like MARVEL transmembrane domain con...
tr|C4Q9A0|C4Q9A0_SCHMA Marvel-containing potential lipid raft-as...
tr|Q6DGM6|Q6DGM6_DANRE CKLF-like MARVEL transmembrane domain con...
tr|C3ZW41|C3ZW41_BRAFL Putative uncharacterized protein (Fragmen...
tr|C1BJ60|C1BJ60_OSMMO Plasmolipin OS=Osmerus mordax GN=PLLP PE=...
sp|Q8IZR5|CKLF4_HUMAN CKLF-like MARVEL transmembrane domain-cont...
AGAP000295-PA
…
(bits) E-Value
155
140
136
135
133
133
132
132
132
132
132
131
129
104
100
74
73
72
67
67
66
66
65
65
65
65
64
64
63
62
1e-35
5e-31
8e-30
1e-29
5e-29
7e-29
9e-29
9e-29
9e-29
9e-29
1e-28
3e-28
1e-27
3e-20
8e-19
6e-11
1e-10
1e-10
4e-09
8e-09
1e-08
1e-08
2e-08
3e-08
3e-08
3e-08
5e-08
5e-08
1e-07
2e-07
Is this
an
informative
annotation
to adopt?
Example Blast results from the Meliteae Cinxia
(The Glanville fritillary butterfly)
Score
Sequences producing significant alignments:
tr|Q7QF43|Q7QF43_ANOGA AGAP000295-PA OS=Anopheles gambiae GN=AGA...
tr|B4JK90|B4JK90_DROGR GH12613 OS=Drosophila grimshawi GN=GH1261...
tr|Q29H35|Q29H35_DROPS GA13573 OS=Drosophila pseudoobscura pseud...
tr|B4N1J8|B4N1J8_DROWI GK16321 OS=Drosophila willistoni GN=GK163...
tr|B4PXZ6|B4PXZ6_DROYA GE17857 OS=Drosophila yakuba GN=GE17857 P...
tr|Q9VZ71|Q9VZ71_DROME CG15211, isoform A OS=Drosophila melanoga...
tr|B4R7M7|B4R7M7_DROSI GD16987 OS=Drosophila simulans GN=GD16987...
tr|B4IDT9|B4IDT9_DROSE GM11433 OS=Drosophila sechellia GN=GM1143...
tr|B3NVE7|B3NVE7_DROER GG18369 OS=Drosophila erecta GN=GG18369 P...
tr|B3MQD3|B3MQD3_DROAN GF20425 OS=Drosophila ananassae GN=GF2042...
tr|B4L7Y1|B4L7Y1_DROMO GI11027 OS=Drosophila mojavensis GN=GI110...
tr|Q16VM8|Q16VM8_AEDAE Putative uncharacterized protein (Fragmen...
tr|B0WJ18|B0WJ18_CULQU Putative uncharacterized protein OS=Culex...
tr|C1C2H5|C1C2H5_9MAXI Plasmolipin OS=Caligus clemensi GN=PLLP P...
tr|Q1DGM2|Q1DGM2_AEDAE Putative uncharacterized protein (Fragmen...
tr|C3ZW39|C3ZW39_BRAFL Putative uncharacterized protein OS=Branc...
tr|A3KQ86|A3KQ86_DANRE Novel protein similar to vertebrate trans...
tr|C3YLB7|C3YLB7_BRAFL Putative uncharacterized protein OS=Branc...
sp|Q8CJ61|CKLF4_MOUSE CKLF-like MARVEL transmembrane domain-cont...
tr|A4IFB9|A4IFB9_BOVIN CMTM4 protein OS=Bos taurus GN=CMTM4 PE=2...
sp|P47987|PLLP_RAT Plasmolipin OS=Rattus norvegicus GN=Pllp PE=1...
sp|Q9DCU2|PLLP_MOUSE Plasmolipin OS=Mus musculus GN=Pllp PE=2 SV=1
tr|Q4SI17|Q4SI17_TETNG Chromosome 5 SCAF14581, whole genome shot...
tr|B1H2E6|B1H2E6_XENTR LOC100145405 protein OS=Xenopus tropicali...
tr|A7YYE5|A7YYE5_DANRE CKLF-like MARVEL transmembrane domain con...
tr|C4Q9A0|C4Q9A0_SCHMA Marvel-containing potential lipid raft-as...
tr|Q6DGM6|Q6DGM6_DANRE CKLF-like MARVEL transmembrane domain con...
tr|C3ZW41|C3ZW41_BRAFL Putative uncharacterized protein (Fragmen...
tr|C1BJ60|C1BJ60_OSMMO Plasmolipin OS=Osmerus mordax GN=PLLP PE=...
sp|Q8IZR5|CKLF4_HUMAN CKLF-like MARVEL transmembrane domain-cont...
…
(bits) E-Value
155
140
136
135
133
133
132
132
132
132
132
131
129
104
100
74
73
72
67
67
66
66
65
65
65
65
64
64
63
62
1e-35
5e-31
8e-30
1e-29
5e-29
7e-29
9e-29
9e-29
9e-29
9e-29
1e-28
3e-28
1e-27
3e-20
8e-19
6e-11
1e-10
1e-10
4e-09
8e-09
1e-08
1e-08
2e-08
3e-08
3e-08
3e-08
5e-08
5e-08
1e-07
2e-07
Plasmolipins
Chemokine like factor
superfamily
members
Why traditional method is stupid?
• Blind copying of the nearest neighbor annotation is
creating errors in database annotations
Errors in public databases
Schnoes AM, Brown SD, Dodevski I, Babbitt PC, 2009. Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme
Superfamilies. PLoS Computational Biology 5(12)
Errors in databases
• So all the annotations in the databases are not correct
• What can be done?
Rational Functional annotation
• Collect some features for analyzed sequence
• Compare these features to features in known
sequences
• Estimate the function based on the similarity with many
sequences
What can be used to predict function
• Where computer programs and researchers can get the
information for the sequence?
Function Prediction: What can we use to predict
function
This lecture discusses sequence based features!
•
•
•
•
•
Sequence homology (BLAST result list)
Phylogenetic tree of sequences
Protein Domains (PFAM domains)
Short sequence patterns – motifs
Sequence features
Function Prediction: What I am omitting
•
•
•
•
Literature search
Gene expression (in tissues, development..)
Protein Protein interaction
Genetic interactions
These are also important source. But only if they are
available.
Why the two previous groups matter
• Assume you are predicting genes for neuronal growth in
mammal
• Assume you are predicting entzyme functions
Sequence Homology Methods
• Do a BLAST search with a query sequence
• Collect GO classes for genes in the BLAST result hit
• Give a weight to each BLAST hit
– often log(E-value)
• Combine the scores from the genes that belong to
same GO class
• Report the top best / significant GO classes
Example Blast results re-visited
Score
Sequences producing significant alignments:
tr|Q7QF43|Q7QF43_ANOGA AGAP000295-PA OS=Anopheles gambiae GN=AGA...
tr|B4JK90|B4JK90_DROGR GH12613 OS=Drosophila grimshawi GN=GH1261...
tr|Q29H35|Q29H35_DROPS GA13573 OS=Drosophila pseudoobscura pseud...
tr|B4N1J8|B4N1J8_DROWI GK16321 OS=Drosophila willistoni GN=GK163...
tr|B4PXZ6|B4PXZ6_DROYA GE17857 OS=Drosophila yakuba GN=GE17857 P...
tr|Q9VZ71|Q9VZ71_DROME CG15211, isoform A OS=Drosophila melanoga...
tr|B4R7M7|B4R7M7_DROSI GD16987 OS=Drosophila simulans GN=GD16987...
tr|B4IDT9|B4IDT9_DROSE GM11433 OS=Drosophila sechellia GN=GM1143...
tr|B3NVE7|B3NVE7_DROER GG18369 OS=Drosophila erecta GN=GG18369 P...
tr|B3MQD3|B3MQD3_DROAN GF20425 OS=Drosophila ananassae GN=GF2042...
tr|B4L7Y1|B4L7Y1_DROMO GI11027 OS=Drosophila mojavensis GN=GI110...
tr|Q16VM8|Q16VM8_AEDAE Putative uncharacterized protein (Fragmen...
tr|B0WJ18|B0WJ18_CULQU Putative uncharacterized protein OS=Culex...
tr|C1C2H5|C1C2H5_9MAXI Plasmolipin OS=Caligus clemensi GN=PLLP P...
tr|Q1DGM2|Q1DGM2_AEDAE Putative uncharacterized protein (Fragmen...
tr|C3ZW39|C3ZW39_BRAFL Putative uncharacterized protein OS=Branc...
tr|A3KQ86|A3KQ86_DANRE Novel protein similar to vertebrate trans...
tr|C3YLB7|C3YLB7_BRAFL Putative uncharacterized protein OS=Branc...
sp|Q8CJ61|CKLF4_MOUSE CKLF-like MARVEL transmembrane domain-cont...
tr|A4IFB9|A4IFB9_BOVIN CMTM4 protein OS=Bos taurus GN=CMTM4 PE=2...
sp|P47987|PLLP_RAT Plasmolipin OS=Rattus norvegicus GN=Pllp PE=1...
sp|Q9DCU2|PLLP_MOUSE Plasmolipin OS=Mus musculus GN=Pllp PE=2 SV=1
tr|Q4SI17|Q4SI17_TETNG Chromosome 5 SCAF14581, whole genome shot...
tr|B1H2E6|B1H2E6_XENTR LOC100145405 protein OS=Xenopus tropicali...
tr|A7YYE5|A7YYE5_DANRE CKLF-like MARVEL transmembrane domain con...
tr|C4Q9A0|C4Q9A0_SCHMA Marvel-containing potential lipid raft-as...
tr|Q6DGM6|Q6DGM6_DANRE CKLF-like MARVEL transmembrane domain con...
tr|C3ZW41|C3ZW41_BRAFL Putative uncharacterized protein (Fragmen...
tr|C1BJ60|C1BJ60_OSMMO Plasmolipin OS=Osmerus mordax GN=PLLP PE=...
sp|Q8IZR5|CKLF4_HUMAN CKLF-like MARVEL transmembrane domain-cont...
…
(bits) E-Value
155
140
136
135
133
133
132
132
132
132
132
131
129
104
100
74
73
72
67
67
66
66
65
65
65
65
64
64
63
62
1e-35
5e-31
8e-30
1e-29
5e-29
7e-29
9e-29
9e-29
9e-29
9e-29
1e-28
3e-28
1e-27
3e-20
8e-19
6e-11
1e-10
1e-10
4e-09
8e-09
1e-08
1e-08
2e-08
3e-08
3e-08
3e-08
5e-08
5e-08
1e-07
2e-07
Plasmolipins
Chemokine like factor
superfamily
members
Sequence Homology Methods
• Simple method
• Can fail to detect some similarities
• Programs
–
–
–
–
BLAST2GO (http://www.blast2go.com/b2ghome)
GOTCHA (http://www.compbio.dundee.ac.uk/gotcha/gotcha.php)
ARGOT2(http://www.medcomp.medicina.unipd.it/Argot2/form.php)
PFP (http://kiharalab.org/web/pfp.php)
– PANNZER (http://ekhidna2.biocenter.helsinki.fi/sanspanz/)
Phylogenetic tree methods
•
•
•
•
Create the pair-wise distances for the set of genes
Do a hierarchical clustering of genes
Map the know GO functions to cluster tree
Look for unknown genes in a cluster with many genes
from the same GO class
• Report the top best / significant GO classes
• More => http://genome.cshlp.org/content/8/3/163.full
Phylogenetic tree methods
•
•
•
•
These should outperform sequence homology methods
Require a set of related genes
Often much heavier calculations
Programs:
– SIFTER
• (http://genome.cshlp.org/content/early/2011/07/22/gr.104687.109)
Prediction with Protein domains
• Look what protein domains there are in query protein
(PFAM)
• Map the functions that are linked to domains to your
query sequence
– PFAM2GO
• Programs: InterProScan + PFAM2GO
• Drawbacks:
– This mapping is same in plant, mammal, bacteria
– Many domains to specific function
Prediction with Protein domains
• Benefits:
– Can create annotation from separate domains
– Similar seq:s do not have to be in database
• Programs: InterProScan
(http://www.ebi.ac.uk/InterProScan/)
• Drawbacks:
– The mapping is same in plant, mammal, bacteria
– Many domains to specific function
Our contribution: PANNZER
• Use BLAST result list
• Add Taxonomic information
• Score GO classes using a score that takes the frequency
of GO class in seq. DB into account
• Method is used to predict:
– GO Classes
– Description line
Our contribution: PANNZER
• Benefits:
– Taking the species taxonomy into account
– Improved use of statistics
• Drawbacks:
– Use only sequence similarity
Method comparisons
Method comparisons
• Many annotation methods available
• All methods claim to be best available
• What methods are really the best?
Critical Assessment of Function Annotations
(CAFA)
• Select a set of unknown genes
• Ask research groups to predict GO terms
• After dead line start collecting new annotations for
genes
• Next evaluate the methods
Critical Assessment of Function
Annotations (CAFA)
Radivojac et al. A large-scale evaluation of computational protein function prediction.
Nature Methods. 2013 Jan 27. doi: 10.1038/nmeth.2340.
CAFA 1
• Most successful methods
– JonesUCL
• http://www.slideshare.net/idoerg/david-jones-afpcafa2011
– Argot2
• http://www.medcomp.medicina.unipd.it/Argot2/
– Pannzer
• http://ekhidna.biocenter.helsinki.fi/pannzer
Method demonstrations
• InterProScan
• Argot
• Pannzer
Demo sequences
•
•
•
•
5 sequences
3 from eukaryota, 2 in prokaryota
All are annotated as unknown
They can be still annotated based on the sequence
InterProScan
• http://www.ebi.ac.uk/interpro/search/sequence-search
• InterProScan is metaserver that looks many sequence
features
• Many of these features can be used to annotate
sequences
• You have to give one sequence at the time
InterProScan
Results
• http://www.ebi.ac.uk/interpro/sequencesearch/iprscan5S20161208-163056-0935-89512669-oy
• http://www.ebi.ac.uk/interpro/sequencesearch/iprscan5S20161208-163142-0955-53752333-oy
• http://www.ebi.ac.uk/interpro/sequencesearch/iprscan5S20161208-163259-0726-8535830-oy
• http://www.ebi.ac.uk/interpro/sequencesearch/iprscan5S20161208-163327-0153-62340367-oy
• http://www.ebi.ac.uk/interpro/sequencesearch/iprscan5S20161208-163413-0491-69491884-pg
Argot2
• http://www.medcomp.medicina.unipd.it/Argot2/
• Tool that processes BLAST output
• One of the best methods in CAFA 1
Argot2
• Results:
http://www.medcomp.medicina.unipd.it/Argot2/getStatu
s.php?js=12196
PANNZER2
•
•
•
•
http://ekhidna2.biocenter.helsinki.fi/sanspanz/
Processes results similar to BLAST
Predicts text and Gene Ontology classes
Significantly faster than the other tools
Conclusion
•
•
•
•
These methods increasingly needed
Some methods exist
Unfortunately no clear evaluation
Remember: These are predictions. No certain info until
they are tested in wet lab…