Download Printer Friendly Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Proteomics wikipedia , lookup

Protein mass spectrometry wikipedia , lookup

Protein purification wikipedia , lookup

Western blot wikipedia , lookup

Bimolecular fluorescence complementation wikipedia , lookup

Protein structure prediction wikipedia , lookup

Homology modeling wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Trimeric autotransporter adhesin wikipedia , lookup

Protein domain wikipedia , lookup

Protein moonlighting wikipedia , lookup

List of types of proteins wikipedia , lookup

Transcript
COMPARATIVE GENOMICS & METABOLISM
File name
COMPARATIVE GENOMICS AND METABOLISM 2013
USING COMPARATIVE GENOMICS RESOURCES
Comparative genomics can:
● Find genes for functions (i.e., the function is known to exist, but the gene specifying it has not been
identified)
● Find functions for genes (i.e., the gene is known from the genome, but its function is not)
It operates on the ‘guilt by association’ principle – ‘Show me your friends and I’ll tell you who you are’ or
‘Birds of a feather flock together’
Genomic evidence
A
B
C
Post-genomic evidence
Gene W
Gene X
Gene Y
Gene Z
D
Gene clustering
Co-expression
Orf X
Orf Y
Orf XY
A
B
Gene fusion
C
A
XYYX
Protein-protein
interactions
B
XYYX
M
Organelle proteomes
C
XYYX
D
XYYX
Essentiality & other phenome data
Shared regulatory sites
–
V
–
+
+
–
+
+
+
–
Phylogenetic occurrence
Structures
• STRING http://string.embl.de/
Functional relationships between proteins can often be inferred from genomic associations between the
genes that encode them: groups of genes involved in the same pathway tend to be close together
(clustered) in prokaryote genomes (often in operons), to be involved in gene-fusion events, and to show
similar species coverage.
STRING is a precomputed database to explore functional relationships between proteins (clustering,
fusions, co-occurrence etc). STRING gives an integrated confidence score for the associations it predicts.
It is the often the best database to begin a comparative genomics project.
Example 1 – Rediscovering Nudix enzyme FolQ in Lactococcus lactis: Enter via a protein
name, view associations among proteins with that name.
* FolP (Dihydropteroate synthase (EC 2.5.1.15), a key enzyme of pterin and folate synthesis
* Select Lactococcus lactis MG1363 from organism list (results are similar but not identical using other
species) * Click Go!
* Displays ‘Evidence View’ - different line colors represent types of association (clustering on chromosome
(= Neighborhood), co-occurrence, co-expression, protein-protein interactions, etc)
* Spheres are gene products, can be dragged to disentangle networks * Filled spheres have protein
structures, all spheres are clickable for more information
* Click Confidence View’ - stronger associations are represented by thicker lines
* Note strong associations between the set of folate synthesis enzymes – FolP, FolB, FolC, HPPK – & FolQ
(before FolQ was known it was included in the network as unknown protein YlgG)
* Can turn off each kind of evidence (e.g. Databases, Textmining) in ‘Info & Parameters’ box
* To see more interactions, Click ‘+ More’ button, or expand list in pull-down menu e.g. to 50 interactors
* In ‘Evidence View’ screen, click on bullets in the table for more information, e.g. HPPK (=FolK) bullet in
‘Neighborhood’ → shows linkage between FolP & FolK in diverse genomes.
Notes: STRING data come from many genomes, not just the one used to enter the system (L.
lactis in this case)
The more diverse the genomes, the more probable it is that the linkage represents a
functional relationship
* Gene fusions – for the HPPK protein, click on ‘Gene Fusion’ bullet → Shows HPPK is fused to FolP in
various organisms, including Arabidopsis
Example 2 – Predicting possible functions for the At4g26860 protein:
>At4g26860 [Arabidopsis thaliana]
MAAPAVEATVASALRSVILRARKAAEQVGRDPERVRVLPVSKTKPVSLIRQIYDAGHRCFGENYVQEIID
KAPQLPEDIEWHFVGHLQSNKAKTLLTGVPNLAMVHGVDGEKVANHLDRAVSNLGRHPLKVLVQVNTSGE
VSKSGIEPSSVVELARHVKHHCPNLVFSGLMTIGMPDYTSTPENFRTLSNCRADVCKALGMAEDQFELSM
GMSGDFELAIEMGSTNVRVGSTIFGPREYPKKTT
* Suppose that we are investigating proline biosynthesis. Using ATTED, we find that the expression of
At1g23310 GGT1, a peroxisomal glutamate:glyoxylate aminotransferase that is functionally linked to
proline synthesis, is very strongly correlated with expression of At4g26860, which encodes a protein of
unknown function in Arabidopsis (annotated ‘pyridoxal phosphate binding, alanine racemase family protein,
or putative proline synthetase associated protein’ in GenBank).
* TargetP and Predotar indicate plastid targeting of At4g26860
* BLASTp search of bacteria in GenBank → Conserved domain search indicates uncharacterized member
of the alanine racemase family (pyridoxal phosphate-containing) → Best hits (e-40 or better), e.g.
Geobacter sp. M21, Vibrio harveyi
* Go to STRING, BLAST, select Geobacter sp. M21 or Vibrio harveyi * Note extremely strong clustering
with the proline biosynthesis enzyme pyrroline-5-carboxylate reductase, ProC
* Note that pyrroline-5-carboxylate reductase is reported to be plastidial in plants, i.e. that it is in the same
subcellular compartment as At4g26860.
* Therefore one possible functional prediction is that At4g26860 participates in proline biosynthesis.
MetaCyc shows that one step in proline biosynthesis (the cyclization of glutamic acid γ-semialdehyde to
give Δ1-pyrroline-5-carboxylate) is considered to be spontaneous – could At4g26860 accelerate this
reaction?
Example 3 – Predicting possible functions for the At3g13050 protein:
>At3g13050 [Arabidopsis thaliana]
MADGNTRFTVDEALVAMGFGKFQIYVLAYAGMGWVAEAMEMMLLSFVGPAVQSLWNLSARQESLITSVVF
AGMLIGAYSWGIVSDKHGRRKGFIITAVVTFVAGFLSAFSPNYMWLIILRCLVGLGLGGGPVLASWYLEF
IPAPSRGTWMVVFSAFWTVGTIFEASLAWLVMPRLGWRWLLAFSSVPSSLLLLFYRWTSESPRYLILQGR
KAEALAILEKIARMNKTQLPPGVLSSELETELEENKNIPTENTHLLKAGESGEAVAVSKIVLKADKEPGF
SLLALLSPTLMKRTLLLWVVFFGNAFAYYGVVLLTTELNNSHNRCYPTEKQLRNSNDVNYRDVFIASFAE
FPGLLISAAMVDRLGRKASMASMLFTCCIFLLPLLSHQSPFITTVLLFGGRICISAAFTVVYIYAPEIYP
TAVRTTGVGVGSSVGRIGGILCPLVAVGLVHGCHQTIAVLLFEVVILVSGICVCLFPFETSGRDLTDSIS
ASKEPPSASV
* Suppose that we are investigating NAD(P) biosynthesis. Using ATTED, we find that the expression of
NADP synthesis enzyme At3g21070 NADK1 (a cytosolic isoform of NAD kinase) is positively correlated with
expression of At3g13050, which encodes a protein of unknown function in Arabidopsis (annotated
‘transporter-related’ in GenBank).
* TMHMM search shows that At3g13050 has multiple membrane-spanning domains
* BLASTp search of bacteria in GenBank → Conserved domain search indicates At3g13050 is a major
facilitator superfamily (MFS) transporter → Best hits (e-50 or better) include Deinococcus geothermalis,
Deinococcus deserti, and Deinococcus radiodurans
* Go to STRING, BLAST, select Deinococcus geothermalis * Note clustering with 3 enzymes of NAD
synthesis and 5 enzymes of thiamin synthesis
(Similar but not identical results with D. deserti or D. radiodurans. Note that it is important to try
more than one organism as an entry point to the system.)
* Click on Neighborhood bullets to see the organisms in which the clustering with NAD or thiamin synthesis
genes occurs
* Therefore a functional prediction is that At3g13050 transports NAD or thiamin, or a precursor of NAD or
thiamin
* At3g13050 and various bacterial homologs are now known to transport the NAD precursor nicotinic acid.
Thermus thermophilus homolog is known to transport thiamin
• The SEED http://pubseed.theseed.org/seedviewer.cgi
PubSEED (part of the SEED family of databases) is a versatile tool to investigate functional relationships
between genes. Unlike STRING, it is not rigidly precomputed; the user has more control. To explore SEED,
we will use the At3g13050 example above, and then use SEED to search for fusions to known enzymes.
Predicting possible functions for the At3g13050 protein:
* Go to PubSEED, Click ‘Navigate’ tab, select BLAST search * Paste At3g13050 sequence into box →
Select Deinococcus geothermalis (either entry) * Best hit 2e-61 * Click on link * Opens Annotation Overview
page (the ‘Facebook page’ for the gene) (protein is annotated ‘Niacin transporter NiaP’ – note that this is a
prediction) * Note links to KEGG, to Psi-BLAST etc
* The ‘Compare Regions’ tool displays the chromosome region around the D. geothermalis niaP query
gene, and those around the four closest homologs of the query gene
* Similar genes have the same color → Hover over to see annotations * Note that niaP is in a cluster of
NAD synthesis genes in D. geothermalis and its relatives
* Click on ‘Advanced’, expand number of regions to 400, relax both the cutoffs to 1e-10, click ‘Draw’
* Regions around homologs of the query gene are displayed from hundreds of genomes. The genes are
numbered in order of decreasing frequency of occurrence, 1 being the query gene, 2 being the most often
clustered, 3 being the next most often etc.
* Note NAD-related gene clusters also in Pyrobaculum islandicum (NAD kinase) and Thermotoga spp.
(Transcriptional repressor for NAD biosynthesis) (3/4 of way down page)
* Note thiamin-related gene clusters in Thermus thermophilus (1/3 way down page) and Pyrobaculum
islandicum (3/4 way down page) second cluster, next to NAD-related one)
* Phylogeny of selected bacterial and plant proteins (and Bacillus subtilis, Acinetobacter sp. NiaP, shown to
transport niacin) places the plant genes closest to a gene clustered with NAD genes:
Niacin Bacillus subtilis
Niacin Acinetobacter sp.
Niacin Pyrobaculum islandicum
Thiamine Thermus thermophilus
Choline Burkholderia xenovorans
Niacin Thermotoga maritima
At3g13050 Arabidopsis thaliana
Maize GRMZM2G066801
Niacin Deinococcus geothermalis
Thiamine Pyrobaculum islandicum
* Therefore, as with STRING, SEED prediction favors niacin transport, or transport of thiamin or its
precursors. Both these predictions have now been validated experimentally.
The phylogenetic tree tool:
* First examine conserved domain of D. geothermalis NiaP protein (Major Facilitator Superfamily MFS –
very large, diverse – therefore trees have many branches) * Click on trees link
* Select first tree with the radio button, press Update * Scroll down noting coloring (blue = fusion to Nterminus of MFS protein, red = fusion to C-terminus) * Click on identifier of Streptomyces
viridochromogenes (red, ~40% of way down tree) * CDD link * Note C-terminal extension (no conserved
domain) * Click on identifier of Achromobacter piechaudii (blue, ~80% of way down tree) * CDD link * Note
double MFS plus N-terminal aromatic hydroxylase domain * Psi-Blast search – single hit * Thus although
fusion is plausible, it could be a sequencing artifact
* Select last tree * Note cluster of pink-brown color, Streptomyces species (~80% of way down tree) * Click
on identifier of Streptomyces avermitilis A-4680 * CDD link * Note double MFS plus three carboxypeptidase
regulatory-like domains in tandem * Psi-Blast search * Many hits * Thus this fusion is real
Finding ‘unknown’ fusions to known metabolic enzymes:
* Go to PubSEED, Click ‘Navigate’ tab, select Curate Subsytems * Find Thiamin biosynthesis subsystem
(encoded by Rodionov) * Click on subsystem link, go to spreadsheet
* Go to yeast, Saccharomyces cerevisiae * Go to TPPK (thiamin pyrophosphokinase, last enzyme in yeast
pathway), click on link (no. 5891) * Note in Compare Regions viewer that Schizosaccharomyces pombe
homolog is much longer (suggestive of fusion)
* Click on trees link * Scroll down to S. pombe (blue, ~1/3 of way down) * Click on S. pombe identifier * CDD
link * Note N-terminal Nudix hydrolase fusion (Nudix family large, diverse)
* Psi-Blast search * Fusion present also in another Schizosaccharomyces species, therefore real
* Possible function of Nudix domain: Nudix family enzymes typically cleave pyrophosphate P-O-P bonds.
Product of TPPK is thiamin pyrophosphate, therefore fusion domain could be a combined synthesis –
breakdown enzyme
* Return to tree, below S. pombe, note blue color in Atopobium parvulum and two related genomes * Click
on A. parvulum identifier * CDD link * Note N-terminal HAD hydrolase fusion (HAD family large, diverse)
* Psi-Blast search confirms fusion present in several genomes, therefore real
* From Conserved Domain display, click on Search for similar domain architectures * Goes to CDART –
Conserved Domain Architecture Retrieval Tool * Displays proteins in NCBI database that belong to HAD
family (multiple pages) * Note that many HAD family fusions occur as fusions
* Another tool to explore fusions (similar to CDART) – pfam * To search pfam, capture protein sequence of
A. parvulum sequence from SEED Annotation Overview page
>fig|521095.6.peg.839 [Atopobium parvulum DSM 20469] [hypothetical protein]
MQVTGAIFDCDGTLVDSMCVWHNVFSAVLPKYGKTVDPDIFNRVEAVSLI
GGCQICVDELALPVTAETLYEEFCAYATDQYQHHVSIVPGAKEFLQELYD
AGIPLAVASSTPVREVRAALAAQGIEHLFKTVVSTEDVGGVDKVEPDVYL
EALRRLGTDKATTWVFEDAPFGAQTAQKAGFPVVALYNDHDGRDPVFMRE
HSNIFAHTYGELSLLRLCDYERPLTSAPSGEKPLEVLIVGGSPEAVSKTT
LSTCVQSADYLIAVDHGADACHVAGVVPQLALGDFDSASLETVTWLKEQQ
VPCMKFNADKYDTDLALALKSAEHEAIRRNSKLSLTVVSTSGGHLDHQLV
VLGLLAAWAKTGKAKVRVVENDFEMRFLAADQIDSWQLDASATGKKISLV
ALSEECEVSESGMRWNLNHEKFTLLGDDGISNIVEADGAWVKCEKGCLLVQLWN
* Go to pfam http://pfam.janelia.org/ * Select Sequence Search, paste in sequence, press Go * Result:
Hits on TPPK domain and HAD domain * Click on HAD link, then on Domain Organisation
* Displays various fusion arrangements of members of HAD family
* For step-by-step guide to finding fusions, download ‘Quickest path to fusions identification VdeC tutorial’
from class website
Finding fusions directly from CDD or Pfam:
* Capture sequence of a biochemically validated, unfused pathway enzyme (e.g. from a model organism
such as E. coli, B. subtilis, or yeast. Example: B. subtilis ThiC
>fig|224308.1.peg.878 [Bacillus subtilis subsp. subtilis str. 168] [Hydroxymethylpyrimidine phosphate
synthase ThiC]
MQNNSVQQANISIMSSFSGSKKVYVEGSSSDIQVPMREIALSPTTGSFGE
EENAPVRVYDTSGPYTDPEVTINIQEGLKPLRQIWITERGDVEEYEGRAI
KPEDNGYKKAKPNVSYPGLKRKPLRAKAGQNVTQMHYAKKGIITPEMEFI
AIREHVSPEFVRDEVASGRAIIPSNINHPESEPMIIGRNFHVKINANIGN
SAVTSSIEEEVEKMTWAIRWGADTMMDLSTGKDIHTTREWIIRNCPVPVG
TVPIYQALEKVNGVAEDLTWEIYRDTLIEQAEQGVDYFTIHAGVLLRYVP
LTAKRTTGIVSRGGAIMAQWCLAHHQESFLYTHFEEICEIMKMYDIAFSL
GDGLRPGSIADANDEAQFAELETLGELTQIAWKHDVQVMIEGPGHVPMHK
IKENVDKQMDICKEAPFYTLGPLTTDIAPGYDHITSAIGAAMIGWYGTAM
LCYVTPKEHLGLPNRDDVREGVITYKIAAHAADLAKGHPGAQIRDDALSK
ARFEFRWRDQFNLSLDPERALEYHDETLPAEGAKTAHFCSMCGPKFCSMR
ISQDIRDYAKKNDLSEAEAINKGLKEKAKEFVDTGSNLYQ
* Go to CDD search page http://www.ncbi.nlm.nih.gov/Structure/lexington/lexington.cgi * Search with
ThiC sequence * Note fusion to TIM_phosphate_binding superfamily domain
* For more information click on [+] * Click on Domain details * Shows that TIM_phosphate_binding
superfamily member is thiamine monophosphate synthase (TMP)
* Go to pfam search page http://pfam.janelia.org/ * Search with ThiC sequence * * Click on ThiC link, then
on Domain Organisation
* Note fusion to thiamin synthesis enzyme TMP synthase in multiple genomes* Also detects fusion to
second thiamin synthesis enzyme HMPP kinase (but only a single occurrence, therefore questionable)
* CDD and pfam are complementary