Download Defense - Gerstein Lab

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

SR protein wikipedia , lookup

Phosphorylation wikipedia , lookup

Theories of general anaesthetic action wikipedia , lookup

Protein (nutrient) wikipedia , lookup

G protein–coupled receptor wikipedia , lookup

SNARE (protein) wikipedia , lookup

Signal transduction wikipedia , lookup

Protein wikipedia , lookup

Protein structure prediction wikipedia , lookup

Protein phosphorylation wikipedia , lookup

Cyclol wikipedia , lookup

Protein moonlighting wikipedia , lookup

Intrinsically disordered proteins wikipedia , lookup

Magnesium transporter wikipedia , lookup

Cell membrane wikipedia , lookup

Thylakoid wikipedia , lookup

List of types of proteins wikipedia , lookup

Endomembrane system wikipedia , lookup

JADE1 wikipedia , lookup

Trimeric autotransporter adhesin wikipedia , lookup

Protein purification wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Proteolysis wikipedia , lookup

Western blot wikipedia , lookup

Transcript
Analysis of Membrane Proteins in
Metagenomics: Networks of correlated
environmental features and protein families
Prianka Patel, Thesis Defense
Yale University
Molecular Biophysics and Biochemistry
2.17.10
Projects
# Coevolving pairs
Analysis of Membrane Protein Structures
Bowie, James Nature, 2005
Sequence Separation
Metagenomics of Ocean Microbes: Co-variation with Environment
Photosynthesis
2
What is Metagenomics?
Traditional Genomics
Metagenomics
Collect sample
from environment
Select organism
and culture
Extract DNA and
sequence
atgctcgatctcg
atgctcgatctcg
atcgatctcgctg
atcgatctcgctg
atgccgatctaa
atgccgatctaa
Extract DNA and
sequence
Contig 1
Assemble and
annotate
Estimated that less than 1% of
microbes can be cultured
Contig 2
...
Assemble and
annotate
...
Lose information about which
gene belongs to which microbe
3
Comparative Metagenomics
= Average
Sargasso Sea 2
Sargasso Sea 4
Sargasso Sea 3
Whale 1 (bone
Whale 2 (bone)
Whale 1 (microbial mat)
Acid mine Drainage
Minnesota farm soil
Foerstner et al., EMBO Rep, 2005
GC content is shaped by environment
Very different environments: whale bone
associated, ocean, acid mine, soil
An amino acid change in Proteorhodopsin
proteins is linked to abundant wavelengths in
the sample of origin
4
Comparative Metagenomics
invariant
variant
Photosynthesis
Dinsdale et. al., Nature 2008
There are microbial pathways that
discriminate between categorically
different environments
Gianoulis et al., PNAS 2009
There are microbial pathways that
discriminate between similar
environments
5
Motivation
Membrane proteins interact with the
environment, transporting available
nutrients, sensing environmental
signals, and responding to changes
Engelman et al., Nature, 2005
Variation in membrane proteins across different environments may give insight into
microbial adaptations that allow them to survive in a specific habitats.
6
Sorcerer II Global Ocean Survey
Sorcerer II journey August 2003- January 2006
Sample approximately every 200 miles
Rusch, et al., PLOS Biology 2007
7
Sorcerer II Global Ocean Survey
Metadata
GPS coordinates, Sample Depth,
Water Depth, Salinity, Temperature,
Chlorophyll Content
Metagenomic Sequence
0.1–0.8 μm size fraction (bacteria)
6.3 billion base pairs (7.7 million
reads)
Reads were assembled and genes
annotated
The majority of samples are from open ocean, with a few estuaries and lakes
Each site has its own metadata
Assembly was done over all locations, but can be mapped back to a particular site
Rusch, et al., PLOS Biology 2007
8
Extracting environmental data using GPS Coordinates
GOS
Sample Depth:
1 meter
Water Depth:
32 meters
Chlorophyll:
4.0 ug/kg
Salinity:
31 psu
Temperature:
11 C
Location:
41°5'28"N, 71°36'8"W
GPS coordinates allow us to extract information from other sources:
* World Ocean Atlas
* National Center for Ecological
Analysis and Synthesis
9
World Ocean Atlas 2005
NOAA (National Oceanic and Atmospheric Administration) and
NODC (National Oceanographic Data Center)
Annual Phosphate [umol/l] at the surface
* Cumulative annual data at
the ocean surface
* Resolution is 1 degree
latitude/longitude
. . . no simple geometric shape matches
the Earth
Nutrient Features Extracted:
Phosphate
Silicate
Nitrate
Apparent Oxygen Utilization
Dissolved Oxygen
10
National Center for Ecological Analysis
and Synthesis (NCEAS)
* Resolution is 1 km square
* Value of a activity at a particular location
is determined by the type of ecosystem
present:
Impact = ∑ Features * Ecosystem * impact weight
Shipping
Anthropogenic Features Extracted:
Ultraviolet radiation
Shipping
Pollution
Climate Change
Ocean Acidification
Climate Change
11
Halperin et. al.(2008), Science
Predicting membrane proteins in GOS data
Metagenomic Reads
Protein Clusters
Membrane Protein Clusters
Family 1
GOS Mapping
TMHMM
Filtering
COG
* 151
Families
Family 2
- TMHMM (Transmembrane Hidden Markov Model): finds hydrophobic stretches of amino acids
- COG (Clusters of Orthologous Groups): orthologous groups of protein families
12
Predicting membrane proteins in GOS data
Site Name
Sargasso Sea, Hydrostation S
Gulf of Maine
Browns Bank, Gulf of Maine
Outside Halifax, Nova Scotia
Northern Gulf of Maine
Block Island, NY
Cape May, NJ
Off Nags Head, NC
South of Charleston, SC
Off Key West, FL
Gulf of Mexico
Yucatan Channel
Rosario Bank
Northeast of Colon
Gulf of Panama
250 miles from Panama City
30 miles from Cocos Island
134 miles NE of Galapagos
Devil's Crown, Floreana Island
Coastal Floreana
North James Bay, Santigo Island
Warm seep, Roca Redonda
Upwelling, Fernandina Island
North Seamore Island
Wolf Island
Cabo Marshall, Isabella Island
Equatorial Pacific TAO Buoy
201 miles from F. Polynesia
Rangirora Atoll
Sum
# proteins
# proteins with
% proteins with
predicted predicted membrane
membrane
spanning regions
spanning region
138,843
189,035
85,295
79,377
77,342
113,436
110,513
162,087
196,814
197,474
192,335
370,261
211,933
213,938
193,265
192,610
191,568
158,620
320,755
283,545
208,852
570,496
659,780
208,632
221,536
114,654
107,066
98,014
188,985
29,759
35,971
17,089
15,520
15,219
23,415
23,302
31,154
41,993
43,081
42,027
77,900
44,876
45,447
41,558
41,764
39,302
34,221
68,068
60,268
45,408
124,657
140,276
45,314
48,391
24,891
22,882
20,789
40,136
6,057,061
1,284,678
21.43%
19.03%
20.04%
19.55%
19.68%
20.64%
21.09%
19.22%
21.34%
21.82%
21.85%
21.04%
21.17%
21.24%
21.50%
21.68%
20.52%
21.57%
21.22%
21.26%
21.74%
21.85%
21.26%
21.72%
21.84%
21.71%
21.37%
21.21%
21.24%
# proteins in
membrane protein
clusters
20,609
23,104
11,161
10,254
9,728
15,883
15,975
20,662
28,353
29,545
28,161
53,396
30,712
31,007
27,985
27,996
26,850
23,367
46,680
40,864
31,078
85,303
97,806
30,534
32,359
17,077
15,599
14,385
27,285
% proteins in
# proteins in
membrane protein membrane proteins
clusters clusters mapping to
COG
14.84%
12.22%
13.09%
12.92%
12.58%
14.00%
14.46%
12.75%
14.41%
14.96%
14.64%
14.42%
14.49%
14.49%
14.48%
14.54%
14.02%
14.73%
14.55%
14.41%
14.88%
14.95%
14.82%
14.64%
14.61%
14.89%
14.57%
14.68%
14.44%
6,123
6,927
2,961
3,347
3,008
4,519
4,842
6,416
7,692
8,596
7,804
13,638
8,003
8,959
7,923
7,849
6,337
6,280
11,988
10,389
8,083
23,061
26,658
9,021
9,172
4,137
4,016
3,540
6,581
22% of unique proteins in membrane
protein clusters map to COG
873,718
237,870
13
What is the Relationship?
Membrane Protein Families
Environmental Features
?
•
Correlation of Sites based on environmental features or protein families
•
Discriminative Partition Matching
•
Canonical Correlation Analysis/Protein Features and Environmental Features Network
14
How Similar are the Sites to each other?
1
0
-1
15
Species Distribution
•
•
•
The 16S rRNA gene is a component of the small prokaryotic ribosomal subunit
Bacteria with 16S rRNA gene sequences more similar than 97% are considered the
same ‘species’
10,025 16S genes found and classified
Biers et al. App. Env. Microbiology , 2009
20% level, “phylum”
16
This suggests that the observed
membrane protein variation is
more a function of the measured
environmental features, than
phylogenetic diversity.
Method: For each site, we
correlated the EF profile
distances and its MPF
frequency profile distances
and 16S profile distances
17
Discriminative Partition Matching
Sites cluster into three distinct groups:
Which membrane protein
families are discriminating
between these clusters?
Groups are geographically separated:
We can partition the
membrane protein family
matrix by these site
groupings, and then look for
significantly different
distributions of proteins
families between the clusters.
18
Discriminate Partition Matching
First, we performed PCA on the
membrane protein families matrix,
and grouped the first component
scores by the environmental
clustering
This revealed that the Mid-Atlantic
and Pacific were more similar to
each other in terms of membrane
protein content, and these sites
were grouped
Which families are discriminating between these two site-sets? (T-test)
19
DPM results
•
•
•
30 families showed significant differences (p-value<0.01) between the site sets
Most were enriched in the North Atlantic (28/30)
Higher pollution, chlorophyll, and possibly higher nutrients and cell abundance in the North Atlantic
microbes’ need to
expel antimicrobials,
Buffer
against
by-products
of shifts
in
ocean solute
Chlorophyll
content
metabolism,
or
concentrations
again
Stabilization
of toxins
DNA
environmental
alluding
and RNAto the
increased pollutants,
and possibly nutrient
Exchanges
ATP for
fluxes from land
and
ADP
riversin mitochondria
and obligate
intracellular parasites,
may be nucleotide/H+
transporters
20
Simultaneous Correlations of Environmental Features and Membrane Proteins
Canonical Correlation Analysis
Environmental Features
Membrane Protein Families
?
We have addressed this questions by:
1. Comparing site similarity based on these two sets of features
2. Finding particular discriminating families between environmental groupings
But we don’t know what particular features are associated with each other, and we know that
they are all likely interdependent: Canonical Correlation Analysis
Family 1
Salinity
Family 2
Pollution
Family 5
Temp
21
Canonical Correlation Analysis
- CCA allows us to take advantage of the continuity of the features and observe which features are
invariant or variant, and the type (positive, negative) of relationship between them.
-We correlate all the variables, protein families and environmental features simultaneously.
- We have two sets of variables, X1. . . X15 (environmental features) and Y1. . . Y151 (membrane
protein families)
Environmental Features
Membrane Protein Families
We are looking for two vectors, a and b (a set of weights for X and Y), such that the correlation
between X, Y is maximized:
22
CCA results
We are defining a change of basis of the cross co-variance matrix
We want the correlations between the projections of the variables, X and Y, onto the basis vectors to
be mutually maximized.
Eigenvalues squared canonical correlations
Eigenvectors normalized canonical correlation basis vectors
Environment
Family
Correlation= 1
This plot shows the correlations in the first
and second dimensions
Correlation = .3
Correlation Circle: The closer the point is to
the outer circle, the higher the correlation
Variables projected in the same direction
are correlated
23
CCA results
107 variant membrane protein
families
Pollution
Climate change
Shipping
Dissolved O2
Dimension 2
Water depth
invariant
variant
44 invariant membrane protein
families
Acidity
Chlorophyll
Sample Depth
UV
Temperature
Phospahte
Salinity
Difficult to see the strength and
directionality of a relationship
Weights of the features are
difficult to visualize and compare
Nitrate
Silicate
App. O2 util.
Dimension 1
There is no means of quantifying
the variation between sets of
features
24
Protein Families and Environmental Features Network (PEN)
a  b  a b cos
Distance: Dot product between 1st and 2nd Dimension of CCA
25
Protein Families and Environmental Features Network (PEN)
COG0598, Magnesium Transporter
COG1176, Polyamine
Transporter
“Bi-modules”: groups of environmental features and membrane proteins families that are associated
UV, dissolved oxygen, apparent oxygen utilization, sample depth, and water depth are not in the
network
26
Bi-module 1: Phosphate/Phosphate Transporters
Low Phosphate, high affinity
phosphate transporters
which are induced during
phosphate limitation
High Phosphate, low affinity
inorganic phosphate ion
transporter which are
constitutively expressed
27
Microbes modulate content in response to phosphate
Martiny et al. Env Microbiology, 2009
Phosphate Concentration related to phosphate
acquisition genes in Prochlorococcus
Van Mooy et al. Nature, 2009
Microbes modulate phospholipid content in
response to phosphate concentrations
28
Bi-module 2: Iron Transporters/Pollution/Shipping
Negative relationship between areas
of high ocean-based pollution and
shipping and transporters involved in
the uptake of iron
Pollution and Shipping may
be a proxy for iron
concentrations 
29
Bi-module 2: Iron Transporters/Pollution/Shipping
Iron is usually limiting in oceans: High
Nitrate-Nutrient/Low Chlorophyll regions
Delivery of iron to is usually by:
- terrestrial input
- fluvial (rivers) input
- upwelling from the ocean floor
- aeolian dust from land
Rigwell A. J. (2002) Phil. Trans. R. Soc. Lond.
30
Bi-module 2: Iron Transporters/Pollution/Shipping
Pollution and Dust
N/C and Iron Transporters
-Negative correlation between COG4558 and COG0609 and dust/pollution values
(p-value <0.01)
- Searching the BRENDA database for enzymes using iron as a cofactor reveal
that an increase in these two COGs negatively correlated to the amount of
enzymes present that required iron.
31
Conclusions
New method (PEN) to visualize complex
relationships in metagenomic data using explicit
environmental variables
We show both known and intuitive relationships
between features and genomic content
CCA also reveals the invariant fraction of
environmental features and protein families
(highlights important cellular processes):
Chloride Channel, Type II secretion Proteins
(virulence)
Many variant ABC-type transporters(34/41):
suggests streamlining for optimization and energy
conservation
32
Much of Membrane Protein Space Remains
Uncharacterized
• 15% of predicted membrane proteins had NO homology to Genbank
(e-value<1e-10)
• We used short motifs (PROSITE) to characterize a small fraction of
these including ABC Transporters, GPCRs, Lipocalins, betalactamases
16% (29,384) were annotated
33
Intraribotype diversity and the definition of a ‘species’
16S analysis of GOS data reveals that
most sequences fall into 5 ribotypes
However, there were very few identical
sequences, suggesting that no two cells
have identical genome sequences
Eugene V Koonin Nat Biotechnology, 2007
This suggests that ocean microbes are rather adaptive to their environments
We observe diversity in membrane protein content and abundance, and show that it is a reflection
of different environmental conditions more than phylogenetic diversity (16S)
These are mostly oligotrophic (nutrient poor) waters and environmental conditions have likely
been fairly constant over many years , genomes are “streamlining”
34
Conclusions
Microbes from ocean surface samples
Genotypic variation within similar natural
show diversity in membrane protein
populations
occurs
in
response
to
content
Integration of Environmental
Diversity in membrane proteins was
environmental conditions
Features using GPS
coordinates
shown to be a reflection of different
environmental conditions more than
phylogenetic diversity
Integration of geospatial data can highlight
unexpected trends as anthropogenic
factors seem to be reflected in microbial
function
Environmental clusters show differences
in membrane protein content which
reflect environmental conditions
(pollution/efflux proteins)
Developed (PEN) and adapted
techniques to connect features of
environment to specific protein families
35
Acknowledgements
Advisors: Donald Engelman and Mark Gerstein
Collaborators
Gerstein Lab:
Tara Gianoulis
Kevin Yip
Rob Bjornson
Committee Members:
Jim Bowie (UCLA)
Annette Molinaro
Lynne Regan
Mike Snyder
Administrative Staff:
Mary Backer
Ann Nicotra
Nessie Stewart
Yale University Biomedical
High Performance Computing Facility
NIH grant RR19895 which funded the instrumentation
Nicolas Carriero
Philip Kim
Jan Korbel
Sam Flores
Engelman Lab:
Damien Thevenin
Yale Map Collection:
Julia Rogers
Stacey Maples
Past and Present members of Engelman
and Gerstein Labs
36