* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Defense - Gerstein Lab
Phosphorylation wikipedia , lookup
Theories of general anaesthetic action wikipedia , lookup
Protein (nutrient) wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
SNARE (protein) wikipedia , lookup
Signal transduction wikipedia , lookup
Protein structure prediction wikipedia , lookup
Protein phosphorylation wikipedia , lookup
Protein moonlighting wikipedia , lookup
Intrinsically disordered proteins wikipedia , lookup
Magnesium transporter wikipedia , lookup
Cell membrane wikipedia , lookup
List of types of proteins wikipedia , lookup
Endomembrane system wikipedia , lookup
Trimeric autotransporter adhesin wikipedia , lookup
Protein purification wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Analysis of Membrane Proteins in Metagenomics: Networks of correlated environmental features and protein families Prianka Patel, Thesis Defense Yale University Molecular Biophysics and Biochemistry 2.17.10 Projects # Coevolving pairs Analysis of Membrane Protein Structures Bowie, James Nature, 2005 Sequence Separation Metagenomics of Ocean Microbes: Co-variation with Environment Photosynthesis 2 What is Metagenomics? Traditional Genomics Metagenomics Collect sample from environment Select organism and culture Extract DNA and sequence atgctcgatctcg atgctcgatctcg atcgatctcgctg atcgatctcgctg atgccgatctaa atgccgatctaa Extract DNA and sequence Contig 1 Assemble and annotate Estimated that less than 1% of microbes can be cultured Contig 2 ... Assemble and annotate ... Lose information about which gene belongs to which microbe 3 Comparative Metagenomics = Average Sargasso Sea 2 Sargasso Sea 4 Sargasso Sea 3 Whale 1 (bone Whale 2 (bone) Whale 1 (microbial mat) Acid mine Drainage Minnesota farm soil Foerstner et al., EMBO Rep, 2005 GC content is shaped by environment Very different environments: whale bone associated, ocean, acid mine, soil An amino acid change in Proteorhodopsin proteins is linked to abundant wavelengths in the sample of origin 4 Comparative Metagenomics invariant variant Photosynthesis Dinsdale et. al., Nature 2008 There are microbial pathways that discriminate between categorically different environments Gianoulis et al., PNAS 2009 There are microbial pathways that discriminate between similar environments 5 Motivation Membrane proteins interact with the environment, transporting available nutrients, sensing environmental signals, and responding to changes Engelman et al., Nature, 2005 Variation in membrane proteins across different environments may give insight into microbial adaptations that allow them to survive in a specific habitats. 6 Sorcerer II Global Ocean Survey Sorcerer II journey August 2003- January 2006 Sample approximately every 200 miles Rusch, et al., PLOS Biology 2007 7 Sorcerer II Global Ocean Survey Metadata GPS coordinates, Sample Depth, Water Depth, Salinity, Temperature, Chlorophyll Content Metagenomic Sequence 0.1–0.8 μm size fraction (bacteria) 6.3 billion base pairs (7.7 million reads) Reads were assembled and genes annotated The majority of samples are from open ocean, with a few estuaries and lakes Each site has its own metadata Assembly was done over all locations, but can be mapped back to a particular site Rusch, et al., PLOS Biology 2007 8 Extracting environmental data using GPS Coordinates GOS Sample Depth: 1 meter Water Depth: 32 meters Chlorophyll: 4.0 ug/kg Salinity: 31 psu Temperature: 11 C Location: 41°5'28"N, 71°36'8"W GPS coordinates allow us to extract information from other sources: * World Ocean Atlas * National Center for Ecological Analysis and Synthesis 9 World Ocean Atlas 2005 NOAA (National Oceanic and Atmospheric Administration) and NODC (National Oceanographic Data Center) Annual Phosphate [umol/l] at the surface * Cumulative annual data at the ocean surface * Resolution is 1 degree latitude/longitude . . . no simple geometric shape matches the Earth Nutrient Features Extracted: Phosphate Silicate Nitrate Apparent Oxygen Utilization Dissolved Oxygen 10 National Center for Ecological Analysis and Synthesis (NCEAS) * Resolution is 1 km square * Value of a activity at a particular location is determined by the type of ecosystem present: Impact = ∑ Features * Ecosystem * impact weight Shipping Anthropogenic Features Extracted: Ultraviolet radiation Shipping Pollution Climate Change Ocean Acidification Climate Change 11 Halperin et. al.(2008), Science Predicting membrane proteins in GOS data Metagenomic Reads Protein Clusters Membrane Protein Clusters Family 1 GOS Mapping TMHMM Filtering COG * 151 Families Family 2 - TMHMM (Transmembrane Hidden Markov Model): finds hydrophobic stretches of amino acids - COG (Clusters of Orthologous Groups): orthologous groups of protein families 12 Predicting membrane proteins in GOS data Site Name Sargasso Sea, Hydrostation S Gulf of Maine Browns Bank, Gulf of Maine Outside Halifax, Nova Scotia Northern Gulf of Maine Block Island, NY Cape May, NJ Off Nags Head, NC South of Charleston, SC Off Key West, FL Gulf of Mexico Yucatan Channel Rosario Bank Northeast of Colon Gulf of Panama 250 miles from Panama City 30 miles from Cocos Island 134 miles NE of Galapagos Devil's Crown, Floreana Island Coastal Floreana North James Bay, Santigo Island Warm seep, Roca Redonda Upwelling, Fernandina Island North Seamore Island Wolf Island Cabo Marshall, Isabella Island Equatorial Pacific TAO Buoy 201 miles from F. Polynesia Rangirora Atoll Sum # proteins # proteins with % proteins with predicted predicted membrane membrane spanning regions spanning region 138,843 189,035 85,295 79,377 77,342 113,436 110,513 162,087 196,814 197,474 192,335 370,261 211,933 213,938 193,265 192,610 191,568 158,620 320,755 283,545 208,852 570,496 659,780 208,632 221,536 114,654 107,066 98,014 188,985 29,759 35,971 17,089 15,520 15,219 23,415 23,302 31,154 41,993 43,081 42,027 77,900 44,876 45,447 41,558 41,764 39,302 34,221 68,068 60,268 45,408 124,657 140,276 45,314 48,391 24,891 22,882 20,789 40,136 6,057,061 1,284,678 21.43% 19.03% 20.04% 19.55% 19.68% 20.64% 21.09% 19.22% 21.34% 21.82% 21.85% 21.04% 21.17% 21.24% 21.50% 21.68% 20.52% 21.57% 21.22% 21.26% 21.74% 21.85% 21.26% 21.72% 21.84% 21.71% 21.37% 21.21% 21.24% # proteins in membrane protein clusters 20,609 23,104 11,161 10,254 9,728 15,883 15,975 20,662 28,353 29,545 28,161 53,396 30,712 31,007 27,985 27,996 26,850 23,367 46,680 40,864 31,078 85,303 97,806 30,534 32,359 17,077 15,599 14,385 27,285 % proteins in # proteins in membrane protein membrane proteins clusters clusters mapping to COG 14.84% 12.22% 13.09% 12.92% 12.58% 14.00% 14.46% 12.75% 14.41% 14.96% 14.64% 14.42% 14.49% 14.49% 14.48% 14.54% 14.02% 14.73% 14.55% 14.41% 14.88% 14.95% 14.82% 14.64% 14.61% 14.89% 14.57% 14.68% 14.44% 6,123 6,927 2,961 3,347 3,008 4,519 4,842 6,416 7,692 8,596 7,804 13,638 8,003 8,959 7,923 7,849 6,337 6,280 11,988 10,389 8,083 23,061 26,658 9,021 9,172 4,137 4,016 3,540 6,581 22% of unique proteins in membrane protein clusters map to COG 873,718 237,870 13 What is the Relationship? Membrane Protein Families Environmental Features ? • Correlation of Sites based on environmental features or protein families • Discriminative Partition Matching • Canonical Correlation Analysis/Protein Features and Environmental Features Network 14 How Similar are the Sites to each other? 1 0 -1 15 Species Distribution • • • The 16S rRNA gene is a component of the small prokaryotic ribosomal subunit Bacteria with 16S rRNA gene sequences more similar than 97% are considered the same ‘species’ 10,025 16S genes found and classified Biers et al. App. Env. Microbiology , 2009 20% level, “phylum” 16 This suggests that the observed membrane protein variation is more a function of the measured environmental features, than phylogenetic diversity. Method: For each site, we correlated the EF profile distances and its MPF frequency profile distances and 16S profile distances 17 Discriminative Partition Matching Sites cluster into three distinct groups: Which membrane protein families are discriminating between these clusters? Groups are geographically separated: We can partition the membrane protein family matrix by these site groupings, and then look for significantly different distributions of proteins families between the clusters. 18 Discriminate Partition Matching First, we performed PCA on the membrane protein families matrix, and grouped the first component scores by the environmental clustering This revealed that the Mid-Atlantic and Pacific were more similar to each other in terms of membrane protein content, and these sites were grouped Which families are discriminating between these two site-sets? (T-test) 19 DPM results • • • 30 families showed significant differences (p-value<0.01) between the site sets Most were enriched in the North Atlantic (28/30) Higher pollution, chlorophyll, and possibly higher nutrients and cell abundance in the North Atlantic microbes’ need to expel antimicrobials, Buffer against by-products of shifts in ocean solute Chlorophyll content metabolism, or concentrations again Stabilization of toxins DNA environmental alluding and RNAto the increased pollutants, and possibly nutrient Exchanges ATP for fluxes from land and ADP riversin mitochondria and obligate intracellular parasites, may be nucleotide/H+ transporters 20 Simultaneous Correlations of Environmental Features and Membrane Proteins Canonical Correlation Analysis Environmental Features Membrane Protein Families ? We have addressed this questions by: 1. Comparing site similarity based on these two sets of features 2. Finding particular discriminating families between environmental groupings But we don’t know what particular features are associated with each other, and we know that they are all likely interdependent: Canonical Correlation Analysis Family 1 Salinity Family 2 Pollution Family 5 Temp 21 Canonical Correlation Analysis - CCA allows us to take advantage of the continuity of the features and observe which features are invariant or variant, and the type (positive, negative) of relationship between them. -We correlate all the variables, protein families and environmental features simultaneously. - We have two sets of variables, X1. . . X15 (environmental features) and Y1. . . Y151 (membrane protein families) Environmental Features Membrane Protein Families We are looking for two vectors, a and b (a set of weights for X and Y), such that the correlation between X, Y is maximized: 22 CCA results We are defining a change of basis of the cross co-variance matrix We want the correlations between the projections of the variables, X and Y, onto the basis vectors to be mutually maximized. Eigenvalues squared canonical correlations Eigenvectors normalized canonical correlation basis vectors Environment Family Correlation= 1 This plot shows the correlations in the first and second dimensions Correlation = .3 Correlation Circle: The closer the point is to the outer circle, the higher the correlation Variables projected in the same direction are correlated 23 CCA results 107 variant membrane protein families Pollution Climate change Shipping Dissolved O2 Dimension 2 Water depth invariant variant 44 invariant membrane protein families Acidity Chlorophyll Sample Depth UV Temperature Phospahte Salinity Difficult to see the strength and directionality of a relationship Weights of the features are difficult to visualize and compare Nitrate Silicate App. O2 util. Dimension 1 There is no means of quantifying the variation between sets of features 24 Protein Families and Environmental Features Network (PEN) a b a b cos Distance: Dot product between 1st and 2nd Dimension of CCA 25 Protein Families and Environmental Features Network (PEN) COG0598, Magnesium Transporter COG1176, Polyamine Transporter “Bi-modules”: groups of environmental features and membrane proteins families that are associated UV, dissolved oxygen, apparent oxygen utilization, sample depth, and water depth are not in the network 26 Bi-module 1: Phosphate/Phosphate Transporters Low Phosphate, high affinity phosphate transporters which are induced during phosphate limitation High Phosphate, low affinity inorganic phosphate ion transporter which are constitutively expressed 27 Microbes modulate content in response to phosphate Martiny et al. Env Microbiology, 2009 Phosphate Concentration related to phosphate acquisition genes in Prochlorococcus Van Mooy et al. Nature, 2009 Microbes modulate phospholipid content in response to phosphate concentrations 28 Bi-module 2: Iron Transporters/Pollution/Shipping Negative relationship between areas of high ocean-based pollution and shipping and transporters involved in the uptake of iron Pollution and Shipping may be a proxy for iron concentrations 29 Bi-module 2: Iron Transporters/Pollution/Shipping Iron is usually limiting in oceans: High Nitrate-Nutrient/Low Chlorophyll regions Delivery of iron to is usually by: - terrestrial input - fluvial (rivers) input - upwelling from the ocean floor - aeolian dust from land Rigwell A. J. (2002) Phil. Trans. R. Soc. Lond. 30 Bi-module 2: Iron Transporters/Pollution/Shipping Pollution and Dust N/C and Iron Transporters -Negative correlation between COG4558 and COG0609 and dust/pollution values (p-value <0.01) - Searching the BRENDA database for enzymes using iron as a cofactor reveal that an increase in these two COGs negatively correlated to the amount of enzymes present that required iron. 31 Conclusions New method (PEN) to visualize complex relationships in metagenomic data using explicit environmental variables We show both known and intuitive relationships between features and genomic content CCA also reveals the invariant fraction of environmental features and protein families (highlights important cellular processes): Chloride Channel, Type II secretion Proteins (virulence) Many variant ABC-type transporters(34/41): suggests streamlining for optimization and energy conservation 32 Much of Membrane Protein Space Remains Uncharacterized • 15% of predicted membrane proteins had NO homology to Genbank (e-value<1e-10) • We used short motifs (PROSITE) to characterize a small fraction of these including ABC Transporters, GPCRs, Lipocalins, betalactamases 16% (29,384) were annotated 33 Intraribotype diversity and the definition of a ‘species’ 16S analysis of GOS data reveals that most sequences fall into 5 ribotypes However, there were very few identical sequences, suggesting that no two cells have identical genome sequences Eugene V Koonin Nat Biotechnology, 2007 This suggests that ocean microbes are rather adaptive to their environments We observe diversity in membrane protein content and abundance, and show that it is a reflection of different environmental conditions more than phylogenetic diversity (16S) These are mostly oligotrophic (nutrient poor) waters and environmental conditions have likely been fairly constant over many years , genomes are “streamlining” 34 Conclusions Microbes from ocean surface samples Genotypic variation within similar natural show diversity in membrane protein populations occurs in response to content Integration of Environmental Diversity in membrane proteins was environmental conditions Features using GPS coordinates shown to be a reflection of different environmental conditions more than phylogenetic diversity Integration of geospatial data can highlight unexpected trends as anthropogenic factors seem to be reflected in microbial function Environmental clusters show differences in membrane protein content which reflect environmental conditions (pollution/efflux proteins) Developed (PEN) and adapted techniques to connect features of environment to specific protein families 35 Acknowledgements Advisors: Donald Engelman and Mark Gerstein Collaborators Gerstein Lab: Tara Gianoulis Kevin Yip Rob Bjornson Committee Members: Jim Bowie (UCLA) Annette Molinaro Lynne Regan Mike Snyder Administrative Staff: Mary Backer Ann Nicotra Nessie Stewart Yale University Biomedical High Performance Computing Facility NIH grant RR19895 which funded the instrumentation Nicolas Carriero Philip Kim Jan Korbel Sam Flores Engelman Lab: Damien Thevenin Yale Map Collection: Julia Rogers Stacey Maples Past and Present members of Engelman and Gerstein Labs 36