Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Proc. Natl. Acad. Sci. USA 105: 21034-21038 (2008) . Discovery and revision of Arabidopsis genes by proteogenomics Natalie E. Castellanaa, Samuel H. Payne, Zhouxin Shen, Mario Stanke,* Vineet Bafna, and Steven P. Briggs University of California San Diego,, *Institute for Microbiology and Genetics, Gottingen, Germany Limitations of gene annotation • Based on evidence of transcripts • Depends on gene finding/ protein prediction algorithms. • How do we define genes? • Models suffer from errors in reading frame and exon definition. • Rare transcripts? Noise? • Arabidopsis is the best annotated plant genome and other plant genomes are annotated relative to Arabidopsis. Types of alternative splicing What did Castellana et al. do to detect gene model errors? • Isolated Arabidopsis proteins from different tissues. • Analyzed tryptic peptides by Tandem Mass Spectrometry. • Determined sequences for 144,079 distinct peptides. • Confirmed gene models for 40% (12,769) of annotated genes (assuming gene total of 31,922). • 18,024 novel peptides were found, suggesting 13% of the proteome was missing or incorrect. • They added or corrected 1473 gene/proteins, leaving 1 to 4% unidentified protein coding genes. Proteins • Protein extracts of four Arabidopsis organs: (leaf, root, flower, silique) and cell culture MM2d. • Phosphoproteins were enriched using TiO2from MM2d • Sodium orthovanadate (Na3VO4)used as a phosphatase inhibitor. • Cysteines were reduced and alkylated. • Digested with trypsin. • Separated by high resolution 3D-LC: RP1, SCX, RP2, • in 45 runs producing 144,079 tryptic peptides. Mass Spectrometry (MS) From Wikipedia. Ionized molecules or molecule fragments are measured by their mass-to-charge ratios 1) the components of the sample are ionized by an electron beam, which results in the formation of charged particles (ions), 2) directing the ions into a electric and/or magnetic fields, 3) computation of the mass-to-charge ratio of the particles based on their motion as they transit through electromagnetic fields 4) 5) detection of the ions, which in step 3) were sorted according to m/z. Mass Spectrometers consist of three modules: 1) An ion source, which can convert gas phase sample molecules into ions (or, in the case of electrospray ionization, move ions that exist in solution into the gas phase); 2) a mass analyzer, which sorts the ions by their masses by applying electromagnetic fields; and 3) a detector, which measures the value of an indicator quantity and thus provides data for calculating the abundances of each ion present. A quadrupole time-of-flight hybrid tandem mass spectrometer. Multiple stages of mass analysis separation can be accomplished with MS steps separated in space or time. In tandem mass spectrometry the elements are physically separated. These elements can be sectors, transmission quadrupole, or time-offlight. ESI is electrospray ionization MALDI is matrix-assisted laser desorption/ionization Work flow Castellana N. E. et.al. PNAS (2008) 105:21034-21038 ©2008 by National Academy of Sciences Acquisition of Spectra • Peptides charged by electrospray ionization. • LTQ linear ion trap tandem mass spectrometery • 21 million spectra were acquired. Data is archived in Tranche (http://tranche.proteomecommons.org) • Spectra were searched against three reference databases: TAIR 7, a six frame translation of the genome, and ab initio gene predictions using AUGUSTUS and exon prediction. Number of assigned spectra, distinct peptides, and proteins in different samples and organs. Baerenfaller et al. (2008) Science 320: 938-941. • • • • • • • • • • • • • • • • • • • Plant tissue Spectra Distinct peptides Proteins Differentiated organs 465,836 64,219 10,902 Roots 71,516 27,546 6,125 Roots 10 days 38,476 20,301 5,159 Roots 23 days 33,040 16,984 4,466 Leaves 80,186 20,417 4,853 Cotyledons 39,419 13,628 3,665 Juvenile leaves 40,767 14,437 3,892 57.8 Flowers 147,650 33,192 7,040 Flower buds 54,588 19,467 5,104 Open flowers 57,861 20,205 5,215 Carpels 35,201 13,393 3,946 Siliques 79,589 23,054 5,779 Seeds 86,895 13,901 3,789 Cell culture 324,345 49,842 8,698 Dark 149,051 34,551 6,547 Light 143,583 32,656 6,474 Light; small 31,711 15,318 4,472 Total 790,181 86,456 13,029 • TAIR7 27,029 Avg. Mol. Mass (kD) 54.6 55.0 55.7 54.3 57.5 58.2 57.4 58.5 59.0 56.7 54.6 54.7 57.3 59.7 59.8 43.2 54.7 45.9 65% of all peptides were detected in only one organ. 1.3% were identified an all organs. Some Peptide Bookkeeping Total peptides Peptides in TAIR 7 annotation Peptides not in TAIR Peptides not in TAIR but uniquely located in the genome New intergenic “clusters” Former noncoding pseudogenes Never recognized as genes before due to inadequate support Uniquely identified by peptides 144,079 126,055 18,024 16,348 1,765 (genes) 561 genes (31%) 331 genes (20%) 198 genes Fig. S1. Discovery Curve, showing the number of distinct peptides matching to TAIR7 recovered as a function of the number of annotated spectra. The discovery curve is separated to show the contribution of each individual dataset. Novel gene discovery A cluster of 13 uniquely located peptides that do not overlap a current gene model (Chr3). The prediction track shows the single exon gene model produced by AUGUSTUS. (B) The predicted sequence shows strong homology to a Thylakoid lumen family protein (sp|P82658|TL19_ARATH). It also shows strong similarity to proteins in both grapevine (emb|CAO40861.1 a hypothetical gene) and rice (Os08g0504500 a cDNA derived gene). Castellana N. E. et.al. PNAS 2008;105:21034-21038 ©2008 by National Academy of Sciences Intergenic Regions 64% of intergenic clusters overlap annotated pseudogenes or transposons. Annotated pseudogenes may be incorrectly truncated, and have missing exons. Transposons may contain protein coding genes unrelated to transposon activity. (gene hitch-hiking) A large number (7,442 ) of small ORFs have been found as transcripts from intragenic regions*. 155 of these have predicted peptides. *Hanada et al. (2007) Genome Research 17:632-640. Peptides overlapping a predicted transposable element gene Five peptides overlap an annotated transposable element gene. The inferred protein is 56% identical to a ubiquitin like protease. Castellana N. E. et.al. PNAS 2008;105:21034-21038 ©2008 by National Academy of Sciences Gene refinement: new exons, boundary change, exon skipping, modified translation start and stop sites. A majority are novel exons: 60% are within introns, and 40% are in UTRs. 26 cases may actually be a single exon. Exon extension and shortening are equally frequent. AUGUSTUS using the peptide evidence predicts altered transcripts in 695 genes. In 130 cases, peptide variation indicates new isoforms. Refined Gene Model 4 novel peptides map in the 5’UTR and the first exon of a protein kinase Castellana N. E. et.al. PNAS 2008;105:21034-21038 ©2008 by National Academy of Sciences New gene models from identified peptides Baerenfaller et al (2008) Science 320: 938-941. New gene models from identified peptides Baerenfaller et al (2008) Science 320: 938-941. Take home lessons MS is a powerful adjunct to genomics and transcriptomics. More precise definition of coding genes. Proteomics is becoming more quantitative and less expensive. MS can provide absolute protein quantitation. Likely to play an increasing role in “omic” research. Proteomics people will want more respect. References •Katja Baerenfaller, Jonas Grossmann, Monica A. Grobei, Roger Hull, Mattias Hirsch-Hoffman, Shaul Yalovsky, Phillip Zimmermann, Ueli Grossniklaus, Wilhelm Gruissem, Sacha (2008). Genome scale proteomics reveals Arabidopsis thaliana Gene models and proteome dynamics. Science 320: 938-941. •Stephen Tanner, Zhouxin Shen, Julio Ng, Liliana Florea, Roderic Guiogo, Steven Briggs and Vineet Bafna. (2007). Improving gene annotation using peptide mass spectrometry. Genome Res. 2007. 17: 231-239 2007;17:231-239 •Kousuke Hanada, Xu Zhang, Justin O. Borevitz, Wen-Hsiung Li, •and Shin-Han Shiu1 (2007). A large number of novel coding small open reading frames in the intergenic regions of the Arabidopsis thaliana genome are transcribed and/or under purifying selection. Genome Res. 2007 17: 632-640