Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The genesis of new proteins and new biosyntheses Benjamin Pouvreau MSc (Plant Biology) This thesis is presented for the degree of Doctor of Philosophy of The University of Western Australia in 2016 School of Chemistry and Biochemistry, Faculty of Science Abstract The emergence of new proteins with novel functions is at the root of adaptive evolutionary innovation. Understanding how proteins are made and how new proteins evolve is of fundamental importance. Increasingly, it is clear that de novo protein evolution is frequent and important. The cyclic peptide SFTI-1 (Sunflower Trypsin Inhibitor 1) represents a shortcut to de novo protein evolution as it represents a protein that has evolved de novo within another protein instead of evolving fully de novo, i.e. from non-coding DNA. SFTI-1 biosynthesis and its evolutionary origin have been partly resolved. In this study I used synthetic genes and Arabidopsis transgenesis to provide more evidence for the requirement for asparaginyl endopeptidase (AEP) in processing SFTI-1 from its dual-product precursor PawS1 (Preproalbumin with SFTI-1). My data showed that SFTI-1 is not a by-product of its adjacent albumin. SFTI-1 can be processed from non-native locations within PawS1 as well as from unrelated protein precursors in an AEP-dependent fashion. In nature, the sequence for SFTI-1 emerged by expansion within an ancestral albumin gene, without altering the coding frame, creating a dual-product gene encoding a single protein precursor named PawS1 that is processed by enzymatic cleavage to produce the two distinct mature proteins. I termed neoproteinisation this type of de novo protein evolution and I believe it has been overlooked. To support this, I reveal a different family of similarly buried plant peptides (VBP for Vicilin Buried Peptides) that appear to have also evolved by neoproteinisation within ancestral vicilin precursors. VBPs are approximately 30 residues, possess a conserved CXXXC(X10-14)CXXXC motif and present as an alpha helical hairpin tethered by two disulfide bonds. VBPs appeared at the emergence of flowering plants and are now widespread in most flowering plants except Brassicales. This discovery confirms that neoproteinisation has occurred at least twice and could be a widespread and overlooked mechanism for de novo protein evolution. I Acknowledgement I thank the University of Western Australia for the financial support of my PhD by awarding me an International Postgraduate Research Scholarship (IPRS) and an Australian Postgraduate Award (APA). I thank my supervisor Josh Mylne, lecturer and research group leader at the School of Chemistry and Biochemistry for welcoming me in his research group and supervising me during this PhD. I thank the ARC Centre of Excellence in Plant Energy Biology for giving me access to plant growth facilities and mass-spectrometry facilities. I thank Aurélie Benfield for generating the aep triple mutant lines and a third of the transgenic lines used during my PhD. Special thanks to Ricarda for her magical talent with targeted proteomics. Special thanks to Owen for coping with my infinite number of questions on mass-spectrometry. I thank all the members of the Mylne lab for their support during my PhD. Special thanks to Mark for his computing tips. Special thanks to Amy for correcting many and many mistakes in this manuscript. Special thanks to Max for his enthusiasm and good mood. Special thanks to Julie for coping with me during my PhD thesis writing. I also thank all the administrative staff of UWA and the School of Chemistry and Biochemistry who helped me many times during my PhD. Special thanks to Adam and Nehal for their continuous good mood and help with my unusual product requests, such as the now famous Craig Balls. II Thesis declaration The research presented in this thesis is composed of my original work except where due reference has been made in the text. I have clearly stated the contribution by others to jointly-authored works that I have included in my thesis. This work was carried out in the School of Chemistry and Biochemistry, The University of Western Australia. The content of my thesis is the result of work I have carried out since the commencement of my candidature for the degree of Doctor of Philosophy of The University of Western Australia and does not include a substantial part of work that has been submitted to qualify for the award of any other degree or diploma in any university or other tertiary institution. The thesis is presented in separate Chapters, including three that will be submitted for publication. The bibliographical details of the work and where it appears in the thesis are as outlined below: Chapter 1: Pouvreau, B.: SFTI-1, an unusual plant cyclic peptide. Introduction Chapter 2: Pouvreau, B., Mylne, J.S.: The importance of seed storage protein maturation for germination. Author contribution: Pouvreau, B.: performed experiments (100%), wrote Chapter (100%); Mylne, J.S.: conceived the study (100%), edited Chapter. Chapter 3: Pouvreau, B., Mylne, J.S.: AEP function in dark-induced leaf senescence. Author contribution: Pouvreau, B.: performed experiments (100%), wrote Chapter (100%); Mylne, J.S.: conceived the study (100%), edited Chapter. III Chapter 4: Pouvreau, B., Mylne, J.S.: Using SFTI-1 to study protein trafficking defects. Author contribution: Pouvreau, B.: conceived the study (30%), performed experiments (100%), wrote Chapter (100%); Mylne, J.S.: conceived the study (70%), edited Chapter. Chapter 5: Pouvreau, B., Fenske, R., Mylne, J.S.: How readily will SFTI-1 hijack other sequences? (Main body of a manuscript ready for publication) Author contribution: Pouvreau, B.: conceived the study (50%), performed experiments (90%), wrote and edited manuscript (50%); Fenske, R.: performed experiments (10%); Mylne, J.S.: conceived the study (50%), wrote and edited manuscript (50%). Chapter 6: Pouvreau, B., Mylne, J.S.: Neoproteinisation: a new mechanism for de novo evolution. (Review ready for publication) Author contribution: Pouvreau, B.: wrote manuscript (100%); Mylne, J.S.: edited manuscript. Chapter 7: Pouvreau, B., Mylne, J.S.: Neoproteinisation in vicilin precursors. (Partly included in the discussion of a manuscript ready for publication) Author contribution: Pouvreau, B.: Conceived the study (50%), Performed experiments (100%), wrote manuscript (100%); Mylne, J.S.: Conceived the study (50%), edited manuscript. Chapter 8: Pouvreau, B., Mylne, J.S.: General discussion and future directions. (Conclusion to the thesis) Author contribution: Pouvreau, B.: wrote manuscript (100%); Mylne, J.S.: edited manuscript. IV Abbreviations In alphabetical order: 1‑D: One dimension 5d: Five days 9d: Nine days A. thaliana: Arabidopsis thaliana ACTH: Adrenocorticotropin ADH (or adh): Alcohol dehydrogenase AEP: Asparaginyl endopeptidase aep‑null: aep quadruple null mutant (aep1 aep2 aep3 aep4) AFGP: Antifreeze glycopeptides AGI: Reference gene model for the protein in the A. thaliana information resource (TAIR) database AP: Aspartic protease Asn: Asparagine AUC: Area under the curve C2: Original CXXXC(10-14)CXXXC motif peptide characterised in Cucurbita maxima C2a: First form of the C2 peptide from Cucurbita maxima C2b: Second form of the C2 peptide from Cucurbita maxima C2m: Potential C2 peptide homologue from Cucumis melo C2ps: Potential C2 peptide homologue from Pisum sativum C2pv: Potential C2 peptide homologue from Phaseolus vulgaris V C2s: Potential C2 peptide homologue from Cucumis sativus C2vf: Potential C2 peptide homologue from Vicia faba CaMV35S: Cauliflower mosaic virus 35s promoter ClaI: Caryophanon latum L. restriction enzyme 1 CO8: Cycloviolacin-O8 from Viola odorata Col‑0: Arabidopsis thaliana ecotype Columbia-0 CraS1: Chimeric construct with SFTI-GLDN inserted inside CRU1 CRU1: CRUCIFERIN A1 cSlo: Chicken homologue of the Drosophila Slowpoke gene CtA: Cyclotheonamide A Ctc: Cyclotide genes from Clitoria ternatea CtM: Cyclotide M from Clitoria ternatea C‑terminal: Carboxy-terminal cTI: Cyclic trypsin inhibitor from Momordica cochinchinensis CXXXC motif: amino acid motif consisting of two cysteine residues spaced by three other amino acid residues CXXXC(10-14)CXXXC motif: amino acid motif consisting of two CXXXC motifs spaced by 10 to 14 other amino acid residues DNA: Deoxyribonucleic acid cDNA: complementary DNA Dscam: Down syndrome cell adhesion molecule DTT: Dithiothreitol E. coli: Escherichia coli EDTA: Ethylenediaminetetraacetic acid VI ER: Endoplasmic reticulum ER‑signal: Peptide signal for ER targeting F(1 to 6): Filial generation of offspring of distinctly different parental types. Fg01: Arbitrary attributed mouse gene name fgf4: Fibroblast growth factor 4 gene Flt1: Vascular endothelial growth factor receptor 1 sFlt1: Alternatively spliced Flt1 sFlt1‑14: Alternatively spliced Flt1 that is human specific GABI: Generation of flanking sequence tags from T-DNA mutagenised A. thaliana G‑protein: Guanine nucleotide-binding protein GTPase: Guanosine triphosphate hydrolase HIV: Human immunodeficiency virus HPLC: High performance liquid chromatography Hsmar1: Homo sapiens mariner elements IDL: Individually darkened leaf JGW (or jgw): Jingwei KB: kalata B1 type cyclotides KLK4: human kallikrein‑related peptidase 4 LC: Liquid chromatography LEA: Late embryogenesis abundant LeaS1: Chimeric construct with SFTI-GLDN inserted inside a LEA precursor LINE‑1: Long interspersed element 1 LPH: Lipotropin VII lola: Longitudinal lacking LSU: large subunit Ltd: Limited (referring to private companies) MAG: Maigo (means a stray or lost child in Japanese) MALDI‑TOF: Matrix-assisted laser desorption/ionisation time-of-flight MES: 2-(N-morpholino)ethanesulfonic acid MiAMP2: Macadamia integrifolia antimicrobial peptide 2 MS media: Murashige-Skoog media MS: Mass spectrometry MS/MS: Tandem mass spectrometry MSH: Melanotropin NMR: Nuclear magnetic resonance NAHR: Non-allelic homologous recombination NHEJ: Non-homologous end joining NOS1: Nitric oxide synthetase 1 nos 3′: nopaline synthase 3′ UTR N‑terminal: NH2-terminal Oak: Oldenlandia affinis kalata gene OLE1: Oleosin 1 OleoS1: Chimeric construct with SFTI-GLDN inserted inside OLE1 precursor ORF: Open reading frame PA1: Pea albumin 1 pAC161: Plasmid artificial chromosome P1 version 161 VIII pAOV: Plasmid (antisense orientation vector) used to create the transformants generated during this PhD PAP85: Plastid lipid associated protein 85 PapS1: Chimeric construct with SFTI-GLDN inserted inside PAP85 precursor PawS1: Preproalbumin with Sunflower inhibitor 1 Paw[N65]S1: Chimeric construct with SFTI-GLDN removed from its native position and placed after asparagine 65 inside PawS1 Paw[N84]S1: Chimeric construct with SFTI-GLDN removed from its native position and placed after asparagine 84 inside PawS1 Paw[N96]S1: Chimeric construct with SFTI-GLDN removed from its native position and placed after asparagine 96 inside PawS1 Paw[N143]S1: Chimeric construct with SFTI-GLDN removed from its native position and placed after asparagine 143 inside PawS1 Paw[N146]S1: Chimeric construct with SFTI-GLDN removed from its native position and placed after asparagine 146 inside PawS1 Paw4SFTI: Chimeric construct with four tandem SFTI-GLDN variants placed at SFTI-GLDN native position inside PawS1 PawS1[Δ2‑21]: Chimeric construct with ER-signal removed from PawS1 PawS2: Name given to the sunflower PawS1 homologue pBI121: Plasmid binary version 121 PCR: Polymerase chain reaction pDAP101: Plasmid used to create some of the SAIL lines PDB: Protein Data Bank PhA: Petunia hybrida cyclotide A Phc1: Petunia hybrida cyclotide gene 1 IX pOLEOSIN:PawS1: Construct, expressing native PawS1 protein precursor under the control of a seed specific A. thaliana OLEOSIN promoter POMC: Proopiomelanocortin pROK2: Plasmid used to create some of the SALK lines pipsl: Phosphatidylinositol-4-phosphate 5-kinase and proteasome subunit like PV100: Protein vacuolar 100 QqQ: Triple quadrupole mass spectrometer Q‑TOF: Quadrupole time-of-flight mass spectrometer RACE: Rapid amplification of cDNA ends RNA: Ribonucleic acid mRNA: Messenger RNA SacI: Streptomyces achromogenes restriction enzyme 1 SAG12: Senescence associated gene 12 SAIL: Syngenta Arabidopsis insertion library SDS: Sodium dodecyl sulfate SDS‑PAGE: Sodium dodecyl sulfate - polyacrylamide gel electrophoresis SESA1 (or 4): SEED STORAGE ALBUMIN 1 (or 4) SesaS1: Chimeric construct with SFTI-GLDN inserted inside SESA1 SesaS2: Chimeric construct with SFTI-GLDN inserted inside SESA1 without internal propeptide. SET: Su(var)3-9 enhancer-of-zeste trithorax domain setmar: Fusion of SET and Hsmar1 DNA sequence SFTI‑1: Sunflower trypsin inhibitor 1 (cyclic peptide) 1(to 4)SFTI: Chimeric constructs with one to four tandem SFTI-GLDN variants placed at SFTI-GLDN native position inside a PawS1 variant with no albumin X SFTI‑1F12Y: SFTI‑1 variant with tyrosine 12 replaced by phenylalanine SFTI‑1I10V: SFTI‑1 variant with isoleucine 10 replaced by valine SFTI‑1K5R: SFTI‑1 variant with lysine 5 replaced by arginine SPAD: Minolta SPAD-502 leaf chlorophyll meter (SPAD is just a brand name) SSU: Small subunit SUMO: Small ubiquitin-like modifier T1 (or T2): First (or second) transgenic generation TAIR10: The A. thaliana information resource database (version10) tbc1d3: TBC (Tre-2/Bub2/Cdc16) 1 Domain Family, Member 3 gene TI: Trypsin inhibitors UniProtKB: Universal protein knowledgebase usp: Ubiquitin-specific protease gene VaA: Viola arvensis peptide A VEGF: Vascular endothelial growth factor Voc: Viola odorata cyclotide gene Vok: Viola odorata kalata gene VPE: Vacuolar processing enzyme (or AEP, legumain) VPS: Vacuolar protein sorting VSR: Vacuolar sorting receptor Ws‑2: Arabidopsis thaliana ecotype Wassilewskija-2 WT: Wild type XhoI: Xanthomonas holcicola restriction enzyme 1 XI XL: Large upstream exon from the gene for G-protein first subunit ALEX: Product of the alternative splicing of the XL exon ynd: yande gene ZAG1: Zea mays AGAMOUS 1 ZMM2: Zea mays MADS 2 (MADS is a conserved amino acid motif) XII Contents ABSTRACT I ACKNOWLEDGEMENT II THESIS DECLARATION III ABBREVIATIONS V CONTENTS XIII CHAPTER 1 1 1 2 SFTI-1, AN UNUSUAL PLANT CYCLIC PEPTIDE 1.1 Plant cyclic peptides 2 1.1.1 Plant cyclic peptide identification 2 1.1.2 Plant cyclic peptide biosynthesis 4 1.1.3 Plant cyclic peptide evolution 8 1.2 Discovery of the sunflower trypsin inhibitor SFTI-1 10 1.2.1 Identification of SFTI-1 10 1.2.2 Multiple applications for SFTI-1 properties 12 1.3 Discovery of SFTI-1 biosynthetic origins 14 1.3.1 Identification of PawS1 (Preproalbumin with SFTI-1) 14 1.3.2 The napin-type seed storage albumins 14 1.3.3 Seed albumins are matured by AEP 16 1.4 SFTI-1 is matured by the seed albumin processing machinery 19 1.4.1 SFTI-1 requires AEP for proper processing in transgenic lines 19 1.4.2 A recombinant AEP can perform SFTI-1 maturation 22 1.5 SFTI-1 as a model to study protein biosynthesis and evolution 23 1.5.1 SFTI-1 is an unusual buried cyclic peptide 23 1.5.2 SFTI-1 stepwise evolution 26 1.6 Thesis aim 28 XIII CHAPTER 2 2 29 THE IMPORTANCE OF SEED STORAGE PROTEIN MATURATION FOR GERMINATION 30 2.1 Introduction 30 2.2 Material and methods 32 2.2.1 Plant material 32 2.2.2 Plant growth conditions 32 2.2.3 Germination assays 32 2.2.4 Protein extraction 33 2.2.5 Protein 1-D gel electrophoresis 33 2.3 Results 34 2.3.1 Preliminary study 34 2.3.2 Germination rate is not influenced by media composition 35 2.3.3 Germination rate varies in a fixed genotype 37 2.3.4 Impaired seed storage protein maturation does not influence germination rate 37 2.4 Discussion 38 CHAPTER 3 39 3 40 AEP FUNCTION IN DARK-INDUCED LEAF SENESCENCE 3.1 Introduction 40 3.1.1 AtAEP3 expression is highest in senescing leaves 40 3.1.2 Senescence is a natural recycling plant mechanism 41 3.1.3 Dark-induced senescence as a tool to study senescence 42 3.2 Material and Methods 43 3.2.1 Plant material 43 3.2.2 Chlorophyll abundance measurement 43 3.2.3 Plant Growth Conditions, selection, crosses and seed collection 43 3.3 Results 44 3.3.1 Preliminary study 44 3.3.2 Natural variation for response to dark-induced senescence 46 3.3.3 AEP is not involved in dark-induced senescence 47 3.4 Discussion 49 XIV CHAPTER 4 50 4 51 USING SFTI-1 TO STUDY PROTEIN TRAFFICKING DEFECTS 4.1 Introduction 51 4.2 Material and Methods 54 4.2.1 Plant material and growth conditions 54 4.2.2 Plant genotyping 54 4.2.3 Plant crossing and selection 55 4.2.4 Segregation analysis 56 4.2.5 Seed peptide extraction 56 4.2.6 Acyclic SFTI-1 detection and quantification 57 4.2.7 Protein 1-D gel electrophoresis 58 4.3 Results 59 4.3.1 Generation of hybrid lines 59 4.3.2 SFTI-1 as a tool to monitor seed storage protein maturation 59 4.3.3 Seed storage protein trafficking pathways 61 4.4 Discussion 64 CHAPTER 5 65 5 66 HOW READILY WILL SFTI-1 HIJACK OTHER SEQUENCES? 5.1 Introduction 66 5.2 Material and methods 67 5.2.1 Synthetic genes for in planta expression 67 5.2.2 Plant transformation and growth conditions 68 5.2.3 Seed peptide extraction 68 5.2.4 SFTI-1 (cyclic and acyclic) detection and acyclic SFTI-1 quantification 69 5.2.5 Shotgun proteomics of A. thaliana seeds 69 5.2.6 Targeted proteomics 70 5.3 Results 71 5.3.1 The SFTI-1 sequence can emerge from non-native locations 71 5.3.2 The SFTI-1 sequence can recruit AEP 74 5.3.3 The SFTI-1 sequence can recruit AEP without an adjacent protein 77 5.4 Conclusion 78 XV CHAPTER 6 6 79 NEOPROTEINISATION: A NEW MECHANISM FOR DE NOVO EVOLUTION 80 6.1 Introduction 80 6.2 How do new proteins arise? 81 6.2.1 New protein evolution by DNA based gene duplication and divergence 81 6.2.2 New protein evolution through RNA based gene duplication and divergence 83 6.2.3 Creation of new proteins through gene or transcript fusion 85 6.2.4 Acquisition of de novo proteins by horizontal gene transfer 89 6.2.5 Emergence of de novo protein by new alternative gene coding 90 6.2.6 De novo protein evolution 91 6.2.7 Neoproteinisation, a new mechanism of divergence that creates de novo proteins 94 6.3 How to discover other neoproteinisation events? 6.3.1 Neoproteinisation creates unusual genes encoding multiple proteins 6.3.2 Leads for discovery of other neoproteinisation events 95 95 100 6.4 Conclusion 102 CHAPTER 7 103 7 104 NEOPROTEINISATION IN VICILIN PRECURSORS 7.1 Introduction 104 7.2 Material and methods 107 7.2.1 Vicilin alignment 107 7.2.2 Seed extract fractions preparation 107 7.2.3 MALDI-TOF-MS analysis 108 7.2.4 LC/MS analysis 108 7.3 Results 109 7.3.1 Several classes of vicilin precursors 109 7.3.2 Emergence of conserved CXXXC(10-14)CXXXC motifs in vicilin precursors 111 7.3.3 Are CXXXC(10-14)CXXXC motifs a neoproteinisation event? 114 7.3.4 Identification of CXXXC(10-14)CXXXC peptides 115 7.3.5 Identification of C2 homologues from closely related species 117 7.3.6 More widespread evidence for C2-like peptides 118 7.4 Conclusion 120 XVI CHAPTER 8 121 8 122 GENERAL DISCUSSION AND FUTURE DIRECTIONS 8.1 Seed storage protein trafficking 122 8.2 SFTI-1 maturation 123 8.3 SFTI-1 evolution, neoproteinisation 124 8.4 Neoproteinisation of vicilin precursors 125 CHAPTER 9 126 9 127 SUPPLEMENTARY INFORMATION 9.1 Gene sequences, relative to trafficking mutants (Chapter 4) 127 9.2 Oleosin promoter used for transgenic expression in A. thaliana 137 9.3 Synthetic gene sequences from Chapter 5 138 9.4 Chimeric protein precursors from Chapter 5 144 REFERENCES 147 XVII Chapter 1 SFTI-1, an unusual plant cyclic peptide An introduction 1 1 SFTI-1, an unusual plant cyclic peptide 1.1 Plant cyclic peptides 1.1.1 Plant cyclic peptide identification Within the last two decades, there has been significant advancement in research to identify circular proteins in higher organisms and in the development of synthetic cyclic proteins. In general, backbone cyclic peptides have several advantages over their linear counterparts. They are more thermodynamically stable than non cyclic peptides and because they have no free ends, they resist digestion by exopeptidases, making them much more stable than their linear counterpart. Cyclic peptides are also more efficient than non cyclic ones at creating binding interactions with target proteins because the cyclic structure leads to reduced entropic losses upon binding (Marx, Korsinczky et al. 2003). Cyclic peptides have various functions and are widespread. They have been isolated from several tissues of plants and mammals, and are also found in fungi and bacteria (Craik, Mylne et al. 2010). Although the biological function of cyclic peptides in plants often remains unknown, the initial interest in cyclic peptides from plants was driven by their diverse and often very potent biological activities. Their bioactivities include uterotonic, anti-HIV and neurotensin inhibitory activities, amongst others. Cyclic peptides were initially discovered through biochemical analysis of indigenous medicinal plants (Gran 1970; Gran 1973) and later during pharmaceutically oriented screening programs (Gustafson, Sowder et al. 1994; Witherup, Bogusky et al. 1994). A typical example is the plant cyclic peptide kalata B1 extracted from the leaves of the African plant Oldenlandia affinis. Kalata B1 was originally discovered as the active component in a native African medicine from the Congo region, made from Oldenlandia affinis boiled leaves and used by women to accelerate childbirth (Gran 1970; Gran 1973). The kalata B1 (founding member of the cyclotide family of plant cyclic peptide, see next section for a more detail definition of all families of plant cyclic peptides) sequence and structure was characterised in 1995, revealing its cyclic nature (Saether, Craik et al. 1995). 2 Kalata B1 was later also reported to have antimicrobial (Tam, Lu et al. 1999), haemolytic (Barry, Daly et al. 2003), anti-HIV (Daly, Gustafson et al. 2004) and insecticidal activities (Jennings, West et al. 2001). More recently, several cyclic peptides, and particularly cyclotides, have been reported to have insecticidal activity (Jennings, West et al. 2001; Jennings, Rosengren et al. 2005; Barbeta, Marshall et al. 2008). Many other activities also related to defence have been reported for cyclotides and other specific cyclic peptides, including antimicrobial, antifouling, cytotoxic, molluscicidal and nematocidal activities (Lindholm, Göransson et al. 2002; Göransson, Sjogren et al. 2004; Colgrave, Kotze et al. 2008; Plan, Saska et al. 2008; Colgrave, Kotze et al. 2009). However these studies lack native in planta evidence and the reported activities might not reflect the native functions and could be a result of their known property for interfering into differing degrees with membranes (Craik, Mylne et al. 2010). Nevertheless, all their biological properties and structural advantages make cyclic peptides promising scaffolds for the design of stable pharmaceuticals. Naturally occurring peptidic inhibitors of known structure are a promising starting point for the design of novel drugs. They can provide a relatively large surface area to enhance the potential of designing selective analogues. In addition, as the overall three dimensional structure of the original protein is preserved in the new variants; the interpretation of interactions at the atomic level is possible without further structural analysis. Although linear peptides are useful for some applications, these molecules tend to be unstable in vivo due to the hydrolytic activity of peptidases. Therefore, cyclic peptides are promising scaffolds for the design of stable drugs (Boy, Mier et al. 2010). Numerous examples of re-engineered plant cyclic peptides, and particularly cyclotides, have been reported, with potential applications in the treatment of many diseases including vascular disorders (Gunasekera, Foley et al. 2008; Chan, Gunasekera et al. 2011; Getz, Cheneval et al. 2013), allergic asthma (Sommerhoff, Avrutina et al. 2010), inflammatory pain (Wong, Rowlands et al. 2012), obesity (Eliasen, Daly et al. 2012), cancer (Ji, Majumder et al. 2013), HIV (Aboye, Ha et al. 2012) and other diseases related to the immune system (Wang, Gruber et al. 2014). For example, kalata B1 was re-engineered in targeted studies focusing on the development of therapeutics for the treatment of multiple sclerosis, an inflammatory disorder of the central nervous system, in which the immune system has a significant role (see next page). 3 Figure 1.1. Plant cyclic peptides. For each major class the size range is given, the typical structure (when known) is represented by canonical example of each group and presence of disulfide bonds is highlighted in yellow; (a) SFTI-1, PDB code 1SFI (Luckett, Garcia et al. 1999), (b) kalata B1, PDB code 1KAL (Saether, Craik et al. 1995), (c) cTI2, PDB code 1HA9 (Heitz, Hernandez et al. 2001), (d) Segetalin A (6 amino acid residues orbitide) chemical schematic (Condie, Nowak et al. 2011; Arnison, Bibb et al. 2013), (e) Adouetine X chemical schematic (Dias, Gressler et al. 2007; Trevisan, Maldaner et al. 2009). The treatments currently used usually involve repressing the patient immune system, thus creating the particularly strong side effect of not being able to have an immune response to pathogens infections. Peptide antigens are a promising alternative strategy, to modulate the immune system instead of shutting it down, but their therapeutic potential is often limited by their poor in vivo stability. The exceptional stability of some cyclic peptide provides another alternative, and the potential of re-engineered plant cyclic peptides in the treatment of complex diseases has already been demonstrated. In a recent study, three of the 29 amino acid residues of kalata B1 were replaced by a seven amino acid residues peptide corresponding to the active site of an immune-regulatory protein called myelin oligodendrocyte glycoprotein. Injection of this so-called grafted peptide reduced significantly both clinical and histological symptoms of multiple sclerosis in mouse system (Wang, Gruber et al. 2014). Moreover, the use of a naturally occurring peptide as scaffolds for the design of new therapeutic molecules also leaves open the possibility of using plants as “factories” for producing the modified inhibitors via a low cost route (Quimbar, Malik et al. 2013). The numerous plant cyclic peptides reported with insecticidal activities could also be used to develop alternative approaches to chemical pesticides, such as the production of transgenic plants that express naturally occurring plant cyclic peptides to protect themselves against pathogens (Craik, Mylne et al. 2010). 1.1.2 Plant cyclic peptide biosynthesis In order to engineer plants for the production of modified cyclic peptides, it is necessary to understand how plants make cyclic peptides. In this regard, plant cyclic peptides can be separated into five families according to how their specific structures and biosynthesis are known (figure 1.1). Some are thought to be non-gene-encoded and synthesised by a non-ribosomal peptide synthetase, but the regulation of the process is not well understood (Tan and Zhou 2006) and poorly studied. Some non-gene-encoded cyclic peptides have been reported in other organisms, from bacteria to human. They can be produced during other proteins degradation by enzymes that cut pieces off different proteins and ligate and cyclise them together, or by a non-ribosomal protein synthesis pathway with spontaneous cyclisation (Xu, Li et al. 2011). 4 Figure 1.2. Primary structures of cyclotides and their precursors. (a) Conserved primary structure of cyclotides, represented by kalata B1 (KB1). (b) Sequences of characteristic cyclotides for each plant family where cyclotides have been found, presented in an alignment highlighting their strong sequence conservation, in particular for the cysteine residues. Alignment with CLC Genomics (CLCbio), black for perfect homology, grey for strong homology cysteine residues highlighted in red. (c) Primary structure of the protein precursors for characteristic cyclotides. The Oak genes from Rubiaceae encode the kalata B (KB) cyclotides and can encode one to three cyclotides. In Violaceae kalata B1 is produced from a multiple precursor named Vok1 also containing two copies of the cyclotides VaA, but single precursors also exist such as Voc1 containing the cyclotide CO8. The truncated type of precursor from Solanaceae (Phc1) and the unusual Ctc type of dual precursor from Fabaceae are also presented. Sequences are from UniProtKB; Oak1 (P56254) containing kalata B1 (KB1), Oak4 (P58454) containing kalata B2 (KB2), Oak2 (P58455) containing kalata B3/B6 (KB3/6), Oak3 (P58457) containing kalata B7 (KB7), Voc1 (P58440) containing Cycloviolacin-O8 (CO8), Vok1 (Q5USN7) containing cyclotide VaA, Phc1 (B3EWH5) containing cyclotide PhA and Ctc1 (P86899) containing the cyclotide CtM. Precursor structure deduced from sequences and previously published data (Gruber, Elliott et al. 2008; Nguyen, Zhang et al. 2011; Poth, Mylne et al. 2012). The cyclic peptides of the orbitide family are mostly represented in Caryophyllaceae and close relatives. They typically consist of a single ring of five to twelve amino acid residues, that are thought to originate from gene-encoded linear precursors, in a two-step process involving serine-protease-like enzymes, but non-gene-encoded amino acids can also be added to the structure, and their biosynthesis remain to be fully elucidated (Condie, Nowak et al. 2011; Barber, Pujara et al. 2013). On the other hand, the members of the three other known families of plant cyclic peptides are thought to have their sequence totally gene-encoded and their biosynthesis have been described in more details and thus represent a more evident choice for transgenic applications. The third family of plant cyclic peptides, and the most broadly represented by far, is the cyclotide family, which members have been isolated from various species in the Rubiaceae (Craik 2001; Jennings, West et al. 2001), Violaceae (Dutton, Renda et al. 2004), Fabaceae (Nguyen, Zhang et al. 2011; Poth, Colgrave et al. 2011), and Solanaceae (Poth, Mylne et al. 2012) which are phylogenetically distant plant families. Cyclotides are disulfide-rich cyclic peptides from 29 to 31 amino acid residues and have a unique structure (figure 1.1 and 1.2) consisting of a distorted cyclised triple-stranded β-sheet, with a knotted arrangement of three disulfide bonds (Craik, Daly et al. 1999). This cyclic cystine-knot motif makes cyclotides particularly resistant to enzymatic degradation, but also to thermal or chemical degradation (Colgrave and Craik 2004). Their sequence and structure (figure 1.2) are highly conserved among species (Dutton, Renda et al. 2004). They are characterised by the first cyclotide to have been discovered, kalata B1. Kalata B1 was originally discovered in Oldenlandia affinis (Rubiaceae) leaves, but is also found in the phylogenetically distant species Viola odorata (Violaceae) (Gruber, Elliott et al. 2008). Cyclotides are typically encoded by a gene producing a larger protein precursor consisting of a signal-peptide (endoplasmic-reticulum targeting), a propeptide and one to three identical or slightly different cyclotide domains, which themselves consist of a cyclotide surrounded by two propeptides. However, little is known about the enzymes involved in releasing and cyclising cyclotides from their linear precursors. An enzyme called asparaginyl endopeptidase (AEP, also referred to as VPE, vacuolar processing enzyme or 5 legumain) was suggested to be involved in the removal of the C-terminal propeptide of cyclotides (Saska, Gillon et al. 2007) , but the enzymes involved in the removal of the two N-terminal propeptides are still unknown (Mylne, Chan et al. 2012). However, an unusual cyclotide precursor was recently described in a plant from the Fabaceae family. Ctc type cyclotide genes from Clitoria ternatea encode dual protein precursors that also contain an albumin precursor similar to PA1 (Pea albumin 1) seed storage albumin. Although no data were provided to support the existence of the albumin at the protein level, the Ctc cyclotides were found in most tissues of this plant and then characterised. They share very high sequence and structure similarity with most other plant cyclotides and process cytotoxic activities like many other plant cyclotides (Nguyen, Zhang et al. 2011). These cyclotides, characterised by the CtM cyclotide, only require cleavage by AEP at their C-terminus to be released from their precursor, because their N-terminus is unmasked by endoplasmic reticulum (ER) signal removal. 6 Figure 1.3. Structures of squash trypsin inhibitors and their precursors. (a) Conserved primary structure of cyclic squash trypsin inhibitors, represented by cTI2 from Momordica cochinchinensis (Cucurbitaceae). (b) Sequence of cyclic squash trypsin inhibitors from Momordica cochinchinensis for which the precursor and encoding gene has been identified (cTI - 1, 2, 4, 7 and 8), presented in an alignment (as for Figure 1.2) highlighting their strong sequence conservation between each other, but also with their linear counterparts (TI5 and TI6) and with other common linear (*) squash trypsin inhibitors from pumpkin (TI-I and TI-IV). Contrastingly their sequence strongly differs from cyclotides, exemplified here by kalata B1 (KB1), , apart from the structural cysteines and the initial Gly critical for cyclisation. (c) Primary structure of the known precursors of cyclic squash trypsin inhibitors, compared to a common cyclic squash trypsin inhibitor precursor. Sequences and precursor structure from previously published data (Mylne, Chan et al. 2012) excepted sequences of TI-I (P01074) and TI-IV (P07853) that are from UniProtKB. (d) Structure of cTI2, PDB code 1HA9 (Heitz, Hernandez et al. 2001) and TI-I, PDB code 2V1V (Zhukov, Jaroszewski et al. 2000). The terminal amino acid residues for acyclic TI-I and the additional “cyclic-loop” in cTI2 are labelled. The cyclic squash trypsin inhibitors, or macrocyclic knottins, from the Cucurbitaceae family, set apart and constitute a fourth family of plant cyclic peptide (Huang, Liu et al. 1993; Mylne, Chan et al. 2012). They share the same cystine-knot motif as cyclotides and for this reason they have been used similarly to cyclotides as drug-scaffolds, but they strongly diverge in sequence. In addition, whereas cyclotides can naturally have various activities and are found in phylogenetically dispersed plant families, cyclic squash trypsin inhibitors have only been found in the seeds of Momordica cochinchinensis (Cucurbitaceae), a tropical liana, and their native activity is limited to trypsin inhibition, although they have been re-engineered for diverse different applications (Thongyoo, Bonomelli et al. 2009). The cyclic squash trypsin inhibitors are matured from large precursors that also differ from traditional cyclotides precursors. Their precursors are constituted by tandem series of cyclic peptide sequences, separated by identical spacers and terminate with a sequence for an acyclic squash trypsin inhibitor of similar sequence and structure (figure 1.3). Contrary to their cyclic homologues, this type of acyclic squash trypsin inhibitor, such as the typical trypsin inhibitor one (TI-I) from pumpkin, is widespread in Cucurbitaceae, with almost identical sequence and structure compared to acyclic trypsin inhibitors from Momordica cochinchinensis (figure 1.3). The squash cyclic trypsin inhibitors are thought to be processed at both ends and cyclised by AEP, whereas the maturation of the acyclic counterpart remains unclear, but is independent of AEP. The identical structures and repetitive sequences of the cyclic squash trypsin inhibitors suggest an evolution by internal duplication of the same motif within an ancestral gene (Mylne, Chan et al. 2012). Sunflower trypsin inhibitor 1 (SFTI-1) is the first member of a newly discovered family of cyclic peptides, restricted to sunflower (Astereceae) and close relatives (termed Helianthae), and will be introduced in detail in the other parts of this Chapter. 7 Figure 1.4. Structures of PA1b , Ctc cyclotides and of their precursors. The sequences of some of the PA1 and Ctc known precursors are presented in separated alignments (as for Figure 1.2), associated with their schematic of their primary structure as previously described (Nguyen, Zhang et al. 2011). Sequences are from UniProtKB; PA1-1 (P62926), PA1-2 (P62927), PA1-3 (P62928), PA1-4 (P62929), PA1-5 (P62930), PA1- 6(P62931), Ctc1 (P86899), Ctc2 (G1CWI0), Ctc3 (G1CWH9), Ctc4 (G1CWH1). The structure of PA1b, PDB code 1P8B (Jouvensal, Quillien et al. 2003) and CtM, PDB code 2LAM (Poth, Colgrave et al. 2011) are presented. 1.1.3 Plant cyclic peptide evolution Although it is unclear how cyclotides with identical sequence and similar precursors originated in a few evolutionary dispersed plant families (Mylne, Chan et al. 2012), the discovery of the unusual Ctc gene family in Fabaceae brought a first clue. The Ctc gene family share many similarities with the PA1 gene family from pea, another Fabaceae (figure 1.4). The gene PA1 encodes a dual protein precursor that is cleaved into two different and independent proteins, an albumin, PA1a, and a linear cystine-knot peptide, PA1b (Higgins, Chandler et al. 1986). Several copies of the gene exist in the pea genome, with little sequence variation. The PA1a albumin is the most abundant low molecular weight seed storage albumin from pea, but it differs from most other low molecular weight seed albumins, such as napin-type. PA1a is monomeric whereas most napin-type albumins are structured into two subunits connected by disulfide bonds. However, to some extent, PA1a seems to share some moderate sequence similarity with a napin-type albumin sub-unit including a similar cysteine pattern that is highly conserved in most seed storage albumins (Higgins, Chandler et al. 1986). Contrastingly, whereas acyclic, PA1b shares high similarity in structure, but also in sequence, with CtM and most plant cyclotides and in particular conserves the cystine-knot motif (figure 1.4). PA1b is a highly potent insecticidal peptide used in several transgenic approaches to improve crop resistance to pests in alternatives to chemical pesticide (Gressent, Da Silva et al. 2011). The strong structure and sequence homology between the Ctc genes and PA1, described in the previous paragraph, suggest that they are related. In Clitoria ternatea, the lack of protein evidence for the PA1a-like albumin from Ctc precursor would suggest that this albumin might not be produced, whereas the Ctc cyclotides are found in most tissues. On the other hand, in pea the PA1a albumin is the most abundant seed-storage albumin and PA1b-type peptides are not cyclic peptides and are only found in seeds. Taken together this suggests that Ctc genes might have evolved from PA1-type genes in the Fabaceae lineage (Nguyen, Zhang et al. 2011). This also suggests that cyclisation and ubiquitous expression of the Ctc cyclotides was achieved by divergence. However, the fate of their adjacent albumin remains unknown, but 8 is likely to have diverged from the ancestral seed storage function in regard to the gene expression pattern. This hypothesis also implies a strong positive evolutionary advantage for favouring the knotted peptide production over the co-encoded albumin, which is quite likely due to the high potency of PA1b, Ctc peptides and many other cyclotides as insecticidal and cytotoxic molecules. Furthermore, the common redundancy of albumin genes in plants genomes allows some degree of liberty for divergence. In this scenario evolution achieving cyclisation of the favoured peptide would have constituted an undeniable advantage because cyclic peptides are traditionally much more stable than their linear counterparts. Additionally, achieving more ubiquitous production would have provided a serious advantage in regard to their activity. If this model of evolution is correct, then it is also possible that in other species such dual genes would have diverged even further and lost the albumin region to create the typical cyclotide precursors currently known. However the origin of this dual precursor itself remains unknown and it could likely also have appeared from the fusion of an ancestral cystine-knot peptide precursor with an ancestral albumin precursor. Much more investigations would be required before having a clear answer, but it seems that cyclic peptides and seed storage proteins could be evolutionary linked. Thus it appears that cyclotides might have evolved from linear counterparts also harbouring a cystine-knot motif. Interestingly, the cyclic squash trypsin inhibitors might also have evolved from linear counterparts. Indeed, their precursors are very likely to have evolved by internal duplication of the same motif, and are terminated by an acyclic variant. Because these acyclic variants are almost identical to other more traditional linear squash trypsin inhibitors (figure 1.3), it is likely that they constitute the ancestral version, and that atypical divergence and gene internal duplication in the squash Momordica cochinchinensis gave rise to the cyclic squash trypsin inhibitor precursors (Mylne, Chan et al. 2012). However, once again, more data are required to confirm this hypothesis. 9 Figure 1.5. Structure of SFTI-1. (a) The structure of SFTI-1 (cyan) was first elucidated in complex to bovine β-trypsin (grey), PDB code 1SFI, (Luckett, Garcia et al. 1999) and later the NMR solution structure (lighter), PDB code 1JBL, (Korsinczky, Schirra et al. 2001)), identified that the peptide did not differ greatly in conformation to the crystal structure in complex, apart from the orientation of the active lysine residue (in red) which is flexible; bonded cysteines are highlighted (yellow). (b) 8 Bowman-Birk inhibitors (PDB codes 1BBI, 1D6R, 1G9I, 1H34, 1MVZ, 1SMF, 1TAB, 2BBI, line format, magenta) overlaid to SFTI-1 (PDB code 1SFI, stick format, cyan) illustrating the structural homology of SFTI-1 with the active arm of the Bowman-Birk inhibitor family. (c) SFTI-1 (cyan) overlaid, in a zoom, to CtA (Cyclotheonamide A in blue) and the previous Bowman-Birk inhibitors. The α-ketrohomo-arginine of CtA overlays with reactive lysine of SFTI-1 (red) and the four other residues mimic one loop of SFTI-1 reasonably well, although CtA is not wholly peptidic and display less homology than SFTI-1 toward Bowman-Birk inhibitors, CtA PDB code (1TYN), (Ganesh, Tulinsky et al. 1996). 1.2 Discovery of the sunflower trypsin inhibitor SFTI-1 1.2.1 Identification of SFTI-1 In 1999 a screen for protease inhibitor activities in various seed extracts was conducted in Peter Shewry’s lab (Shewry 1999; Konarev, Anisimova et al. 2002; Konarev, Griffin et al. 2004). In extracts from sunflower seeds, the most potent trypsin inhibitor had a very low molecular weight, suggesting a peptidic nature, and this sample was labelled SFTI-1 for Sunflower Trypsin Inhibitor 1. The sequence and cyclic structure of SFTI-1 were only revealed after soaking trypsin crystals with SFTI-1 and analysing the structure of the complex between trypsin and SFTI-1 (Luckett, Garcia et al. 1999). Later structural analysis conducted on a synthesised SFTI-1 revealed that its structure in complex did not differ significantly from its solution structure, with the exception of the Lys5, (figure 1.5.a). This lysine residue was shown to be flexible in solution and to be responsible for the potent activity of SFTI-1 (Korsinczky, Schirra et al. 2001). SFTI-1 is a 14 residues cyclic peptide and adopts an anti-parallel β-sheet conformation stabilised by a single disulfide bond dividing the peptide into a nine residues loop region, the ‘reactive loop’, and a five residues turn, the ‘cyclic loop’ (figure 1.5.a). As its name suggests, SFTI-1 can inhibit trypsin, but also several other proteinases including cathepsin, elastase, chymotrypsin and thrombin (Luckett, Garcia et al. 1999). The catalytic mechanism of serine proteases such as trypsin, chymotrypsin and elastase has been characterised. The active site of these proteases contains conserved residues, a histidine, a serine and an aspartic acid. These three amino acid residues constitute a catalytic triad, each playing an essential role for peptide bond cleavage (Kraut 1977). The flexible lysine residue of SFTI-1 binds to trypsin by forming contacts with two residues of the trypsin binding pocket and one of the main chains of trypsin. This binding places the bond between the flexible lysine residue of SFTI-1 and the next residue close enough to the histidine and serine of the trypsin catalytic triad to allow hydrogen bonds formation. As a result SFTI-1 is locked within the 10 trypsin active site and inhibits trypsin without release, thus making SFTI-1 a potent (Ki 0.1 nM) inhibitor of trypsin (Luckett, Garcia et al. 1999). The structure of SFTI-1 mimics the functional arm of the Bowman-Birk protease inhibitors (figure 1.5.b). In particular the flexible lysine residue of SFTI-1 and its next residue perfectly overlay with the two residues of the functional arm of the Bowman-Birk inhibitors that is responsible for the binding with proteases (Luckett, Garcia et al. 1999). Bowman-Birk inhibitors are found in some legumes and cereals seeds and are serine proteinase inhibitors. Bowman-Birk inhibitors precursors are typically around 100 amino acid residues long and about 65 for their mature forms, which possess a complex structure typically stabilised by seven disulfide bonds. Bowman-Birk inhibitors traditionally have two active sites that can inhibit two proteinases simultaneously or independently. Most of them can inhibit trypsin with one reactive site and chymotrypsin with the second reactive site (Korsinczky, Schirra et al. 2001). The active loops of these inhibitors are characteristic for their ability to retain their conformation after binding to their target enzyme (Laskowski and Kato 1980; Bode and Huber 1992). Similarly, SFTI-1 retains its conformation when complexed with bovine β-trypsin (Luckett, Garcia et al. 1999). All these observations led to SFTI-1 being described as the smallest Bowman-Birk-type inhibitor, but SFTI-1 differs from this family of proteases inhibitors on several points. Although SFTI-1 structurally mimics the N-terminal trypsin inhibitory reactive site of Bowman-Birk-type inhibitors, SFTI-1 is much smaller (14 residues versus 65 on average). SFTI-1 also structurally differs from these inhibitors by possessing a cyclic backbone, and has an increased potency for inhibition. For example SFTI-1 is six times more potent than the soybean Bowman-Birk inhibitor (Luckett, Garcia et al. 1999), the most potent known inhibitor of this inhibitor family before the discovery of SFTI-1 (Voss, Ermler et al. 1996), and is twice more potent than cyclotheonamide, the most potent small serine protease inhibitor before SFTI-1 identification (Ganesh, Tulinsky et al. 1996). Cyclotheonamide is a cyclic pentapeptide, isolated from sponges, that also mimics the trypsin inhibitory arm of Bowman-Birk inhibitors, but with a weaker homology than SFTI-1 (figure 1.5.c). 11 Despite the extensive knowledge about its sequence, structure and biochemical properties, the biological role of SFTI-1 in sunflower seeds is still undetermined. It is known that SFTI-1 inhibits several proteinases, including many digestive enzymes and particularly trypsin and chymotrypsin (the two most common digestive enzymes of insects and mammals). Thus SFTI-1 might be produced as a component of the sunflower seeds constitutive defence against predation by gramnivores, such as seed-boring insects. In almost all kingdoms of life, cyclic peptides with defence functions have been characterised. For example the bacterial cyclic peptide subtilosin is antimicrobial (Babasaki, Takao et al. 1985; Blond, Péduzzi et al. 1999), and many other plants cyclotides have been shown to be insecticidal or antiviral (Jennings, West et al. 2001; Daly, Clark et al. 2006; Pelegrini, Quirino et al. 2007; Colgrave, Kotze et al. 2008). In addition seeds generally produce and accumulate many serine protease inhibitors as a part of their defence system against pathogens and pests (Tedford, Fletcher et al. 2001). Nevertheless, the biological relevance of SFTI-1 as a sunflower seed protease inhibitor remains to be demonstrated. 1.2.2 Multiple applications for SFTI-1 properties SFTI-1 is not only a very potent trypsin inhibitor, but it also has other interesting bioactivities. It can inhibit matriptase with a similar inhibition constant to that against bovine trypsin (Long, Lee et al. 2001; Yuan, Chen et al. 2011). Matriptase is a type II epithelial transmembrane serine protease that was initially purified from human milk, but it is also produced by normal and cancerous breast epithelial cells in culture. Recent studies have suggested that its mis-regulation can contribute to abnormal cell proliferation, mobility, and states of differentiation, and thereby can cause pathogenic states, such as cancer (Lin, Anders et al. 1999). These results revealed that SFTI-1 can also inhibit a protease of strong clinical importance with high specificity, launching several studies to develop therapeutics adapted for specific targets by using SFTI-1 as a backbone (Korsinczky, Schirra et al. 2001; Daly, Chen et al. 2006; Kawakami, Ohta et al. 2009; Swedberg, Nigon et al. 2009). For example, a modified SFTI-1 with three substitutions was generated. This modified SFTI-1 loses its ability to inhibit trypsin and can specifically and selectively block the activity of the human kallikrein-related peptidase 4 (KLK4), 12 a trypsin-like serine protease that can activate many tumorigenic and metastatic pathways causing prostate cancer. Currently, there are no KLK4-specific small-molecule inhibitors available for therapeutic development. This modified SFTI-1 was shown to be extremely robust and to have a high potency of inhibition for KLK4, making it an excellent candidate for further therapeutic development (Swedberg, Nigon et al. 2009). This illustrates the high potency of SFTI-1 as a scaffold for therapeutic molecules with high specificity. Other analogues of SFTI-1 were successfully designed to specifically inhibit the activity of chymotrypsin, leukocyte elastase (Pereira, Salgado et al. 2009), or cathepsin-G (Legowska, Debowski et al. 2009), mostly by replacing the Lys5 of the SFTI-1 active site with other residues to change its binding specificity. These consecutive successes also drew attention to SFTI-1 from biotechnology and pharmaceutical industries (Robinson, DeMarco et al. 2008; Pratley, Nauck et al. 2010). 13 Figure 1.6. PawS1 and SFTI-1 sequences and structures. (a) PawS1 sequence with ER signal (rose), SFTI-1 (aqua), PawS1 albumin small subunit (green) and large subunit (orange). Spacer regions or the GLDN tail are coloured black. (b) PawS1 protein precursor (after ER signal removal) hypothetical 3D peptide model in ribbon format. It was generated by homology modelling using EasyModeller4.0 (Kuntal, Aparoy et al. 2010), using all available napin-type albumin structures (1SM7A, 1S6DA, 1W2QA, 1PSYA, 2LVFA) as template for the albumin part of PawS1 and SFTI-1 three-dimensional structures as a template for the SFTI-1 part of PawS1. Disulfide bonds are indicated as sticks. Model evaluation data: ProSA-web (Wiederstein and Sippl 2007) Z-Score= -4.87, Procheck (Laskowski, MacArthur et al. 1993) Ramachandran parameters: Procheck Ramachandran Plot (Allowed= 90.8%, Additionally allowed= 8.5%, Generously Allowed= 0.7%, Disallowed region= 0.0%) and Procheck G-factors (Dihedrals= -0.11, Covalent= 0.26, Overall= 0.02). (c) SFTI-1 solution structure by nuclear magnetic resonance spectroscopy, (PDB code 1JBL, (Korsinczky, Schirra et al. 2001)), presented in stick format. The carbon backbone is represented in cyan with the disulfide bond in yellow, nitrogen in dark blue, and oxygen in red. 1.3 Discovery of SFTI-1 biosynthetic origins 1.3.1 Identification of PawS1 (Preproalbumin with SFTI-1) Despite the lack of a characterised function, all the remarkable bioactivities and potential therapeutic applications displayed by SFTI-1 have drawn attention to this low molecular mass protein. Thus an effort was made to identify the DNA sequence encoding SFTI-1. In 2011, Mylne et al. performed 3′ then 5′ RACE PCR using mature sunflower seed RNA and identified the SFTI-1 encoding sequence, which revealed something rather unexpected. The sequence encoding SFTI-1 is buried within PawS1 (Preproalbumin with SFTI-1), which is also a precursor for a sunflower napin-type seed storage albumin. The gene PawS1 encodes a protein precursor of 151 amino acid residues, composed of an ER-signal, SFTI-1 and two subunits that constitute a typical napin-type seed storage albumin precursor, separated by spacer propeptides (figure 1.6). Mylne et al. also found a second similar gene from sunflower, that they called PawS2. It also encodes a napin-type albumin and a small cyclic peptide and displays strong sequence similarities with PawS1 for the encoded albumin, but less for the small cyclic peptide. This second cyclic peptide was called SFT-L1, for ‘SFTI-Like 1’, but cannot inhibit trypsin and no bioactivity has been determined for it (Mylne, Colgrave et al. 2011). In this study, both SFTI-1 and the napin-type pro-albumin adjacent to it in PawS1 protein precursor (figure 1.6) were shown to require proteolytic cleavage for maturation (Mylne, Colgrave et al. 2011). 1.3.2 The napin-type seed storage albumins As described above, SFTI-1 is buried inside the N-terminal propeptide of a napin-type albumin precursor produced in sunflower seeds. So far this type of dual-precursor has only been found in sunflower and close relatives (Elliott, Delay et al. 2014). Angiosperms and other seed plants produce seeds to propagate and disperse themselves, but seeds are also the major plant organ consumed by humans and livestock and constitute their major source of protein (Shewry, Napier et al. 1995). Seed storage proteins can constitute over 50% of 14 total seed proteins (Shewry and Halford 2002) and therefore have been of major interest since the 18th century and were already extensively reviewed in 1909 (Osborne 1909). Plant seed storage proteins are accumulated as a source of nitrogen and sulphur for the germinating seedling (Shewry, Napier et al. 1995; Herman and Larkins 1999) and are classified into four categories based on which solvents they dissolve in most readily. In seed extracts albumins, globulins, prolamins, and glutelins can be dissolved respectively in water, saline solution, water/alcohol mixture, and acid or basic solution (Osborne 1909). Seed albumins are highly stable and constitute a major portion of seed storage proteins in dicots (Shewry and Pandya 1999). Seed albumins are found in most dicot seeds and are also referred as 2S albumins. Despite their obvious importance as a source of nutrients for the emerging seedling, various seed albumins have been reported to possess bactericidal (Maria-Neto, Honorato et al. 2011) and fungicidal (Terras, Schoofs et al. 1992; Ribeiro, Taveira et al. 2012; Freire, Vasconcelos et al. 2015) activities. Seed albumins have also often been of particular interest in selection programs due to their allergenic properties. For example the two principal albumin allergen from peanuts have been silenced (Zhou, Wang et al. 2013), and albumins with allergenic effect are found in many other seeds of commercial importance such as Brazil nut (Nordlee, Taylor et al. 1996), yellow mustard seeds (Menéndez-Arias, Moneo et al. 1988), castor bean seeds (Thorpe, Kemeny et al. 1988) and peanut (Lehmann, Schweimer et al. 2006). It appears that seed albumins represent a large and diverse group of seed proteins. Among this diversity of seed albumins, the seed storage albumins are the most widely represented in dicotyledonous plants seeds. The structure and properties of seed storage albumin are exemplified by napins of oilseed rape (Brassica napus), the first well studied seed storage albumins, with napin-type albumins being the most common type of seed storage albumins in dicotyledonous plants seeds (Shewry and Pandya 1999). Plants usually contain multiple napin-type albumins genes that are often intron-less. Napin-type albumins have a highly conserved pattern of cysteine residues and are often rich in glutamine. Each mature napin-type albumin is processed from a precursor with a conserved structure. A typical napin-type albumin precursor contains an ER-signal and a pro-domain which are both removed during the 15 maturation process. During proteolytic maturation of the napin-type albumin precursor, the napin-type albumin is often cleaved (by removal of an internal propeptide) into a heterodimer of small and large subunits with molecular weights of around 4 and 9 kDa respectively that remain connected by disulfide bonds (Ericson, Rödin et al. 1986). Mature napin-type albumins generally have eight conserved cysteine residues, two in the small subunit and six in the large subunit (Shewry, Napier et al. 1995; Shewry and Pandya 1999). However, the albumins from Fabaceae, such as the pea PA1a, are different and constitutea single unit with only two internal couples of bonded-cysteines and are encoded in a dual-precursor (Higgins, Chandler et al. 1986), as described above. 1.3.3 Seed albumins are matured by AEP AEP belongs to the superfamily of C13 peptidases and cleaves preferentially on the C-terminal side of asparagine or aspartic acid residues. The active site of AEP contains a conserved catalytic dyad, constituted of one histidine residue and one cysteine residue, which are both preceded by four conserved hydrophobic residues and separated by approximately 35 residues. In plants, the AEP active site also contains highly conserved residues, an arginine residue and a serine residue that play important roles in the binding of the targeted residue in the enzyme active site. Additionally, plant AEPs possess another highly conserved sequence of four residues at the C-terminus that is essential for their vacuolar targeting (Kuroyanagi, Nishimura et al. 2002). AEP function in plants has been deeply studied and characterised. The first report of AEP activity was in vacuolar extracts of maturing pumpkin seeds which could convert seed pro-globulin to mature globulin (Hara-Nishimura and Nishimura 1987). Soon after, AEP activity was demonstrated in extracts of castor bean (Ricinus communis) extracts (Hara-Nishimura, Inoue et al. 1991). AEP was then characterised as a cysteine protease that cleaves seed storage protein precursors, for their maturation, on the C-terminal side of asparagine residues preferentially or aspartic acid residues occasionally. (Hara-Nishimura, Inoue et al. 1991; Hara-Nishimura, Takeuchi et al. 1993; Sheldon, Keen et al. 1996; Hiraiwa, Kondo et al. 1997; Hiraiwa, Nishimura et al. 1999). AEP activity has also been studied in barley (Hordeum vulgare L.) cell walls (Linnestad, Doan et al. 1998), soybean (Glycine max) (Jung, Scott et al. 1998) and kidney 16 bean (Phaseolus vulgaris) (Rotari, Dando et al. 2001). However sequence analyses of several substrates of various AEPs showed a strong variability for the residues preceding the asparagine or aspartic acid residues targeted by AEP (Jung, Scott et al. 1998; Rotari, Dando et al. 2001). Other studies suggest that substrate conformation is particularly important for recognition by proteases and that most proteases usually recognize their target residues when they are placed at an extremity of a β-strand (Tyndall, Nall et al. 2005). Thus, the conditions that make a given asparagine or aspartic acid residue an AEP target remain to be understood. Further studies conducted on the model plant Arabidopsis thaliana (A. thaliana), proved that AEP is a key enzyme essential for the processing of all seed albumins and globulins (Shimada, Yamada et al. 2003; Gruis, Schulze et al. 2004), the major components of the seed storage proteins. A. thaliana has four AEP genes (Kinoshita, Nishimura et al. 1995; Gruis, Schulze et al. 2004; Nakaune, Yamada et al. 2005). These are AtAEP1 (At2g25940), AtAEP2 (At1g62710), AtAEP3 (At4g32940) and AtAEP4 (At3g20210). In A. thaliana, the AEP genes are expressed in many plant tissues, but each has specific expression and function. AtAEP2 is mainly expressed in seeds and is the most important for maturation of storage proteins in A. thaliana seeds. AtAEP1 and AtAEP3 are also expressed in seeds at lower level than AtAEP2 and can compensate for most of its function when AtAEP2 is knocked out, AtAEP3 being more effective at seed storage protein maturation than AtAEP1, although their activity could also be dependent upon AtAEP2 (Shimada, Yamada et al. 2003; Gruis, Schulze et al. 2004). AtAEP4 is expressed only in maturing flower and young seed coat and was shown to be involved in seed coat maturation by programmed cell death of specific cell layers. It is not present within seeds and is not involved in seed protein storage maturation (Nakaune, Yamada et al. 2005). It has been shown that A. thaliana aep1 aep2 aep3 triple mutants (expressing AEP4 only) and aep quadruple mutants (aep1 aep2 aep3 aep4), also called aep-null mutants, abnormally accumulate many precursors of storage proteins in their seeds. However, it has also been reported that under normal growth condition these triple and quadruple mutants have no morphological or 17 developmental differences from wild-type plants. A. thaliana aep-null lines were reported to germinate normally under standard conditions with no morphological or developmental difference from the wild-type (Kinoshita, Nishimura et al. 1995; Gruis, Selinger et al. 2002; Shimada, Yamada et al. 2003; Gruis, Schulze et al. 2004). This is surprising as in higher plants, seed storage proteins are deposited in protein storage vacuoles as a source of carbon, nitrogen, and sulphur during seed germination. Thus remobilisation of amino acids that compose these proteins is critical to germination and provide building blocks for rapid growth (Shewry, Napier et al. 1995; Herman and Larkins 1999). Because aep-null seeds don’t accumulate correctly matured seed storage proteins, one might have expected this nutritional source used by the developing seedlings to be less stable or harder to re-mobilise, and potentially result in germination problems. Thus I investigated the possibility of germination delay phenotypes for several aep mutants and I present my results in Chapter 2. Although AEP is the major seed storage protein processing enzyme (Shimada, Yamada et al. 2003; Gruis, Schulze et al. 2004), other functions have been reported for AEP, in different tissues than the seed. AtAEP3 is also expressed in stem and leaves and is up-regulated in the lytic vacuoles of vegetative tissues during senescence and under various stress conditions (Kinoshita, Yamada et al. 1999). Several studies suggest that AEP could have a role in programmed cell death (see review by Hatsugai, Yamada et al., 2015 for details). For example, AtAEP3 was reported to play an essential role in programmed cell death of pathogen infected leaves (Kuroyanagi, Yamada et al. 2005). However, despite the up-regulation of AtAEP3 in senescing leaves, to date no function has been reported for this gene that would imply a role in senescence. In Chapter 3, I present the results of my investigations for a function for A. thaliana AEPs in dark-induced senescence of individual leaves. 18 Figure 1.7. Acyclic and cyclic SFTI-1. (a) Acyclic SFTI-1 (green, PDB code 1JBN, (Korsinczky, Schirra et al. 2001)) and (b) cyclic SFTI-1 (cyan, PDB code 1JBL, (Korsinczky, Schirra et al. 2001)), shown in stick mode with the carbon backbone represented in green or cyan, disulfide bonds in yellow, nitrogen in dark blue and oxygen in red. Acyclic and cyclic SFTI-1 display close structural similarity even though the acyclic version has no peptide bond between SFTI-1 first glycine and last aspartic acid residues. In SFTI-1 acyclic version these two residues have their respective free termini maintained in close proximity. 1.4 SFTI-1 is matured by the seed albumin processing machinery 1.4.1 SFTI-1 requires AEP for proper processing in transgenic lines The role played by AEP in SFTI-1 processing at both peptide proto-termini was first suggested in Mylne et al. (2011). As SFTI-1 is buried in a precursor (PawS1) for a seed albumin, generally processed by AEP (Shimada, Yamada et al. 2003; Gruis, Schulze et al. 2004), and because, inside PawS1, SFTI-1 is preceded by an asparagine residue and ends with an aspartic acid residue, which are both AEP specific proteolysis sites (Hara-Nishimura, Inoue et al. 1991; Hiraiwa, Nishimura et al. 1999) AEP was likely to be the SFTI-1 processing enzyme. It was also suggested that AEP might perform SFTI-1 cyclisation. First, The napin-type albumin precursors form their disulfide bonds before AEP processing (Krebbers, Herdies et al. 1988; D'Hondt, Van Damme et al. 1993; HaraNishimura, Takeuchi et al. 1993; Müntz 1998), suggesting that SFTI-1 would also acquire a constrained acyclic structure prior to proteolytic maturation, thus already bringing its two proto-termini closer together. This can be observed in the solution structure of the acyclic version of SFTI-1 (PDB code 1JBN, (Korsinczky, Schirra et al. 2001)) that shows that the first and last residues of SFTI-1 remain in close proximity even when not linked to each other by a peptide bond (figure 1.7). Then, the proposed sequence of events for SFTI-1 maturation, based on the fact that AEP targets preferentially asparagine residues over aspartic acid residues (Hara-Nishimura, Inoue et al. 1991; Hiraiwa, Nishimura et al. 1999), would be that the first glycine residue of SFTI-1 is liberated before cleavage at its last asparagine residue. These features would create highly energetically favourable conditions for a cleavage-coupled cyclisation reaction. 19 Figure 1.8. SFTI-1 maturation in sunflower and in A. thaliana. MALDI-TOF-MS profiles of seed peptide extracts from sunflower as well as A. thaliana expressing a synthetic PawS1 in wild-type (WT) and aep-null backgrounds. A mass for acyclic SFTI-1 (1,531.82 Daltons) is the major form detectable in WT A. thaliana. Only mis-processed products are detectable in the aep-null background. In Mylne et al. 2011, an pOLEOSIN:PawS1 construct expressing native PawS1 under the control of a seed specific A. thaliana OLEOSIN promoter, was introduced in the genome of the model plant A. thaliana (Mylne, Colgrave et al. 2011). The availability of substantial mutant collections makes A. thaliana a great model for functional studies. A. thaliana has four AEP genes and several aep-null (quadruple null mutant) lines are available. In addition, no small peptides from A. thaliana have ever been described and A. thaliana seed peptide profile on MALDI-TOF-MS appears to lack any masses of significance in the mass range and experimental conditions used (figure 1.8). This provides a great opportunity to use A. thaliana as a genetic tool to study SFTI-1 maturation and the potential involvement of AEP in this process. Thus, the pOLEOSIN:PawS1 construct was introduced in a wild-type and in an aep-null background. Seeds peptide extracts of these transgenic lines were analysed by MALDI-TOF-MS. A mass (1,513.82 Daltons) identical to that of sunflower SFTI-1 as well as a more abundant mass corresponding to acyclic SFTI-1 (1,531.82 Daltons) can be detected in seeds of wild-type A. thaliana pOLEOSIN:PawS1 lines, but not in extracts of aep-null A. thaliana seeds (figure 1.8). Further analyses have demonstrated that the 1,513.82 Daltons and 1,531.82 Daltons masses produced in A. thaliana pOLEOSIN:PawS1 lines correspond to cyclic and acyclic SFTI-1 (Mylne, Colgrave et al. 2011). These data confirm that A. thaliana can process SFTI-1 correctly from PawS1, but is less efficient than sunflower at cyclisation (figure 1.8). Several lines of evidence were presented to support a role for AEP in SFTI-1 processing. First, no mass for cyclic or acyclic SFTI-1 was detectable in transgenic A. thaliana pOLEOSIN:PawS1 seeds from the aep-null genetic background, (figure1.8); but instead masses corresponding to SFTI-1 with additional or less amino acid residues at its N- and C- termini were detected, corresponding to the amino acid residues sequence in PawS1 and thus suggesting mis-processing of SFTI-1 (figure1.8). These observations are suggesting that the enzyme (or enzymes) responsible for accurate SFTI-1 cleavage was lacking in the aep-null genetic background. Then, additional studies revealed that mutations of the asparagine residue preceding the SFTI-1 N-terminus or the aspartic acid residue at the SFTI-1 C-terminus caused 20 mis-processing of SFTI-1 in a wild-type background (Mylne, Colgrave et al. 2011). This strongly supported the idea that AEP could be responsible for SFTI-1 processing. Finally, using targeted mass spectrometry, unprocessed PawS1 albumin and SFTI-1 were shown to abnormally accumulate in an aep-null genetic background. Several tryptic fragments from the mature PawS1 albumin large subunit were used as controls and the tryptic fragment SIPPICFPDGLDNPR, that contains SFTI-1 C-terminus and PawS1 albumin small subunit N-terminus, was selected to represent unprocessed PawS1. The tryptic fragment SIPPICFPDGLDNPR increased 160-fold in an aep-null background relative to wild-type (Mylne, Colgrave et al. 2011), attesting that SFTI-1 processing was strongly impaired in an aep-null background. In wild-type A. thaliana seeds, the processing of SFTI-1 is correct, but cyclisation is not efficient. The great majority of SFTI-1 is produced as an acyclic form in this context, with less than 5% of SFTI-1 produced as its native cyclic form estimated (Mylne, Colgrave et al. 2011). The sequencing of the 1,531.82 Daltons mass by MS/MS confirmed its acyclic nature and sequence. Three possibilities could explain these results. AEP is not responsible for SFTI-1 cyclisation at all, A. thaliana AEPs do not have efficient cyclisation capability, or an activity that cleaves at aspartic acid is too dominant and masks efficient macro cyclisation. The second scenario would suggest that sunflower AEP would have evolve unusual cyclisation abilities. Additionally parallel studies showed that other plant cyclic peptides were also processed and, in some cases, cyclised by AEP (Mylne, Chan et al. 2012). 21 Figure 1.9. Proposed model for PawS1 maturation. PawS1 structure was generated as described above (see figure 1.6). SFTI-1 (aqua), PawS1 albumin small subunit (green) and large subunit (orange), spacer regions (grey) GLDN tail (black). Napin-type albumins have a conserved proteolytic maturation process (Hara-Nishimura, Takeuchi et al. 1993; Hiraiwa, Kondo et al. 1997; Otegui, Herder et al. 2006) and the model for SFTI-1 maturation was proposed in 2015 (Bernath-Levin, Nelson et al. 2015). Seed storage protein precursors meet AEP in multi-vesicular bodies originating for the Golgi complex and targeted to the protein storage vacuole. (a) First AEP (pale yellow shape) cleaves the three asparagine residues preceding all mature domains of PawS1 (SFTI-1 and both PawS1 albumin subunits). (b) Then AEP performs the GLDN tail cleavage coupled to SFTI-1 cyclisation and the PawS1 albumin remaining cleavage is perform by another enzyme (brown shape) called aspartic protease (Hiraiwa, Kondo et al. 1997), to release (c) mature SFTI-1 and PawS1 albumin. 1.4.2 A recombinant AEP can perform SFTI-1 maturation In complement to the A. thaliana system approach described in the previous section, Bernath-Levin et al. (2015) developed an in vitro approach to investigate a potential role for AEP in SFTI-1 processing and cyclisation. In an in situ assay they confirmed that crude extract made from sunflower seeds can produce SFTI-1 (GRCTKSIPPICFPD) by cleavage and cyclisation of a synthetic SFTI-GLDN precursor (GRCTKSIPPICFPDGLDN). In these artificial conditions, the reaction appeared to be less efficient than in vivo as a high proportion of acyclic SFTI-1 was produced. However, the unexpected presence of acyclic SFTI-1 after incubation allowed an additional experiment in which they introduced heavy water to the in situ assay. By performing the in situ in 18 O-water, a 1.9 Dalton mass increase was consecutively observed for the acyclic version of SFTI-1 produced, but no mass shift at all for the cyclic SFTI-1 produced. The 1.9 Da mass shift was consistent with the incorporation of a heavy oxygen atom (from the heavy water) during hydrolysis. This confirmed the hypothesis that SFTI-1 cyclisation does not involve separate hydrolysis and ligation. It suggested that the macro cyclisation reaction is coupled energetically to the cleavage reaction at the last aspartic acid residue of SFTI-1 (Bernath-Levin, Nelson et al. 2015). Bernath-Levin et al. (2015) also confirmed that a recombinant jack bean AEP produced in E. coli could perform the coupled cleavage and cyclisation reaction and generate cyclic SFTI-1 in vitro. Of five plant AEPs tested, only jack bean AEP could perform the macro cyclisation reaction, suggesting that this activity might be restricted to some specific AEPs (Bernath-Levin, Nelson et al. 2015). In sunflower, which contains several AEPs, it is possible that one AEP might have diverged from the others and acquired unusual cyclisation capability. These data strongly support the idea that AEP is responsible for the maturation of both mature products of PawS1 precursor (see model in figure 1.9). Nevertheless, to date no AEP from sunflower with macro cyclisation capability has been discovered. 22 Figure 1.10. PawS1, an unusual napin precursor. Napins are parts of the 2S albumin class of seed storage proteins. Napin precursor structure (as shown above) is usually highly conserved. They have a signal peptide targeting them to the endoplasmic reticulum (ER, pink) and two subunits linked by disulfide bonds, the small subunit (SSU, green) and a the large subunit (LSU, orange). Napins are matured into a heterodimer by proteolytic removal of propeptide regions (black lines). PawS1 is a unique napin precursor because it evolved to also contain a precursor for a small cyclic peptide, SFTI-1 (blue) and its tail (dark). This tail is not in the mature SFTI-1, but is essential for its maturation. 1.5 SFTI-1 as a model to study protein biosynthesis and evolution 1.5.1 SFTI-1 is an unusual buried cyclic peptide In many regards, SFTI-1 is a unique type of plant cyclic peptide. First its structure greatly differs from the traditional cystine-knot motif found in many plant cyclic peptides and other acyclic trypsin inhibitory peptides from plants, and consists of a twisted cyclic loop, with a single disulfide bond (Mylne, Colgrave et al. 2011). SFTI-1, like most plant cyclic peptides described so far, is encoded by a gene. However, unlike SFTI-1, most plant cyclic peptides are encoded by genes encoding a single protein or variants of the motif in multi-copies (Mylne, Chan et al. 2012), suggesting that they have undergone extensive gene internal duplication of the cyclic peptide domain. The protein precursor for SFTI-1 varies from all other cyclic peptide precursors. It is encoded by the gene PawS1, which unlike other genes encoding cyclic peptides precursors, is a dual protein gene that encodes a traditional napin-type albumin precursor adjacent to a completely different type of protein (a small cyclic peptide). PawS1 and PawS2 are the first members to have been discovered (Mylne, Colgrave et al. 2011) of a family of genes restricted to sunflower and its close relatives and encoding a dual precursor for a napin-type albumin and a cyclic peptide (Elliott, Delay et al. 2014). These genes also differ from the Ctc gene families, which also encode dual precursors which could appear similar as they also encode a cyclic peptide and an albumin. However, this albumin appears to be a non-functional version (Nguyen, Zhang et al. 2011) of the PA1a albumin, which is different from the napin type albumins (Higgins, Chandler et al. 1986). In addition Ctc genes contain a small intron inside the sequence encoding the ER signal, which is unusual to napin-type albumin genes which are traditionally intron-less, but quite common for cyclotide genes (Nguyen, Zhang et al. 2011). Thus the originality of SFTI-1 precursor raises many questions, but also provides unique study opportunities. For example, SFTI-1 is produced from the dual precursor PawS1, which also encodes a napin-type albumin, the most common type of seed storage albumin. However PawS1 dual type of napin and additional peptide precursors is restricted to 23 sunflower and its close relatives, and most napin precursors in other plants produce only napin (figure 1.10). This originality provides an interesting opportunity to address current limitations to techniques available for the study of napin and seed storage protein in general. The study of seed storage protein maturation and trafficking, for example, particularly suffers from the limitations of current techniques applied for protein quantification from tissues samples. Many proteomics laboratories have developed protein quantification methods based on mass spectrometry detection of tryptic peptides, thus requiring digestion of the protein samples prior to detection. However, these methods are poorly suited to studies of protein maturation, because it is difficult to define whether tryptic peptides are derived from the mature or immature form, and these methods cannot allow identification of native (non-digested with trypsin or other enzymes) proteins (Bantscheff, Lemeer et al. 2012). Other methods can allow identification of native proteins, and thus could be used for detection of seed storage proteins, but these methods are not suited for quantification (Karas and Hillenkamp 1988; Fenn, Mann et al. 1989). Contrary to larger proteins, SFTI-1, as many small peptides, is easily and precisely detectable in its native form using mass spectrometry (Mylne, Colgrave et al. 2011; Elliott, Delay et al. 2014). Because of their small size, small peptides are also cheap and easy to synthesise, in order to be used as controls necessary for any quantification technique. This allows the development of simple quantitative mass spectrometry methods that precisely quantify such small peptides. In Chapter 4, I present a proof of concept application for such a quantification assay, in the study of seed storage protein trafficking. Indeed, PawS1 is an original seed storage albumin precursor that co-produces the small cyclic peptide SFTI-1 along with PawS1 albumin upon maturation. In transgenic A. thaliana producing PawS1, the cyclisation of SFTI-1 is inefficient and the major form of SFTI-1 is acyclic (Mylne, Colgrave et al. 2011). Because acyclic SFTI-1 and PawS1 albumin are produced from the same protein precursor, processed by the same enzyme and both accumulate in dry seeds, their respective quantity in dry seeds should be linked, as well as their fate. Thus in dry seeds of wild type and mutants affected in seed storage protein maturation, the variation of acyclic SFTI-1 quantity should be proportional to the variation of total mature albumin quantity. So I propose to use acyclic SFTI-1 quantification as a tool, to provide an alternative method to 24 compare this variation in several mutants of the seed storage protein trafficking pathway, and gather more understanding about the complex regulation of the fundamental process of protein trafficking, which is intimately associated with protein maturation. The manner by which SFTI-1 is produced from the dual precursor PawS1 also raises questions about its biosynthesis. Its position within the precursor PawS1, in the N-terminal propeptide that is traditionally removed during napin-type albumin maturation, and its requirement for maturation by AEP, the central seed storage protein maturation enzyme, suggest that SFTI-1 could be produced opportunistically as a by-product of its adjacent albumin maturation. In Chapter 5, I will present a study based on the use of synthetic genes in which the sequence for SFTI-1 is put in different contexts. I introduced these genes into A. thaliana and then using the same quantification method described in Chapter 4, I compared acyclic SFTI-1 quantity in the seeds of each transgenic line. By using this method, I provide new data supporting that AEP is responsible for SFTI-1 release from its protein precursor and demonstrating that SFTI-1 has the ability to recruit AEP for its maturation, and is not a by-product of its adjacent albumin maturation. Previous work also showed that perturbing the double cysteine bond or that single alanine substitutions within SFTI-1 sequence in PawS1 preproalbumin (e.g. R37A, T39A, P43A, P44A, I45A, F47A) abolish in vivo acyclic SFTI-1 expression (Mylne, Colgrave et al. 2011; Elliott, Delay et al. 2014). This implies that not just any sequence will emerge from such chimeric proteins and raises questions about how such an efficient sequence has evolved. Contrary to other plant cyclic peptides, SFTI-1 evolution has partially been elucidated, and, as I will elaborate later, it depicts a quite unusual path for protein evolution in general that has been taken in sunflower and close relatives toward the production of cyclic protease inhibitors. 25 Figure 1.11. Model for SFTI-1 evolution. PawS1 is the first member of a recently discovered family of napin-type albumin precursors with buried peptides that evolved within the Asteraceae family over 40 million years ago. SFTI-1 is the first member of a family of buried peptides and appears to have evolved de novo in a napin-type albumin gene (that became PawS1), via a genetic expansion that did not alter the napin-type albumin coding frame and allowed two proteins to be matured from it instead of one. (a) First the ancestral processing site between the N-term propeptide and the first subunit of the ancestral albumin expanded and (b) in the daisy family further expansion led to the creation of an extra processed peptide compared to the original precursor. (c) Later, after acquisition of the double bounded cysteines and the complete SFTI-1 type processing sites, allowing cyclisation, the peptide became stable. (d) Further events led to evolve a mimic of a Bowman-Birk inhibitor loop in this peptide, and (e) in sunflower close relatives only it evolved trypsin inhibitory activity in sunflower close relatives only and (f) diverged in SFTI-1 in sunflower. AEP processing site are indicated on SFTI-1 precursor. During maturation SFTI-1-GLDN is first cut out by the strong AEP preference for asparagine (red Arrows). SFTI-1-GLDN is then processed by an AEP that specifically recognises Asp14 on SFTI-1 and simultaneously cuts (blue arrow) and ligates it with SFTI-1 Gly1. 1.5.2 SFTI-1 stepwise evolution SFTI-1 is the first member of a recently discovered family of buried peptides that evolved within the daisy (Asteraceae) family over forty million years ago (Elliott, Delay et al. 2014). SFTI-1 appears to have evolved via a genetic expansion in an ancestral napin-type albumin gene, without altering the mature napin-type albumin coding frame, and thus allowed the two proteins to be matured from one precursor (Elliott, Delay et al. 2014). In figure 1.11, I present a model for SFTI-1 evolution, based on the original model and data from Elliott, Delay et al. (2014). In this model, I illustrate, what I believe represents the first example of, the natural and stepwise de novo evolution of a coding sequence for a new independent protein within a pre-existing gene without altering the reading frame for the ancestral protein. In Chapter 6, I compare this type of evolutionary mechanism to other established evolutionary mechanisms and illustrate how this peptide emerged by a mechanism that has never been reported previously. I call this mechanism neoproteinisation and highlight the differences between previously described evolution mechanisms and neoproteinisation. One characteristic of neoproteinisation, which I also highlight in Chapter 6, is to create unusual genes encoding multiples protein (more than just one protein). SFTI-1 differs from the so-called cryptides or crypteins (Autelitano, Rajic et al. 2006) because SFTI-1 is not a breakdown product of a functional protein, rather it emerges independently from a biologically inert ‘spacer’ or pro-region of a proalbumin that is removed during maturation into albumin. PawS1 also differs from most polyprotein genes as it encodes two proteins differing in size, structure, sequence and function. SFTI-1 is a 14 residues macrocycle and inhibits several proteinases (Luckett, Garcia et al. 1999), many of which are digestive enzymes, implicating a role protecting seeds from gramnivores. By contrast, PawS1 albumin is a linear heterodimer of a 24 residues small subunit connected by disulfide bonds to a 67 residues large subunit forming a bundle of five α-helices, folded in a right-handed super-helix (see figure 1.9). Mature PawS1 albumin contributes to a pool of sunflower seed albumins (Jayasena, Franke et al. 2016) that are catabolised during germination to provide a nitrogen and sulphur source for the developing seedling. 26 Based on this characteristic, I looked in the literature for other similar examples. I believe that, similarly to newly discovered evolution mechanisms, neoproteinisation has been overlooked, and that many other examples exist. In Chapter 7, I present one potential example that I found and investigated that led to the discovery of what I am sure will be a new family of small cysteine-bounded peptides buried in larger protein precursor, namely seed storage vicilins, that evolved by neoproteinisation. 27 1.6 Thesis aim The overarching goal of this thesis was to use the original cyclic peptide SFTI-1 as a model to study the maturation and evolution of proteins, and to explore eventual new functions for AEP. As the several studies that I conducted toward that goal relate to different subjects, I present their results in separate Chapters. The enzyme proposed to be responsible for SFTI-1 maturation is AEP. Despite a wide expression range across many plant tissues, no function has been confirmed for AEP apart from seed protein maturation. So another goal of the thesis was to investigate other potential functions for AEP in plants. In Chapter 2, I present the results of my investigation of AEP-deficiency implications on in seed germination. Then, in Chapter 3, I present the results of my investigation of the potential role for AEP in senescence. The unusual nature of the PawS1 precursor also provided an opportunity to study protein biosynthesis and evolution. In Chapter 4, I propose using SFTI-1 as a tool to estimate seed storage protein production. I present an example of application in protein trafficking studies, and provide insights into the complex regulation of protein trafficking and maturation. In Chapter 5, I present the study of SFTI-1 production from various chimeric protein precursors. I provide more evidence supporting AEP requirement for SFTI-1 maturation from its protein precursor. I also demonstrate that SFTI-1 and its trailing GLDN sequence have the ability to recruit AEP for precise cleavage from alternative host proteins, and that SFTI-1 is not a mere by-product of its adjacent albumin maturation. In Chapter 6, I compare and highlight the differences between previously described protein evolution mechanisms and neoproteinisation. I also illustrate how genes which have undergone neoproteinisation differ from other genes encoding multiple proteins. The SFTI-1 family is the first example of a neoproteinisation event, but in Chapter 7, I introduce a new family of small proteins, also buried in larger protein precursors that are potentially widespread among most flowering plant families. The evidence that I gathered suggest that this other peptide family also evolved by neoproteinisation, and bolsters the notion that neoproteinisation has been overlooked in the past and that many more cases of neoproteinisation are waiting to be discovered. 28 Chapter 2 The importance of seed storage protein maturation for germination 29 Figure 2.1. One-dimensional SDS-PAGE analysis of A. thaliana seed protein content in various aep backgrounds. Protein precursors (pro) of albumins and globulins are indicated, as well as alpha and beta subunits of globulins. 2 The importance of seed storage protein maturation for germination 2.1 Introduction A. thaliana aep-null lines lack mature seed storage proteins and accumulate mis-processed seed storage proteins precursors compared to wild-type (Kinoshita, Nishimura et al. 1995; Gruis, Selinger et al. 2002; Shimada, Yamada et al. 2003; Gruis, Schulze et al. 2004). Triple mutant lines aep2-3 aep3-1 aep4-1 and aep1-1 aep2-3 aep3-1 have a similar seed profile to the aep-null, whereas aep1-1 aep3-1 aep4-1 and aep1-1 aep2-3 aep4-1 have a similar seed profile to wild-type (figure 2.1). This mis-processing of seed storage proteins could have an impact on seed physiology during germination and the remobilisation of seed storage proteins. If the mis-processing of seed storage proteins leads to germination defects, then the aep-null line and the triple mutant lines aep2-3 aep3-1 aep4-1 and aep1-1 aep2-3 aep3-1 should exhibit a delay in germination compared to wild-type, whereas one would expect aep1-1 aep3-1 aep4-1 and aep1-1 aep2-3 aep4-1 to germinate similarly to wild-type. No abnormal germination was reported for aep-null lines, but in the previous studies a problem with germination could have been missed if performed in standard laboratory conditions. These experiments were not detailed and were described as ”data not shown”, but were likely to be done on nutrient-rich media. Most commonly, germination studies on A. thaliana are conducted on traditional Murashige-Skoog media, often complemented with sucrose or glucose. These nutrients help the plants to develop in these artificial conditions. During the early germination process, the seed naturally only uses its reserves to develop its first roots before it can use nutrients from the environment (Shewry, Napier et al. 1995; Herman and Larkins 1999). Thus the artificial presence of high concentrations of soluble nutrients, such as sugar or nitrogen commonly supplemented in germination media, could have complemented germination defects during the previous studies. 30 As no data were provided to support the report that aep-null seeds germinate as wild-type, in this study I wanted to investigate aep-null seeds germination abilities in more detail. To provide a detailed analysis of a potential germination defect due to an altered seed storage protein composition, I conducted germination tests on the four possible combinations of aep triple mutants and aep-null lines of A. thaliana on traditional Murashige-Skoog (MS) media. In order to test potential supplementation effects due to the germination media I also tested variations of the traditional Murashige-Skoog media, but not complemented with sugar and eventually depleted in one type of the major nutrients usually used for seed germination medium (nitrogen, salts and vitamins). 31 2.2 Material and methods 2.2.1 Plant material The wild-type plants used in this study were Arabidopsis thaliana ecotype Columbia (Col-0). The A. thaliana aep-null corresponding to the aep quadruple mutant aep1-1 aep2-3 aep3-1 aep4-1 αvpe-1 βvpe-3 γvpe-1 δvpe-1) expressing (also no AEP, the known as triple mutant aep1-1 aep2-3 aep3-1, as well as the aep4-1 single mutant were generously provided by Professor Ikuko Hara-Nishimura from Graduate School of Science, Kyoto University, Japan. This aep-null line has a mixed genetic background of Columbia (Col-0) and Wassilewskija (Ws-2). The A. thaliana triple mutants aep2-3 aep3-1 aep4-1, aep1-1 aep3-1 aep4-1 and aep1-1 aep2-3 aep4-1, each expressing only one AEP, were previously generated in my supervisor’s research group from crosses between aep1-1 aep2-3 aep3-1 aep4-1 and the aep4-1 single mutant. All these triple mutants also have a mixed genetic background of Columbia (Col-0) and Wassilewskija (Ws-2). A second aep quadruple mutant, or aep-null, aep1-3 aep2-5 aep3-1 aep4-1 (also known as αvpe-3 βvpe-5 γvpe-1 δvpe-1) with a Columbia (Col-0) genetic background only (CS67918) was obtained from the Arabidopsis Biological Resource Center. 2.2.2 Plant growth conditions For each assays, seeds originated from parents grown side-by-side under identical conditions. Parent seeds were sown on soil, stratified for 3 days at 4°C in the dark and grown at 22°C with 16 h of light per 24 h days, with fluorescent light of 150 μmol photons m −2 s −1 at 22°C and 65% relative humidity. Parent plants were allowed to self-fertilise and produce seeds. Seeds were harvested independently for each parent plant after complete natural senescence. 2.2.3 Germination assays For seed germination assays, precisely 100 seeds per replicate of each line were surface-sterilised with chlorine 32 gas. Seeds were placed in a microcentrifuge tube, on a rack, inside a desiccator jar where a beaker with 100 mL bleach was also placed. Just before closing the desiccator jar 3 mL concentrated HCl was added to the bleach to allow chlorine gas to form, and sterilisation was conducted for 3 h. Surface-sterilised seeds were then sown, individually for each batch of 100 seeds, on normal or modified sterile Murashige-Skoog 1% (w/v) agar media (Austratec Pty. Ltd.) plates, eventually depleted in salt or nitrogen, and then stratified and grown in the same conditions described above. For each assay the whole set of plates were randomised for position under light. Seed germination rate, established by radicle protrusion, was then quantified every day for up to 13 days. 2.2.4 Protein extraction Total protein was extracted from whole seeds of each tested genotype according to Gruis, Selinger et al. (2002). Total protein was extracted from precisely 100 seeds, crushed in liquid nitrogen and then resuspended in 200 μL of 2% (w/v) SDS, 160 mM DTT, and 50 mM Tris-HCl, pH 6.8. Immediately after addition of the buffer, samples were vortexed, incubated for 5 minutes in a dry bath incubator set at 100°C and centrifuged for 10 minutes at 20,800 x g. Supernatants were collected and vortexed before 20 μL of protein samples were adjusted with a SDS-PAGE sample buffer to final concentrations of 62.5 mM Tris-HCl pH 6.8, 2.5% (w/v) SDS, 0.002% (w/v) bromophenol blue, 5% (v/v) β-mercaptoethanol, and 10% (v/v) glycerol. 2.2.5 Protein 1-D gel electrophoresis Samples adjusted with SDS-PAGE sample buffer were loaded on a precast Bolt™ 4-12% Bis-Tris Plus Gel, fifteen wells (Thermo Fisher Scientific). 1-D gel electrophoresis was performed for 35 min at 165 V in SDS-PAGE running buffer: 50 mM MES (2-[N-morpholino]ethanesulfonic acid), 50 mM Tris base, 1 mM EDTA, 0.01% (w/v) SDS, using the compatible Mini Gel Tank (Thermo Fisher Scientific). Gels were stained with 0.5% (w/v) Coomassie Brilliant Blue G-250 (made in 50% (v/v) methanol, 10% (v/v) acetic acid), and later destained in 40% (v/v) methanol with 10% (v/v) acetic acid. 33 Figure 2.2. A. thaliana germination rate in various aep backgrounds, assay 1. Germination assays were conducted on 100 seeds per genotype and per plate. For each genotype, 5 times 100 surface sterilised seeds were placed on five identical germination media plates. Germination was assessed by radicle protrusion counted every day at the same time from the original time of transfer to the growth chamber. 2.3 Results 2.3.1 Preliminary study In a preliminary experiment, I conducted germination tests on Col-0, an aep-null line (aep1-1 aep2-3 aep3-1 aep4-1) and each of the four corresponding combinations of aep triple mutants of A. thaliana, on normal MS media. For each genotype I used seeds from one plant, and did five replicates, grown for 13 days. The number of germinated seeds was determined for each plate every day from the second day after being placed under the light, and averaged per genotype (figure 2.2). Each genotype presented a low variability for germination rate from one plate to another as indicated by the standard deviation. During this assay, the aep-null demonstrated a significant (P < 1.10-4) delay in germination compared to Col-0, the wild-type control (figure 2.2). However, parts of these preliminary results were intriguing. Among all the other triple mutants, only aep1-1 aep2-3 aep4-1 exhibited an intermediary germination rate pattern, between the aep-null and the wild-type and the three other triple mutants geminated as wild-type (figure 2.2). However, if the delay in germination observed for this aep-null line was due to misprocessing of seed storage proteins, aep2-3 aep3-1 aep4-1 similar and results should have been aep1-1 aep2-3 aep3-1, but obtained not for for aep1-1 aep2-3 aep4-1 that should germinate like wild-type. Thus despite the apparent delay in germination for the aep-null seeds compared to wild-type, my first results did not provide further understanding of the mechanisms underlying this difference. 34 Figure 2.3. A. thaliana germination rate in various aep backgrounds, assay 2. Germination assays were conducted on 100 seeds per genotype. For each plant, 100 surface sterilised seeds were placed on five different germination media, with five replicates per media. Germination was assessed by radicle protrusion counting every day at the same time than original transfer to growth chamber. 2.3.2 Germination rate is not influenced by media composition In a second experiment, I assessed the possibility that the outcome of the germination assay could be biased by the composition of the germination media. Using a similar experimental set-up as previously described, the six different genotypes were sown on duplicates of five different media plates. The traditional MS media used for growing plants on plates, in particular A. thaliana, is mostly constituted of salts and nitrogen. So to test if these components could help an impaired line to germinate as wild-type by complementing its deficiency, I conducted tests on normal MS media, on MS media with a half dose of salts or nitrogen, and on MS media completely depleted in salts or nitrogen. The 60 total plates were randomised for position under light and grown for 4 days. Based on my previous results, I established that most wild-type seeds would germinate in 4 days (figure 2.2). The number of germinated seeds was established for each plate every day after being placed under the light, and averaged per genotype (figure 2.3). For each genotype, the results obtained on each duplicate of media were similar. Surprisingly, for each genotype no difference was noticeable with respect to rate of germination on any type of media, depleted or not in nitrogen or salt (data not shown). Thus, all results obtained per genotype were averaged regardless of the germination media used. The very low (less than 10%) standard deviation at each time point and for each line attest that the different germination medium had no influence (P > 0.05) on germination rates of these lines (figure 2.3). Interestingly germination rates differed from one genotype to another, regardless of the germination media. Most of the genotypes exhibited a quite different germination pattern than during the preliminary experiment. More remarkably the wild-type and the aep2-3 aep3-1 aep4-1 and aep1-1 aep2-3 aep4-1 lines were the most delayed in germination, whereas the aep-null line was the fastest to germinate during this assay. 35 Whereas intriguing, these results nevertheless demonstrate that the nutrient composition of the media does not influence the germination rate of a given line, and that an aep-null line can germinate normally even on nitrogen or salt depleted media. At this stage I realised that my experiments could be biased. Though I had replicates, during both germination tests I used a single independent line per genotype, but seeds from a different parent plant were being used from one assay to the other. It was then possible that the differences observed for the same genotype between the two assays were due to internal phenotypic variation in this genotype. 36 Figure 2.4. A. thaliana germination rate in various aep backgrounds, assay 3. Germination assays were conducted on seeds from eight different plants per genotype, all harvested at the same time and stored at 25⁰C for at least 4 weeks. For each plant, 100 surface sterilised seeds were plated on germination media. Germination was assessed by radicle protrusion counting every day at the same time than original transfer to growth chamber. (a) Germination rates of seedlings of eight different plants, either Col-0 or aep-null. (b) Averaged germination rates of aep-null seeds and wild-type ones. 2.3.3 Germination rate varies in a fixed genotype To test the hypothesis that germination rate could be variable from one plant seedling to another with the same genotype, I conducted a third assay on Col-0 and aep-null. Based on my previous results suggesting no correlation between nutrient composition of the media and germination rate, I conducted the next germination assays on traditional MS media. For each genotype, eight individual lines were sown on media plates. Germination rates were established up to 4 days after light induction. Additionally, to prevent any phenotypic variability that could be linked to the mixed genetic background of the previously used mutants, this new assay was conducted on a different aep-null line (aep1-3 aep2-5 aep3-1 aep4-1) with a pure Col-0 genetic background. As expected each genotype exhibited a great internal variability for their germination rate (figure 2.4a). These data suggest that regardless of the seed protein content, other factors independent from genetic information have a strong effect upon germination rate (please see discussion on next page). 2.3.4 Impaired seed storage protein maturation does not influence germination rate Results obtained for all lines of each genotype were averaged and compared. Overall, the aep-null line had no statistically significant difference in germination rate compared to the wild-type line plant (figure 2.4b). This result indicates the impaired seed storage protein maturation that can be observed in aep-null seeds and in seeds of some aep triple mutants does not influence seed germination. My findings concur with these of the previous authors (Kinoshita, Nishimura et al. 1995; Gruis, Selinger et al. 2002; Shimada, Yamada et al. 2003; Gruis, Schulze et al. 2004). 37 2.4 Discussion In this study, I investigated the germination rates of several aep mutants and confirmed the previous undocumented reports attesting that aep mutants germinate normally compared to wild-type. My results also revealed a great variability for the germination rate in A. thaliana, unrelated to the genetic information, which was not previously reported in the literature to my knowledge. In previous studies Col-0 germination rate was much more uniform (Wang, Chen et al. 2016; Waterworth, Footitt et al. 2016; Yan, Wu et al. 2014). This suggests that in addition to the genetic control of germination, other factors might also be involved in germination control and generate germination rate fluctuation inside a fixed genotype lineage. For example, an increasing number of studies have shown that epigenetic regulations modifies gene transcription during seed development and germination and thus could contribute to the control of seed germination (Nakabayashi, Okamoto et al. 2005; Wolny, Braszewska-Zalewska et al. 2014). In addition, recent reports demonstrated that new epigenetic variants are generated at a higher rate than genetic variants in homogeneous A. thaliana population during several generations (Becker, Hagmann et al. 2011; Schmitz, Schultz et al. 2011) and several studies also showed that epigenetic variations can be influenced by the environment and inherited at least to the immediate progeny (Boyko, Blevins et al. 2010; Kathiria, Sidler et al. 2010). Thus these observations suggest that epigenetic variations between plants of a population with a fixed genotype can lead to variability in germination control of the progeny plants, resulting in variable germination rates for seeds originating from different sister-plants. This mechanism has ecological significance because it allows a fixed genetic population to have phenotypic variability for germination rate, preventing synchronised germination of a whole population which could represent a great disadvantage compared to time-spaced germination in various environmental conditions. 38 Chapter 3 AEP function in dark-induced leaf senescence 39 Figure 3.1. Relative expression of AEPs in A. thaliana. Data based on mRNA microarray experiments from Gene Expression Map of A. thaliana Development (Schmid, Davison et al. 2005), and the Nambara lab for the imbibed and dry seed stages (Nakabayashi, Okamoto et al. 2005). Data are normalised by the Affymetrix® GeneChip® Operating Software method, target value of 100. Most tissues were sampled in triplicate. Data downloaded from the ‘‘Electronic Fluorescent Pictograph’’ Browser by B. Vinegar (Winter, Vinegar et al. 2007), top cartoons drawn by J. Allis and N. Provart. 3 AEP function in dark-induced leaf senescence 3.1 Introduction 3.1.1 AtAEP3 expression is highest in senescing leaves As introduced in Chapter 1, AEP function in plants has been intensively studied and characterised. AEP is a cysteine protease that cleaves seed storage protein precursors on the C-terminal side of an asparagine or an acid aspartic residue for their maturation (Hara-Nishimura, Inoue et al. 1991; Hara-Nishimura, Takeuchi et al. 1993; Sheldon, Keen et al. 1996; Hiraiwa, Nishimura et al. 1999). Four homologues of AEP have then been discovered in A. thaliana. The A. thaliana AEP genes are expressed in many plant tissues, but each AtAEP has specific expression (figure 3.1) and probably a different function. In A. thaliana, the role of AEP in maturation of seed storage proteins has been confirmed and fully detailed with AtAEP2 (mainly expressed in seeds) being the most important for this function, but AtAEP3 can compensate (as well as AtAEP1 to a less extent) for this function if AtAEP2 is knocked out (Kinoshita, Nishimura et al. 1995; Gruis, Selinger et al. 2002; Shimada, Yamada et al. 2003; Gruis, Schulze et al. 2004). AtAEP3 is mainly expressed (as well as AtAEP1 more moderately) in stem and leaves (figure 3.1). AtAEP3 is also involved in programmed cell death of pathogen infected leaves (Kuroyanagi, Yamada et al. 2005) and is up-regulated in leaves during senescence and under various stress conditions (Kinoshita, Yamada et al. 1999). Despite the fact that AtAEP3 expression is highest in senescing leaves, a role for AEPs in senescence has not been reported. 40 3.1.2 Senescence is a natural recycling plant mechanism Senescence is the final stage of leaf development and of every plant organ, resulting in the death of single cells, tissues or whole plants. Plant senescence is best known from the public in the form of autumnal leaf senescence. During autumn in temperate regions of the world, deciduous trees develop colours of great aesthetic value, when leaves change colour from green to yellow, orange, red and eventually brown, before they die and fall from the tree. These colour changes are due to the preferential degradation of specific pigments, such as chlorophylls, over others like carotenoids or anthocyanins (Gan and Amasino 1997; Keskitalo, Bergquist et al. 2005). This transformation is just the visible part of a much more complex process. During senescence most nutrients contained in leaves are re-distributed to other parts of the plant before the leaves die (Leopold 1961) to prepare for the winter. This mechanism allows deciduous trees to retain valuable nutrients, which are often present in limiting amounts in the soil, instead of losing them when the leaves die and detach (Vitousek and Howarth 1991). Senescence also occurs in annual plants, and generally the whole plant turns from green to golden as the seeds ripens before harvest (in the case of crops) or the plant simply dies (Buchanan-Wollaston, Earl et al. 2003). Finally, senescence can also occur in response to pathogens or environmental stresses. These external factors can lead a plant to favour some organs over others where senescence is consequently induced (Quirino, Normanly et al. 1999; Weaver and Amasino 2001; Hickman, Hill et al. 2013). During senescence, the visible transformation is accompanied by active metabolic changes that result in the degradation of the concerned organ’s components accumulated during development, such as chlorophylls, proteins, carbohydrates, lipids, and nucleic acids. The nutrients are then redirected to other parts of the plant, where they can be recycled for the production of new vegetative, reproductive or storage organs (Watanabe, Balazadeh et al. 2013). Many proteases are involved in senescence, particularly for protein degradation (Roberts, Caputo et al. 2012), but no report suggests that this function has ever been investigated for the protease AEP. 41 3.1.3 Dark-induced senescence as a tool to study senescence In 2001, Weaver and Amasino developed an assay to induce senescence on individual leaves by covering a few individual leaves per plant with aluminium foil for several days. This assay mimics a natural event, the coverage of individual leaves by other leaves, organs or other objects. Upon removal, covered leaves from wild-type plants would exhibit a typical leaf depigmentation and desiccation phenotype of senescence. Biochemical analysis revealed that chlorophyll and total protein remobilisation had occurred and that several markers of senescence such as SAG12 were induced (Weaver and Amasino 2001). Indeed, senescence appears to be a logical biological response to recycle the contents of a covered leaf that is less efficient for photosynthesis than uncovered leaves and to use these remobilised contents as nutrients to build new leaves. This assay was used in several studies to identify effectors and inhibitors of senescence such as the A. thaliana mitochondrial enzyme NOS1 involved in nitric oxide synthesis. NOS1 inhibits senescence by reducing the production of reactive oxygen species and oxidation of proteins and lipids that naturally occur during senescence (Guo and Crawford 2005). It was also used to analyse the specific genes regulation and hormone signalling (van der Graaff, Schwacke et al. 2006) that occur in senescing leaves and to provide detailed analysis of the chlorophyll breakdown pathway (Pruzinska, Tanner et al. 2005), chloroplast autophagy (Wada, Ishida et al. 2009) and the early disruption of the microtubule network (Keech, Pesquet et al. 2010) accompanying leaf senescence in A. thaliana. In most of these studies, senescence was induced on individual leaves of 3 to 4 weeks old plants if grown under long day conditions or of 8 weeks old plants if grown in short days condition. Senescence induction of 5 days was enough to induce an easily noticeable phenotype and was conducted for up to 9 days, which was sufficient to result in total leaf desiccation and protein remobilisation, mimicking the natural death by senescence of an aging leave (Keech, Pesquet et al. 2010). To investigate in detail a possible role for AEP proteases in senescing leaves, I artificially induced leaf senescence in aep mutant plants using dark treatment on single leaves. I used several aep mutants to see if they would exhibit abnormal responses to this senescence induction. 42 3.2 Material and Methods 3.2.1 Plant material The wild-type plants used in this study were Arabidopsis thaliana ecotype Columbia (Col-0) and Wassilewskija (Ws-2). The A. thaliana mutant lines used here were previously described in 2.2.1. 3.2.2 Chlorophyll abundance measurement During the senescence induction assays, at various time points (from 1 to 9 days of darkness), chlorophyll content was estimated from covered and control (uncovered) leaves using a chlorophyll meter (or SPAD meter), a small portable spectrometer that allowed non-destructive measurement. 3.2.3 Plant Growth Conditions, selection, crosses and seed collection All plant genotypes were propagated as in 2.2.2. For induction of senescence assays, plants were sown on soil and grown at 22°C under short-day conditions (8 h light per 24 h day) with fluorescent light of 150 μmol photons m −2 s −1 at 22°C and 65% relative humidity. Leaves from 7 or 8 week-old plants were individually covered with aluminium foil for up to 9 days. I chose to dark-induce senescence on leaf 8 for all tested plants. In this way, all individually darkened leaves (IDLs) were at the same physiological state, which minimised risks of experimental artefacts. For the same reason, I used leaves 7 and 9 from the same plant as non-induced controls, because they are of similar physiological state as leaf 8 and constitute better controls than a non-covered leaf 8 from a sister plant, which could be in a different physiological state due to various external factors. 43 a b Figure 3.2. Chlorophyll content variation in 5 days IDLs (8 weeks old plants). For each line, 16 plants were grown for 8 weeks before dark-induction of senescence on IDLs. (a) For each line, an IDL is presented after 5 days of dark-induction, associated with a corresponding control uncovered leaf from the same plant. After 5 days of induction, the IDLs were uncovered and chlorophyll content was estimated by a SPAD index, as well as for two control uncovered leaves per plant. (b) Results are summarised as average values obtained for each line. 3.3 Results 3.3.1 Preliminary study In a preliminary experiment, I used the aep quadruple null line used in previous studies (Mylne, Colgrave et al. 2011), and the related aep triple mutants. This aep-null line had a mixed Col-0/Ws-2 genetic background, with two mutated AEP genes originally isolated in a Columbia ecotype (Col-0), and the two others from a Wassilewskija ecotype (Ws-2). As a result the various triple mutants also had a mixed genetic background. Thus I used both Col-0 and Ws-2 ecotypes as controls during this analysis. Because these two ecotypes have different flowering times, all lines were cultivated under short day conditions to delay flowering and allow a similar vegetative growth for all genetic background. After 8 weeks of vegetative growth, selected leaves were individually covered with aluminium foil to induce senescence. Each individually darkened leaf (IDL) was uncovered after 5 or 9 days (5d or 9d). After 5 days of induction only the wild-type Col-0 genetic background IDLs exhibited a clear, but light, yellowing phenotype characteristic of senescence compared to their respective uncovered control leaves (figure 3.2). All the mutants and the wild-type Ws-2 genetic background appeared to be much less sensitive to the dark treatment, their respective IDLs only exhibiting a slight lightening compared to their respective controls. To quantify this colour change mostly due to chlorophyll remobilisation, I estimated leaf chlorophyll content using a chlorophyll meter (or SPAD meter). The chlorophyll meter gives values (called SPAD index) exponentially linked to the tested leaf chlorophyll contents (Blackmer and Schepers 1995; Markwell, Osterman et al. 1995; Uddling, Gelang-Alfredsson et al. 2007). The resulting values (figure 3.2) reflect the phenotypes described above. A strong decrease in chlorophyll content can be observed in the wild-type Col-0 genetic background IDLs compared to their control, as expected for a wild-type plant. 44 a b Figure 3.3. Chlorophyll content variation in 9 days IDLs (8 weeks old plants). For each line, 16 plants were grown for 8 weeks before dark-induction of senescence on IDLs. (a) For each line, an IDL is presented after 9 days of dark-induction, associated with a corresponding control uncovered leaf from the same plant. After 9 days of induction the IDLs were uncovered and chlorophyll content was estimated by a SPAD index, as well as for two control uncovered leaves per plant. (b) Results are summarised as average values obtained for each line. This decrease is much less obvious for the other lines and particularly for the aep-null line and the Ws-2 wild-type control. I also observed that the uncovered Ws-2 wild-type leaves were on average less green than the other lines and this was reflected by the SPAD measurement. After 9 days of dark treatment, the induction of senescence was much more clearly identifiable. The Col-0 and Ws-2 ecotype exhibited a clear senescence phenotype for their IDLs as expected, but much more severe for the Col-0 phenotype than for Ws-2 (figure 3.3). By contrast the aep-null and aep1-1 aep2-3 aep3-1 lines exhibited a stay-green phenotype in response to the dark treatment. IDLs from the three other triple mutant backgrounds also exhibited a yellowing phenotype similar to wild-type. Overall these results would indicate that AEP is involved in senescence, as aep-null does not respond to induction of senescence. Furthermore the results obtained with the triple mutants suggest that the three AEP1, AEP2, AEP3 can fulfil this function, but not the seed-coat specific AEP4. These results correlate with the expression pattern reported for these respective AEPs. As for the previous induction time, after 9 days of induction the chlorophyll contents of IDLs and control leaves was estimated using a chlorophyll meter. The values obtained indicated a clear and strong chlorophyll loss for the wild-type Col-0 genetic background IDLs (figure 3.3). The aep-null and the aep1-1 aep2-3 aep3-1 lines IDLs also showed lower chlorophyll contents than their respective controls, but the decrease was much less important than for Col-0, suggesting a delay in senescence induction. Intermediary results were obtained for the three other triple mutants. Unfortunately during this assay, all the wild-type Ws-2 plants had a light-green phenotype compared to all the other lines tested (figure 3.3) and also more variability in their response to the darkening treatment, as reflected by the chlorophyll contents measurements (figure 3.3). Thus I could not obtain statistically significant results from this analysis whereas the aep-null and the aep1-1 aep2-3 aep3-1 lines seemed to be delayed for senescence induction compared to wild-type Col-0 ecotype line. 45 Figure 3.4. Leaf phenotype of 9 days IDLs (7 weeks old plants). For each line an IDL is presented after 9 days of dark-induction, associated with a corresponding control uncovered leaf from the same plant. 3.3.2 Natural variation for response to dark-induced senescence I repeated the same experiment previously described in 3.3.1. In the previous trial, I encountered an unexpected coloration difference between wild-type leaves of Ws-2 and Col-0 genetic background after 9 days of senescence induction. It appeared that the wild-type Ws-2 plants would start flowering at approximately 9 weeks under these specific growth conditions. As a result, the probable stress generated by flowering induction under short day conditions might have induced a more generalised, but partial, remobilisation response in wild-type Ws-2 plants, thus interfering with this assay. However, for the plants induced for 5 days, this difference was not noticeable, indicating that a change occurred in the Ws-2 plants during this time lapse, probably linked to flowering induction. To prevent this problem, in this assay the plants were grown for 7 weeks instead of 8 in the previous assay. After 9 days of senescence induction, IDLs were uncovered for analysis. As for the previous experiment, the wild-type Col-0 plants as well as the aep2-3 aep3-1 aep4-1, aep1-1 aep3-1 aep4-1 and aep1-1 aep2-3 aep4-1 plants all exhibited a typical yellowing phenotype indicating advanced senescence (figure 3.4). Surprisingly, this time aep-null and aep1-1 aep2-3 aep3-1 IDLs did not exhibit the stay green phenotype observed during the first trial. Instead they seemed to have also undergo senescence, but to a less severe extent, suggesting a delay in senescence. Unfortunately, the wild-type Ws-2 plants also had a similar reaction, suggesting a possible genetic background dependent response. This strongly suggests that Col-0 and Ws-2 ecotypes have very different senescence responses. 46 Figure 3.5. Chlorophyll content kinetic in IDLs (7 weeks old plants). For each line, 16 plants were grown for 7 weeks before IDLs senescence induction. After up to 9 days of induction the IDLs were uncovered and chlorophyll content was estimated by a SPAD index. Results are summarised as average values obtained for each line. (a) Col-0 and Ws-2 display different chlorophyll remobilisation profiles, with some aep mutants displaying intermediate profiles. (b) Most aep mutants exhibit a similar response than wild-type Col-0 to senescence induction. To further investigate this possibility a second set of plants were induced and IDLs were uncovered after 3, 6 or 9 days to evaluate their chlorophyll contents with a chlorophyll meter. For all lines the chlorophyll contents of IDLs decreased, establishing a specific remobilisation pattern for each line (figure 3.5). Interestingly, wild-type Col-0 and Ws-2 genetic background present a different remobilisation pattern during dark-induced senescence, suggesting a different regulation of senescence in these two ecotypes. Most triple mutants had a wild-type However, remobilisation the pattern pattern, correlating the similar of chlorophyll remobilisation visual changes. for aep-null and aep1-1 aep2-3 aep3-1, that exhibit a delay in yellowing compared to Col-0, but similar to Ws-2, seem to be combinations of Col-0 and Ws-2 chlorophyll remobilisation pattern. This strongly suggested that the genetic background variation introduced a bias in this study, and for this reason, it was impossible to get statistically clear results, even if aep-null and aep1-1 aep2-3 aep3-1 chlorophyll remobilisation pattern clear suggest a delay in senescence induction compared to Col-0 in this assay, which confirms the phenotypic observations. Nevertheless, whether the senescence induction delay observed for aep-null and aep1-1 aep2-3 aep3-1 is due to their genetic background or is actually due to the lack of enzyme is still to be determined. 3.3.3 AEP is not involved in dark-induced senescence As previously said, Col-0 and Ws-2 A. thaliana ecotypes have very different flowering time, Col-0 is a late-flowering ecotype and Ws-2 is rather an early-flowering ecotype, indicating different regulations of flowering initiation. Though flowering is known to induce senescence, interactions between flowering regulation and dark induced senescence regulation are poorly understood or studied. However, a recent study suggested the signalling pathways involved in the regulation of flowering and dark-induced senescence might be connected (Breeze, Harrison et al. 2011). 47 Figure 3.6. Leaf phenotype of wild-type IDLs. Eight Col-0 plants were grown for 7 weeks before individual leaf coverage for dark induced senescence assay. For each plant, IDLs after 1, 3, 5 or 7 days of senescence induction are presented associated with a corresponding non-covered control leaf. It appears that further analysis of the natural variability that exists between the Col-0 and Ws-2 ecotypes for senescence could bring more insight on the regulation of initiation and inhibition of senescence. However this was not the subject of this study aiming to identify the function of AEP in senescing leaves, and rather represented a serious issue, that I will explain now. The aep-null line I was using for this study had a mixed genetic background of the two A. thaliana ecotypes, Col-0 and Ws-2. Thus, in addition to the AEP mutations, these mutants also have a different allelic combination of genes. Because Col-0 and Ws-2 demonstrate different senescence responses it is likely that they possess different alleles of important and unidentified senescence regulators. The resulting mix of these regulators in the aep-null line, and in the aep triple mutant lines too, makes it difficult to compare these mutants to their wild-type parents for AEP specific responses to dark induced senescence. However aep-null and aep1-1 aep2-3 aep3-1 lines repeatedly exhibited a different response to the induction compared to both Col-0 and Ws-2 wild-type lines, suggesting that AEP might be required for dark-induced senescence. To elucidate this possible involvement of AEP in senescence and to counter the unexpected outcome of my previous assays, I ordered a new aep-null line from the Arabidopsis Biological Resource Center (Ohio, USA). This aep quadruple mutant is exclusively resulting from the original crossings of Col-0 ecotype single aep mutant alleles, and thus has a fixed Col-0 genetic background. I repeated the previous experiment with this new 100% Columbia aep-null line, and a wild-type Col-0 line. This assay revealed that this mutant has no abnormal phenotype related to dark-induced senescence (figures 3.6 and 3.7). It is clear here that beside some small internal variability, the aep-null line (figure 3.7) exhibits the same response to senescence induction than the wild-type line (figure 3.6). Chlorophyll content profiles (not shown) were almost identical and presented no significant difference (P > 0.05). These results imply that the differences that I observed in my previous assays between aep-null and aep1-1 aep2-3 aep3-1 lines were more likely to be due to the mixed genetic background of the mutants and not to the aep mutations. This also suggests convincingly that AEP has no involvement in dark induced leaf senescence. 48 Figure 3.7. Leaf phenotype of aep-null IDLs. Eight aep-null (in a pure Col-0 background) plants were grown for 7 weeks before individual leaf coverage for dark induced senescence assay. For each plant, IDLs after 1, 3, 5 or 7 days of senescence induction are presented associated with a corresponding non-covered control leaf. 3.4 Discussion In this study, I investigated the possible involvement of A. thaliana AEPs in senescence. Using genetic knock-out lines I showed that plants totally deficient in AEP would respond similarly to wild-type plants to dark-induced senescence. Although several proteases have been shown to be crucial for protein remobilisation during senescence (Roberts, Caputo et al. 2012), and A. thaliana AEP1 and AEP3 are preferentially expressed in leaves, particularly senescing leaves (Kinoshita, Yamada et al. 1999; Yamada, Shimada et al. 2005), my results suggest that A. thaliana AEPs are not essential to achieve dark induced leaf senescence. Thus, an AEP function in senescing leaves remains unknown, and the reason for high expression of AEP3 in senescing leaves remains unexplained. However, AEP could still be involved in some aspect of leaf senescence, and contribute to the degradation of some of the re-mobilised protein content, but much finer and exhaustive analysis than the one I conducted here would be required to investigate such hypothesis, such as investigating the degradation of leaf proteins during senescence relative to the presence of AEP in the leaves or not. However, AEP is surely not a major protease required for senescence, and gathering more knowledge of AEP function during senescence won’t help understanding how plants resist senescence, which is the main goal of most studies on leaf senescence. 49 Chapter 4 Using SFTI-1 to study protein trafficking defects A proof-of-concept study 50 Figure 4.1. PawS1 albumin and SFTI-1 biosynthesis. 1 PawS1 gene (blue helix) is transcribed into mRNA (red curve). 2 mRNA exit the nucleus via nuclear pores and is recognised by ribosome for translation into protein (black). 3 Translation continues on the ER, where the protein is integrated, the peptide signal is cut (purple) and disulfide bonds form. 4 Protein precursors are secreted from the ER to the Golgi complex. MAG2, located on the ER, and MAG4, located on the Golgi complex, are involved in this process. 5 In the Golgi complex, VSR1 recognises and binds the C-terminal region of seed storage protein precursors. 6 VSR1 is also involved in seed storage protein precursors sorting to multivesicular bodies via Golgi vesicles. 7 In the multivesicular bodies, seed storage protein precursors are sequentially processed as they traffic to the protein storage vacuole. 8 AEP cuts PawS1 at the three C-term extremities of the acid aspartic residues preceding SFTI-1, the small and the large PawS1 albumin subunits. 9 Napin-type pro-albumins are further processed by an aspartic peptidase. 10 SFTI-GLDN is simultaneously cleaved and cyclised by AEP into SFTI-1. 11 After delivery of mature seed storage proteins in the protein storage vacuole, VSR1 is recycled by the retromer complex (which contains MAG1, VPS35 and other proteins) back to the Golgi complex. 4 Using SFTI-1 to study protein trafficking defects 4.1 Introduction As described previously SFTI-1 is encoded by a multiprotein gene also encoding a napin-type albumin, called PawS1. The full details of SFTI-1 biosynthesis are still to be confirmed, but the processing pathway for napin-type albumins is well characterised and allows us to infer a robust description of PawS1 maturation, which is as follows (figure 4.1). After translation, the PawS1 protein precursor is transported to the endoplasmic reticulum. Upon entry in the endoplasmic reticulum, the PawS1 pre-pro-protein (151 amino acid residues) is cleaved and the ER-signal peptide is removed to release PawS1 pro-protein (130 amino acid residues). As most seed storage precursors, napin-type albumin precursors are then transported to the Golgi complex where they accumulate in specific buds that give rise to electron-dense vesicles released in the cellular matrix (Hillmer, Movafeghi et al. 2001). In the same way, protease precursors involved in the maturation of seed storage proteins are also released from the Golgi complex in separate specialised vesicles. These two types of vesicles fuse with each other to give rise to multivesicular bodies (Otegui, Herder et al. 2006). Then, as they traffic to protein storage vacuoles in multivesicular bodies, napin-type albumin precursors are processed by AEP and aspartic peptidase, to release the disulfide-bonds-linked-two-subunit napin-type albumins (Hara-Nishimura, Takeuchi et al. 1993; Otegui, Herder et al. 2006), and SFTI-1 in the case of PawS1 (Mylne, Colgrave et al. 2011; Bernath-Levin, Nelson et al. 2015). AEP cleaves PawS1 at three points to release the two albumin subunits and SFTI-1-GLDN. Then, the GLDN trailing sequence of SFTI-1 is cut by AEP and SFTI-1 is simultaneously cyclised by a head to tail ligation between the first and the last residues of SFTI-1 mature sequence (Mylne, Colgrave et al. 2011; Bernath-Levin, Nelson et al. 2015). The internal propeptide separating the two albumin subunits is then traditionally removed from the small subunit by the action of aspartic peptidase (Hiraiwa, Kondo et al. 1997) 51 Protein biosynthesis and maturation have been extensively studied and seed storage proteins have often been used as models to study this process due to their abundance and major contribution to the worldwide food production (Shewry, Napier et al. 1995). The trafficking of seed storage protein during maturation is a very complex process, and although the key events happening during this process are well understood (Otegui, Herder et al. 2006), few regulators have been characterised, mostly in A. thaliana. The A. thaliana mutant lines mag2 and mag4 are altered in their transport of seed storage protein precursors from the endoplasmic reticulum to the Golgi complex. Their seed cells develop a large number of abnormal structures containing storage protein precursors and these precursors accumulate in dry seeds (Li, Shimada et al. 2006; Takahashi, Tamura et al. 2010). MAG2 is a peripheral membrane protein of the ER involved in the transport of seed storage protein precursors between the ER and the Golgi complex in plants (Li, Shimada et al. 2006). MAG4 was shown to be localised on the Golgi complex in A. thaliana root and cultured protoplast cells and is involved in the transport of seed storage protein precursors from the ER to the Golgi complex in plants (Takahashi, Tamura et al. 2010). However, the mode of action of these two proteins remains unknown and their role in seed storage protein trafficking is still poorly understood. The A. thaliana various vps35 double and triple mutants, as well as the mutant lines mag1 and vsr1, are altered in sorting proteins from the Golgi complex to protein storage vacuoles in seeds. They abnormally accumulate the precursors of storage proteins in their seeds as well as mis-sort storage proteins and partially secrete them out of the seed cells (Shimada, Fuji et al. 2003; Shimada, Koumoto et al. 2006; Yamazaki, Shimada et al. 2008). VSR1 has been shown to bind the C-terminal peptides of seed storage protein precursors, which suggests that it acts as a specific receptor of seed storage proteins and targets them from the Golgi complex to the protein storage vacuoles (Shimada, Fuji et al. 2003). VPS35 (three copies in A. thaliana) and MAG1 (also called VPS29) are part of the retromer complex and were shown to interact together and to co-localise in pre-vacuolar compartments. Thus MAG1 and VPS35 are thought to interact in the recycling of VSR1 from the pre-vacuolar compartments back to the Golgi complex (Shimada, Koumoto et al. 2006; Yamazaki, Shimada et al. 52 2008). Nevertheless, as for MAG2 and MAG4, the function of MAG1 and VPS35 in seed storage protein trafficking remain largely unclear. Although all of the mutants in these studies displayed seed storage protein processing defects, there was no quantitative data regarding the levels of mature or misprocessed seed storage proteins in the mutants. Indeed, in contrast to the relative simplicity in observing the accumulation of the seed storage proteins precursors in these mutants (Shimada, Fuji et al. 2003; Li, Shimada et al. 2006; Shimada, Koumoto et al. 2006; Yamazaki, Shimada et al. 2008; Takahashi, Tamura et al. 2010), no quantification methods are currently available (Karas and Hillenkamp 1988; Fenn, Mann et al. 1989; Bantscheff, Lemeer et al. 2012) that can allow to compare, relatively simply, but precisely, the amount of mature seed storage protein versus the amount of unprocessed seed storage proteins present inside the seeds. Thus, alternative methods to assess the accumulation of mature seed storage proteins in seeds from these mutants would be of great value and could help to clarify the functions of effectors of the trafficking pathways of seed storage proteins. The biosynthetic output of the seed storage protein precursor PawS1 gives an unusual opportunity to address this problem. In this study, I developed an assay for acyclic SFTI-1 relative quantification, and monitored acyclic SFTI-1 production in seeds of transgenic A. thaliana producing PawS1 in a wild-type background and in various seed storage protein trafficking mutants. I used acyclic SFTI-1 relative quantification as a tool to evaluate the relative quantity of mature seed storage protein in these lines. Based on these results, I propose additional explanations for the previously described phenotypes of the several trafficking mutants used in this study. 53 Forward primer Sequence Reverse primer Sequence Associated Product genotype length when + F1 F3 F5 CGTTGGCTCGCTTGTTATTT GGATAAAAGTTTCGGCGTCA TTTTTGACTGTTGTAGGAAATCC R2 R4 R6 TCGCCAACATCGTTGTAATG AAACTCCAGAAACCCGCGGCTGAG ATATTGACCATCATACTCATTGC MAG1 mag1-1 mag1-2 884 620 460 F7 F9 AGCACGATCCACAGGAAGTT AATAATCGAACGATCCAAATCAGT R8 R6 ACCAGAGCTGGTTAGTTCAATTTC ATATTGACCATCATACTCATTGC MAG2 mag2-3 750 430 F10 F14 F12 F15 GCCTTATAGGTAATCCGGAACG GGTGTATGCCAGGCTTTGTT TTCCTGACAGAGCTTCTTGTTG TAGCATCTGAATTTCATAACCAATCTCGATACAC R11 R11 R13 R13 GAACTGCTGCGAAGAAGATTGT GAACTGCTGCGAAGAAGATTGT CCATTCCCTGATGAGAGAGC CCATTCCCTGATGAGAGAGC MAG4 mag4-1 MAG4 mag4-2 506 513 655 400 F17 F6 TGATTTGTACCACCTGCGAGT ATATTGACCATCATACTCATTGC R18 R18 AGCATGTTACTAGTGATGCCTAC AGCATGTTACTAGTGATGCCTAC VSR1 vsr1-2 585 450 F19 F23 AACCAATTGCATACGAATTTTTCACCC TGGTTCACGTAGTGGGCCATCG R20 R20 GCGTAGTTGCAAAGAATGATTCAGCAG GCGTAGTTGCAAAGAATGATTCAGCAG VPS35b vps35b-1 862 500 F21 F21 CGTTTACCCAACTACAGATTATTTTCA CGTTTACCCAACTACAGATTATTTTCA R22 R23 TTCAGCAGTTCGACATATAGAGACACC TGGTTCACGTAGTGGGCCATCG VPS35c vps35c-1 1370 950 Figure 4.2. A. thaliana MAG1, MAG2, MAG4, VSR1, VPS35b and VPS35c genes. Filled boxes, exons; solid lines, introns; thin lines, untranslated region. The position of each mutation and T-DNA insertion is indicated, but the T-DNA is not shown to scale. Orange, pAC161; dark-purple, pBI121; light-purple, pDAP101; green, pROK2. The position of each primer used for genotyping is indicated with a red arrow (see table for primer sequence). See full genes sequences in supplemental figures 9.1. 4.2 Material and Methods 4.2.1 Plant material and growth conditions The wild-type plants used in this study were Arabidopsis thaliana ecotype Columbia (Col-0). The A. thaliana homozygous (single insertion locus) pOLEOSIN:PawS1 transgenic Col-0 lines used in this study, producing a PawS1 protein precursor, were generated for a previous study (Mylne, Colgrave et al. 2011). The mutant lines mag1-1 (T-DNA insertion line, see (Shimada, Koumoto et al. 2006)), mag1-2 (GABI_G037C03), mag2-2 (complete gene deletion mutant, see (Li, Shimada et al. 2006)), mag2-3 (GABI_391C01), mag4-1 (partial gene deletion mutant, see Takahashi, Tamura et al. 2010), mag4-2 (SAIL_1281_G08), vsr1-2 (GABI_503A05) and the double mutant vps35b-1c-1 (SALK_039689 crossed with SALK_099735) were generously provided by Professor Ikuko Hara-Nishimura from the Graduate School of Science, Kyoto University, Japan. All plants were grown as in 2.2.2. 4.2.2 Plant genotyping For further plants crosses, plants were genotyped to select homozygous individuals. Genomic DNA of mag1-1, mag1-2, mag2-2, mag2-3, mag4-1, mag4-2, vsr1-2 single mutants and vps35 b-1c-1 double mutant plants was extracted as described in Edwards, Johnstone et al. ( 1991). Homozygous and double homozygous plants were selected using polymerase chain reaction (PCR) and specific primers (figure 4.2). Later in the project, the same method was used to select what I call hybrid lines, homozygous for one of these mutations and carrying a transgenic pOLEOSIN:PawS1 insertion (selected by another means, see next paragraph). 54 4.2.3 Plant crossing and selection Plants homozygous for the mutations mag1-1, mag1-2, mag2-2, mag2-3, mag4-1, mag4-2, vsr1-2 and double homozygous plants vps35b-1c-1 were crossed with a pOLEOSIN:PawS1 transgenic A. thaliana homozygous line. To prevent the eventuality of the pOLEOSIN:PawS1 transgene being linked to one of the mutant loci (thus resulting in unsuccessful crosses), two different pOLEOSIN:PawS1 transgenic A. thaliana lines were used. The homozygous mutant lines were used as female parents and the pOLEOSIN:PawS1 transgenic as male parents to simplify the first selection stages. The selection marker used in pOLEOSIN:PawS1 construct confers plant resistance to glufosinate ammonium, thus the successful crosses can easily be selected by Basta® herbicide selection. The F1 (first filial generation) plants resulting from the crosses were sown and grown on soil under the same condition previously described. At the cotyledon stage, F1 plants were selected for the presence of pOLEOSIN:PawS1 by Basta® spray at a final concentration of 0.4 g/L glufosinate ammonium (Bayer Cropscience Pty Ltd). Surviving F1 plants were allowed to grow and self-fertilise. F2 (second filial generation) plants were selected for the presence of pOLEOSIN:PawS1 with Basta® herbicide as the previous generation and then surviving plants homozygous for desired mutations were selected by genotyping as described above. From this stage, all the plants selected were originating from the same pOLEOSIN:PawS1 transgenic line with which all crosses were successful. With most trafficking mutations being deleterious when homozygotes, the targeted F2 genotype were grown to generate a larger F3 generation of hybrids for further analyses. F3 plants were also selected for the presence of pOLEOSIN:PawS1 with Basta® herbicide and the survivors allowed to self-fertilize. 55 4.2.4 Segregation analysis Further selection steps were conducted to obtain hybrids also homozygous for the pOLEOSIN:PawS1 transgene. Seeds of F3 plants were surface-sterilised and then sown on sterile Murashige-Skoog media, complemented with 40 μL of Basta® herbicide per litre of media, and grown in the same conditions previously described in 2.2.2. Segregation of Basta® resistance was determined a week later by the ratio [living plants / total plants]. Seeds of F3 plants 75% resistant to Basta® herbicide were considered as issued from an heterozygous plant for the pOLEOSIN:PawS1 transgene and seeds of F3 plants 100% resistant to Basta® herbicide were considered as issued from an homozygous plant for the pOLEOSIN:PawS1 transgene. Selected F4 plants were genotyped as described above to confirm the homozygous stage of the desired mutation, and then allowed to self-fertilise. F5 plants were grown as previously described to multiply the few desired hybrids obtained and F6 seeds were collected for further analyses. 4.2.5 Seed peptide extraction Peptides were extracted from 100 A. thaliana F6 seeds, for ten plants of each hybrid generated. Seeds were placed in a 2.0 mL plastic tube with a 3/16" stainless steel ball, frozen in liquid nitrogen and shaken for 1 min at an oscillation frequency of 30 Hertz in a Retsch mixer mill MM301. Tissue powder was resuspended in 0.25 mL methanol and 0.25 mL chloroform. For phase separation, 0.1 mL of 0.05% (v/v) trifluoroacetic acid was added and mixed for 2 min at room temperature before centrifugation for 5 min at 20000 x g at room temperature. The aqueous phase was retained and a 50-fold dilution from this extract was made with 50% (v/v) acetonitrile 0.1% (v/v) trifluoroacetic acid before analysis by MALDI-TOF-MS. 56 4.2.6 Acyclic SFTI-1 detection and quantification The matrix used for MALDI-TOF-MS was prepared by adding an excess of α-cyano-4-hydroxycinnamic acid matrix (Fluka) to 90% (v/v) acetonitrile 0.1% (v/v) trifluoroacetic acid, followed by sonication in a water bath for 15 min and then centrifugation for 5 min at 10,000 x g. The supernatant was diluted 10-fold in acetonitrile to make the final matrix solution used in MALDI-TOF-MS. For acyclic SFTI-1 detection in each seed peptide extract, the previously diluted aqueous phase was mixed in equal proportion with the final matrix solution, and 2 μL of this mix were spotted onto MTP AnchorChip™ 384 plate. Dried spots were analysed with an UltraFlex III MALDI-TOF/TOF mass spectrometer (Bruker Daltonics) at 30% laser intensity. For each spot, ten spectra, each corresponding to 500 shots, were accumulated to generate the final spectrum used for acyclic SFTI-1 detection. Spectra were analysed using FlexControl software (Bruker Daltonics). To quantitate acyclic SFTI-1, I adopted the widely accepted assumption that peptide abundance directly correlates with the integrated area under the curve (AUC) of precursor ion spectra (Fenyö and Beavis 2008; Benk and Roesli 2012). The standard I selected for acyclic SFTI-1 quantification is a synthetic acyclic SFTI-1 mimic with a diselenide bond. Diselenide bonds are excellent structural mimics of disulfide bonds and the dominant Se78 and Se80 isotopes give this peptide a distinctive isotopic pattern and higher molecular mass than natural acyclic SFTI-1, thus making this diselenide version of acyclic SFTI-1 a suitable structural analogue for use as an standard (Duncan, Roder et al. 2008). Therefore, acyclic SFTI-1 quantity in a sample can be directly deduced by comparing AUC for the mass corresponding to acyclic SFTI-1 (1531.8 Daltons) relative to AUC for the mass corresponding to the exogenously spiked acyclic SFTI-1 with a diselenide bond (1627.7 Daltons). To introduce the standard, the aqueous phase of each diluted sample was mixed in equal proportion with the standard also diluted in 50% (v/v) acetonitrile 0.1% (v/v) trifluoroacetic acid to a concentration of 2 pM. This concentration was previously determined to be enough for quantification, but not to suppress the acyclic SFTI-1 signal in the sample. This mix was then diluted to 50% (v/v) with the matrix solution and the same MALDI-TOF-MS analysis as for acyclic SFTI-1 detection was conducted. 57 Areas of the peaks at 1531.8 Daltons and 1627.7 Daltons corresponding respectively to the acyclic SFTI-1 ion and the control synthetic peptide ion were calculated with the provided Bruker software. For each seed peptide extract, the relative acyclic SFTI-1 quantity was then obtained by the ratio of AUC (1531.8 Daltons) to AUC (1627.7 Daltons). This MALDI-TOF-MS analysis is similar to the one previously described in Bernath-Levin et al. (2015). For each plant tested (10 per hybrid), four technical repeats were performed, corresponding to four independent analyses of the same seed peptide extract. Final results were normalised relative to the parental transgenic line expressing PawS1 in a wild-type genetic background. 4.2.7 Protein 1-D gel electrophoresis To check the seed protein contents of the different mutants, 12 μL of the same seed extracts previously used for MALDI-TOF-MS analyses were adjusted with a SDS-PAGE sample buffer and analysed by protein 1 D gel electrophoresis as described already in 2.2.5. All replicates previously used for MALDI-TOF-MS analyses were tested with apparent identical results (data not shown). For figure 4.3, all replicates for each line were combined in equal proportion. For each line, 12 μL of this combined sample were adjusted with a SDS-PAGE sample buffer and analysed by protein 1 D gel electrophoresis as previously described. 58 4.3 Results 4.3.1 Generation of hybrid lines In this study, I generated what I refer to as hybrid A. thaliana lines, homozygous for a pOLEOSIN:PawS1 transgene and for one of the mag1-1, mag1-2, mag2-2, mag2-3, mag4-1, mag4-2, vsr1-2 or vps35b-1c-1 mutations. These mutations are responsible for abnormal accumulation and mis-trafficking of seed storage protein precursors. All hybrids were obtained from the same pOLEOSIN:PawS1 single locus insertion line to ascertain identical PawS1 production in all hybrids. Successful hybrid lines were selected using PCR to select homozygous lines for the mag1-1, mag1-2, mag2-2, mag2-3, mag4-1, mag4-2, vsr1-2 or vps35b-1c-1 mutations, and using segregation analyses of the selective marker (Basta resistance) to select homozygous lines for the pOLEOSIN:PawS1 transgene. 4.3.2 SFTI-1 as a tool to monitor seed storage protein maturation All the mutants used in this study were reported to abnormally accumulate seed storage protein precursors. Whereas no quantitative data were produced, comparison of protein profiles suggests that mag2-2 and mag2-3 seeds might accumulate similar levels of mature albumin than wild-type (Li, Shimada et al. 2006), as well as mag4-1 (Takahashi, Tamura et al. 2010) and vps35b-1c-1 dry seeds (Yamazaki, Shimada et al. 2008). On the contrary, immunoblot analyses suggest that mag4-2 seeds produces more mature albumin than wild-type or mag4-1 seeds (Takahashi, Tamura et al. 2010). Similarly comparison of protein profiles suggests that, compared to wild-type, albumin accumulation is increased in mag1-1 seeds and decreased in mag1-2 seeds (Shimada, Koumoto et al. 2006). Only the study of vsr1-2 did not allow any inference toward levels of seed albumin production (Shimada, Fuji et al. 2003). Although these mutants all accumulated unprocessed seed storage proteins, many appear to have wild-type levels or even higher levels of properly matured seed storage proteins, with no obvious correlation with the proposed function of mutated genes. Surprisingly this was not commented on in these manuscripts. 59 Figure 4.3. Protein profile and acyclic SFTI-1 (acSFTI-1) quantity in A. thaliana seeds of various backgrounds. Soluble proteins were extracted from 100 seeds of wild-type and various trafficking mutants. Relative production of acyclic SFTI-1 in 100 seeds based on MALDI-TOF-MS analysis of seed peptide extracts. Stars indicate significant difference with wild type. To understand the processes deregulated in the various mutants I developed a simple method for precise relative quantification of acyclic SFTI-1. I introduced the PawS1 expression construct in various mutant backgrounds and used this semi quantitative method to compare acyclic SFTI-1 production level in the seed of the resulting lines. Because acyclic SFTI-1 production is bound to the PawS1 albumin production, relative variation of acyclic SFTI-1 quantity in the seeds should reflect relative variation of the level of properly matured storage protein accumulated in the seeds (figure 4.3). The level of acyclic SFTI-1 produced in seeds of hybrids derived from mag2-2, mag2-3, mag4-1 and vps35b-1c-1 did not differ significantly (P > 0.05) from the level obtained for seeds of the pOLEOSIN:PawS1 transgenic Col-0 line. On the contrary, compared to the pOLEOSIN:PawS1 transgenic Col-0 line, acyclic SFTI-1 production was significantly (P < 0.0001) reduced to 40% in the mag1-2 derived hybrid and to less than 20% in the vsr1-2 derived hybrid, and increased to 170% or 220% respectively in the hybrids derived from the mag4-2 and mag1-1. My results correlate with the relative quantity variations for mature albumins in these mutant seeds noticeable on a protein gel (figure 4.3). The 1-D protein gel is not a quantitative method, but for each sample, the same quantity of seeds and volumes were used so that the variations observed on gel reflect the variations between seeds contents of the different mutants. Similar procedures were used for the MALDI-TOF-MS analyses. The relative quantity variations observed on figure 4.3 for acyclic SFTI-1 seed contents seems to correlate with the relative quantity variations of mature albumins and globulins in these same seeds that can be deduced from figure 4.3. Overall acyclic SFTI-1 quantity seems to be a good indicator of mature albumins and globulins quantity in the various mutant seeds, and thus validate the concept of SFTI-1 as a tool to evaluate mature seed storage quantity in seeds. I propose to use this acyclic SFTI-1 simple quantification method to infer the levels of mature albumins and globulins accumulating in lines transformed with PawS1 and to provide more understanding about seed storage protein trafficking pathways. 60 4.3.3 Seed storage protein trafficking pathways The results presented above suggest that in mag2 seeds, despite abnormal accumulation of seed storage protein precursors, similar amounts of mature seed storage proteins are produced. The mag2-2 mutation is a complete gene deletion, thus mag2-2 seeds don’t produce any MAG2 protein, and the mag2-3 mutation is a complete loss of function. MAG2 was shown to be involved in seed storage protein precursor trafficking from the ER to the Golgi complex (Li, Shimada et al. 2006). To explain the presence of mature seed storage proteins in mag2 seeds, the authors suggested that a MAG2 homologue might compensate for its function, or that an independent pathway might exist. However, these two possibilities would not explain similar amounts of mature seed storage proteins in mag2 seeds than in wild-type seeds, as their precursors are partly mis-sorted to abnormal intracellular structures where they remain unprocessed. In agreement with the previous conclusions, my results suggest that in mag2-2 another effector can effectively compensate for MAG2 function, but also that similar amounts of protein precursors are correctly trafficked and matured in the seeds. In this case, the accumulation of precursors in the same mutant seeds implies that more precursors were produced to compensate for their partial mis-trafficking and that the MAG2 homologue or the other pathway compensating for MAG2 loss of function is less effective. Overall, these results reveal that in mag2 seeds an unknown mechanism can sense the accumulation of mis-trafficked seed storage protein precursors in abnormal structures and triggers an overproduction of seed storage protein precursors to compensate for their inefficient trafficking. MAG4 was also shown to be involved in trafficking seed storage protein precursors from the ER to the Golgi complex (Takahashi, Tamura et al. 2010), with mag4 seed cells also accumulating abnormal structures containing seed storage protein precursors. The mutant mag4-2 was shown to have complete loss of function for MAG4 whereas this loss was only partial for mag4-1, with mag4-2 plants exhibiting a much stronger phenotype than mag4-1 plants. As previously described, mag4-2 is a null mutant that contains a T-DNA insertion in MAG4 gene and the mag4-1 deletion in MAG4 gene results in the production of a truncated MAG4 protein. As a result, the MAG4 transcription is undetectable in mag4-2 while it is similar to wild type in the mag4-1 mutant (Takahashi, 61 Tamura et al. 2010). The presence of mature seed storage proteins in mag4 seeds confirmed the potential existence of other effectors than MAG2 and MAG4 or of a Golgi-independent pathway. In addition, my results suggest that the accumulation of seed storage protein precursors in the mag4 seeds is correlated with production of mature seed storage proteins in quantities similar or greater than wild-type respectively for mag4-1 and mag4-2. Overall, these results support my previous deduction and indicate that the accumulation of seed storage protein precursors in abnormal structures is detected and proportionally compensated in mag4 seeds. VSR1 is a receptor originally localised on the internal Golgi surface and that selectively binds C-termini of seed storage protein precursors. The precursors are thereby gathered in vesicles produced by the Golgi apparatus and targeted to the protein storage vacuole. During their transport to the protein storage vacuole, these vesicles merge with other vesicles containing processing proteases to form multivesicular bodies and allow seed storage protein maturation to start (Shimada, Fuji et al. 2003; Otegui, Herder et al. 2006). Accordingly, vsr1 seeds abnormally accumulate seed storage protein precursors, and immunogold analyses, with antibodies that cannot differentiate mature proteins or protein precursors, showed that some globulins and albumins are abnormally secreted from seed cells in vsr1 lines (Shimada, Fuji et al. 2003). My results also suggest that the abnormal accumulation of seed storage protein precursors is accompanied with a reduction in mature seed storage proteins production. Combined with the previous studies, these conclusions imply that the lack of VSR1 results in the secretion of some seed storage protein precursors outside the cell, thus never coming into contact with their processing proteases. This ultimately results in less mature seed storage proteins being produced. In contrast to the previous mag2 and mag4 mutants, this reduction is not compensated. This confirms the theory that a compensation mechanism can detect the accumulation of seed storage protein precursors in the cell, which is not activated in the vsr1 seeds where these precursors are secreted out of the cell. 62 MAG1 (or VPS29) and VPS35 are components of the retromer complex, which is responsible for retrograde transport of proteins (Coudreuse, Roel et al. 2006; Oliviusson, Heinzerling et al. 2006; George, Leahy et al. 2007; Nielsen, Gustafsen et al. 2007). The mag1 and vps35 seeds accumulate seed storage protein precursors, both in abnormal intracellular or extracellular structures (Shimada, Koumoto et al. 2006; Yamazaki, Shimada et al. 2008). The authors concluded that the retromer complex was involved in VSR1 recycling (Shimada, Koumoto et al. 2006; Yamazaki, Shimada et al. 2008), but this would not explain the accumulation of seed storage protein precursors inside the cell, or why mag1 and vps35 lines exhibit a strong dwarf phenotype and pleiotropic phenotypes. Rather these results imply that the retromer complex is involved in retrograde transport of proteins at large and that mis-sorting of seed storage proteins in retromer components mutants is just a side-effect of the deregulation of protein transport in general. This correlates with my results indicating that in the partial loss of function mutants vps35b-1c-1 and mag1-1 the accumulation of seed storage protein precursors is compensated to produce similar or even larger amounts of mature seed storage proteins as wild-type, similarly to mag2 and mag4 lines. This suggest that other components of the seed storage protein trafficking pathway (potentially MAG2 and MAG4) might also be partially impaired in the vps35b-1c-1 and mag1-1 lines, resulting in the previously reported accumulation of seed storage protein precursors inside the cell, and that similarly to mag2 and mag4 lines this accumulation is detected and compensated. This mechanism might be deregulated as well in the mag1-1 lines, resulting in the over production of mature seed storage protein suggested by my results. In contrast, when the loss of function is complete (such as for the mutant mag1-2), my results indicate that the mis-sorting of seed storage protein precursors cannot be compensated by another mechanism, supporting the idea that not only is VSR1 recycling greatly impaired in mag1-2, but protein transport in general, thus greatly impairing protein production and maturation at large. 63 4.4 Discussion In this study, I developed a tool to monitor mature seed albumin production. The specificities of PawS1 albumin precursor, suggested it could be used as a tool for estimating albumin abundance in seeds. It contains the sequence for the small cyclic peptide SFTI-1 and for an albumin, both of which are processed by AEP for their maturation. This suggests that SFTI-1 and its adjacent albumin fate should be linked and their respective quantity in seeds should be proportional. I developed a method to compare acyclic SFTI-1 quantity in seeds of A. thaliana lines. Then, I created several hybrids, each expressing transgenic PawS1 protein precursor in a different seed storage protein trafficking mutant contexts and compared their acyclic SFTI-1 quantity in mature seeds, allowing us to infer on mature seed storage protein production in each mutant background. This method allowed us to gain a greater understanding of the modifications that occur in various protein trafficking mutants. I elucidated in more detail these mutants phenotypes and, more importantly, I identified the existence of a mechanism (that remains to be characterised) that can detect when seed storage protein precursors accumulate inside the cells of these mutants, but can’t when they are secreted outside of the cells. This mechanism can compensate for the detected accumulation of the precursors and consequently results in similar or greater production of mature seed storage proteins compared to wild-type. This suggests that this compensation mechanism acts at the translational or at the transcriptional level to increase the production of protein precursors and thus compensate the defects resulting in maturation of only some of these precursors. More detailed analyses of the transcript levels for seed storage proteins and of the production of their protein precursor during seed development in the various mutants are required to provide a better understanding of this proposed sensing mechanism. Additionally, the generation of mutated populations from these mutants might enable the characterisation of effectors of this compensation mechanism in targeted screens. 64 Chapter 5 How readily will SFTI-1 hijack other sequences? SFTI-1 is not a passive by-product of its adjacent albumin maturation 65 Figure 5.1. PawS1 and SFTI-1 processing. (a) PawS1 sequence with ER signal (rose), SFTI-1 (aqua), PawS1 albumin small subunit (green) and large subunit (orange). Spacer regions or the GLDN tail are coloured black. Asparagine residues are numbered. (b) PawS1 (after ER signal removal) hypothetical 3D peptide model in ribbon format. It was generated by homology modelling using EasyModeller4.0 (Kuntal, Aparoy et al. 2010) and using all available napin-type albumin structures (1SM7A, 1S6DA, 1W2QA, 1PSYA, 2LVFA) as template for the albumin part of PawS1 and SFTI-1 three-dimensional structures as a template for the SFTI-1 part of PawS1. Model evaluation data: ProSA-web (Wiederstein and Sippl 2007) Z-Score= -4.87, Procheck (Laskowski, MacArthur et al. 1993) Ramachandran parameters: Procheck Ramachandran Plot (Allowed= 90.8%, Additionally allowed= 8.5%, Generously Allowed= 0.7%, Disallowed region= 0.0%) and Procheck G-factors (Dihedrals= -0.11, Covalent= 0.26, Overall= 0.02). Disulfide bonds are indicated as sticks. Asparagine residues are numbered and their side chains shown in line format. (c) MALDI profiles of seed peptide extracts from sunflower as well as A. thaliana expressing a synthetic PawS1 in wild-type (WT) and aep-null backgrounds. A mass for acyclic SFTI-1 (1,531.82 Daltons) is the major form detectable in WT A. thaliana. Only mis-processed PawS1 derivatives are detectable in the aep-null background. 5 How readily will SFTI-1 hijack other sequences? 5.1 Introduction As introduced in Chapter 1, SFTI-1 is a small macrocyclic peptide from sunflower seeds that has potent trypsin inhibitory activity (Luckett, Garcia et al. 1999). The gene encoding SFTI-1 (figure 5.1) was named Preproalbumin with SFTI-1 (PawS1) as it also encodes a napin-type seed storage albumin (Mylne, Colgrave et al. 2011). Using transgenic A. thaliana wild-type and AEP knockout lines, it was shown that both SFTI-1 and its adjacent albumin require the protease AEP to be processed from PawS1 dual precursor (Mylne, Colgrave et al. 2011) and SFTI-1 biosynthesis has been reconstituted in situ with sunflower seed extracts as well as in vitro with a recombinant AEP (Bernath-Levin, Nelson et al. 2015). Processing by AEP is one obvious requirement for SFTI-1 maturation, but little is known about the ability of this peptide sequence to emerge from its protein precursor successfully. In this study, I investigated whether SFTI-1 really overcomes its adjacent albumin for maturation or whether SFTI-1 is a passive proteinaceous passenger produced as a result of PawS1 albumin maturation. To examine how readily SFTI-1 would emerge from other proteins in vivo, I chose to work in the genetic model A. thaliana. It was previously shown (Mylne, Colgrave et al. 2011) that a synthetic PawS1 gene coupled to an A. thaliana seed-specific promoter can produce SFTI-1 in vivo, in A. thaliana seeds, by accurate AEP-dependant cleavages at the asparagine preceding SFTI-1 and at the terminal acid aspartic residue of SFTI-1 (figure 5.1). The macrocyclisation reaction is inefficient in A. thaliana so the majority of product is acyclic SFTI-1. Our goal was to monitor the ability for the SFTI-1 sequence to emerge from host sequences rather than to efficiently macrocyclise, so in this study I focussed solely on the mass for acyclic SFTI-1. I created synthetic genes with the DNA sequence for SFTI-1 (and a short trailing sequence) at different locations within PawS1 or in genes for unrelated seed proteins. Then, I used a MALDI-TOF-MS based method to compare the quantity of SFTI-1 cleaved from these synthetic precursors, in the seeds of the various transgenic lines created, and to get more insight on SFTI-1 maturation. 66 5.2 Material and methods 5.2.1 Synthetic genes for in planta expression Synthetic genes (GENEART®) were designed with a ClaI restriction site preceding the start ATG and a SacI restriction site after the stop codon (supplemental figures 9.3 and 9.4) to allow sub-cloning into a binary vector. Most synthetic genes (except 3SFTI, 2SFTI, 1SFTI and PawS1[∆2-21]) were ordered this way. A synthetic PawS1 gene was also generated using the same technology to use as a control in this study, because the other synthetic variants of PawS1 generated were codon optimised for expression in A. thaliana. From original plasmids containing the ordered synthetic genes, variants were generated by PCR. The PCR were conducted using phosphorylated primers surrounding a region to delete in an original plasmid. The whole plasmid was amplified by PCR, at the exception of the small region surrounded by the couple of primers. The resulting truncated and linearized plasmid was then re-circularised thanks to ligation of the two phosphorylated free ends. Synthetic genes encoding 3SFTI, 2SFTI and 1SFTI were generated this way from the one encoding 4SFTI, using the primer couples BP37 (5'-GTT ATC AAG ACC ATC TGG ATA GCA-3') and BP40 (5'-TGA TGA GAG CTC ATT AAT TAA-3'), BP38 (5'-ATT ATC CAA ACC ATC TGG GAA ACA-3') and BP40 or BP39 (5'-GTT ATC GAG TCC ATC AGG GAA-3') and BP40 respectively to remove one, two or three SFTI-GLDN domains. To remove the ER signal and generate the artificial gene encoding PawS1[∆2-21], the PawS1 synthetic gene was used as template in a PCR reaction with primers BP51 (5'-GGA TAC AAG ACC TCT ATC TC-3') and BP52 (5'-CAT TGT ATC GAT TGG CGC GC-3'). To create binary constructs for plant transformation, the vector pAOV (Mylne and Botella 1998) was cut with XhoI/SacI to remove its CaMV35S promoter, and then the OLEOSIN promoter with XhoI/ClaI restriction sites and a synthetic gene digested with ClaI and SacI were triple ligated with the open promoter-less pAOV. Upon ligation, the synthetic gene is then preceded by the OLEOSIN promoter and followed by a Nopaline synthase terminator (nos3’) included in pAOV. 67 5.2.2 Plant transformation and growth conditions All plants were grown as in 2.2.2. Constructs carrying the synthetic genes, E. coli resistance to tetracycline and plant resistance to glufosinate ammonium, were tri-parental mated into Agrobacterium tumefaciens strain LBA4404 and used to transform A. thaliana wild-type Columbia (Col-0) or an aep1-1 aep2-3 aep3-1 aep4-1 (also known as αvpe-1 βvpe-3 γvpe-1 δvpe-1) quadruple mutant (Nakaune, Yamada et al. 2005) by floral dip (Bechtold, Ellis et al. 1993). Seeds resulting from the parental transformation were collected and sown on soil for selection of the first transgenic generation (T1) with Basta® herbicide with a final concentration of 0.4 g/L glufosinate ammonium (Bayer Cropscience Pty Ltd). The segregation ratio of resistant to susceptible seedlings in the second transgenic generation (T2) was used to select single locus transgene insertion lines. 5.2.3 Seed peptide extraction Peptides were extracted from fifty T2 seeds from each single locus transgenic line generated (or from one seed in the case of sunflower), as in 4.2.5. Seeds were placed in a 2.0 mL plastic tube with a 3/16" stainless steel ball, frozen in liquid nitrogen and shaken in a mixer mill (Retsch MM301). Tissue powder was resuspended in 0.25 mL methanol and 0.25 mL chloroform. For phase separation 0.1 mL of 0.05% (v/v) trifluoroacetic acid was added and mixed for 2 min at room temperature before centrifugation for 5 min at 20000 x g. The aqueous phase was retained and diluted 50-fold in 50% (v/v) acetonitrile 0.1% (v/v) trifluoroacetic acid for analysis by MALDI-TOF-MS. 68 5.2.4 SFTI-1 (cyclic and acyclic) detection and acyclic SFTI-1 quantification To detect SFTI-1 and quantify acyclic SFTI-1 in each seed peptide extract the same procedure as described in 4.2.6 was conducted. For each synthetic gene I created and introduced in A. thaliana, T2 seeds of a minimum of eight independent transgenic lines were tested and a minimum of four technical repeats per line were performed. 5.2.5 Shotgun proteomics of A. thaliana seeds To determine the abundant seed proteins in wild-type A. thaliana (Col-0 ecotype), proteins were extracted from 240 seeds in six independent biological replicates and digested with trypsin as described (Mylne, Colgrave et al. 2011). For each sample, 8 µg of digested proteins were injected into an Agilent 6550 Q-TOF using HPLC via Chip Cube interface with a C18 trapping/analytical Polaris chip for protein binding, and a 45 min gradient of 10% to 30% (v/v) acetonitrile in 0.1% (v/v) formic acid for elution. Settings used were positive ion mode, eight mass spectrometry (MS) scans at 250 to 1,400 mass-to-charge ratio per second, maximum of eight precursors fragmented per cycle with an absolute threshold of 5,000 counts, scan speed varied according to abundance, and charge state selection set to +2 and +3. Spectra were searched against TAIR10 protein sequences using Mascot 2.3.02 (Matrix Science) to infer the forty most abundant proteins in dry A. thaliana seeds (table 5.1). 69 5.2.6 Targeted proteomics For each construct, proteins were extracted from 240 seeds (30 seeds for eight independent transformation events with acyclic SFTI-1 production) were pooled). Proteins were extracted and digested with trypsin as previously described (Mylne, Colgrave et al. 2011). Based on a theoretical digest of each of the protein, possible peptide transitions were determined using Skyline, version 1.1.0.2905 (MacLean, Tomazela et al. 2010). These were then optimised following an algorithm specific for Agilent 6400 series instruments. For PawS1 albumin, I selected the fragment QLQQGQGGQQQVQQMVK (from the large subunit) as it is the only fragment unaffected by any of the artificial SFTI-GLDN insertion. As an internal control I targeted the tryptic fragment QEEPVCVCPTLR from SESA4, an endogenous A. thaliana napin-type albumin. For each fragment, I selected at least three transitions for one precursor ion. For identification of the selected daughter ions, 2 µg of digested protein were injected using an Agilent Technologies 1200 series nano/capillary LC system on an Agilent 6490 QqQ mass spectrometer with an HPLC Chip Cube interface with C18 trapping/analytical Polaris chip, and eluted with a 15.5 min and 3% to 100% (v/v) acetonitrile gradients in 0.1% (v/v) formic acid. The mass spectrometer was run in positive ion mode, with a drying gas temperature of 365°C and flow rate of 5 litres per min; for each transition, the fragmentor was set to 130 and dwell time was 5 ms. As reference for the expected fragmentation pattern and retention times for PawS1 albumin tryptic fragment, I performed an identical proteomic analysis on recombinant 6-His tagged PawS1 albumin produced as previously described (Mylne, Colgrave et al. 2011). 70 5.3 Results 5.3.1 The SFTI-1 sequence can emerge from non-native locations In PawS1, the sequence for SFTI-1 and its trailing GLDN sequence (GRCTKSIPPICFPDGLDN, called SFTI-GLDN hereafter) immediately precede the sequence of mature PawS1 albumin. This location avoids interfering with the adjacent albumin. The GLDN sequence trailing SFTI-1 is conserved within the PawS1 family of protein precursors and is essential for SFTI-1 maturation (Mylne, Colgrave et al. 2011; Elliott, Delay et al. 2014). Although AEP only cleaves at three asparagine (Asn) residues, there are seven in PawS1 (figure 5.1): Asn35 precedes SFTI-1, Asn53 ends SFTI-GLDN and precedes PawS1 albumin small subunit, Asn65 is in the middle of PawS1 albumin small subunit, Asn84 precedes PawS1 albumin large subunit whereas the remaining three (Asn96, Asn143, Asn146) are within PawS1 albumin large subunit and are interspersed with conserved cysteine residues that form disulfide bonds and impart the five helical structure conserved in all known structures for napin-type seed storage albumin (Mylne, Hara-Nishimura et al. 2014). I forced SFTI-1 and PawS1 albumin to compete for processing by creating PawS1 variants with SFTI-GLDN placed directly after one of the five alternative existing asparagine residues in and around what become the small and large PawS1 albumin subunits during maturation (figure 5.1). The SFTI-GLDN sequence was removed from its native position and inserted at Asn65, Asn84, Asn96, Asn143, Asn146 in the constructs Paw[N65]S1, Paw[N84]S1, Paw[N96]S1, Paw[N143]S1, and Paw[N146]S1 respectively. Of these five asparagine residues, only Asn84 is targeted by AEP in native PawS1. I also generated PawS1[Δ2-21], a variant lacking the N-terminal ER signal sequence. Each synthetic open reading frame was placed in a binary vector under control of the seed-specific OLEOSIN promoter (Parmenter, Boothe et al. 1995) and transformed into A. thaliana wild-type plants by Agrobacterium-mediated in planta transformation. 71 Figure 5.2. Acyclic SFTI-1 production from non-native locations in PawS1. (left) Synthetic precursor proteins with ER signal peptide (pink), SFTI-1 (cyan), GLDN (black), PawS1 albumin small subunit (green), PawS1 albumin large subunit (orange) and propeptides (line). (middle) acyclic SFTI-1 relative quantity based on MALDI-TOF-MS analysis of seed peptide extracts from transgenic A. thaliana wild-type, stars indicate significant difference compared to the PawS1 line (grey). (right) MALDI-TOF-MS spectra of seed peptide extracts. Each spectrum typically displays only the 1531.82 Daltons mass for acyclic SFTI-1, excepted for lines expressing the chimeric precursors with no ER signal, that produce no peptide. For full sequences see supplemental figures 9.3 and 9.4. To determine whether acyclic SFTI-1 could emerge from these non-native locations within PawS1, seed extracts from transgenic lines were analysed by matrix-assisted laser desorption-ionisation mass spectrometry (MALD-TOF-MS). Although the objective was to look for the presence of a mass for acyclic SFTI-1 (1,531 Daltons), full spectra from 900 Daltons to 3,000 Daltons were acquired, to look for mis-processing events. A mass consistent with acyclic SFTI-1 was detectable for all five SFTI-GLDN-insertion constructs and there were no additional masses, which would have indicated mis-processing. No acyclic SFTI-1 was detectable when the ER signal was removed from PawS1 (figure 5.2). To compare acyclic SFTI-1 production from each construct, MALDI-TOF-MS spectra were acquired for at least nine independent, single-locus lines. Relative quantitation from spectra peak areas was enabled by including a spiked standard identical to acyclic SFTI-1, but possessing a diselenide bond instead of disulfide. This gave the spiked standard comparable ionisation, but a higher mass (1627.7 Daltons) and distinct isotopic peak pattern. I found that the acyclic SFTI-1 levels produced when the SFTI-GLDN sequence was within PawS1 albumin small subunit (Paw[N65]S1) or at PawS1 albumin large subunit release point (Paw[N84]S1) did not differ significantly from the levels produced by native PawS1. Acyclic SFTI-1 continued to emerge when it was buried in the large albumin subunit (Paw[N96]S1, Paw[N143]S1, and Paw[N146]S1), but in significantly (P <0.01) reduced levels, about half that when at its native location (figure 5.2). Overall these data suggest that acyclic SFTI-1 can emerge successfully from multiple locations within PawS1 provided this protein precursor has an ER signal. The ability for acyclic SFTI-1 to consistently emerge from within PawS1 albumin suggests that its disulfide bond is formed correctly and that AEP is able to process acyclic SFTI-1 from all tested locations. It is also interesting to note here that the artificial presence of the SFTI-GLDN sequence downstream asparagine residues not usually targeted by AEP converted them into AEP processing targets. 72 Figure 5.3. PawS1 albumin production from PawS1 variants. Targeted proteomic analysis of seed protein extracts from A. thaliana lines, each expressing one of the PawS1 derivate synthetic constructs. PawS1 (transgenic) and SESA4 (native) albumins presence were each determined by detection of one specific tryptic fragment, both selected for being the most easily and reproducibly detectable in preliminary analysis. The first column shows the retention time and relative intensity of the detected fragments. The two next column shows MS/MS fractionation of respectively the PawS1 albumin fragment first and the SESA4 one then, when being detected. Within PawS1, SFTI-1 and PawS1 albumin are neighbours, but when the SFTI-GLDN sequence is artificially buried in the mature PawS1 albumin sequence, acyclic SFTI-1 continues to emerge. To find out if unprocessed or truncated PawS1 albumin persists in these lines, its presence was determined using targeted LC-MS/MS with endogenous A. thaliana SEED STORAGE PROTEIN 4 (SESA4) as an internal control. Using this targeted proteomics approach I found no evidence for mature PawS1 albumin in transgenic lines expressing constructs in which PawS1 albumin was interrupted by SFTI-GLDN (figure 5.3). The endogenous control SESA4 was detected in all transgenic lines and PawS1 albumin was detectable in lines expressing unaltered PawS1. Combined with acyclic SFTI-1 quantification, these data confirm that although acyclic SFTI-1 and PawS1 albumin both emerge from PawS1 when they reside side-by-side, when SFTI-GLDN resides within the mature PawS1 albumin domain, acyclic SFTI-1 will keep emerging to the detriment of PawS1 albumin. 73 AGI At5g44120 At4g28520 At1g03880 At4g27150 Function/identity Cruciferin, 12S globulin, seed storage Cruciferin, 12S globulin, seed storage Cruciferin, 12S globulin, seed storage Napin type 2S albumin, seed storage At4g27160 Napin type 2S albumin, seed storage At4g27170 Napin type 2S albumin, seed storage At5g66400 At3g22640 Dehydrin family, stress induced RmlC-type cupin, vicilin, 7S seed storage Lipase/lipooxygenase, PLAT/LH2 family Napin type 2S albumin, seed storage Polyketide cyclase/dehydrase Lipid transfer 1-cysteine peroxiredoxin antioxidant Napin type 2S albumin, seed storage At4g39730 At5g54740 At1g14950 At2g38530 At1g48130 At4g27140 At1g03890 At3g15670 At2g40170 At5g12030 At5g40420 At1g52690 At4g25140 At4g39260 At3g46230 At1g05510 At3g02480 At5g42980 At5g45690 At3g17520 At5g06760 At2g42560 At3g51810 At4g38740 At3g53040 At5g01300 At2g36640 At3g05020 At3g50980 At3g15260 RmlC-type cupin, vicilin, 7S seed storage Late embryogenesis abundant Late embryogenesis abundant Heat shock, chaperone activity Oleosin, seed lipid accumulation Late embryogenesis abundant Oleosin, seed lipid accumulation Cold induction, circadian rhythm related Heat shock protein Oil body associated and ABA responsive Late embryogenesis abundant Thioredoxin (H-TYPE) Unknow n function Late embryogenesis abundant Late embryogenesis abundant Late embryogenesis abundant Stress induced protein Rotamase cyclophilin Late embryogenesis abundant Phosphatidylethanolamine-binding Late embryogenesis abundant Acyl carrier Unknow n function Protein phosphatase 2C family Nam e CRU1, CRA1 CRU3, CRC CRU2, CRB SESA2, AT2S2 SESA3, AT2S3 SESA4, AT2S4 RAB18, ATDI8 PAP85 PLAT1 SESA5 LTP2, CDF3 PER1 SESA1, AT2S1 LEA6, EM6 HSP17.6 OLE2, OLEO2 LEA7 OLE1, OLEO1 CCR1, GRP8 HSP17.4 OBAP1A TRX3, TRXH3 LEA-LIKE LEA4-5 GEA1, EM1 ROC1 PEBP ECP63 ACP, ACP1 XERO1 aa 472 524 455 170 Hits 263 413 189 89 frag 9 20 10 5 cov. 31.05 55.07 33.40 36.98 Q 29.72 21.20 18.85 16.59 164 78 6 33.05 13.43 166 67 7 37.87 9.57 186 486 25 87 3 12 25.12 24.58 8.71 7.28 181 165 155 118 216 164 13 32 31 17 38 13 2 6 6 3 8 3 11.38 25.23 40.77 34.78 36.27 18.08 6.00 5.76 5.34 5.20 4.83 4.28 451 32 8 17.22 4.20 225 92 156 199 169 173 169 156 241 31 15 8 8 12 12 5 11 29 8 4 2 2 4 3 2 3 9 40.13 37.88 10.30 12.77 26.43 20.00 11.00 18.80 37.20 4.00 3.91 3.90 3.54 3.48 3.41 3.38 3.37 3.31 68 118 247 298 158 635 152 172 479 162 448 137 128 289 12 6 15 11 8 28 10 5 10 5 9 3 2 4 4 2 5 4 3 11 4 3 5 2 5 2 1 4 54.88 20.05 25.16 14.90 28.25 22.25 30.70 23.60 10.50 12.48 11.86 20.43 16.78 13.56 3.29 3.27 3.08 2.74 2.56 2.49 2.19 2.08 2.00 2.00 1.77 1.55 1.29 1.17 Table 5.1. Shotgun proteomic analysis of A. thaliana Col-0 dry seeds. Presented results are averaged on analysis of six independent biological replicates. Only proteins identified in at least five of six replicates with more than 10% coverage are presented. Proteins are ordered according to their relative abundance in the samples. AGI: reference gene model for the protein in the A. thaliana information resource (TAIR) database. Function/identity: summary of functional information available for the protein on TAIR and the Universal Protein Resource. Name: list of the most common names given to the protein. aa: number of amino acid in the protein. Hits: total number of peptide fragments identified and matching to the protein sequence. frag: total number of unique peptide fragments identified and matching to the protein sequence. cov: Percentage of the total protein precursor sequence matching with identified peptide fragments. Q: relative quantity of proteins (in 1 µg A. thaliana Col-0 ecotype dry seeds total protein extract) was roughly estimated based on average number of hits per unique sequences identified by shotgun proteomics. 5.3.2 The SFTI-1 sequence can recruit AEP The above results show that acyclic SFTI-1 can emerge from non-native locations and triumphs over PawS1 albumin when forced into competition for proper processing. To determine whether acyclic SFTI-1 can emerge successfully from other protein precursors I created synthetic chimeric genes by fusing the DNA sequence for SFTI-GLDN to genes encoding different and abundant A. thaliana seed proteins. To determine a set of such proteins from wild-type A. thaliana seeds, I used shotgun proteomics, and selected five (table 5.1). The proteins were selected to include similar and different processing routes to PawS1 albumin. The most similar was SEED STORAGE ALBUMIN 1 (At4g27140, SESA1) which, like PawS1 albumin, is napin-type seed storage albumin. The second was CRUCIFERIN A1 (At5g44120, CRU1), which is a different type of seed storage protein referred to as cruciferin or more broadly as globulin (Fujiwara, Nambara et al. 2002). Globulins are other major AEP-processed seed storage proteins and mutations in trafficking or processing machinery affect albumins and globulins processing similarly (Gruis, Schulze et al. 2004; Takagi, Renna et al. 2013). The third, Plastid-lipid-Associated Protein 85 (At3g22640, PAP85), is a vicilin-like seed storage protein (Rozwadowski, Yang et al. 2008), another type of seed storage protein. Vicilin genes and vicilin precursor maturation are largely poorly described and understood (Shewry, Napier et al. 1995; Ribeiro, Monteiro et al. 2014), but at least one vicilin precursor has been shown to be processed by AEP for maturation (Yamada, Shimada et al. 1999). The fourth was a Late Embryogenesis Abundant protein (At3g17520, LEA-like) which is unrelated to albumin, vicilin and globulin storage proteins. LEA is ER targeted (Shih, Hoekstra et al. 2008) , but is not thought to require proteolytic maturation. The fifth protein chosen was an oleosin (At4g25140, OLE1), which differs from all other proteins chosen as it has no ER signal and so does not traffic via the secretory pathway, but is synthesised on the ER (Hsieh and Huang 2004). 74 Figure 5.4. Acyclic SFTI-1 production from chimeric precursors. (left) Chimeric precursor proteins with ER signal peptide (pink), SFTI-1 (cyan), GLDN (black), albumin small subunit (green), albumin large subunit (orange), other protein units (brown) and propeptides (line). (middle) acyclic SFTI-1 relative quantity based on MALDI-TOF-MS analysis of seed peptide extracts from transgenic A. thaliana wild-type (grey) or the aep-null background (black, aep), . stars indicate significant difference compared to the PawS1 lines (right) MALDI-TOF-MS spectra of seed peptide extracts. Each spectrum typically displays only the 1531.82 Daltons mass for acyclic SFTI-1, excepted for lines expressing the chimeric precursors with no ER signal, that produce no peptide. The only spectrum showing an unexpected mass was for the CraS1 construct, due to minor acyclic SFTI-1 misprocessing. For full sequences see supplemental figures 9.3 and 9.4. The first synthetic gene I designed (SesaS1) encodes a chimeric protein precursor with SFTI-GLDN inserted inside the precursor for SESA1, as a mimic of PawS1 and the second (SesaS2) is a variant with no N-terminal propeptide encoding sequence, placing SFTI-GLDN immediately adjacent to the predicted ER signal (see figure 5.4). Previous PawS1 deletion analysis showed the sequence between the ER signal and SFTI-GLDN could be removed without preventing acyclic SFTI-1 production (Mylne, Colgrave et al. 2011). For the three next chimeras (CraS1, PapS1 and LeaS1), SFTI-GLDN is inserted immediately after each ER signal such that ER signal cleavage would liberate the first amino acid residue of SFTI-1. For all chimeric constructs except SesaS1, proto-N-terminal processing of SFTI-1 is done by ER signal removal instead of AEP cleavage so only a single cleavage is required at Asp14 on SFTI-1 for acyclic SFTI-1 to be successfully released. For the oleosin chimera (OleoS1), as there is no ER signal, SFTI-GLDN was inserted immediately after the OLE1 start methionine. All chimeric constructs were stably transformed into A. thaliana under control of the OLEOSIN promoter (Parmenter, Boothe et al. 1995) and seeds of multiple transgenic lines analysed by MALDI-TOF-MS. A mass for acyclic SFTI-1 (1531.8 Daltons) was detectable in seed peptide extracts for all constructs except the OLE1 chimera (figure 5.4). Quantification for each line (relative to the PawS1 line) with a spiked standard (1627.7 Daltons) found the level of acyclic SFTI-1 produced was not significantly different (P = 0.66) when SFTI-GLDN was buried in a similar location within SESA1 (SesaS1 construct), although removal of the spacer regions between the ER signal and SFTI-1 (SesaS2 construct) reduced acyclic SFTI-1 production significantly (P < 0.01). Acyclic SFTI-1 levels in lines expressing CraS1, PapS1 or LeaS1 lines were significantly (P <0.013) reduced to respectively 60%, 40% and 60% of PawS1 lines. A mis-processed product was only detectable in the CraS1 lines (figure 5.4), indicating precise acyclic SFTI-1 excision overall. The extra mass in CraS1 lines was consistent with the absence of the first glycine residue of acyclic SFTI-1, possibly due to some flexibility for ER signal removal. 75 Overall, these results demonstrate that acyclic SFTI-1 is robust and can emerge successfully and precisely from a range of A. thaliana seed proteins including some that do not require AEP for processing, and are probably subject to different trafficking. The only construct that failed to produce acyclic SFTI-1 was OleoS1, which lacks an ER signal. Combined with previous PawS1 mutagenesis studies (Mylne, Colgrave et al. 2011; Elliott, Delay et al. 2014), our results imply the only prerequisites for successful acyclic SFTI-1 maturation is that the protein precursor it resides within is delivered to the secretory system and that acyclic SFTI-1 is flanked by the appropriate AEP-target residues for release. To investigate whether acyclic SFTI-1 emerges from chimeric proteins in an AEP-dependent fashion, the same suite of constructs were introduced into an A. thaliana quadruple knockout line lacking AEP. This aep-null line has no obvious phenotype apart from misprocessing of seed storage proteins (Shimada, Yamada et al. 2003; Gruis, Schulze et al. 2004). In all cases, acyclic SFTI-1 production dropped or was abolished, demonstrating that AEP is required for the processing of acyclic SFTI-1 from the chimeric constructs (figure 5.4). A mass for acyclic SFTI-1 (1531.8 Daltons) was detected only for the lines expressing the chimeric PapS1 and LeaS1 precursors, which implies non-AEP proteases might process acyclic SFTI-1 from the PapS1 and LeaS1 chimeric protein precursors. The acyclic SFTI-1 level from PapS1 and LeaS1 in aep-null background are significantly reduced (P < 0.01) by more than 50% relative to wild-type background (figure 5.4), suggesting that AEP is still involved in acyclic SFTI-1 production from PapS1 and LeaS1. Overall, these data illustrate that, provided it is attached to an ER targeted protein, acyclic SFTI-1 can emerge with precision from non-native protein contexts and recruits AEP to do so. 76 Figure 5.5. Acyclic SFTI-1 production from albumin-less PawS1 variants. (left) Synthetic precursor proteins with ER signal peptide (pink), SFTI-1 (cyan), GLDN (black), PawS1 albumin small subunit (green) and large subunit (orange) and propeptides (line). The SFTI-1 variant in each construct are indicated as followed. R corresponds to SFTI-1K5R (acyclic mass 1559.82 Daltons), Y corresponds to SFTI-1F12Y (acyclic mass 1547.82 Daltons) and V corresponds to SFTI-1I10V (acyclic mass 1517.82 Daltons). (middle) acyclic SFTI-1 relative quantity based on MALDI-TOF-MS analysis of seed peptide extracts from single insertion lines of transgenic A. thaliana with wild-type (grey) or the aep-null (black, aep) genetic background, stars indicate significant difference compared to the PawS1 lines. (right) MALDI-TOF-MS spectra of seed peptide extracts. Spectrum for the PawS1 line is not shown. Each spectrum displays the expected masses for the single amino acid residue variants of acyclic SFTI-1 Spectra for the albumin-less variants are from multiple insertion lines, to have a high enough signal on the picture. For full sequences see supplemental figures 9.3 and 9.4. 5.3.3 The SFTI-1 sequence can recruit AEP without an adjacent protein The ability of acyclic SFTI-1 to emerge from various locations and various proteins implies that the SFTI-GLDN sequence has the ability to overtake adjacent proteins processing and recruit AEP to proteins (or protein locations) not naturally targeted by AEP. To ascertain whether SFTI-1 requires an adjacent protein I created versions of PawS1 without the sequence for PawS1 albumin. To guard against the possibility that a single SFTI-GLDN might traffic or be processed improperly due to its short length, I created four PawS1 variants encoding one to four tandem SFTI-GLDN sequences with no adjacent albumin (respectively called 1SFTI to 4SFTI), each varying by a single amino acid residue so as to be distinguishable based on mass. I used previous results (Mylne, Colgrave et al. 2011) to select mutations that would not impair processing. I also used a variant PawS1 that encodes the four tandem SFTI-GLDN and the adjacent PawS1 albumin (called Paw4SFTI) as a positive control (figure 5.5). This Paw4SFTI construct mimics natural variants of PawS1 such as the 4-cassette PawS1 gene from Galinsoga quadriradiata (Elliott, Delay et al. 2014). I transformed these chimeric genes into A. thaliana and could detect masses for all expected acyclic SFTI-1 variants in all seed peptide extracts (figure 5.5). Acyclic SFTI-1 production from albumin-less precursors was only 5% to 10% of the level when embedded next to PawS1 albumin (figure 5.5). This result shows that although acyclic SFTI-1 does not have an absolute requirement for an adjacent protein, its production benefits from being co-matured with an adjacent protein. I also wanted to test whether AEP was responsible for acyclic SFTI-1 production from this type of precursors. For this reason, the synthetic genes encoding the chimeric protein precursors containing four tandems SFTI-GLDN and either an adjacent PawS1 albumin or no albumin (namely Paw4SFTI and 4SFTI) were also transformed into the aep-null background. I performed the same MALDI-TOF-MS analysis on these transformed lines and acyclic SFTI-1 production was not detectable in their seed extracts (figure 5.5), demonstrating that AEP is required for the proper processing of acyclic SFTI-1 and the three acyclic SFTI-1 variants from these constructs. Overall, these results demonstrate the SFTI-GLDN sequence alone is capable of recruiting AEP for its processing. 77 5.4 Conclusion In the native context, SFTI-1 and its adjacent albumin are co-translated as a single precursor and cleaved into their independent and mature forms by AEP. Here, by making chimeric constructs I showed acyclic SFTI-1 can also emerge from alternative locations within its native protein precursor, to the detriment of PawS1 albumin. The SFTI-1 sequence also emerged with precision from chimeric protein precursors by recruiting AEP anew to the artificial precursors deriving from various other types of seed protein precursors than albumins. I also found that acyclic SFTI-1 can be produced from an albumin-less PawS1 variant, confirming that the sequence for SFTI-1 alone can recruit AEP for processing from within its protein precursor. The production of acyclic SFTI-1 was not dependent upon the ancillary albumin, but was greatly reduced without it. In addition, this result alone raises questions about the relationship between SFTI-1 and its adjacent albumin and why do they remain linked, which will be discussed in the general discussion to this thesis. The dominance of SFTI-1 over its host protein, the ease with which it emerges from other proteins and its ability to recruit AEP for its maturation demonstrate that SFTI-1 is not just a passive by-product of PawS1 albumin production. Overall, I demonstrate with a large variety of constructs that the SFTI-1 precursor sequence (SFTI-GLDN) can recruit AEP for its processing. Previous studies conducted in my research group on SFTI-1 evolution sequence of events, illustrated an unusual evolution mechanism. SFTI-1 coding sequence evolved de novo by sequential expansion in a napin-type albumin gene, without modifying the coding frame or sequence for the mature ancestral protein (Elliott, Delay et al. 2014). Such a mechanism bypasses many of the impediments that other de novo evolving coding sequences generally face before encoding a stable, new, functional protein. This mechanism seems to differ from any other evolutionary mechanism previously described; which will be discussed in the next Chapter. Combined together, these data suggest that this original route for production of new bioactivities is remarkably efficient. 78 Chapter 6 Neoproteinisation: a new mechanism for de novo evolution A review 79 6 Neoproteinisation: a new mechanism for de novo evolution 6.1 Introduction The emergence of new proteins with potentially new bioactivities is essential for the evolution of any living organism and for the emergence of new species, and also contributes to increase the complexity of living organism proteomes. Many evolutionary mechanisms leading to the emergence of new proteins have been described, most of them based on remodelling existing material, and fewer based on de novo creation (Kaessmann 2010). In a recent study, the evolution of a small protein called SFTI-1 was described. The DNA sequence encoding SFTI-1 evolved de novo, through successive steps of expansion inside a napin-type albumin gene and without disturbing the sequence or the coding frame for the ancestral napin-type albumin (Elliott, Delay et al. 2014). This novel evolutionary mechanism has not previously been reported. In this review, I illustrate what differentiates this new mechanism from other previously described evolutionary mechanisms, and propose to term it neoproteinisation. Neoproteinisation occurs when a genetic expansion in an ancestral gene, that does not alter the coding frame or the sequence for the ancestral protein, leads to the creation of a new protein sequence that is co-produced with the ancestral protein in a single protein precursor cleaved during proteolytic maturation, thus allowing two proteins instead of one to be produced from a single gene. I also suggested that, similarly to de novo gene evolution, which had been overlooked until a decade ago, neoproteinisation might have been overlooked until now. A consequence of neoproteinisation is the creation of multi-protein-coding genes. I will also illustrate how new genes generated after neoproteinisation vary from most multi-protein-coding genes, and how understanding these differences can help reveal other potential cases of neoproteinisation. 80 6.2 How do new proteins arise? Evolutionary mechanisms can create new proteins in living organisms by various ways. The most commonly described mechanisms of duplication and divergence produce new protein-coding genes from ancestral gene templates in the lineage of a given organism (Kaessmann 2010). A new protein can also emerge de novo in a lineage after transfer of genetic material between different organisms encoding a protein anew to the receiving organism (Zhou and Wang 2008). New proteins completely anew to a given organism can also emerge de novo by the evolution of new coding DNA from ancestral non-coding DNA (Long, VanKuren et al. 2013; Light, Basile et al. 2014). 6.2.1 New protein evolution by DNA based gene duplication and divergence Gene duplication is the most commonly and extensively studied mechanism for new protein evolution and is very frequent in all prokaryotes (Romero and Palacios 1997) and eukaryotes (Zhou, Zhang et al. 2008). After duplication, one of the two duplicate gene copies can undergo neofunctionalisation (figure 6.1) and subsequently encode for a new protein, when the other copy carries on the ancestral function (Ohno 1970). Alternatively, the eventual multiple functions of an ancestral gene can be divided between duplicate genes in a process called subfunctionalisation (Force, Lynch et al. 1999), and thus give birth to two new protein-coding genes (figure 6.1). Extensive studies have described a large quantity of examples for both models in many organisms, strongly confirming their relevance (Zhang, Zhang et al. 2002; Long, Betrán et al. 2003). A quite interesting example of neofunctionalisation was illustrated in a leaf eating monkey. It was shown that adaptive evolution in Asian and African leaf-eating monkeys led to the convergent duplication and divergence of the pancreatic ribonuclease gene. In both organisms, different substitutions in the gene copies led to convergent neofunctionalisation and profound changes of the diverging copies, including a lower optimal pH more adapted to leaf digestion and stronger target specificity than the parental ribonuclease. The study showed that the resulting inability to digest double stranded RNA, a potentially quite important function for antiviral defence, was directly due to the 81 adaptive evolution to a lower optimal pH. Thus, the pancreatic ribonucleases adaptive evolution would probably have never occurred without prior duplication (Zhang 2006). This illustrates how duplication and divergence can be an efficient solution to adapt existing proteins to new selective constraints, as the adaptation to a leaf-only diet here. Evidence for subfunctionalisation events is often quite poor. For example the floral organ identity gene AGAMOUS, broadly expressed in Arabidopsis carpels and stamens, has two orthologues in maize, ZAG1 and ZMM2, which are differentially expressed in carpels and stamens. ZAG1 and ZMM2 seem to have undergone subfunctionalisation, with ZAG1 being more carpel specific, and ZMM2 being more stamen specific. Unfortunately, the potential changes in the regulatory regions of ZAG1 and ZMM2 that could explain the differential expression pattern compare to the Arabidopsis orthologue remain to be identified (Force, Lynch et al. 1999). Surprisingly, the molecular events underlying gene duplication are less well known. A gene duplicate can originate from small-scale events, such as the duplication of a gene fragment, a whole gene region, or a chromosomal segment containing several genes. These events can emerge through non-allelic homologous recombination (NAHR) through the mediation of repetitive elements (Roth, Porter et al. 1985; Bailey, Liu et al. 2003) or as a result of recombination errors (non-homologous end joining, NHEJ) (Kidd, Cooper et al. 2008; Yang, Arguello et al. 2008) Nevertheless, current knowledge on these mechanisms is rather speculative due to the small number of studied organisms (Zhou and Wang 2008). A new gene can also result from the duplication of a whole genome through various polyploidisation mechanisms (Semon and Wolfe 2007; Conant and Wolfe 2008; Van de Peer, Maere et al. 2009). Whereas single-gene duplication has been shown to have a great impact on new protein origination, and the development of new specific traits (Maere, De Bodt et al. 2005), whole genome duplication seem to have much less contributed to the creation of new activities or functions in animal models such as yeast (Hakes, Pinney et al. 2007) and human (Kaessmann 2010). 82 Figure 6.1. New protein evolution through duplication and divergence. The gene information can get duplicated at two levels. On the DNA level (top panel) a gene can duplicate. The two resulting duplicates can then diverge. Through neofunctionalisation, one of the two duplicates conserves the ancestral sequence and function and the other accumulates mutations until eventually encoding a different protein. Modification of the promoter region only can already change the expression pattern and confer a new function to the duplicate. The two duplicates can also co-diverge and subsequently subfunctionalise to both partly keep the ancestral sequence and function. The two resulting proteins are then necessary to fulfil the ancestral function. On the RNA level (bottom panel) a gene can be retroduplicated inside the genome. This retrocopy lacks regulatory regions, and often requires evolving a new promoter to be able to produce a protein. Thus the newly evolved retrogene can, already at this stage, have a different function than its ‘parental gene’ due to eventual expression pattern modifications. Then the retrocopy can also diverge similarly than already described for a duplicate gene. Contrastingly, whole genome duplication has been of significant importance for plant evolution. Indeed, the emergence of the regulatory genes necessary for seed and flower development was a result of two successive whole genome duplication events, in ancestral lineages, that happened shortly before the emergence of seed plants and flowering plants respectively (Jiao, Wickett et al. 2011). These major innovations have certainly contributed to the success of seed plants (also called gymnosperms) and flowering plants (also called angiosperms) that now represent the vast majority of land plants. This illustrates how whole genome duplication, in the case of plants, has significantly contributed to the creation of new important functions. 6.2.2 New protein evolution through RNA based gene duplication and divergence As outlined above, DNA-mediated gene duplication followed by divergence is the most commonly described mechanism giving rise to new proteins and is still considered as the main contributor to evolution. Another duplication mechanism, known as retroposition or retroduplication, can give rise to new gene copies (Kaessmann, Vinckenbosch et al. 2009). Retroduplication occurs when a ‘‘parent’’ gene is transcribed and matured into a messenger RNA (mRNA), and is then reversed transcribed into a complementary DNA copy and inserted into a new location in the genome (figure 6.1), thus creating a new gene copy (Brosius 1991). The specific enzymes (in particular the reverse transcriptase) required for retroposition are encoded by retrotransposons (or retrotransposable elements). Different retrotransposons exist in different species, for example LINE-1 in mammals, so that the number and specificities of the retrotransposable elements present in a genome are directly linked to the retroduplication occurrence (Kaessmann, Vinckenbosch et al. 2009). In plants, this mechanism is less broadly studied and thus the retrotransposons involved in retroduplication in plants remain to be characterised, but a recent study revealed that approximately 1% of all A. thaliana protein-coding genes were retrogenes, as opposed to 3.4% in human (Abdelsamad and Pecinka 2014). Unlike common gene duplicates, the resulting new retrocopy usually lack parental introns and rarely inherit the parental promoter, thus retaining only a part of the original information. For some time, retroduplicated genes were 83 considered to be pseudogenes because of these defects (Petrov, Lozovskaya et al. 1996; Mighell, Smith et al. 2000). Nevertheless, since the late 1980s (McCarrey and Thomas 1987), individual functional retrocopies (so-called retrogenes) were found. Many retrogenes have been reported in various genomes of plants (Drouin and Dover 1990; Charlesworth, Liu et al. 1998; Zhang, Wu et al. 2005; Wang, Zheng et al. 2006), fruit flies (Betran and Long 2003; Zhou, Zhang et al. 2008), and mammals (Marques, Dupanloup et al. 2005; Potrzebowski, Vinckenbosch et al. 2008). It appeared that retrogenes can acquire regulatory sequences (figure 6.1) from various sources, allowing them to be transcribed (a prerequisite for gene functionality), and evolve new expression patterns (Kaessmann, Vinckenbosch et al. 2009). However, the mechanisms underlying the acquisition of new regulatory sequences seems to be biased, as retrogenes frequently evolve testis-related gene expression patterns, such as 53% of all identified Drosophila retrogenes and 42% of all identified human retrogenes (Vinckenbosch, Dupanloup et al. 2006; Bai, Casola et al. 2007) for example. Similarly, most recently discovered retrogenes in A. thaliana had a strong preferential expression pattern for male gametes (Abdelsamad and Pecinka 2014). The acquisition of new expression patterns can lead retrogenes to evolve new functions. However, studies that elucidate the functions of these newly originated genes are limited. In 2009, Kaessmann et al. revealed that, in primates, retrogenes evolving new spatial expression patterns not only adapt to new functions, but also have contributed to hominoid brain evolution (Kaessmann, Vinckenbosch et al. 2009). A more striking example of new retrogene formation demonstrates how important a retroduplication event may be. In the domestic dog, chondrodysplasia, a short-legged phenotype common to several dog breeds is due only to a recently acquired retrogene. This retrocopy (fgf4), derived from a growth factor gene, encodes the same protein as its parental gene, but with a different expression pattern leading to the drastic modification of a morphological trait (Parker, VonHoldt et al. 2009). In addition to a new set of regulatory regions, a new retrocopy can also undergo modifications leading it to encode a completely new protein, thus creating a new function (figure 6.1). For example, in the human genome several transcribed retrogenes have acquired new exons de novo (Vinckenbosch, Dupanloup et al. 84 2006). More drastic modifications can also occur. For example, an unusual case of a new protein encoded by a retrogene was reported. Fg01, a mouse retrocopy of a ribosomal protein gene, encodes a protein completely different to that encoded by its parental gene because it is transcribed from the reverse strand. This new protein has an entirely new function directly linked to resistance against the Alzheimer’s disease (Zhang, Liu et al. 2009), illustrating that retroduplication can lead to profound functional innovations. 6.2.3 Creation of new proteins through gene or transcript fusion The idea that fusion mechanisms would be a great source of new proteins goes back to the discovery of introns (Gilbert 1978). It was thought that reshuffling exons of different transcription units would generate new chimeric proteins (Long, Betrán et al. 2003). Various mechanisms can lead to creation of chimeric proteins and fusion can happen either at the gene level or at the transcript level (Zhou and Wang 2008; Li, Zhao et al. 2009). Fusion mechanisms are by definition generating new functions as long as the resulting chimeric transcripts are subsequently translated into stable proteins. As introduced above, new transcriptional units can be created by combining exons from different genes or transcriptional units. This can happen at the transcript level, by a mechanism called intergenic splicing or trans-splicing, during RNA processing, when different pre-mRNAs are integrally or partially fused together (figure 6.2). Individual cases of trans-splicing were first reported in the late 1980s in protists, and then in all living kingdom (Bonen 1993). With the advent of the genomics era, genome-wide searches led to the discovery of a large number of chimeric RNAs in human (Akiva, Toporik et al. 2006), yeast (Li, Zhao et al. 2009), fruit fly (McManus, Duff et al. 2010) and rice (Zhang, Guo et al. 2010). However, these approaches are based on reverse transcription and PCR followed by sequencing, so might create artefacts and generate artificial chimeric cDNAs (McManus, Duff et al. 2010), suggesting that reports of trans-splicing should be treated with caution. 85 Figure 6.2. New protein evolution through fusion mechanisms. Fusion mechanisms can happen either at the DNA or at the RNA level. Two different genes can get fused, after duplication or retroduplication (top panel), resulting in a chimeric gene encoding a new protein. During RNA splicing (bottom panel), the splicing machinery can eventually exchange blocks from two different pre-mRNA (from two different genes or from the same gene) and create chimeric mRNA. This chimeric mRNA encodes a new protein, and can also be integrated into the genome through retroduplication. Nevertheless the large number of confirmed trans-splicing events suggests that this mechanism is universal and potentially creates novel functions by increasing the complexity of the transcriptome. Such generated chimeric proteins can have essential functions. For example, in fruit fly an entire family of transcription factors, regulating the nervous system development, is generated through trans-splicing from a single complex locus called lola (Horiuchi, Giniger et al. 2003). It is constituted of 32 exons on the same DNA strand. In 5’, each of the first four exons has an alternative transcription initiation site; the next four exons are common to all splicing variants, and the last 24 exons are alternatively spliced to the constant exons. More than 80 splicing variants can be generated this way. So far, 20 Lola isoforms have been identified; different isoforms specifying different aspects of axon growth and guidance. At least half of these protein isoforms are generated through trans-splicing, meaning that different parts from different pre-mRNAs variants generated from the lola locus are subsequently re-arranged together to generate more variants. Such complexity contributes to achieve a sophisticated expression pattern required for fine regulations of developmental processes. Additionally, transcription-induced chimeric mRNAs can eventually be integrated in the genome by retroposition (figure 6.2). During human evolution, a transcript chimera of two adjacent genes encoding respectively a specific type of kinase and a proteasome subunit became a fixed gene. This newly emerged retrogene, called pipsl, landed in a completely different chromosomic region, is regulated differently to its parental gene and encodes a testis-specific ubiquitin-binding protein unique to humans (Babushok, Ohshima et al. 2007). This example illustrates the importance of trans-splicing for organismal evolution. New proteins can also be generated and integrated in a more persistent fashion in an organism via gene fusion (figure 6.2). In this case, new chimeric genes arise from adjacent, complete or partial, gene copies. These duplicates can be provided from the previously described gene duplication, retroduplication and eventually by retrotransposon transcripts carrying with them a copy of their downstream adjacent genomic sequence. As these new genes are made from copies of DNA sequences, the function of the parental sequence is preserved. 86 Gene duplication creates multiple genes copies in a given genome, and further recombination errors can lead adjacent copies of related or completely different genes to be fused together. The functional importance of gene fusion in early evolution has been extensively illustrated (Patthy 1999), but fusion events of gene duplicates have also been reported in relatively recently evolved species. Several such chimeric genes have been identified in fruit fly (Zhou, Zhang et al. 2008), for example the new testes-specific expression chimeric gene Hun (Arguello, Chen et al. 2006). In humans, the usp6 oncogene, encoding an ubiquitin-specific protease, arose after partial duplication and fusion of two quite different genes, usp32 and tbc1d3. They are both quite ubiquitously expressed in human tissues, but usp32 is a unique ancient gene, highly conserved in animals, encoding a protein with an ubiquitin-protease domain and tbc1d3 is a newly evolved human specific gene, with multiple non-functional copies derived from duplication. After partial duplication in a other part of the genome, the fusion of the functional domain of tbc1d3 (Rab GTPase signalling function) and of the ubiquitin-protease domain of usp32 created usp6, a testis specifically expressed gene (Paulding, Ruvolo et al. 2003). New genes with testis-specific expression patterns are often supposed to be involved in the emergence of reproductive barriers. Overall, this specific case of gene fusion illustrated how new chimeric genes may have contributed to human speciation, and organism evolution in general. Gene fusion can also happen after retroduplication, as this mechanism copies coding sequences to new genomic locations. Retrocopies can land inside another gene (figure 6.2), resulting in the formation of a new chimeric gene, (Vinckenbosch, Dupanloup et al. 2006). Assuming that the gene where the retrocopy landed has duplicates, so that its original function is conserved and/or that the fusion is beneficial, this new chimeric gene might be retained through evolution. A typical example of this is the first young gene ever characterised; jingwei (jgw) in Drosophila (Long and Langley 1993). This gene, specific to the yakuba Drosophila lineage emerged after retroposition of the alcohol dehydrogenase gene (adh) inside a partial duplicate of the gene yande (ynd) and successive fusion of the adh retrocopy with the regulatory elements and the three first exons of the ynd duplicate. The new JGW chimeric protein is also an alcohol dehydrogenase, but has different targets in the pheromone and hormone metabolism compared to the conserved parental ADH, representing a 87 positive selective advantage, which has been kept by natural selection (Zhang, Dean et al. 2004). Similar fusion-after-retroduplication events have also been reported in mammals (Sayah, Sokolskaja et al. 2004; Brennan, Kozyrev et al. 2008) and this process is extremely important in plants (Wang, Zheng et al. 2006; Zhu, Zhang et al. 2009). During plant evolution many mitochondrial genes have been retrocopied into nuclear genes and fused to create new chimeras. In most cases, the nuclear parental genes provided signal sequences allowing the mitochondrial functions to be maintained by targeting the mitochondrion-derived protein back into the mitochondria (Liu, Zhuang et al. 2009). Retrotransposons can also contribute to the creation of new chimeric proteins. Another fusion mechanism was discovered in human (Moran, DeBerardinis et al. 1999) and later called retrotransposon-mediated transduction. During the naturally occurring retrotransposons duplication events, the RNA transcription machinery of a retrotransposon can accidently continue its transcription after the retrotransposon polyadenylation signal down to an alternative polyadenylation site in adjacent DNA. The resulting retrotransposon transcript will carry an extra copy of genomic DNA, and after retrotransposition this extra piece of DNA as well as parts of the retrotransposon can be subjected to similar fusion mechanisms as described above and eventually lead to the creation of a new gene encoding a new functional protein. Few cases have been reported, but a striking example is the gene setmar. During primate evolution, an Hsmar1 transposon got inserted downstream the set gene, and led to the creation of the chimeric gene setmar, after fusion of the transposase coding sequence with the set gene. The resulting chimeric protein exhibits similar histone methyltransferase activity as its SET parent, but has an extra region of yet unknown function (Cordaux, Udit et al. 2006). Overall, the large number of described events and the diversity of fusion mechanisms illustrate the importance of chimeric proteins for evolution. For instance more than 30% of all new genes in Drosophila were shown to be chimeric genes (Zhou, Zhang et al. 2008). 88 6.2.4 Acquisition of de novo proteins by horizontal gene transfer Horizontal gene transfer (or lateral gene transfer) is a non-sexual mechanism by which an organism incorporates genetic information from another organism (figure 6.3). This mechanism can result in the production of a new protein in a lineage, which occurs frequently in bacterial (Beiko, Harlow et al. 2005) and microbial eukaryote evolution (Gladyshev, Meselson et al. 2008). In plants, horizontal gene transfer is common between organelles to nuclei (Keeling and Palmer 2008) or between hosts and plant parasites (Yoshida, Maruyama et al. 2010). A recent study showed that during its evolution, the mitochondrial genome of the angiosperm Amborella trichopoda acquired the entire mitochondrial genome of two mosses in addition to whole-genome transfers three from green algae (Taylor, Rice et al. 2015). Horizontal gene transfer has been more rarely documented in animals. One example of bacteria-to-animal horizontal gene transfer was reported in Drosophila (Dunning Hotopp, Clark et al. 2007), but none of the genes have been proven to have a function. More recently, more than 100 human genes with important functions, involved in metabolism or immune responses for example, have been reported as foreign genes because their sequences share more identity with genes from non-animal organism than with other animals (Crisp, Boschetti et al. 2015). However, evidence that they were acquired through horizontal gene transfer is usually lacking, and they might have been lost in most lineages and conserved only in human or could also have arisen by other evolutionary mechanisms. To summarise, except these events of endosymbiosis only a few validated cases of horizontal gene transfer have been reported in eukaryotic lineages with no confirmation of a new protein being consecutively produced in the receiving organism (Gladyshev, Meselson et al. 2008; Pace, Gilbert et al. 2008). Therefore, the contribution of horizontal gene transfer to the origin of new protein-coding genes in multicellular eukaryotes remains to be confirmed and its role and mechanism remain largely uncharacterised. 89 Figure 6.3. Rapid acquisition or emergence of de novo protein. A new protein can appear in a given organism after horizontal transfer of genetic information from another organism (top panel). Alternatively in a given organism a gene can evolve a new translation initiation site de novo and thus encode two protein instead of one, the ancestral one and an additional new protein. Eventually this gene can further evolve to lose the ancestral translation initiation site and then only encode the new protein (bottom panel). 6.2.5 Emergence of de novo protein by new alternative gene coding As described above, various duplication and fusion mechanisms can contribute in some rare cases to the creation of completely new proteins. The complexity of the fusion or of the divergence after duplication can eventually lead to the emergence of a completely new protein, with a unique sequence in the given organism where it occurred and with a different function to it or its parental sequences. However, duplication and fusion mechanisms generally lead to variations of existing protein domains or re-shuffle them to create new assembly of domains, and so are less likely to generate evolutionary innovations. Another, originally less broadly observed, mechanism termed frameshifting or overprinting can lead to the creation of new proteins from existing genes. It was first thought to only concern micro-organisms. It was shown in bacteria (Ohno 1984; Delaye, Deluna et al. 2008), but mostly in viruses (Xue, Jones et al. 2003; Carter, Daugherty et al. 2013; Pavesi, Magiorkinis et al. 2013), that mutations leading to the creation of alternative open reading frames in a pre-existing gene can lead to the emergence of a new protein sequence, with often conservation of the pre-existing frame, allowing the production of two totally distinct proteins from a single mRNA (figure 6.3). An alternative open reading frame coding generates a completely different protein sequence due to the different codon usage and represents a much greater potential for innovation than the duplication or fusion of pre-existing genes. For instance, the emergence of alternative open reading frames (ORFs) represents obvious advantages for small genome organisms where duplication and fusion are limited, and it has greatly contributed to virus evolution. It has been extensively studied and reviewed for hepatitis B virus in particular (Torresi 2002; Zaaijer, van Hemert et al. 2007; Zhang, Chen et al. 2010; Torres, Fernandez et al. 2013). In 2001, for the first time in mammals, a case of overlapping reading frame was found (Klemke, Kehlenbach et al. 2001). The mammalian gene encoding the first subunit of the G-protein, involved in signal transduction particularly in neuroendocrine cells, has an alternative translation initiation site 32 bases downstream of the first one, thus the two ORFs have different frame. The first exon (called XL) originally encodes for the first domain (called XL domain) of the G-protein subunit 1. The presence of this alternative translation initiation site 90 and also of an alternative stop codon right at the end of the XL exon, lead to the production of another protein called ALEX from the same mRNA, but encoded only by the XL exon, thus much smaller than the original G-protein subunit 1, and totally different in sequence due to the different frames. ALEX interacts with the XL domain of the G-protein to achieve fine developmental and neurological functions. This locus and its dual protein expression has been conserved from mouse to human, with a rate of mutation much more important than average, especially between primates and human, suggesting involvement in species-specific neurological differences (Nekrutenko, Wadhawan et al. 2005). Such a striking example drew more attention on this mechanism and its relevance in eukaryotic evolution too. Further studies whose goal was systematic detection of de novo alternative gene coding events in mouse and human genomes revealed hundreds of alternative open reading frames leading to the production of newly expressed stable proteins (Okamura, Feuk et al. 2006; Vanderperre, Lucier et al. 2013), mainly being species-specific. This illustrates the importance of this mechanism for the creation of novel functions, and particularly for speciation (Wu and Sharp 2013). 6.2.6 De novo protein evolution Other mechanisms, more recently discovered, can lead to the creation of new proteins ‘from scratch’, therefore called de novo evolved protein. Similarly to alternative gene coding or horizontal gene transfer, de novo evolution can lead to the creation of completely new proteins with new sequences and functions, anew to the concerned organism. By contrast to new proteins created by other evolutionary mechanisms, de novo evolved proteins are encoded by a genomic sequence with no link to a parent coding sequence, representing true orphan genes that exist only in closely related species, at least at the time of its emergence. Before the omic era it was widely accepted that “The probability that a functional protein would appear de novo by random association of amino acids is practically zero” (Jacob 1977). In fact, it seems now that de novo evolution of protein is quite common, but specific cases are complicated to identify, as it requires complete assembly of genome or transcriptome sequences of numerous related species. This kind of data is still difficult to put together 91 nowadays, but is becoming increasingly achievable. The evolution of de novo genes is emerging as an important source of new proteins in addition to the familiar ‘parental gene’ model common to the traditional and previously described mechanisms (Long, Betrán et al. 2003; Bornberg-Bauer, Huylmans et al. 2010; Long, VanKuren et al. 2013). Cases of de novo genes have now been reported in all living kingdoms. More precisely, striking examples with important functions have been described in bacteria (Yang and Huang 2011), yeast (Cai, Zhao et al. 2008; Li, Dong et al. 2010), plants (Xiao, Liu et al. 2009), Drosophila (Begun, Lindfors et al. 2007), mouse (Heinen, Staubach et al. 2009) and human (Li, Zhang et al. 2010; Wu, Irwin et al. 2011). Comparative genomics and transcriptomics approaches have then revealed a plethora of de novo evolved transcripts (Carvunis, Rolland et al. 2012; Neme and Tautz 2013; Neme and Tautz 2014; Zhao, Saelao et al. 2014) illustrating that de novo evolution is a significant source of new proteins and new functions (Tautz 2014). The contribution of de novo genes to organism evolution and speciation is now accepted. For example, it was recently showed that de novo genes play complex roles in the human brain and that most of them are expressed in the evolutionarily newest part of human brain, the neocortex. Remarkably, these genes arose soon after its morphological origin and their expression pattern suggest a function in early brain development (Zhang, Landback et al. 2011). This demonstrates how de novo genes have contributed to the emergence of new brain functions during human evolution. 92 Figure 6.4. De novo protein evolution through de novo gene evolution. De novo genes evolve from non-coding DNA. A non-coding DNA region can become coding and acquire a transcription initiation site and regulating sequences through several mutations and insertions. More changes are required before the newly evolved gene can encode a stable protein. Recent studies have begun to reveal mechanisms that have enabled de novo gene evolution (Xie, Zhang et al. 2012; Reinhardt, Wanjiru et al. 2013). De novo genes evolve from non-coding DNA (figure 6.4). First several mutations and insertions lead this non-coding DNA to become transcriptionally active. This notably includes emergence of a new transcription initiation site and acquisition of regulating sequences. At this point of its evolution the emerging gene only produces a non-coding RNA (figure 6.4). Further mutations then allow the new gene to encode a protein, a long process during which the regulation of its transcription keeps evolving too. Alternatively, a dormant open reading frame can evolve new regulatory elements and become transcribed to create a new coding gene (figure 6.4). Subsequent challenges awaiting a de novo evolved gene before encoding a stable protein include successful trafficking of the encoded protein precursor, folding, processing and storage of the matured protein (figure 6.4). The complexity of the overall process of de novo gene creation led to the general assumption that such new genes would have a short intronless coding sequence. However, it has been shown that de novo genes are a great source of long coding RNAs (Ruiz-Orera, Messeguer et al. 2014) and that de novo genes can evolve introns before they become an ORF or acquire some after (Yang and Huang 2011). In addition, many orphans genes (present in only one species) were identified in the past as having important functions, suggesting their involvement in the evolution of adaptive traits (Domazet-Loso and Tautz 2003; Khalturin, Hemmrich et al. 2009) and recent studies suggest that an important part of these orphans genes evolved de novo (Toll-Riera, Bosch et al. 2009; Tautz and Domazet-Lošo 2011). Overall, de novo gene evolution, a mechanism first widely neglected, then long underestimated, is now emerging has one of the most important evolutionary mechanisms for the creation of novel functions, and numerous examples will continue to be discovered. Another rarely considered route to de novo protein evolution is for a new protein to evolve within an already existing functional protein precursor. De novo genes emerge from previously non-functional genomic sequence, unrelated to any pre-existing functional genes. However, in an organism, a sequence encoding for a new independent protein, unrelated to any pre-existing sequence in this organism, can also emerge de novo inside a pre-existing gene, through a mechanism termed neoproteinisation. 93 Figure 6.5. De novo protein evolution through neoproteinisation. Neoproteinisation occurs when a genetic expansion in an ancestral gene, that does not alter the coding frame or the sequence for the ancestral protein, leads to the creation of a new protein sequence that is co-produced with the ancestral protein in a single protein precursor cleaved during proteolytic maturation, allowing two proteins instead of one to be produced from a single gene. To ensure transcription of the newly evolved sequence and for the ancestral protein precursor to be maintained, the expansion has to happen inside the ORF, but in a region encoding a propeptide removed during the ancestral protein proteolytic maturation. At some point in its evolution such expansion can acquire the ability to be processed from the protein precursor and to remain stable, and then continue to evolve new bioactivities. 6.2.7 Neoproteinisation, a new mechanism of divergence that creates de novo proteins SFTI-1 is the first member of a recently discovered family of buried peptides that evolved within the daisy (Asteraceae) family over forty million years ago. The SFTI-1 coding sequence appears to have evolved via a genetic expansion within a napin-type albumin gene, without modifying the ancestral napin-type albumin coding frame, allowing two different proteins to be proteolytically matured from one precursor (Elliott, Delay et al. 2014) by the same endoprotease. This evolutionary mechanism that I call neoproteinisation has never been fully defined as an evolutionary mechanism (figure 6.5). Several points differentiate neoproteinisation from other de novo evolution mechanisms. Insertions and mutations in a gene can create an alternative splicing site (Sela, Itin et al. 2008) or an alternative coding frame (Klemke, Kehlenbach et al. 2001; Nekrutenko, Wadhawan et al. 2005) allowing a gene to produce a new additional protein. Nevertheless, most de novo evolved proteins described are encoded by de novo evolved genes (Zhou and Wang 2008; Kaessmann 2010; Light, Basile et al. 2014). Although initially considered rare or unusual (Jacob 1977), de novo gene evolution is now thought to be common, and although protein evidence for these de novo genes is often lacking, some have been shown to have acquired important functions (Heinen, Staubach et al. 2009; Chen, Zhang et al. 2010; Wu, Irwin et al. 2011; Reinhardt, Wanjiru et al. 2013). De novo gene evolution usually involves a silent or non-coding genome region acquiring regulatory sequences and a transcription initiation site, thus becoming transcriptionally active pseudogene. By further evolving a translatable open reading frame, the newly evolved gene encodes a protein, but this still must have proper processing, folding, and trafficking to be a stable component of the proteome. When the new gene provides a selective advantage, it is retained (Xie, Zhang et al. 2012; Reinhardt, Wanjiru et al. 2013). 94 Figure 6.6. Alternative splicing. All the different types of alternative splicing events that can be encountered by a pre-mRNA are represented. They generate alternative transcripts producing distinct proteins. Alternative sequences are indicated by beams. The most extreme possibility of alternative splicing, the use of alternative translation initiation sites in different frames, leads to the production of a completely different protein sequences (represented in different colours). By contrast, neoproteinisation is not preceded by de novo gene evolution and doesn’t require modification of the coding frame. The DNA sequence encoding SFTI-1 emerged de novo within an ancestral transcriptionally active ORF and did not change the coding frame or the sequence of the ancestral encoded protein. Additionally, SFTI-1 uses the trafficking and processing machinery of its adjacent protein (Mylne, Colgrave et al. 2011; Elliott, Delay et al. 2014). Thus, neoproteinisation might be an unlooked for and effective route for de novo protein evolution as it bypasses many of the evolutionary constraints de novo genes face to evolve stable proteins. 6.3 How to discover other neoproteinisation events? In order to discover other neoproteinisation events, it is first necessary to understand what differentiate neoproteinisation from other previously characterised evolutionary mechanisms. 6.3.1 Neoproteinisation creates unusual genes encoding multiple proteins Neoproteinisation is an evolutionary mechanism that leads to the creation of a new gene (from an ancestral one) encoding the ancestral protein and an additional de novo evolved protein. There are various ways for a gene to encode more than just one protein, including mechanisms of alternative splicing and the numerous possible post-translational modifications. Alternative splicing is a mechanism regulated by histone modifications (Luco, Pan et al. 2010) by which a single gene can produce different mRNAs (figure 6.6) by the use of alternative exons, transcription initiation sites or termination sites or by the retention of introns (Blencowe 2006). Most eukaryotic genes are alternatively spliced to generate two or more alternative mRNA, individually translated into different proteins, increasing the complexity of an organism’s proteome or of a cell-specific proteome (Nilsen and Graveley 2010). In mammals, the gene for Flt1, a vascular endothelial growth factor (VEGF) transmembrane receptor, is alternatively spliced to generate sFlt1, a truncated receptor lacking the transmembrane and intracellular domains. This soluble 95 receptor acts as a VEGF inhibitor, because it has the same binding domain as Flt1. In humans only, another splicing variant exists, called sFlt1-14, differing by 75 amino acid residues with sFlt1 on the N-terminal end. VEGF is a key factor in mammals for embryo development and adult neovascularisation. Whereas Flt1 and sFlt1 are broadly expressed and contribute to the transmission and regulation of the VEGF signal, sFlt1-14 appears to be more placenta-specific and associated with preeclampsia, a human-specific disease affecting about 5% of pregnancies, whereby sFlt1-14 becomes the prevalent placental produce of Flt1, resulting in the loss of Flt1 function. (Sela, Itin et al. 2008; Souders, Maynard et al. 2015). This example alone illustrates how alternative splicing contributes to establish fine regulations, signal transmission regulation in this case, and possibly speciation, and how its dis-regulation can lead to severe diseases. Other examples of alternative splicing are even more striking, illustrating how alternative splicing can greatly contributes to increase the proteome complexity and can achieve complex processes. A chicken specific gene called cSlo encodes, through alternative splicing, 576 different proteins, each preferentially expressed in the various hair cells of the avian cochlea to specifically react to different sound frequencies and allow hearing (Black 1998). In the fruit fly, the gene Dscam encodes a nervous system cell-surface protein necessary for axon targeting and is alternatively spliced to create up to more than 38,000 potential variants. Although the exact number of expressed variants is unknown, it has been shown that specific cells express unique combinations of up to 50 variants, potentially creating a form of unique cell identity (Neves, Zucker et al. 2004). Another type of post-transcriptional change, known as RNA editing, by which RNA undergo discrete changes, can change the genomic information. However, RNA editing, common in mitochondria, is rare for nuclear genes with very few examples reported in animals, and thus is regarded as a minor contributor to organismal protein diversity (Brennicke, Marchfelder et al. 1999). 96 Figure 6.7. Protein post-translational events. After translation several events can generate multiple proteins (in blue font). The direct product of translation is often a protein precursor that requires post-translational modifications to become functional. These post-translational modifications can include methylation, acetylation, hydroxylation, phosphorylation, nitrosylation, sulfation, or sugar, lipid or small protein attachment (a) A given protein precursor can eventually undergo different post-translational modifications to have different functions or to modulate its function, depending on various factors such as cell type for example, thus the gene encoding this protein precursor produce multiple different proteins. (b) The most common post-translational modification is proteolytic cleavage, commonly used for signal sequence removal and maturation of multi-units proteins from a single precursor (in light blue font). (c) Proteolytic cleavage can also mature multiprotein precursors by cleaving them into two or more proteins of identical, related or different sequence and function. (d) Protein degradation can sometimes result in the production of cryptic peptides with biological activities (in darker blue font). Another way for a gene to encode for multiple proteins is through post translational modifications. The primary product of translation often goes through modification processes such as sugar, lipid or small protein (e.g. ubiquitin, SUMO) attachment on specific residues or amino-acid modifications, including methylation, acetylation, hydroxylation, phosphorylation, nitrosylation or sulfation (Mann and Jensen 2003), thus increasing the protein diversity (figure 6.7a). However, probably the most important set of modifications occurring during protein maturation is generated by the action of proteases (figure 6.7b). The process of protein maturation by proteolytic cleavage is widely studied, and is required for the maturation of many proteins (Rogers and Overall 2013). It is particularly important for the maturation of multi-chain proteins originally produced as a single-chain precursor, such as insulin (Steiner, Cunningham et al. 1967; Chance, Ellis et al. 1968; Rubenstein, Clark et al. 1969). Another type of precursors, named polyproteins, depends on proteases for maturation. Polyproteins are encoded by genes producing a single mRNA that is translated into a single protein precursor, but this precursor is cleaved into multiple proteins during its maturation (figure 6.7c). It is particularly important for viruses, as many viral genomes encode just one polyprotein, which is cleaved into multiple functional proteins (Kazachkov Yu, Svitkin Yu et al. 1983; Carrington and Dougherty 1988; Korant 1990). Polyproteins have also been reported in all types of organism such as bacteria (Thony-Meyer, Bock et al. 1992), yeast (Singh, Chen et al. 1983; Gessert, Kim et al. 1994), amphibians (Hoffmann, Bach et al. 1983), fish (Hsiao, Cheng et al. 1990), insects (Settle, Green et al. 1995) and plants (Pearce, Moura et al. 2001) where it is particularly frequent for seed storage proteins (Shewry, Napier et al. 1995). In animals, most reported polyproteins produce hormones or neurotransmitters such as the mollusc bag cell neurotransmitters, all cleaved from a single precursor (Rothman, Mayeri et al. 1983), or the mammalian hormone dual precursors of arginine-vasopressin and neurophysin II (Land, Schutz et al. 1982) or of oxytocin and neurophysin I (Land, Grez et al. 1983). 97 Most known polyprotein precursors generate proteins with identical or related sequences and functions (see (Douglass, Civelli et al. 1984) for review). These precursors contain tandem repeats of the same or similar proteins, further cleaved into independent functional proteins. For example in plant seeds, storage albumin precursors can contain two identical or related copies of albumin (Shewry and Pandya 1999) and a fish gene encoding antifreeze glycopeptides (AFGP) produce a single polyprotein precursor including 44 tandem repeats of AFGP8 and two of AFGP7 all separated by a three amino acid spacers, further cleaved to produce the 46 proteins (Hsiao, Cheng et al. 1990). More rarely, polyprotein genes can also encodes multiple proteins with unrelated sequences and unique biochemical actions such as the two neurophysin containing precursor described above. The most extensively studied example is most probably the human proopiomelanocortin (or POMC) prohormone precursor. The POMC precursor can be first cleaved into y-melanotropin (y-MSH), adrenocorticotropin (ACTH) and b-lipotropin (b-LPH), three hormones with unrelated sequences and structures and unique functions (Mains, Eipper et al. 1977; Spiess, Mount et al. 1982). ACTH is primarily a positive regulator of cortisol production and release, y-MSH induces melanin production and b-LPH stimulates lipolysis. Further cleavages can occur and cut these proteins into smaller components involved in various biological responses. ACTH can be cleaved to release another melanotropin (a-MSH) and b-LPH is also a precursor for the endorphins (endogenous opioid neuropeptides) and b-MSH, a third melanotropin (Ueki, Someya et al. 2007). The several proteins derived from POMC are preferentially produced in a tissue-specific fashion thanks to cell-specific production of the processing proteases (Autelitano, Rajic et al. 2006), illustrating how a single gene can produce a unique translational product, but still encodes multiple proteins. Because the secondary POMC derivatives are cleavage products of functional proteins, thus representing additional bioactive polypeptides hidden in bigger functional proteins, they are sometimes referred as cryptides or crypteins. Later, many other examples of so-called cryptic peptides have been described and reviewed (Autelitano, Rajic et al. 2006; Pimenta and Lebrun 2007; Ueki, Someya et al. 2007; Samir and Link 2011). The extracellular matrix is supposed to be the greatest contributor of cryptic fragments (Schenk and Quaranta 2003), such as endostatin (Yamaguchi, Anand-Apte et al. 1999), restin (Sasaki, 98 Larsson et al. 1999), tumstatin (Maeshima, Sudhakar et al. 2002), canstatin (Kamphaus, Colorado et al. 2000) and arresten (Colorado, Torre et al. 2000), but all are degradation products of collagen chains and all are also reported to act as modulators of angiogenesis and tumour cell growth. Unlike POMC derivatives or more traditional protein encoded by polyprotein genes, cryptides are degradation products of functional proteins (figure 6.7d) and not the product of the complex and regulated maturation of a polyprotein precursor. In addition, the finding that peptides derived from larger functional proteins can affect biological processes does not prove that these peptides are true endogenous signalling molecules if their production and eventual secretion are not regulated (Gelman and Fricker 2010), which is still largely unknown. Therefore, most cryptic peptides reported so far, cannot be considered as additional proteins in a given proteome, but rather represent a new ‘ome’, the cryptome (Pimenta and Lebrun 2007). PawS1 represents an unusual polyprotein precursor. SFTI-1 is not a cryptic peptide because it is produced from an immature protein precursor, along with the other protein (a napin-type albumin). In this aspect, PawS1 is a traditional polyprotein precursor, but unlike most of them, PawS1 is cleaved into two proteins of unrelated sequence, size and function. Additionally, unlike these rarely described polyprotein genes encoding unrelated proteins, PawS1 belongs to a family of usually highly conserved genes, the plant napin-type albumin genes. The PawS1 family of napin-type albumin genes is restricted to only sunflower and its close relatives and has been shown to have evolved de novo from an ancestral napin-type albumin gene to encode an additional protein (Elliott, Delay et al. 2014), through a newly described mechanism I propose to term neoproteinisation. 99 6.3.2 Leads for discovery of other neoproteinisation events The proteolytic cleavage of large precursors to mature or ‘free’ functional proteins can be regarded as the most important post-translational modification mechanism due to its irreversibility and because it is ubiquitous to most organisms (Rogers and Overall 2013). This maturation processes generates by-products, the spacers and other propeptides, surrounding the different chains or proteins included in the larger precursors. Once again, it is important to note that such peptides are not to be considered as ‘cryptic’ because they are produced during maturation of an inactive protein precursor, and not by the degradation of a functional protein. Unfortunately, these peptides have been historically considered as non-functional or even junk. To the contrary, the emphasis created by the cryptome debate described above has brought to light this so-called junk which then revealed that it could potentially be a great reserve of a plethora of uncharacterised bioactive peptides. One of the best examples is the human C-peptide produced during maturation of insulin. The precursor for insulin was first described in 1968 (Chance, Ellis et al. 1968) along with the sequence of the peptide connecting the two insulin chains. This peptide was later called the C-peptide and in 1969 it was already reported to be secreted from pancreas to blood along with insulin (Rubenstein, Clark et al. 1969) and proposed to have biological significance. Surprisingly this study seems to have been long underestimated. More than two decades later beneficial effects of C-peptide on diabetic patients only started to be reported (Johansson, Kernell et al. 1993), illustrating the potential of this neglected peptide. During the last decade several studies have elucidated the role of C-peptide in reducing diabetes effects, showing that the C-peptide affects the expression of specific genes to antagonise endothelial dysfunction (Luppi, Cifarelli et al. 2008) and prevents renal hypertrophy by inducing arteriole dilatation and repressing sodium reabsorption (Nordquist, Brown et al. 2009; Nordquist and Wahren 2009). The C-peptide is now regarded as an endogenous hormone, but no C-peptide receptors have been reported so far, so its detailed mode of action is still unclear. 100 Interestingly the fact that the C-peptide was a small peptide lacking interspecies conservation, both in chain length and amino acid composition (excepted for the conserved cleavage sites), were the major reasons leading to it being neglected as having a function. No assumption should be taken regarding the size of a protein (as one says “size doesn’t matter”). The past assumption that small proteins (less than 100 amino acid residues) are less likely to have important functions is no longer widely accepted. Several studies have demonstrated how small peptides, even very small peptides, can have incredibly important functions (see (Su, Ling et al. 2013) for review) such as the eleven amino acid residues protein TAL from Drosophila influencing development by controlling gene expression and tissue folding (Galindo, Pueyo et al. 2007). Similarly to the C-peptide, the N-terminal propeptide of napin-type albumin precursors also displays low conservation (excepted for the highly conserved AEP cleavage site) relative to the more highly conserved napin-type albumin domain. Previous studies showed that, in some napin-type albumin genes from the daisy family of plants only, this propeptide evolved de novo to contain a new bioactive peptide, such as SFTI-1 in sunflower (Elliott, Delay et al. 2014). Therefore, the characteristics that led to human C-peptide function to be overlooked might exactly be what we should look for to discover new bioactivities. The investigation of widely conserved protein families that require proteolytic cleavages for their maturation, such as napin-type albumins or insulins, could reveal a multitude of new bioactive peptides. Furthermore, studying the detailed evolution of such peptides, like the previous work done for SFTI-1 evolution, might also reveal other cases of neoproteinisation. For example, in the next Chapter, I present a study of vicilin precursors (another type of seed storage protein) suggesting that in many plants species, vicilin precursors can possess additional domains, cleaved from the precursors by proteases during maturation, creating stable peptides with functions unrelated to seed storage proteins, such as digestive proteases inhibition and cytotoxic activities. 101 6.4 Conclusion The SFTI-1 family is the first family of proteins to have been shown to have evolved by neoproteinisation, a process by which a new protein-coding sequence evolve de novo inside an ancestral gene, without modifying the coding frame or sequence for the mature ancestral protein. Next generation sequencing has already greatly helped change the old vision of evolution. It was long-considered that de novo gene evolution was almost impossible and probable examples were neglected rather than being studied. Nowadays, the evolution of de novo genes is emerging as an important source of new proteins in addition to the familiar ‘parental gene’ model and cases have now been reported in all living kingdoms. I believe that, similarly, many other cases of neoproteinisation might have been overlooked in the past, and the use of modern approaches such as next generation sequencing allowing whole genome and/or transcriptome sequencing for a growing number of species will help reveal more neoproteinisation events. This mechanism appears to be remarkably efficient in many regards. First, neoproteinisation seems to create particularly efficient new bioactivities. Indeed, as previously said, neoproteinisation led to the creation of SFTI-1, the most potent trypsin inhibitor. SFTI-1 is also the second smallest trypsin inhibitor, which can also be regarded as a form of efficiency in terms of size to achieve a same function, compared to all the larger trypsin inhibitors. My previous results also demonstrated that SFTI-1 was a remarkably efficient peptide sequence for maturation. The AEP processing sites of SFTI-1 evolved prior the emergence of the trypsin inhibitory activity (Elliott, Delay et al. 2014), suggesting a strong dependency toward the processing enzyme for the evolution of SFTI-1. In addition to seed storage protein maturation, AEP was already demonstrated to be involved in the processing of many peptides. Overall, this suggests that AEP processed proteins might represent potential “hot-beds” of neoproteinisation events. Seed storage proteins have long precursors that require maturation by AEP for their maturation and for this reason could be of particular interest for the identification of other events of neoproteinisation. In the next Chapter, I present evidence supporting that the vicilin type of seed storage proteins precursors might also have evolved additional peptides by neoproteinisation. 102 Chapter 7 Neoproteinisation in vicilin precursors 103 Figure 7.1. C2 and similar peptides. The sequences of the C2 peptide from pumpkin, Luffin P1 from a pumpkin relative and the three MiAMP2 peptides from macadamia nuts are presented and aligned. The NMR structure of native Luffin P1, (Li, Yang et al. 2003). Luffin P1 shares 78% sequence identity with C2. 7 Neoproteinisation in vicilin precursors 7.1 Introduction If neoproteinisation was a universal mechanism, there should be other examples. Following the recommendations I presented in the previous Chapter, I searched published datasets for other cases of neoproteinisation. An unusual peptide from pumpkin (Cucurbita maxima) seeds called C2 was shown to emerge from the N-terminal domain of a precursor protein that also encoded a very different type of seed storage protein called vicilin (Yamada, Shimada et al. 1999). The precursor was called PV100 and is matured by AEP into a 50 kDa vicilin, a 5 kDa trypsin inhibitory peptide called C2, and three other peptides from 4-5 kDa with cytotoxic activities (Yamada, Shimada et al. 1999). The C2 peptide has a pair of CXXXC motifs connected by disulfide bonds, and its C-terminus presents two possible AEP cleavage sites. The two forms of C2, C2a (5615 Da) and C2b (5829 Da), only differ by two amino acid residues (see figure 7.1 for sequence). The C2 peptide is a trypsin inhibitor and the CXXXC motifs were proposed to be the active sites by homology with other trypsin inhibitors where conserved interconnected CXXXC motifs have been characterised as the active sites (Yamada, Shimada et al. 1999). Similarly, another research group isolated, from macadamia nuts (Macadamia integrifolia), three similar peptides named MiAMP2 (b, c, and d) and which display antimicrobial activities (Marcus, Green et al. 1999). These three peptides are also encoded in a single gene encoding a polyprotein precursor that also contains a typical vicilin domain. This polyprotein precursor was reported to share lot of similarities with PV100. These MiAMP2 (b, c, and d) peptides present a pair of CXXXC motifs like the C2 peptide from pumpkin and they possess a similar sequence (figure 7.1). They are also spaced by AEP target sites, as the similar peptides from the PV100 precursor, and possess an AEP target site right before their N-terminus. However, their mature sequence suggests they would receive additional processing after AEP processing, which was demonstrated for MiAMP2c. The C-terminal propeptide that separate MiAMP2c from the AEP cleavage site prior MiAMP2d was also sequenced in the immature form of MiAMP2c (Marcus, Green et al. 1999). A few years later a 104 close homologue of the C2 peptide was identified in a pumpkin relative (Luffa cylindrica), during a screen for ribosome inactivating peptide (Li, Yang et al. 2003). However, in this study only partial DNA sequence was obtained from the gene encoding this peptide, and thus the full length protein precursor was unknown and the homology with C2 was not noticed. Despite the lack of full length gene sequence, the double CXXXC motifs, as well as the overall sequence conservation (figure 7.1) implies that Luffin P1 is a C2 homologue and is derived from a homologue of PV100. Luffin P1 was further characterised and shown to display anti-HIV activities and its structure was solved by NMR (Ng, Yang et al. 2011). It confirmed the cysteine bonding (figure 7.1) previously assumed for the C2 peptide, based on an unrelated peptide with two CXXXC motifs, as also suggested for the MiAMP2 peptides. Overall, the sequence of C2 and MiAMP2 peptides is particularly conserved and they define a family of peptides buried in vicilin precursors with various defence-related functions. Additionally the C2 peptide and its homologues are proposed to be processed entirely or partially by AEP. PV100 is an AEP-matured polyprotein precursor producing a seed storage vicilin plus additional peptides; this makes PV100 excitingly similar to PawS1 and SFTI-1 biosynthesis. It suggests that, similarly to the emergence of SFTI-1 in an ancestral albumin precursor, a family of C2-type peptides could have evolved by neoproteinisation of an ancestral vicilin precursor. Vicilins constitute another family of low molecular weight seed storage proteins, also referred as 7S globulins. They vary greatly in sequence and structure to other seed storage proteins and are the subject of heavy post-translational modifications (Shewry, Napier et al. 1995), but their proteolytic maturation has been poorly described apart from PV100. Thus, it appears that seed storage protein precursors would be of particular interest to look for neoproteinisation events. Additionally, the MiAMP2 precursor contains four repeats of double CXXXC motifs, including three that have been characterised at the protein level, and PV100 contains two double CXXXC motifs, C1 and C2, but only C2 was characterised at the protein level. This strongly suggests internal expansion in the N-terminal part of the vicilin precursors in addition to neoproteinisation. Finally, the phylogenetic distance between Cucurbita maxima and Macadamia integrifolia, where C2 and MiAMP2 were respectively discovered, suggests that this family of doubleCXXXC-motif peptides is widespread and constitute a much more ancient 105 neoproteinisation event than SFTI-1, which is confined to the Asteroideae subfamily of the Asterales plant order (ca 42 Mya, Jayasena et al. unpublished). To roughly date the neoproteinisation of vicilin precursors, I investigated published datasets to identify homologues of PV100 and MiAMP2 precursors in the entire plant kingdom. I generated an alignment of most known vicilin precursors described so far. I characterised different classes of vicilin precursors, and in particular one that contains additional buried peptides similar to these present in PV100 and MiAMP2 protein precursor. This class of vicilin precursor contain several repeats of conserved CXXXC(10-14)CXXXC motifs widespread in the plant kingdom, confirming that the C2 peptides and its homologues represent a large and ancient family of peptides buried in vicilin precursors. Then, using mass spectrometry, I identified several masses for potential C2 peptide homologues in close relatives. Although more data will be necessary to establish the detailed evolutionary story of these vicilin buried peptides, they represent a second family of buried peptides that have evolved by neoproteinisation. 106 7.2 Material and methods 7.2.1 Vicilin alignment NCBI and UniProtKB databases were searched for protein sequences annotated as vicilin and for protein sequences similar to PV100. All results were gathered (983 unique entries) and only unique and full ORF sequences (452 sequences) were retained. Then, these sequences were further filtered to eliminate, in a given species, multiple entries of the same sequences with minor variations possibly due either to sequencing errors or recent gene duplication. Such sequences would bring little information and could introduce a bias in the alignment. In this way, the sequence number was reduced to 291 (including PV100 and MiAMP2 precursors). These sequences were aligned with CLC Genomics (CLCbio) and additional modifications to the alignment were made manually to correct obvious alignment errors creating unnecessary large gaps when only one or a few amino acid residues from one protein precursor would not align with the 290 other protein precursors. 7.2.2 Seed extract fractions preparation Seeds of all plant species studied in this study were frozen in liquid nitrogen and ground to a fine powder with a mortar and pestle. The meal was covered with an excess of hexane, mixed for 5 min before centrifugation for 5 min at 20000 x g at room temperature, and the upper hexane/oil phase discarded. The meal was covered with hexane a second time and the above process repeated. The now defatted meal was allowed to dry and then resuspended in 1:1 methanol:chloroform and mixed for 5 min. To separate the phases, the mixture was then diluted to 0.01% (v/v) trifluoroacetic acid with a 0.05% (v/v) trifluoroacetic acid solution and was then mixed for 2 min before centrifugation for 5 min at 20000 x g at room temperature. The aqueous phase was desiccated and resuspended in 5% (v/v) acetonitrile 0.1% (v/v) trifluoroacetic acid. Proteins were separated on a C18 column (Macrospin columns, 50-450 μL, The Nest Group Inc) with a 5% to 75% (v/v) acetonitrile linear gradient with 5% increments and fractions were collected at each elution points. After desiccation, these fractions were resuspended 50% (v/v) acetonitrile for 107 MALDI-TOF-MS analysis or in water for subsequent trypsin digestion as previously described (Mylne, Colgrave et al. 2011) and LC-MS/MS analysis. 7.2.3 MALDI-TOF-MS analysis The matrix used for MALDI-TOF-MS was prepared by adding an excess of α-cyano-4-hydroxycinnamic acid matrix (Fluka) to 90% (v/v) acetonitrile 0.1% (v/v) trifluoroacetic acid, followed by sonication in a water bath for 15 min and then centrifugation for 5 min at 10,000 x g. The supernatant was diluted 10-fold in acetonitrile to make the final matrix solution used in MALDI-TOF-MS. For each seed extract fractions, the previously resuspended fraction was mixed in equal proportion with the final matrix solution, and 2 μL of this mix were spotted onto MTP AnchorChip™ 384 plate. Dried spots were analysed with an UltraFlex III MALDI-TOF/TOF mass spectrometer (Bruker Daltonics) at 30% laser intensity. 7.2.4 LC/MS analysis For each sample, 8 µg of digested proteins were injected into an Agilent 6550 Q-TOF and processed as described in 5.2.5. Spectra were searched against the PV100 sequence using Mascot 2.3.02 (Matrix Science). 108 Figure 7.2. Alignment of 291 vicilin precursors. The result is presented in the coloured top alignment with a consensus schema of plant vicilin precursors above. The colour code is random with one colour specifically attributed to one letter. The three major groups are presented, with group I constituting the ancestral form, and two different events leading to the emergence of either group II or III. The group IV is an exception inside group III, see explanations in the text. The sequence of the characterised C2 and MiAMP2 (b, c, d) peptides are marked by a star. The bottom alignments 1 and 2 are a zoom-in of the boxes marked in the alignment. In these, shown in black and white, cysteine residues are highlighted in red to accent the CXXXC(10-14)CXXXC motifs. An additional zoom 3 is presented, in colour again to highlight the sequence homology and contain sequences supported by mass-spectrometry as shown by black boxes. The colour code is the same as the alignment above. Zoom 1 and 2 will be presented in larger resolution in the following part (7.3.2). 7.3 Results 7.3.1 Several classes of vicilin precursors As introduced above, I generated an alignment of 291 vicilin precursors published sequences. They represent data from 100 species, spread over 22 plant orders and covering most of the plant kingdom from ferns (Polypodiales) to angiosperms. On the alignment presented in figure 7.2, I have outlined what I see to be four classes of vicilin precursors, and they are represented in most families of higher plants. Class I vicilin precursors contains only an ER signal and the vicilin domain, which is highly conserved throughout the plant kingdom. The class I is likely to represent the ancestral form of vicilin precursors (see later comment). What I call the class II vicilin precursors have an additional C-terminal domain rich in arginine and glutamine residues. The class III vicilin precursors, including PV100 and MiAMP2 precursors, has two additional N-terminal domains, a cysteine-rich domain with conserved CXXXC(10-14)CXXXC motifs and a domain rich in arginine and glutamine residues. These two domains are constituted of repeats of the same motifs, and the number of repeats is variable between species and inside each species ranging from one to eight repeats. The characterised C2 peptide and three MiAMP2 peptides are marked. Their dispersed positions on the alignment suggest that potentially many other plants could also produce homologues of these peptides. Finally, the class IV vicilin precursors have only an additional N-terminal domain rich in arginine and glutamine residues, but this class appears restricted to legumes (Fabales). Class IV align within class III due to high sequence homology between their vicilin domains and these of other legume class III vicilin precursors. 109 Figure 7.3. Emergence of new classes of vicilin precursors. Results from the alignment of 291 vicilin precursors summarised on a recently published phylogenetic tree (Ruhfel, Gitzendanner et al. 2014). Species highlighted in blue were represented on the alignment, as well as species highlighted in red for which, additionally, evidence of CXXXC(10-14)CXXXC motifs peptides were found. The presence of the various classes of vicilin precursors is overlaid on a recently published phylogenetic tree of green plants (Viridiplantae) from algae to angiosperms (Ruhfel, Gitzendanner et al. 2014) and presented in figure 7.3. The plants orders represented in the alignment are highlighted on the phylogenetic tree. This figure suggests an ancient evolutionary story for vicilin precursors. The simple class I vicilin precursors contain only a vicilin domain and are the most broadly distributed among all plants, from Polypodiales to Brassicales (figure 7.3), which is consistent with a hypothesis that class I precursors represent the ancestral form of vicilin precursors. Then the two other classes of vicilin precursors, class II and III (figure 7.2) are represented in most flowering plants, from Amborellales to Malvales, but not in species from the order Brassicales, the last plant group on the phylogenetic tree (figure 7.3), that only have class I and II. Pinales, the non-flowering plant order the most closely related to flowering plant orders, only have class I vicilin precursors, which is supported by data from five different species, thus giving some relative confidence that the class II and of vicilin precursors emerged after the transition between Pinales and Amborellales. Interestingly, the transition between these two is unequivocally considered to constitute the emergence of the flowering plants. Brassicales contain the class II of vicilin precursor, but lack the class III. The absence of this class of precursors in Brassicales is supported by data from ten different species including several model species, which allow strong confidence to this statement. The class IV seems to be a variant inside class III. Indeed, these precursors are represented only in a few species of legumes (Fabales), which could suggests incomplete sequencing. However, the Fabales constituted the larger data set and presented the most diversity of vicilin precursors, and this variant class could also represent an intermediate evolutionary state that was fixed only in a few legumes. As said above, their sequence is highly similar to other class III vicilin precursor from the Fabales, apart for the CXXXC(10-14)CXXXC motifs. 110 7.3.2 Emergence of conserved CXXXC(10-14)CXXXC motifs in vicilin precursors From these results, an interesting evolutionary story for vicilin precursors can be proposed. Non-flowering plants possess only class I vicilin precursors, suggesting that class I represents the ancestral form of vicilin precursors, whereas flowering plants possess up to four different classes of vicilin precursors, including this ancestral form. It appears that the vicilin precursors diverged by an internal genetic expansion after the emergence of the flowering plants. This was most probably preceded by gene duplication, as many flowering plants species possess classes I, II and III vicilin precursors. This is consistent with the reports of whole genome duplication events shortly before the emergence flowering plants (Jiao, Wickett et al. 2011). Two different set of genetic events resulted in either the C-terminal or the N-terminal region of the ancestral class I vicilin precursors to expand, creating two new and distinct classes of vicilin precursors, classes II and III. These modifications seem to have been conserved in most flowering plant. We can note the exception of class III not being represented in Brassicales, suggesting that this class of precursor might have been lost after Brassicales emergence. We can also note, according to the different degree of sequence identity on the alignment, that the expansion on the C-terminal side of vicilin precursors is relatively conserved in size and sequence among flowering plants, suggesting that class II vicilin precursors haven’t dramatically diverged after the emergence of flowering plants, whereas the expansion on the N-terminal side of vicilin precursors, which gave birth to variable numbers of CXXXC(10-14)CXXXC conserved motifs between flowering plant species, suggests that class III vicilin precursors continued to diverge after the emergence of flowering plants. These CXXXC(10-14)CXXXC conserved motifs appear to be functional as confirmed by the characterised C2 and three MiAMP2 peptides. The broad representation of class III vicilin precursors among flowering plants suggests that potentially many other plants could also produce vicilin-buried peptides. 111 Populus trichocarpa Populus euphratica Ricinus communis Jatropha curcas Ricinus communis Jatropha curcas Vitis vinifera Vitis vinifera Solanum tuberosum Solanum lycopersicum Nicotiana sylvestris Nicotiana tomentosiformis Nicotiana tomentosiformis Nicotiana tomentosiformis Nicotiana sylvestris Solanum lycopersicum Solanum tuberosum Solanum lycopersicum Erythranthe guttata Erythranthe guttata Sesamum indicum Erythranthe guttata Sesamum indicum Beta vulgaris Nelumbo nucifera Macadamia integrifolia Oryza glaberrima Oryza sativa Oryza sativa Oryza glumipatula Oryza meridionalis Oryza sativa Oryza brachyantha Leersia perrieri Zea mays Zea mays Zea mays Zea mays Sorghum bicolor Setaria italica Zea mays Triticum aestivum Triticum aestivum Triticum aestivum Triticum aestivum Triticum aestivum Triticum urartu Hordeum vulgare Musa acuminata Musa acuminata Musa acuminata Musa acuminata Musa acuminata Musa acuminata Elaeis guineensis Elaeis guineensis Phoenix dactylifera Elaeis guineensis Elaeis guineensis Amborella trichopoda FMVDPAKKKPGQCLEECQRQ EGGKQKSLCRLRCQEKYEREPGRE EEGNM EEKE E FTVDPAKKKLGQCLEECQRQ AGGEQKSLCSLRCQEKYERGPGRE EVGNM EEEE E STTDP EKRLRECQRQCERQ EG QQRTLCRRRCQESYERERERE EEGGR GEREHGRE AERRVGQCQSQCERQ KG EERALCRFRCQQKFGRDEGEH SWG EEK EKSDDPRKQYERCLEICERQEGR QKQQCKRRCYTQYEEQQKEWEEREHGGGGGNSE TET SEDPRVQYERCQELCDRQPER QKSQCRGRCERQFEEKQKEREERGSD RGT NTDPEKQLQQCQKQCER Q REQQKEQCRRQCQDTYERQRGEEGKGGGGQSGEKDE QGRQ VDPEKQLQQCRKQCELLQ PGRQREQCHQECQEKYDQQQLEEG ESDLKQCKHQCKVLQQIEEY QREKCYEICEKLYHGEKQEKEE INF YEGRE ESDLKQCRHHCKVLQHIEEY QREKCYDICEKLYHSEKQEKEEETNS YEGRE DPELKQCKHQCKALQQIGEH QRKECYEMCER YHSEKQSKEEDTSTFL PYRGRE DPELKRCKHQCKALQQIGEH QRKECYEMCEK YDSEKQSKEEDTSTFF PYRGKE NQGRDPQRIFRECQQRCQQQEQ GRQQQRQCQQRCEEEFRREQEQ QRR GQETGEERENPQ NQGRDPQRIFRECQQRCQQQEQ GRQQQRQCQQRCEEEFRREQEQ QRR GQETGEERENPQ NQGRDPQRRFRVCQQRCQQQEQ GRQQQRRCQQRCEEEFRREQEQ QGR GQETGEERENPQ RLQECQRRCQSEQQG Q RLQECQQRCQQEYQRE K GQHQGET NPQ KFQECQRRCQSEEQG Q QLQECQQRCQQEYQRE K GQ QVET NPE SQCDDQNPQGQHPEE KFRECQQHCERQEQ PEQEYQECRQECRRQSREGGSR QERCEERCQEQRREREREQPGR REGGGGMYMYEGRE PEQEYQECRQECRRQSREGGSRRGTS QERCEERCQEQRREREREQPGRREGGGNM YEGR PEKQYQQCRLQCRRQGEGGGFSR EHCERRREEKYREQQGR EGGRGEM YEGRE REDEPAAQRFKQCQSRCGKQEQG Q QRQYCQQKCQWEYEKQK REEEQQHGGGRGGGDPTN RRDPE QQYKQCQSRCAREERG E QRQYCQQKCQWEYERQK REQGREQGGGGGS TN APRRDPEKAYRECRERCQEVEEG RREQQVCESECEQRRER IRRDDEIVGRRSV CQQLCDQ GRGEQ EKQLCRQGCEQRHRQQQKEQD ERQREHC NKRDPQQREYEDCRRRCEQQ E PRQQHQCQLRCREQQRQHGRGGDMMN RRRETSLRRCLQRCE QDRPPYERARCVQECKDQQ QQQ QERRREHGGHDDDRRD RDR RRRETSLRRCLQRCE QDRPPYERARCVQECKDQQ QQQ QERRREHGGHDDDRRD RDR RRRETSLRRCLQRCE QDRPPYERARCVQECKDQQ QQQ QERRREHGGHDDDRRD RDR RRRETSLRRCLQRCE QDRPPYERARCVQECKDQQ QQQ QERRREHGGHDDDRRD RDR RRRETSLRRCLQRCE EDRPLYERARCVQECKDQQ QQQ QERRREHGRHDDDRRD RDR RRRETSLRRCLQRCE QDRPPYERARCVQECKDQQ QQQ QERRREHGGHDDDRRD RDR RRGETSLGRCLQRCE EDRPRYERARCVQECKEQQ QQQ EQERRREHGRHDDDRNG RDR RRGETSLGRCMQRCE EDRPSYERARCLQQCKEQQRQQQQ EEERRREHGRHDDDRSSSRDR NHHHHGGHKSGQCVRRCE DRPWHQRPRCLEQCREEEREKRQ ERS RHEAD DR NHHHHGGHKSGQCVRRCE DRPWHQRPRCLEQCREEEREKRQ ERS RHEAD DR NHHHHGGHKSGRCVRRCE DRPWHQRPRCLEQCREEEREKRQ ERS RHEAD DR NHHHHGGHKSGRCVRRCE DRPWHQRPRCLEQCREEEREKRQ ERS RHEAD DR HDGH RCARRCE DRPWHQRARCVEQCREEEERERQ QQEERGRGDRHE H DR REPWRCARRCE DRPRHERAQCVQECREEERERGR RDELGRRG DR NHHHHGGHKSGRCVRRCE DRPWHQRPRCLEQCREEEREKRQ ERS RHEAD DR EEDRRGGHSLQQCVQRCQ QDRPRYSHARCVQECREDQQQHGRHEQEEQ EEDRRGGRSLQQCVQRCH QDRPRYSHARCVQECRDEQQQHGRHEQEEQ EDDRRGGHSLQQCVQRCQ QDRPRYSHARCVQECRDDQQQHGRHEQEEQ EEDRRGGRSLQRCVQRCQ QDRPRYSHARCVQECRDDQQQHGRHEQEEQ EEDRRGGRSLQQCVQRCQ Q DRPRYSHARCVQECRDDQQQHGRHEQEEQ EEDRRGGRSLQQCVQRCQ QDRPRYSHAR EDDRRGGHSLQQCVQRCR QERPRYSHARCVQECRDDQQQHGRHEQEEEQGRGRGWHGEGEREEE DPEKKR CAMECRGIPE QQQRKLCVHRCLDDSSEQESG ECRGIPE QQQRKLCVHRCLDHSGEQESG DPEKKR CVMECRGIPE QQQRKLC DPEKTR CVMECRGTSE RE AGREQEREEEAEG DRAKKR CHMECRGTPE GRRRKECVRQCLDHSGREREHGEVAEG H EDPERRYRQCRQECTQRAQD KRQLQQCEHRCEEEYKEQRHREGN EEPEKRLEECRRECREQAE RRERRECEKRCEEEYKE HRGRSKDKEEGEEGRGEKRRESD EEPEKRLEECRRECREQAE RRERRECEKRCEEEYKE HRGRSKDKEEGEEGRGEKRRESD HKGEDPEKRLEECRRECREQAEGRE QRECEKRCEEECKE RRGESKEEEKREEEKGEKRRGSD GKREDPRKQLEECRRECQEKQEGRR QQRECEQRCEEEYRK HRGGNKEEEEWEEGQEKKRREGH RKVEECLSKCKESFFYHQDRQLLEKCEQICFDKYDGSH ERERQERQCKSECERSREREREVR QCKQECEERYRRQK Figure 7.4. Details of zoom 1 in the alignment on figure 7.2. Colour scheme is identical to figure 7.2 and cysteine residues are highlighted in red. Class III vicilin precursors appear to have two distinct domains. One domain is particularly rich in arginine and glutamine residues, thus sharing some similarities with the Arg/Glu-rich C-terminal domain of class II vicilin precursors. The second domain contains the repeated CXXXC(10-14)CXXXC motifs. This expansion event gave rise to PV100 and MiAMP2-type precursors. However, the two N-terminal domains seem to be linked, especially in early emerging flowering species, as it can be seen on the zoom of the alignment, presented right after (figure 7.4). In figure 7.4, are shown the sequences from “zoom 1” of the alignment, displaying the conserved CXXXC(10-14)CXXXC motifs typical from the C2 peptide, and potentially several homologues. These sequences are of plants from the orders Amborellales to Malpighiales (from bottom to top), that represent the earliest emerged flowering plant orders from my data-set. The sequence of the MiAMP2d is marked by highlighting the name Macadamia integrifolia. First, the double CXXXC motifs as well as their relative distance (10-14 amino acid residues) are highly conserved among all species of displayed. However, the rest of the sequence is poorly conserved between species, but conserved within a species. This suggests that although the expansion that gave rise to the CXXXC(10-14)CXXXC motifs is conserved, this expansion has subsequently diverged within plant species. This is also supported by the various numbers of repeats of the motifs inside the domains. Another interesting observation from figure 7.4 is that CXXXC(10-14)CXXXC motifs are still quite rich in arginine and glutamine. We previously observed that in legumes a class IV of vicilin precursors contain only the N-terminal domain rich in arginine and glutamine residues. Thus, it seems that the expansion on the N-terminal side of vicilin precursors, that gave rise to PV100 and other class III vicilin precursors, first expanded and incorporated several additional arginine and glutamine residues in particular (probably similarly than the C-terminal side of class II). Then, whereas in class II the C-terminal expansion was fixed and relatively conserved among species, in class III the N-terminal expansion continued to diverge and in most species ended in the emergence of the CXXXC(10-14)CXXXC motifs, most probably by neoproteinisation. 112 Gossypium raimondii Gossypium arboreum Gossypium arboreum Gossypium hirsutum Gossypium raimondii Gossypium raimondii Gossypium hirsutum Gossypium arboreum Theobroma cacao Anacardium occidentale Pistacia vera Citrus sinensis Citrus clementina Citrus clementina Eucalyptus grandis Eucalyptus grandis Pyrus x bretschneideri Malus domestica Pyrus x bretschneideri Prunus persica Prunus mume Prunus persica Prunus mume Fragaria vesca Morus notabilis Morus notabilis Ficus pumila Prunus persica Prunus mume Pyrus x bretschneideri Corylus avellana Carya illinoinensis Juglans regia Juglans nigra Cucurbita maxima Cucumis melo Cucumis sativus Cucumis melo Cucumis sativus Phaseolus vulgaris Pisum sativum Vicia faba Glycine max Glycine max Glycine max Glycine max Glycine max Medicago truncatula Cicer arietinum Arachis hypogaea Arachis duranensis Arachis hypogaea Arachis hypogaea Arachis hypogaea Glycine soja Glycine max Glycine max Glycine max Glycine max Glycine soja Phaseolus vulgaris Cicer arietinum Cicer arietinum Cicer arietinum HPGGE LSECQSRC V GLAGEVRQLCLFRCQQKWRRDYASRWEEEGKK QSKE HPGDG LSECRSRC A GLAGEVRQLCLFRCQQKWRRDYASRWEEEGKK QSKE SQRQFQECQQHCHQQEQ RPERKQQCVRECRERYQENPWRREREE SQRQFQECQQHCHQQEQ RPERKQQCVRECRERYQENPWRREREE SQRQFQECQQHCHQQEQ RPEKKQQCVRECREKYQENPWRGEREE PDKRFKECQQRCQWQEQ RPERKQQCVKECREQYQEDPWKGERENKW PDKQFKECQQRCQWQEQ RPERKQQCVKECREQYQEDPWKGERENKW PDKRFKECQHRCQRQEQ GPERKQQCVKECREQYQEDPGKGERENKW LQRQYQQCQGRCQEQQQ GQREQQQCQRKCWEQYKEQE RGEHEN Y STHEPAEKHLSQCMRQCERQE GGQQKQLCRFRCQERYKKER GQHNYK REDD STHEPGEKRLSQCMKQCERQD GGQQKQLCRFRCQEKYKKER REHSYS RDEE NPGEPAEKHLRQCQRECDRQEE GGQQRALCRFRCQEKYRREKEGEGGQHN T QEEE NPGEPAEKHLRQCQRECDRQEE GGQQRALCRFRCQEKYRREKEGEGGQHN T QEEE NHHRDPKWQHEQCLKQCERRES GEQQQQQCKSWCEKHRQKGQRRREKEGKFNPSSNW TTTEDPQQQFQHCQQRCERQQG REPQRRQCLQECERQFQEQQR GK RKQLKECQKQCERRQQ KGRSREGCQQRCVEAYGRKGEDQPRNRRGKEDQEKERGR DPELKQCKHQCKHQRGFEEW QKKQCERKC EDYIKQKREQEEQRG DPELKQCKHQCKHQRG RGNRRT SGGGGKEEEEEEEGG EGGSFYI DPELKQCQHQCKHQRGFDER QKQQCERKC EDYTGE DPELKQCRHQCEHQQGFDSK QREQCEQGC DKYIKQKREEEKHRR KSEGGGS DPELKQCRHQCEHQQGFDSK QREQCEQGC DKYIKQKREEEKHRG KSGGGGS DHPQDAREEYFYCSQSCGTSEEPEQ CETECRERFDEQLKKEA DHPQDAREEYFYCSQSCGTSEDPEQ CETECRKRFDEQLRKEA DPGLRQCMHQCQHQEGYDEQ QKQICEQRC DDYIKQKRQEERQRR RGGGDTS DPELKQCRHQCKHQRGFEEE QRQQCEQRC EEYIREKKHREREEE G TASRDPEEKFRRCQQRCQEQQHV QERRRECQQECERKYREEERR ERRRRDDDDD E TIHDDPEEKLRRCQKRCQEQQHV QERRRECEQECQKKYREEEHR GGRRRDGDEDVDD QQFEHEKHQYKQCKRGCENQARDD FQKEQCKQQCTQQMSQQYGQQQEGSTGGGGLNQ QQFEHEKHQYKQCKRGCEKQALDDD QKEQCKQQCTQQMSQQYGQQQEGSTGGGGPNQ KQFEQEKQQFKQCKKGCEKQAQDDD QKKQCKLQCAQQGQQQQG GSTGGVHLNQY DPELKKCKHKCRDERQFDEQ QRRDGKQIC EEKARERQQEEGNSSEESYGKEQ QDPQQQYHRCQRRCQTQEQS PERQRQCQQRCERQYKEQQGR EWG PDQASPRR QDPQQQYHRCQRRCQIQEQS PERQRQCQQRCERQYKEQQGR ERG PE ASPRR QDPQQQYHRCQRRCQIQEQS PERQRQCQQRCERQYKEQQGR ERG PE ASPRR NQRGSPRAEYEVCRLRCQVAER GVEQQRKCEQVCEERLREREQGRGEDVDEVERRDPEWER NQE VTEICRQWCQVMEPHGGEQQQRRCRQECEERLRDQKQG EDVED KWRDPERER NQK ETEICRQWCQVMKPQGGEEQ RRCQQECEERLRDQEQG EDVED KWRDPERER DPELKQCKHQCKVQRQFDEQ QKRDCERSCDEYYKMKKEKGRN DPELKQCKHQCKVQRQFDEQ QKRDCERSCDEYYKMKKEKGRN TKVVG DPELKTCKHQCLQQQKYSED DKLICLRSCDEYHGLKHEREKQIEESSEKQIHEHEG EKDPELTTCKDQCDMQRQYDEE DKRICMERCDDYIKKKQERQKHKEHEEEEEQEQEED EKDPELTTCKDKCDLQGQYDEE DKRICMEKCEDYVRKKQERQKHKEHEKEEHEEEEEN TEVEE DPELVTCKHQCQQQRQYTES DKRTCLQQCDS MKQEREKQVEEETRE TEVEE DPELVTCKHQCQQQRQYTES DKRTCLQQCDS MKQEREKQVEE ETRE TEVEE DPELVTCKHQCQQQRQYTES DKRTCLQQCDS MKQEREKQVEE ETRE TEVEEEDPELVTCKHQCQQPQQYTEG DKRVCLQRCDRYHRMKQEREKQIQESRERETRE TEVEEEDPELVTCKHQCQQQQQYTEG DKRVCLQSCDRYHRMKQEREKQIQESRERETRE ETDPELKTCIHQCKQQRQYDE DKGICMDKCEDYHRMKQEREKRQHQ ERDPEIKTCKHQCRQQLQYSEE DKNVCMEECDEYHRMKKEREKRK Y RKTENPCAQRCLQSCQQE PDDLKQKACESRCTKLEYDPRCVYD TGATNQRHPPG Y RKTENPCAQRCLQSCQQE PDDLKQKACESRCTKLEYDPRCVYD TGATNQRHPPG Y RKTENPCAQRCLQSCQQE PDDLKQKACESRCTKLEYDPRCVYD TGATNQRHPPG YQKKTENPCAQRCLQSCQQE PDDLKQKACESRCTKLEYDPRCVYDPRGHTGTTNQRSPPG YQKKTENPCAQRCLQSCQQE PDDLKQKACESRCTKLEYDPRCVYDPRGHTGTTNQRSPPG WEKENPKHNKCLQSCNSE RDSYRNQACHARCNLLKV EKEE CEEGEIPRPR WEKENPKHNKCLQSCNSE RDSYRNQACHARCNLLKV EKEE CEEGEIPRPR WEKQNPKHNKCLQSCNSE RDSYRNQACHARCNLLKV EKEEECEEGEIPRPR WEKQNPSHNKCLRSCNSE KDSYRNQACHARCNLLKV EEEEECEEGQIPRPR QVKPPSHHECLQSCESE TDPYKRKACKLSCKFVPEHGEQQHEEEEREREHPQ WEKQNPSHNKCLRSCNSE KDSYRNQACHARCNLLKV EEEEECEEGQIPRPR TEKP SRRKCVKICESE KDLSRGQTCQLRCKHVSEPGGK EEEEREIPE QGNPCIEKCLQRYNKVHETFKHHECRSHCEKEEEEE HQGNPCLKKCMQRYNKVNEAFK RREEEE QYEQGNPCLKKCMQRYNKVNEAFK RREEEE Figure 7.5. Details of zoom 2 in the alignment on figure 7.2. Colours are identical to figure 7.2, and cysteine residues are highlighted in red. In Fabales only, some of these class III precursors seem to have not led to the emergence of the CXXXC(10-14)CXXXC motifs or they could also have further evolved and even regressed to lose the N-terminal cysteine-rich domain and only maintain the N-terminal domain rich in arginine and glutamine residues. This constitutes class IV vicilin precursors. From Fabales to Malvales, the potential homologues of the C2 peptide, all containing a CXXXC(10-14)CXXXC motifs, display much more sequence conservation between species (figure 7.5). On the figure 7.5, pumpkin (Cucurbita maxima) is highlighted. The corresponding precursor fragment displayed corresponds to the PV100 fragment from which C2 emerges. All the other sequences on this figure, as well as the ones from the figure 7.4, and also from the high-copy-number of multiCXXXC(10-14)CXXXC-motifs from some precursors that can be seen on the alignment figure 7.2, constitute potential homologues of the C2 and MiAMP2 peptides. 113 7.3.3 Did CXXXC(10-14)CXXXC motifs arose by neoproteinisation? Overall, the expansion within the vicilin precursors that created class III vicilin precursors is consistent with neoproteinisation. This example potentially constitutes a much earlier event of neoproteinisation than the event that created the PawS1 precursor and accordingly SFTI-1, evidenced by class III vicilin precursors being distributed widely within the plant kingdom. Among class III vicilin precursors, that possess the CXXXC(10-14)CXXXC motifs, PV100 has been shown to produce multiple stable peptides with potent bioactivities from its extra N-terminal domain, including the C2 peptide exhibiting CXXXC motifs and trypsin inhibitory activity. The C2 peptide from PV100 seems to have homology with many other plant vicilin precursors. Homologues have already been charactered in a close pumpkin relative (Luffin P1) and in the more phylogenetically distant species Macadamia integrifolia (MiAMP2). All these potential homologues share a lot of sequence similarities, and particularly the CXXXC(10-14)CXXXC motifs, which are most likely to be structurally important and potentially constitute the active sites, are very highly conserved. Such a broad conservation of this cysteine-rich domain among the plant kingdom, with the highly conserved CXXXC(10-14)CXXXC motifs in particular, suggests that these peptides would be produced by the plants and functional. Thus, to provide further evidence of this potential new family of buried peptide I searched many of the plants species containing the class III vicilin precursors for the presence of CXXXC(10-14)CXXXC motifs peptides in their seeds. 114 Figure 7.6. MALDI-TOF-MS and LC-MS/MS evidence for C2. Soluble proteins from pumpkin seeds (Cucurbita maxima) were extracted and fractionated. (a) The mass spectra acquired by MALDI-TOF-MS of the fraction (25% acetonitrile elution point) containing both forms of the C2 peptide is presented, with C2a (5615 Daltons) and C2b (5829 Daltons). (b) Each fraction was then reduced, alkylated and digested with trypsin before being loaded in a LC-MS/MS. Resulting spectra were searched against a pumpkin protein sequences database using Mascot 2.3.02 (Matrix Science). Results for the 25% acetonitrile elution point of the pumpkin soluble protein extract are presented here. 7.3.4 Identification of CXXXC(10-14)CXXXC peptides To validate potential C2 homologues, I first investigated methods for extracting and characterise them, using the C2 peptide as a model, making the assumption that C2 homologues should have the same or at least similar characteristics to C2 in terms of extraction methods. Using MALDI-TOF-MS for detection, I developed a sample preparation to obtain a good signal for both forms of the C2 peptide (figure 7.6a), as it possesses two possible C-terminal cleavage sites. The method that I previously used to successfully extract SFTI-1, only allowed a poor signal for the targeted masses in this case. I found that I required a second purification step, based on solvent gradient separation on a C18 column to separate the different components of the samples into several fractions. Doing so I could obtain a fraction where I could detect, with a relatively good signal, masses corresponding to the theoretical average masses of the two forms of the C2 peptides and also corresponding to the masses previously observed (Yamada, Shimada et al. 1999). Then I tried to obtain the sequence of the identified masses. It appeared that direct MS/MS on the MALDI-TOF was not conclusive, probably due to the native form of the peptide. However, any attempt to alter the structure of the targeted peptide resulted in very complex mass spectra above analysing, despite the apparent relative purity of the sample from the MS spectra obtained on the MALDI. Thus, it appeared that the samples I used for MALDI based detection were still quite complex and contained many proteins not detectable in their native form. As an alternative, this same protein extract fraction was reduced, alkylated and digested with trypsin and then, using LC-MS/MS, searched for fragments corresponding to C2 digested fragments (figure 7.6b). I was able to resolve most of the C2 peptide sequence using this method. 115 Although I was able to detect C2 fragments, I could also detect some fragments from the rest of the vicilin precursor (as well as many other seed protein) in the same sample, making it difficult to be certain the detected fragments were coming from the precursor or the matured form. Thus, I could not confirm by this method the existence of the peptides as independent entities. This second method unfortunately could not be used on native peptide of this size (about 5 kDa) due to the particular instrument settings. This machine was set to work with small tryptic fragment for higher resolution, to the detriment of the massrange detection, but this is the most common proteomic approach nowadays. This is routinely used in my research centre. Adjusting such settings would have required attributing one machine to my project only for months, which was not an option due to the limited number of such expensive equipment and the high number of users in our research centre. At this point, I had two different methods to efficiently detect my peptides of interest. The first one allowed detection of native peptide mass, but didn’t allow to sequence the detected masses. The second one allowed to sequence only fragments of the targeted peptides, but not their native forms, thus requiring prior purification to ascertain the origin of the detected fragments not being from the larger protein precursors. I investigated other extraction and purification methods, and used these detection methods to confirm the presence of the targeted peptides in the samples. However, all the other purification or extraction methods that I investigated resulted in the loss of the targeted peptide, or at least it could not be detected anymore by the methods previously described. For these reasons, only MALDI-TOF-MS based detection was used for the identification of potential C2 homologues in other species. 116 Figure 7.7. C2 homologues in cucurbits. (a) Sequence and theoretical masses of C2 and homologues from cucumber (Cucumis sativus) and melon (Cucumis melo). The calculated masses account for the presence of two disulfide bonds and a post-translational modification changing the first glutamine at the N-term into pyroglutamate, as described for C2 (Yamada, Shimada et al. 1999). Cysteine residues are highlighted in red and putative AEP cleavage site are highlighted in yellow. Colour code is identical to figure 7.1. (b) MALDI-TOF-MS analysis of cucumber (Cucumis sativus) and melon (Cucumis melo) fractionated seed extracts. The mass spectra of the cucumber seed extract fraction containing C2s (4317 Daltons) and of the melon seed extract fraction containing C2m (4453 Daltons) are presented. 7.3.5 Identification of C2 homologues from closely related species In the vicilin alignment (figure 7.2) two homologues of PV100 were identified in the close pumpkin relatives melon (Cucumis melo) and cucumber (Cucumis sativa), which are other members of the plant order Cucurbitales that pumpkin (Cucurbita maxima) belongs to. These PV100 homologues also contain sequences for C2 peptide homologues, which were named C2m for the C2 peptide homologue from Cucumis melo and C2s for the C2 peptide homologue from Cucumis sativus. The sequences of C2 and of the potential C2m and C2s peptides, as well as their theoretical masses are presented in figure 7.7a. Using the MALDI-TOF-MS based method developed for the C2 peptide detection, I identified a mass in a Cucumis sativus seed extract fraction, corresponding to C2s (figure 7.7b), and another mass in Cucumis melo fractionated seed extract, corresponding to C2m (figure 7.7b). The masses identified correspond to the theoretical masses predicted (with less than 0.5 Da difference), allowing strong confidence that they correspond to the C2 peptide homologues. 117 Figure 7.8. C2 homologues in legumes (Fabales). The calculated masses account for the presence of two disulfide bonds. Cysteine residues are highlighted in red and putative AEP cleavage site are highlighted in yellow. Colour code is identical to figure 7.1. The mass spectrum of (a) the broad bean (Vicia faba) fraction containing C2vf (5954 Daltons), (b) of the common bean (Phaseolus vulgaris) fraction containing C2ps (5943 Daltons) and of (c) the pea (Pisum sativum) fraction containing C2pv (5440 Daltons) are presented. 7.3.6 More widespread evidence for C2-like peptides I similarly investigated seeds of many other plant species, represented in the alignment in figure 7.2, for presence of potential C2 homologues. In three species, all from the Fabales order, I found masses corresponding to theoretical sequences of C2 homologues (figure 7.8). The masses identified suggest trimming of a few amino acid residues on the C-terminal sides of the peptides as described for the other peptides produced from PV100 (Yamada, Shimada et al. 1999) and for the MiAMP2 peptides (Marcus, Green et al. 1999). The sequence of these homologues, extracted from the alignment in figure 7.2, as well as their theoretical mature sequences and masses are presented in figure 7.8. Only the mass identified in broad bean (Vicia faba) seed extract is a perfect match with the theoretical mass for the corresponding C2 homologues called C2vf (figure 7.8). The slight mass difference (4 Daltons) between the mass detected in common bean (Phaseolus vulgaris) seed extract and the theoretical mass for the corresponding C2 homologues called C2pv, as well as the irregular peak of average mass of 5943.18 Daltons detected in pea (Pisum sativum) seed extract matching the theoretical mass for the corresponding C2 homologues called C2ps (figure 7.8), suggest that these peptides might be subjected more post-translational modifications. Unfortunately, as it can be seen on the generated spectra, I almost reach the detection limits of the instrument. Indeed, the resolution on the generated spectra is quite poor, as shown by the large basis of the peaks that usually covers over 200 Da and by the lack of isotopic resolution (cannot been seen on the presented images), which was not a problem at all when I was working on smaller masses, such as the work done with acyclic SFTI-1 (1531 Daltons). This method already presented limitations due to the fact that the precise cleavage sites of the potential C2 homologues are unknown. For some of them, it was predictable that they could also be matured by AEP such as C2, and when potential AEP cleavage sites were present in the precursor sequence, such assumption was made as in C2 homologues from other Cucurbitaceae for example. These close relatives already harbor an additional AEP cleavage site prior to the two other conserved as in C2, and only this shorter form was detected. Other C2 peptides relatives further diverge, such as the ones from 118 Phaseolus vulgaris, Pisum sativum and Vicia faba. Due to the absence of potential AEP cleavage site or to a multitude of them in their precursors, no clear assumption could be made and only a few additional eventual homologues were then identified with relative confidence. However, the fact that no homologues of the C2 peptide were identified in many of the investigated species doesn’t mean they do not exist, but are rather due to limitations of the method. For example the C2-like peptide masses from macadamia nuts couldn’t be detected (not shown). This could be due to the particular composition of these seeds not being compatible with my extraction method. This could also have been the case for more seeds than expected and thus could explain the identification of potential C2 homologues in a limited number of tested species only. For these reasons other extraction and/or identification methods should be investigated to resolve this problem. This is when my PhD reached its end and I couldn’t further investigate this very interesting topic, which will be continued in my research group. 119 7.4 Conclusion In this study I described three main classes of vicilin precursors suggesting that expansion on both the N- and C-terminal sides of the vicilin domain within vicilin precursors has been extensive. During the emergence of flowering plants, which was preceded by whole genome duplication events (Jiao, Wickett et al. 2011), I believe either the N- or C- termini of some vicilin precursors expanded, seemingly independently as no vicilin precursor has both an N- and C-terminal expansion. The C-terminal side of some of the duplicates appears to have rapidly accumulated additional amino acid residues during evolution and was also rapidly fixed in the plant kingdom. This is demonstrated by the overall homogeneity of class II vicilin precursors in most flowering plants. By contrast the expansion on the N-terminal side vicilin precursors appears to have evolved more gradually. This gave rise to class III vicilin precursors which particularly contain a variable number of CXXXC(10-14)CXXXC motifs at their N-terminal side. The class IV vicilin precursors suggest an intermediary evolution state, which would have been fixed in a few legumes only. The CXXXC(10-14)CXXXC motif appears to have evolved through the gradual expansion of vicilin precursors N-termini, similarly to SFTI-1 that evolved through the gradual expansion of PawS1 albumin precursor. For this reason, the emergence of the CXXXC(10-14)CXXXC motif suggests neoproteinisation. However to support that these CXXXC(10-14)CXXXC motifs correspond to additional peptides processed from the class III vicilin precursors, it is essential to isolate stable CXXXC(10-14)CXXXC peptides that correspond to precursor sequences. In this study I investigated several plant species and found evidence for homologues of the previously characterised C2 peptide from pumpkin, in the pumpkin relatives Cucumis melo and Cucumis sativa, and in seeds from the more distantly related Fabales plant order, Phaseolus vulgaris, Pisum sativum and Vicia faba. The C2 peptide and its homologues represent a large and ancient family of peptides buried in vicilin precursors that remains to be fully characterised. Although more data will be necessary to establish the evolutionary story of these vicilin buried peptides, they represent a potential second family of buried peptides that would have evolved by neoproteinisation. 120 Chapter 8 General discussion and future directions 121 8 General discussion and future directions 8.1 Seed storage protein trafficking During my PhD candidature, I investigated the potential of the protein precursor PawS1 to be used as a tool to study seed storage protein maturation. The trafficking of protein precursors is of particular importance for their maturation. My investigations, presented in Chapter 4, revealed that deregulation of trafficking pathways can have an effect on the quantity of total seed storage precursor being produced. Previous work on trafficking mutants had usually been targeted at one or two mutants and used protein gels standardised for total protein content to show that there was mis-processing of seed storage protein precursors in the mutant seeds. These studies did not provide data on the proportion of mature seed storage protein present in the mutant seeds. My results suggest that although trafficking mutants all accumulate mis-processed seed storage protein precursors, the location of mis-processed seed storage protein precursors would correlate to different level of correctly matured seed storage proteins. Trafficking mutants that accumulate mis-processed seed storage protein precursors inside their cells would over produce or still produce normal quantities of correctly matured seed storage proteins, whereas trafficking mutants that export mis-processed seed storage protein precursors outside their cells would produce less correctly matured seed storage proteins in total. I propose the existence of an unknown mechanism that senses the accumulation of mis-processed seed storage protein precursor inside the cell and triggers an overproduction of seed storage protein precursor to compensate for their mis-processing. This suggests that the regulation of protein maturation and gene expression could be linked. This hypothesis could be validated by studying gene expression of seed storage protein precursors during seed development of mutants impaired in seed storage protein trafficking. This project will probably lead to further collaborations with Ikuko HaraNishimura’s lab, which identified the mutants that I used in my study. 122 8.2 SFTI-1 maturation In Chapter 5, I used several chimeric constructs to provide more evidence of AEP requirement for SFTI-1 maturation and demonstrated the efficiency of the SFTI-GLDN sequence at being processed by AEP in various protein contexts. My results suggested that SFTI-1 can be processed without an adjacent protein. For the synthetic constructs without adjacent protein, the production of acyclic SFTI-1 was 5% of what it was for the wild type protein with an adjacent albumin. However, this might be due to the absence of targeting sequence for proper trafficking. Indeed, it is known that seed storage proteins have a vacuolartargeting sequence in their C-termini (recognise by VSR1). Additionally seed storage protein precursors are matured during their traffic between the Golgi apparatus and the vacuole. Thus, it is quite possible that the low production of acyclic SFTI-1 from the albumin-depleted PawS1 variants is only due to inefficient targeting. This could be tested in my research group with a further set of chimeric constructs. Nevertheless, it appears that SFTI-1 could be matured from its precursor without an adjacent protein. This is particularly interesting in regards to the current knowledge of cyclic peptide evolution. This result suggests that the precursor for SFTI-1 (PawS1) could evolve to lose its albumin domain and still be matured into SFTI-1. Parallel studies in my research group demonstrated that the contribution of PawS1 to the overall albumin content of sunflower seed was minor (Jayasena, Franke et al. 2016), and thus there is no particular evolutionary constraint into keeping this particular albumin. The current analysis of more than 100 transcriptomes of sunflower close relatives in my research group (Jayasena et al. unpublished) also suggests that genes that encode such truncated precursors exist and are transcribed in some species, but to date no protein level evidence have been produced for these precursors. The parallel with cyclotides is astonishing. Cyclotides are another type of cyclic peptides also thought to be partially matured by AEP and generally encoded in precursors consisting of one to three cyclotides in tandem repeat. The existence of dual precursors has also been reported for cyclotides, in the species Clitoria ternatea. In these plants, cyclotides extracted from the leaves appeared to be encoded in dual precursors also encoding an albumin domain (different to the 123 napin type). Thus, although it hasn’t been supported by additional data yet, it can be hypothesised that cyclotides also could have evolved from a dual precursor and diverged to lose the adjacent ancestral protein. This subject is of particular interest to several research groups and should be resolved in the near future. 8.3 SFTI-1 evolution, neoproteinisation My research group is particularly interested in SFTI-1 evolution. Recent and current work, conducted in my group, attests that SFTI-1 evolved by an unusual mechanism. During my PhD candidature, my insight led to the realisation that the evolutionary mechanism that led to the creation of SFTI-1, previously decrypted in details by my colleagues mostly by the use of next generation sequencing, differs to previously described de novo evolutionary mechanisms. After a thorough comparison with these other mechanisms, I named this mechanism neoproteinisation. Next generation sequencing has already greatly helped change perceptions of evolution. For a long time it was considered that de novo gene evolution was almost impossible and probable examples were neglected rather than being studied. Nowadays, the evolution of de novo genes is emerging as an important source of new proteins in addition to the familiar ‘parental gene’ model and cases have now been reported in all living kingdoms (Tautz 2014). Neoproteinisation is another alternative to the traditional ‘parental gene’ model and I believe it has been overlooked, similarly than de novo gene evolution in the past. Indeed, my results suggest that this mechanism is particularly effective to create new functions and many other events might have occurred during evolution. Seed storage proteins were targets of choice for the identification of other neoproteinisation events and I consecutively identified another potential occurrence of neoproteinisation with the prototypic example being a vicilin precursor called PV100. Thus, in an attempt to elucidate this particular precursor evolution I started to study vicilin precursors more generally, which revealed quite an unexpected outcome. 124 8.4 Neoproteinisation of vicilin precursors I established that expansion has been rampant in vicilin precursors, and that neoproteinisation might have given rise to a second family of peptides different from SFTI-1 and possessing a conserved CXXXC(10-14)CXXXC motif. However, to prove that these peptides with CXXXC(10-14)CXXXC motifs emerge from the vicilin precursors as I expected, I needed to prove their existence as independent protein entities. Thus, I developed a sample preparation to clearly detect C2 and some of its homologues using MALDI-TOF-MS. Using this method, I found mass evidence for homologues of the previously characterised C2 peptide from pumpkin, in the pumpkin relatives Cucumis melo and Cucumis sativa, and in other seeds from the phylogenetically more distant Fabales plant order, Phaseolus vulgaris, Pisum sativum and Vicia faba. Additionally to my work, another homologue of the C2 peptide was fully characterised in a close pumpkin relative. My group is currently attempting to recover the gene encoding this peptide (Luffin P1). Thus, in addition to my results, it seems that C2 peptide homologues are common in Cucurbitaceae and maybe also in Fabaceae. Overall my results suggest that class III vicilin precursors could be a second example of neoproteinisation event. This class of vicilin precursor, which I discovered is represented in most flowering plant species and, contains additional buried peptides with a conserved CXXXC(10-14)CXXXC motif. So far the existence of the CXXXC(10-14)CXXXC containing peptides as independent mature entities had only been confirmed in two phylogenetically distant species (pumpkin and macadamia nuts), but I provide evidence for homologues in two more pumpkin relatives and three legumes. This work has created a desire to identify and characterise this new family of peptides. This potential second and more ancient event of neoproteinisation suggests that neoproteinisation could be a widespread mechanism for protein evolution. In this regard, this work constitutes the foundation studies for the characterisation of neoproteinisation. 125 Chapter 9 Supplementary information 126 9 Supplementary information 9.1 Gene sequences, relative to trafficking mutants (Chapter 4) MAG1 gggaaaacaaaggatcttccatgggcgatcgctgagaagatccgcaagaacaaaaacatcaaattccttga aatcaggaatcttattcgatagtctatttgagcccttgatctggtaagctctgtctgatcttctaaagtttcgctttttaa cttttggggaaatcatccacctttgtatcaagtttctggcaatttcttctctgtagttagaatcattcaataagcccttta atctcttcgacgcagctagtgttgaattacatctgggttttgtctactttgtgtcaaaaccatgcttttatgtttgattttcg cgaagtatgatttaggttcgattttagtgaaagaaagacttaacagttgtagttaactgaaactatgcgatatctga ttacgttggtgttatgattttattggaagggttttggaaatgattccacattgtattgtagatctcattgacttttggtttttg atctgtttctcaatccagattctttatactttagtttaacaagtgacaaactttgatctttgattggttcaccaacgtcttt gaaggttttttatttggtgaaagttttgaaATGGTGCTGGTATTGGCATTGGGGGATCTCCA TGTACCCCATAGAGCGGCTGATCTACCTCCTAAGTTCAAATCAATGCTTGT TCCTGGGAAGATTCAACACATCATCTGCACTGGAAATCTCTGCATCAAAgta gtgttttttttccttcaaattttcaacttttacatctctagtatgctttattttttttcgttaatgtcgttgtgggatcattctgtctg atgtaactgtttctgggatatttacttatgaagcagaaatgaggattcagaaggagggtagactctaatttcaagc tttatatgatgcttttctctgatgcaacctagtctatgtttatcctgtattgggttgtttagaactattttcttacttggttgtgt aagccttaccagaaatttgataaggagactcaatagtgaattatgtgatctactcggagcttgttctatatctgctg gattgtgtagatttgaattcttaccttgtagtgtgagagtcgtgtaaaatgggaagaggaattccattacattttggtt tgatcagttttagttttagatttgcttgtgaccatggaattgtaaagtgattaagtcagcatcgatgagtcatttatgag aagccaatgatgttcttttagttggaagatatgttaagaagtttgtaaggtgtttagttgaaaacccatcttagaga aactgcagacctgatttatttgtatgctgcaaagtcttatatggttcatgtttgtggtaagcttaacatatttttgactgtt gtagGAAATCCATGACTACTTGAAGACTATCTGCCCTGATTTGCATATAGTTC GAGGGGAGTTTGACGAAGATGCGCGATACCCTGAGAATAAAACCTTGACT ATTGGGCAATTCAAGCTGGGATTGTGTCATGGCCACCAGgttcagtttcccttctttgt gcatttcatcatgatctttacgagttagacgctagcatttgcagcatacgaataatccctcgtatcatacatcatctg tttatcagaattctcagactagaaaaccaaattttgcttgcctttacttcattgctttgttggtgtgtgtcccttacccagt gagtttcagcgttggctcgcttgttattttgcatgataatttaaacaggctttgaatatatat[mag1.2_insertio n]cagatggaatattatctttctacccttttactttgtcatatggtaaacagtgataaaatagtttttgtctagatgcctc taatctgcacatactttttctatatatgataagaactgtattggagattagtaaacgggaaaggtctaggataaaa gtttcggcgtcaatgattttcattttcaatgtcaatgttttagactaaaactatgaaagagaataagagcaatggtta gttaagttatataacttatttgcagtggtatgaaaacgagcttctctatttgtggtacagGTGATCCCATGG GGTGATCTAGACTCACTCGCCATGCTTCAGAGACAACTGGGCGTCGACAT TCTCGTAACAGGCCATACCCACCAGTTCACAGCGTACAAACACGAGGGAG GAGTGGTGATAAACCCGGGCTCTGCAACCGGAGCTTACAGCAGCATAAAC CAAGACGTTAACCCAAGCTTTGTCCTTATGGACATCGATGGTTTCCGAGCT GTAGTCTATGTCTATGAGCTAATCGATGGAGAAGTTAAAGTCGACAAGATC GAGTTCAAGAAGCCCCCTACTACCAGCTCTGGTCCGTAGttccttaaaaaaaaact taaaccgttttatgttttcgtctgaaatcactcctaaaca[mag1.1_insertion]tttagaactttcttgtgaaa aataatatttttaacaatttgcaaaacgaaccattgagaataaacaacttttaacaacttcctgcaaaagtcgaat atttttattggcatttgcaattttgcattacaacgatgttggcgagtttattcggaaatattaagccctaataacccga ccccacccgacccgacccgacccga 127 MAG2 ttaaagattttgctcgatttggggatccatgattgattgaatttggagattttgctcgacgtcactttgttttttttttgttttttt ttcgctttgttgttatagaaaacaataacgttttgttgttatagaaagcaataatgttttgaattaacgggtatgttaat aaaaaattaattttgttaacggaatctgttaatctacttaacatcatgtctaatatgaccattagcaaatattttgtattt attaagaaagtcacataagtttgatatttattagaaaaatcaacataaatttgatatttatttcgtttcgttggatgttg agaatttaaacaaatgttttcaaataaaagttggagattttagtgataattttccgataaaagaatgtcatttctttga gaatgccattacagttttaatttacttgatatgttaaaaacatttttgtgggcttaaccagggcccattaatccgagtt accaaaactcgcaactcgttaacactccgagtcattgactcagtatcttcttccaacgaaatttcaatttttcattgtt cttgatgaaactgaggccacgacgaatcgcaaATGGAGGCGATCAAACCACTTCCACAG GTCTCGAGCTTCTCCGCTTCGGTATTTAGCTTTCTCGATGGCAGATTCAAA GAATCCACGGATCTCTCTCATTCTACGGGTCTGGTTTCCGAGTTACAGACG GAGATTTCCGAATTGGATCAGAGATTGGCCGGATTAAATCGCCAGCTCGA GTCAGGTCTCGCTGCTTACGCTTCGTTTTCCGATCGCGTCGGTGGTCTATT TTTCGAGGTTAATGCCAAATTGGCTGATCTCTCTTCTTCTACCTCCGTTACT CGCTCCGCGTCAGgttcgtatgatttcatcacgattcttctggatgtagattatgttcttgaaatgttgtagtc aaatatgtaaattcagtcctggtgatggaagattgaatatattcccgttataatttgagaatttgataattgtttaattc attcagATAGCGGAAAGGAGGAGGAGGCGACGGAGCATGTAGCCGGAGAGG ATCTTCCTTCATTAGCAAAGGAAGTAGCACAAGTTGAGTCTGTGCGTGCTT ATGCTGgtatgtttggtgttaaggcgttttgtatttgatggctgctttctgataaagactagctctattttatgatttg gtgttatcgatgttgatcgggattagttagtcccaaagttggagttaccaaccttctttcttagtttggtcactgtggtat tcaacattttaagatctttctctatgtttcagAGACTGCACTAAAACTTGACACTTTAGTTGGT GATATTGAGGATGCTGTGATGTCTTCTTTGAATAAAAACTTAAGAACATCTC GGTCAAGTGGTTTTGAAgtaagcacactgtcttcagcttagactgttccgtagttttagccagtgtgtgt ggttcaatgttctaagtttttactgatttcagGAAGTGCGTCTACATGCTATTAAAACACTTAA AACGACAGAAGAGATACTGAGTTCAGTAGCAAAGAGACATCCTCGGTGGG CACGTCTTGTTTCTGCTGTTGATCATAGAGTTGATAGAGCTTTAGCTATGAT GAGACCTCAGGCAATTGCTGATTACAGAGCGTTGCTTTCTTCTCTTCGATG GCCACCTCAGCTTTCTACACTAACTTCGGCAAGTCTTGATTCAAAGTCAGA AAATGTTCAAAATCCGCTTTTCAACATGGAAGGGAGCCTCAAAAGTCAATA CTGTGGAAGTTTTCATGCCTTGTGTAGCCTGCAGGGGTTACAGTTGCAAAG AAAGTCGCGGCAGCTTGGGATCCATAAGGGAGAAAATGTTCTTTTCCACCA GCCACTCTGGGCTATTGAAGAGCTGGTCAACCCTCTGACAGTTGCATCTC AGCGACATTTTACAAAGTGGAGTGAAAAGCCAGAATTCATTTTTGCCCTTG TGTATAAAATCACAAGGGACTATGTCGATTCTATGGATGAGTTGTTACAAC CGCTTGTGGATGAAGCAAAACTAGCTGGGTACAGTTGCCGAGAAGAGTGG GTTTCGGCTATGGTAAGCTCACTGTCTTTGTACTTGGTGAAAGAGATCTTT CCTATATATGTTGGTCAGCTAGACGAAGCAAATGAAACTGATCTTCGTTCT GAGGCTAAGGTCTCGTGGCTCCATCTCATTGACCTGATGATCTCCTTTGAT AAGCGAGTTCAGTCTTTGGTATCACAATCAGGAATACTTTCGCTTCAAGAA GATGGGAATCTTTTGAGAATTTCCTCTCTCTCAGTTTTCTGTGACAGACCTG ATTGGCTTGATTTATGGGCAGAGATAGAGCTAGATGAGAGGCTCGTCAAAT TCAAGGAGGAGATTGATAACGACAGAAATTGGACAGCGAAGGTCCAAGAC GAACTCATCTCCAGTTCCAACGTTTACAGACCACCAATCATTTCCAGCATC TTTCTACAGCATTTGTCATCAATAATCGAACGATCCAAATCAGTGCCGGCC TTATATTTGAGGGCTAGGTTCCTGAGACTGGCAGCCTCACCAACAATTCAT AAGTTCTTGGATTGCCTCCTTCTCAGGTGCCAAGACGCCGACGGACTAAC TGCATTAACTGAAAATAACGATCTAATCAAGGTCTCGAACTCTATTAATGCT GGTCACTACATTGAATCTGTCTTAGAAGAATGGTCTGAGGATGTCTTTTTC CTTGAAATGGGAACTGGACAGCACGATCCACAGGAAGTTCCAGGACTGGA GAACTTTACTGAACCTTCTGAAGGTATTTTCGGAGAAGAATTTGAAAAGTTG GAGAAGTTCCGG[mag2.3_insertion]CTAGAGTGGATAAACAAATTGTCGGT GGTGATCTTGAGAGGCTTTGATGCTCGAATCCGAGAATACATAAAAAACAG 128 AAAGCAATGGCAGGAGAAGAAAGACAAAGAATGGACGGTGTCAAGGGCAC TAGTTGGGGCTCTAGACTACTTGCAAGGAAAAACGTCTATAATAGAAGAAA ATCTAAACAAAGCAGACTTTACCGCTATGTGGAGAACTCTAGCCTCAGAGA TAGACAAGTTGTTCTTCAACAGCATCTTGATGGCGAACGTGAAGTTTACCA ATGATGGAGTCGAAAGGTTAAAAGTAGACATGGAGGTTCTATATGGGGTTT TCCGGACGTGGTGTGTTAGACCCGAAGGTTTCTTTCCTAAACTAAGTGAGG GGCTTACGCTTCTGAAGATGGAAGAGAAGCAAGTGAAGGACGGTCTGAGT AGAGGCGATAAGTGGCTACGTGAGAATAGAATTCGATATTTGAGTGAAGC CGAAGCCAAGAAGGTAGCGAAGAGTAGAGTGTTCTCTTAGttagatatttttcagca aatcaatatctcatacgaaggtacattcataggtgagatcatttcatttatttgtggattaagatataacacacaac tttggatttttaattgaaattgaactaaccagctctggttctgtttcgtcttaaattaatgattcatgagttaatagagaa acaaaacaaaa MAG4 tgaaaccaaacaaaaatttcaaaataatcaaatggatactaaattttagaaccgaaaaaactgataaccgaa caccgatgactaaagtctatgagcataacacaatccacaaaaaataaacctccacagttttactggctatgcc ataaaagcccactggacctaaatgggcccatattcaagcatcctgtaatctcacaagagctagttatctggtgc gggcatgagtataatacgttatcaatcttcctcttcgaccattaaagaagaaaaaaaacaatatcatcgtgacg aatcgaacggtttatcagctcttctgactttgcttgaatagatcggtcaggtgaaaaacacgagagactctgtga agaagagatcgctgcgatctatcacggttccattgcctggaggattcttcttccttgttttgatcttgttcttatcaacat agtctgtgtaagctgttcgattctcgccattttcaatttctgttttccctctctgtgataaatttcggagttctagggttttgt agtttgattctccgtttttgtactggtttgtgttgttgctatttcaattttggttgaaaatttcttgatttttgttgtttcaggtgatt ttattggctataacgagatATGGATTTGGCATCCCGTTATAAGgtgaatatctttcttcgattgtga gcttaatttcagctgtcaggttttcagggtgattaaattctgttttggatcgaatgcatgtttccttcttgtttccgtgttttc gaatagctgtatttgttgagaggagtctgttttgcatttgctcggtgttgattccaatatgaattttagtagtttctcaatg gtgtagaaagagtgactggtagtttactaatcatgaggatggcagagagctttatactatatttccatggcatttta acagcagcattattaaagtttcccggtgtttttaacttgttttacgaacttttacagGGGGTTGTTGGAATG GTTTTTGGCGACAATCAATCTTCTAATGAAGACAGgtacattttcttctcttttctttgccatg agtcgggagatataggctatcatggctcgtaagtatgtcgagaaatgaagtaatgaggattggatttttatttggt gtatttttgtagTTACATTCAACGCCTGCTGGATCGTATAAGTAATGGTACCTTGC CTGACGATAGGAGAACTGCTATAGTAGAACTTCAGTCTGTTGTCGCAGAAA GTAATGCTGCTCAATTAGCTTTTGGAGCAGCCGgtatgtgtctttctgttttatgttatgtgtct gattgattgtgagataaaatctaaagttacaatttacattctgtcatttcagGATTTCCTGTGATAGTGG GCATCCTAAAGGATCAACGAGATGATCTTGAAATGgtgctgtacccttattttttttttccat atttcttttgttctctctcagtttgtgcatagtgctgtttgtccttacgtcatagcacaatctcttgtcctcactttatggaaa atgtggtaaaaatcgttgaagcaaagtttgttaaagatggattttttttttttgctccaaacaaagtagttttgggtga gtagcatttgtaattctacttttcgtagtccttgctaatagtgtagcgatgtataattcagaatcagaataggtccaca agtccaggggaaaatcaattggaaaggaaaatgttgtaaatttccttcagtatggattctggtctagtgtagattct taaccatgaaagttgcacttataaatctcttgttgttttactttgggattatccattgaatttcttgttaatgttaccgttttg tgcttttcttcaacttatatttttgttgactgtccaggttcgaggtgccttagacttataattttgttgactgtccagGTT CGAGGTGCTTTAGAAACTCTTCTTGGCGCTCTAACGCCAATCGATCATGCA AGGGCACAGAAGACTGAAGTCCAGGCAGCTCTTATGAACTCCGACTTGCT CTCCCGTGAAGCTGAAAATATCACTCTTCTTTTGAGTTTGCTGgttagagtttactc aatattttcgaaatagaaaaagttattttttagttttgtttgttttttatgtgtgcatcctatggttaattttcctggtggtatg acagGAGGAAGAAGACTTTTATGTTCGATATTACACTCTTCAGATTTTGACAG CTCTGCTTATGAACTCTCAAAACAGgtatgtatcttaggatgtactatctctgtgtcatttttctatct gtatgctatagagactgttgctgactgtctcttcagGTTGCAGGAAGCTATTTTGACCACTCCT CGCGGTATTACTCGGCTCATGGATATGCTAATGGATAGAGAGgtaaataattaatt 129 atttattggatatgtatgatctttccattgtttacgtaagactctgacctctttaacctcctgtcaacctcaaataggag gttatgagacggtaaactctggaagggaacaatatcatgtttgtaattttttttttggtgtatgccttatagGTAAT CCGGAACG(mag4.1_deletion)AGGCTTTGTTGCTTCTTACGCACTTGACTC GTGAAGCAGAGgtgagacctcctggttatatttgtaactgttccttttcttctgccttgagccgagggcagcc attgatcgatcaattttacatttgtagGAGATCCAAAAAATTGTTGTCTTTGAAGGTGCTTT TGAAAAGATTTTTAGCATCATTAAGGAGGAAGGGGGGTCAGATGGCGATG TGGTTGTGCAGgtgaccagttgctcttttaaaaatatttatagcacttgtaatctgttaaaaatttggcatata aaaaggataaatgaagaacccattctcctatttaggcctaaacattattggcttttacatttccatatgtcagcata ctttgggattgcctgcttatgctaaacttcaaattccttgcccgtgcacttgcactaatcttgttctttatatacagGA CTGTCTGGAGTTGTTGAACAATCTTCTTCGCAGCAGTTCATCTAACCAGgtttt gaagctttgtttctggtcttgatttttctgattaatttactagatccaacatatgctaatatttgaacttatacagATAT TACTAAGGGAAACCATGGGTTTTGAACCTATCATTTCAATTCTAAAACTGCG AGGGATTACATATAAGTTCACCCAGCAAAAGgttagatcatagagtgcttctcccattatgg atttattttaccagcttgtgcaattttgttaatctttgtccttctgaatgaacaaattcccagtagACTGTTAATC TGCTTAGTGCACTGGAGACCATCAATATGCTGATCATGGGACGTGCAGAC ACGGAGCCTGGAAAAGATTCAAACAAGCTGGCAAATCGAACAGTTTTAGTT CAGgtttgttatgatattcaaatgttatgattccttattgtggacaggatcgaatatgtttcttattctatcatatttgatc caataacagAAAAAACTGCTAGACTACCTACTGATGTTGGGTGTTGAAAGCCA ATGGGCGCCTGTTGCTGTTCGCTGCATGgtaagcaaatcatctttattcgaaatcttcttcctt agccccttccctttttttgttagtattcatatttcattgtatatcgaaatttcattcctgatttgcgcaactgtcaataagtt ggaatagtttcattcagattctagaattcagctgttgaacattgtttctaggttgaccatgggaactgatatacttac agtgatctattagcagtttttttctagagattgttgcactttctttttagccaattgctgaacagttgtctgaagcaagta ttttattgatgtttttttctaactggtcttttgcattgtgattctttgctgctagACATTTAAATGCATTGGAGA TCTAATTGATGGACATCCTAAGAACCGTGATATCCTTGCTAGCAAAGTTCTT GGTGAAGATCGGCAAGTAGAACCTGCACTCAATTCTATCCTCCGGATCATT TTACAGACGTCCAGTATTCAAGAGTTTGTTGCAGCTGATTACGTTTTCAAGA CATTCTGTGAGgtttgaaactttttttcctatagatttttagttcatagtctaatgtgatattgcttcttctaatata gtatgtgattggatctttatctagaatctcattcactaaacgttttcagAAAAATACTGAGGGTCAAAC AATGCTAGCTTCGACTTTAATACCTCAACCACATCCCACATCCCGAGATCA TCTAGAAGATGATGTTCACATGTCATTTGGAAGgttccatcttcttttctttcttctgcggctttc gcagtgttgttttgaagggatctcactcttcattctcattgcgagtttgcatggtcctccaactttgacctacttctcttgt attaaaacttatccaactttttttctatatgtgtcagTATGTTGTTACGAGGCCTTTGTTCTGGTG AAGCTGATGGTGATCTTGAGgtaaaattgtttctctctgttggatattaatttaaaagcattccggaa cttacagttatttttttacttcttaaaaacagACATGTTGCAGAGCTGCAAGCATTCTTTCTCA TGTTGTCAAGGACAATCTTCGGTGCAAGGAAAAGgtaacaaagtcatcactttaaactt attttctgtacgcaatatttttgtttttcctgcattgactgactttttgggaaactcaatagGCTTTGAAAATCG TACTTGAATCACCGATGCCGTCCATGGGAACTCCAGAGCCTCTCTTTCAGA GGATTGTCAGATACCTGGCTGTTGCTTCTTCCATGAAAAGCAAAGAGAAAT CAAGCACACTGGGGAAATCATATATTCAACAGATCATTTTAAAACTGCTGG TGACATGGACTGTTGATTGTCCGACTGCAGTTCAGTGCTTTTTAGATTCCC GCCACCACCTAACGTTTTTGCTCGAGCTGGTAACAGACCCAGCTGCAACA GTTTGCATAAGGGGATTGGCGTCTATTCTATTGGGAGAATGCGTCATTTAC AATAAATCAATTGAGAATGGTAAAGATGCTTTTTCTGTAGTTGATGCTGTGG GCCAGAAGATGGGGCTGACATCATACTTCTCAAAGTTTGAAGAGATGCAG AACAGCTTCATCTTCTCTCCCTCCAAGAAACCTCCACAGGGTTATAAACCG TTGACAAGAACACCCACGCCGAGTGAAGCCGAGATTAATGAGGTGGATGA GGTAGATGAAATGGTTAAAGGGAATGAGGATCACCCAATGCTCTTATCTTT GTTTGATGCTAGTTTCATAGGCTTAGTGAAGAGCTTGGAAGGTAACATCAG AGAAAGAATCGTGGATGTGTATAGTCGGCCAAAAAGTGAGGTGGCTGTAG TACCAGCAGATTTGGAGCAGAAGAGCGGAGAAAATGAGAAAGACTACATC AACCGCTTGAAAGCCTTTATAGAGAAACAGTGCTCGGAGATTCAGgtatgtccc tcttcagaactgtctatgctactgagtataatactgttagcatatctgaaaatatatactattttggtgttgattgcaatt 130 aaaagaagacaaatcttgagattcatcacaagtaacatgaaatttaaaaatagatgaaaagttcagggggttc aactaaaaatatgggtattaggaagcattgcttattgctgaaacggaatataggaagcactaagttttggctaag atgggactatttagaattggtcttatgttctcagctggatatcaaacaatattcagacaataggtctctgaattgatg tgataaatatatgttgaggttaaccgttcaaaaaagaaatctcaacatcagttgttcctgacagagcttcttgttgtt gtgacctctgtgttttacagAACCTTCTTGCTCGAAACGCTGCTTTGGCAGAAGACGT AGCTAGCTCTGGCAGGAACGAGCAATCTCAAGGATCAGAGCAAAGAGCAA GTACAGTAATGGACAAAGTTCAGATGGAAAGCATCAGACGAGAACTCCAA GAAACATCTCAGCGTTTAGAGACAGTGAAGGCTGAAAAAGCCAAGATCGA GTCTGAGGCATCAAGTAACAAGAACATGGCTGCAAAACTTGAGTTCGACCT AAAAAGCTTATCGGATGCATACAACAGTCTAGAACAGGCGAATTACCATCT TGAGCAGGAAGTGAAATCCTTGAAAGGAGGAGAGAGTCCCATGCAGTTTC CTGATATAGAAGCCATCAAGGAAGAAGTGAGAAAAGAAGCTCAGAAAGAG AGCGAAGACGAGTTAAATGACTTGTTAGTATGTTTGGGACAAGAGGAAAGC A[mag4.2_insertion]AAGTCGAGAAACTAAGCGCGAAGCTGATAGAATTAGG AGTTGATGTTGATAAGCTTTTAGAGGATATTGGAGACGAATCAGAAGCTCA AGCGGAATCTGAAGAAGACTGAccattgaacaagttgtagagattttatatcctctaaaatgctct ctcatcagggaatggaacttgtacagtcttgggtgatttatttcgatatttgacaattctctttgcttaatcttttagttat aactctttaatttatctaccaaacttcttagattgcttgaaacatacacatcttaacaatgaaacacaatgtggtttta agttactgcaggttgttatgcttttcatattttgttgaaacaatcaaacaagtactcagttcaattagtgttt VPS35b aaactacagacaacacattaaacagaaaacaacattggaagaccaaacaggtggtatccatgttgcagag aaaactatttataatttgatttggttcgattttaatcaaaatgcaaatcaaaaatataataaaataacaaagtagtt acaatattttatattaattgatatgtttattcataatacataaaatataaaattgaaataaattttgtttatccggttcggt tgagttccaatatagcagaaacgattgcggttaaaattctggattttattcggtttgatgaaatagaatatggaccg ttagatctaaagttaagtagaaacaacgaacggtggatagcgaggaaaacgtacacgtgtgaagcatgaca taatttttttcccatccaaagactttcacaaggtcttttttttgaaacctgaaaagtgagaaacccagagatcgaga ttgagatttagagaagacgaagatcagaagATGATCGCAGACGGATCAGAAGATGAAGA GAAATGGCTCGCCGCCGGTGCTGCTGCTTTCAAGCAAAACGCATTTTACAT GCAACGCGCTATTgtaaattctgcttttcccttcacttcttcctttcgtttctctgtgactattttcgttgaaattg agaaatttccaaatttgtacttttcattgatgtgtagaattgaaggaaccctaattttggttttgggaaaaaacttgtct tgtttttggttaattagGACTCGAATAATCTGAAAGATGCTCTCAAGTATTCGGCTCA GATGCTAAGCGAGTTACGGACTTCGAAGCTATCACCTCACAAGTACTATGA TCTATgtgagttgtttccgtcttctgatctaccttgatcatatttatttcgctcaatcccttaattgacttaatagggttt tgtttgtgatctgatttgatggtttttagATATGAGAGCTTTTGATGAATTGAGGAAACTTGA GATTTTCTTTATGGAAGAAACTCGTCGTGGCTGCTCAGTCATTGAACTCTAT GAGCTGGTTCAGCATGCTGGTAACATATTACCGCGTTTgtaagttactctcatacaat tccctttcgctgtctgttcctatgttttggcaagtttttttttatcgatcgtatcttcgtaaagaatcattttggtctttcttagtt caaattcatgttatcttaaagatggtttatttgattactacttgatgttgctagcatttcttctgatatttagtctgttcacttc tctagctaaagtaatatttgttcattttattttcccttccctagGTATCTCCTATGTACAGCAGGATCC GTGTATATCAAAACCAAGGAAGCTCCTGCCAAGGAAATTCTTAAAGATCTT GTTGAGATGTGCCGTGGGATTCAGCATCCTCTACGTGGTCTCTTCTTAAGA AGTTACCTTGCGCAGATTAGTCGAGATAAATTACCTGACATTGGTTCTGAG TATGAAGGgtaagcatgtagacactattgttgggtttcttcatagcagctgggactcccatggggttcttgctt gtgcttaaattatcttcttagacaaagatccctagtttctttcgacttcatatagaatgttggctattaattaaaatcattt tgattcatctagttggtgactcgttttcaaatctctacagAGATGCTGATACAGTCATAGATGCTG TGGAGTTTGTACTACTGAACTTTACTGAGATGAATAAACTCTGGGTCAGAA TGCAACATCAGgtgagttatgatttaagattgttatacacccttctactcattttggaaacgagcttgccaga 131 atgtttgcacttcatggagatttaattggactatgagagcatgattctgatcatgcaggtgttgtattgcttgtgaatc attaccctttgatgatcctgagttgaagcttcctatcgggctttctggtctgcagGGACCTGCTCGAGAA AAGGAGAGACGGGAGAAAGAGAGGGGCGAGCTTCGTGACCTTgtaatatgactt tctttcagctggtaacagtgaactgaagtgccgaccgtttgttttgtactgatggaagcgtacgttttccctcacttat cagGTTGGAAAGAACCTTCACGTGCTGAGTCAGTTAGAAGGTGTGGACCTT GATATGTACAGAGATACAGTTCTTCCTAGAGTCTTAGAGCAGgtaaaatattatat cctgtaaacacacatacacttgcaaacttaacttattccctaaccaaaaactaatgtttttctgatgatttctcagA TTGTGAACTGCAGAGATGAGATTGCCCAATATTACCTAATAGACTGTATAAT TCAAGTTTTTCCTGACGAGTATCACTTGCAGACTCTAGATGTACTTCTTGG GGCGTGTCCTCAACTTCAGgtagtctgagtagctctctcgtagttagcattatgctctgctcctaatca tctaacttgattgcaactgcactgtcggggtttttggtgtcagtggttcactaacctcctgtttttttcactttctctttggttt tcttctcagGCATCAGTTGACATCATGACAGTGCTTTCCCGTTTAATGGAGAGG CTGTCAAATTATGCTGCCTTAAATGCGGAAGTATTACCTTATTTCCTGCAAG TGGAAGCTTTCTCAAAGTTGAATAATGCAATTGGAAAGgtaacattacttgtaggctgt gagtcctgttacacttttgtttctgtatgaaagtatgaaagtaatgacagcttagaaatcttgttgactggacttgtta gcatcacgaaagacatggagaaggcccaactatttttttgcttgacatttcttttttaagtgaaactcttcactttaga cgattccgtcaacatctagaaagatgcattgttctttatctgttggatgtatccgctgcaaagcgtgaccctaatta acacagttctgcatgtgttcacttcttaaattatgtggttttgacgccttttttcccaatgaattttttttcttaagtgggaat actatgatgttgcagGTGATAGAAGCACAAGAAGACATGCCTATTCTGAGTGCAGT AACCCTATATTCCTCCCTTCTCAAGTTTACTCTTCACGTTCACCCTGATCGG CTTGATTATGCGGACCAAGTGTTGgtcagttttcttatagcgactgtatttacctattgttctcttctta actcaagtttaaatactccgtcttctgtcaatgttgcttagtggcctatcgatctcattatttagattatcttgtcattctat tcacattgctgagacattgctaaatttgtgatacttgtgtcaaaagatgccaatctttttatataaaaaaaaaaaca gatttcaggttcatctttgtttatcattcttgtagcggatattggttcaggaatttatgttttattctgttatatgtatgtgctttt ttatacatgtttatggtacgctacatttgcagaaaacttcctactacgtatcttatacgaattttttgctggaatagggt acattgggtgaaagagtttgaatttaatgttcttcactaactcttcccagGGATCATGTGTTAAGCAAC TGTCCGGAAAAGGAAAGATTGATGACACTCGTGCAACAAAGGAGCTTGTC TCGCTTTTAAGTGCTCCCTTAGAGAAGTATAATGATGTTGTCACCGCCCTT AAACTAACTAACTATCCCCTCGTGGTGGAGTACCTTGATACCGAAACAAAG AGAATAATGGCTACTGTTATAGTTCGAAGCATTATGAAAAACAATACTCTTA TTACTACAGCGGAGAAGgtaggatacctttatgccattaccctttgggttgaactatttgtgtggtaaaa ttgaatttctttgtcttgttttactttggtagGTTGAAGCATTGTTTGAACTGATTAAAGGAATTA TCAACGATTTGGATGAGCCGCAAGGTCTTGAGgttgttataagcattttgagatactttctat cttgtacttaatatcctctcaatcacttgtgaccccaatattaagttgtattttgacttctgtacatatttttctacgctctta gGTTGATGAAGATGATTTTCAGGAGGAGCAGAATTCTGTTGCGCTTCTCAT TCATATGTTATATAATGATGACCCAGAAGAGATGTTTAAGgtgcattgagctcttgca cattcagattcattaatgtgtgtcatatctttatgctgactttaggctttctagttattggcttacattaatatttttaatgttt aatggcatttattcccagttgcctttattctttgtttgtaattttcaactgagacatagattatacttatccccaggagtc atcatttccttcaaaattctattaaggttttggctttcagtgtttgaagctgtttaattgaatttcagATAGTCAAT GTCCTGAAGAAGCATTTCCTGACAGGAGGGCCAAAGCGCTTAAAATTCAC CATTCCTCCCCTTGTTGTTTCTACTCTAAAGgtttgatttcttaagcgaagttattctttaaagc ccctcataatttatgttgagcttaatttgcgatgacatggtttggttgtttctatttatgtgccataccgcatagaagagt atctgatatagtttgttaattttgttcttcttcaaattctgtaatttttgttgtctgttttgtttaaattgtagCTAATCAGG CGATTGCCAGTGGAAGGAGACAATCCTTTTGGAAAAGAGGCTTCCGTTACT GCTACTAAAATATTCCAATTTCTAAATCAGgtaagtttcataacttttcactgccttgtgtgttca gatctgtttttaggtagaacaatgtgtctcagagctacattacttttttaaaacttagccaaagttagttcatacaagc tgattgacaacgttgaaaattctcaatcaacagtttgtagatattagctaaagagactgctgttattgctctgtctcc actacagtcatcagttcttgtatttattgttggatttttttgcttgttgcagATTATCGAAGCGCTACCTAA TGTTCCATCACCTGACCTGGCATTCCGGTTGTACTTGCAATGTGCTGAGgta aaactacgtacctgtttatttagtatgcacttgctcggtaaaatgctcacagatctttcctatttcatcagGCTGC GGATAAGTGTGATGAAGAACCAATTGCATACGAATTTTTCACCCAGGCATA CATCTTATACGAAGAAGAAATTTCGgttagttatgttttcgcggacgttcatatgtctcgtgtctaa 132 accaaagcaagtaggtgagtttaactctctctcgtgcgtcaaaaacttgccttttgaacagGACTCAAAG GCCCAGGTGACTGCGTTACAACTTATAATTGGAACTCTGCAGAGGATGCAA GTATTTGGTGTTGAGAATAGAGATACATTAACGCACAAGGCTACGGGGgtatt tcatcatgatatgataattagtctcttaccatattcaattgtctctgcttaaaggctgacaagggaaaacttatactct tgcagTATGCAGCGAAACTTCTAAAGAAACCTGATCAATGTCGAGCTGTTTAT GCCTG[vps35b.1_insertion]TTCTCATTTGTTCTGGCTGGAAGATCGTGAGA CCATACAAGATGGAGAAAGgtaatactttcttctgtgcgatagaacacagtgtgtttaaaatagctca ttaggtttcttacttgaaatcagGGTTCTACTTTGTCTGAAACGAGCGCTTAAAATTGCA AATTCGGCTCAACAAGTGGCCAACACAGCTCGGGGTAGTACAGGGTCTGT TACCCTCTTCATCGAGATACTAAACAAgtaagcgctttttattgtttgttttacttctcattttcccag atcagcaagcagcattatgtaatgtgaattatggcagGTACCTCTATTTCTATGAGAAAGGGG TTCCACAGATAACAGTTGAATCAGTAGAAAGCCTGATAAAACTGATCAAGA ACGAAGAATCGATGCCCTCTGATCCATCTGCTGAATCATTCTTTGCAACTA CGCTTGAGTTCATGGAGTTCCAAAAGCAGAAAGAGGGTGCCATTGGTGAG AGATACCAGGCGATCAAAGTATAGaatctcatgttggtcaaaagaagtcggtttgtttttattaaa aaaaaaaaaagctttgctttgagtcgtttgagaatcggaaagggactgatattgtgttgcttgctgctcgaaatgt aaacccgggtttaggtctagttttattcctcggattttcaaggcggttcatggtctgtaacttgtggagtttgatgtgtg tttatacgaatctttagtattctgagtttaacattggatattacatattagataatgtccttcatgaaaaaagtattcattt tatgtattctgttgaatgattattgacgcataaaaacattttcaataccattgtataacgcataattaacaacctttgt gaggcttcagccttcaaatatatcaatagaaattcagtttacgaaatcaacttaaaaaattttgaaaaaataata aaatttaaaaaaaaaacaaattgtaggttaaatttttttaaaaacaaaataaaacagaaaagaaacgagaaa aaaaa VPS35c aatgtgaggaaaaagggtaaatcaaattctaaatcatgtgatctatttacgtaatttaaattaccagcttactcaa aagtcaaaattgaagttgtgagagataagactaatgtagcctaatgactatgaaatgagttatgtcaccatctaa tttaaagttaaagctgataatgaagaaagaacaaatttaaatgaggagagtttcaattcgcgaagatctcaga aaattcggtcaaaattgagacttctctctgtagagtaaagtagtgaggaacatagaATGATCGCCGAC GACGATGAGAAGTGGCTTGCCGCCGCAATCGCCGCCGTTAAGCAAAATGC TTTCTATATGCAGCGCGCTATCgtaccactccttcccctctcctctcttcttcatcaatcaattttccat tcgggatcttagaaccctaaatcctgcattgttttctaacgaatctccaatttatcctcgcagGACTCCAACA ATCTCAAAGACGCTCTAAAGTTTTCTGCTCAAATGCTCAGCGAGCTTCGTA CTTCCAAGCTTTCTCCTCATAAATACTATGAACTCTgtgagtcgactcgttcgttacccg ctcttcctctattcgtagctcttgtgtgtgattttcatttccttactgcgatctgatttgttcagATATGCGAGTTTT TAATGAATTGGGGACTTTGGAGATATTCTTCAAGGAAGAGACGGGACGTG GTTGCTCCATCGCAGAGCTCTACGAGTTAGTTCAGCATGCCGGTAACATTT TGCCAAGATTgtaagtttcttttgttcttgttgttcagaatacgttgtcatttgcacactctgattttaactttgtag atactcatcaattgagtttcagGTATCTTTTATGTACCATAGGATCTGTATATATCAAAT CCAAGGATGTTACTGCTACGGATATACTTAAGGATCTTGTTGAAATGTGTC GAGCTGTTCAACATCCCTTGCGTGGTCTTTTCTTGAGAAGTTATCTTGCTC AAGTTACTAGGGATAAGTTACCCAGCATTGGTTCTGACTTGGAAGGgtaaggc cacacgacctgctgttagatgattttgtttaagctatgaaacacatgttaagaggcaagtgttttgcttttctctccag AGATGGTGATGCACACATGAATGCTTTAGAGTTCGTGCTTCAAAATTTTAC CGAGATGAACAAATTATGGGTTCGAATGCAACATCAGgttggttggaaatgctaatga tatactgctctcatatccagcatgtatgtattgaatatgtttgcgtactgtgagagtccaatgtctggtcatgctgctat tttactctttgtgattttttcctactgagctgaacctgagttgagactttcttgtatccagGGACCTAGTCGGG AGAAAGAAAAACGTGAGAAAGAAAGGAATGAACTACGTGATCTTgtaatagcatt aacttgtacctgattcttctcatctagaaatttgcgagctatggttcttattattttctgaaagacacttttttcttctcttta 133 ctgcttgattagGTTGGTAAGAACCTTCATGTACTGAGTCAGTTAGAAGGTGTTG ACCTTGGAATCTACAGAGATACTGTTCTTCCTAGAATACTGGAGCAGgtctggt ttttctctccgtttagatagtttcagcatccaaccttccgtctacatgttgtcaactattgtctcactatgtcttaaaagG TTGTCAACTGTAAGGATGAACTTGCACAGTGCTACCTTATGGATTGCATTAT TCAAGTCTTTCCTGATGATTTTCACCTTCAAACTCTCGATGTATTATTGGGA GCTTGTCCACAACTTCAGgtacttttttatgtaaccgaccacttcttttttatatgcatttttctttttaaaaa attagactgcttataaacctcatcccatcagaacatgtttctcctatttttctttttgtcattttttgttccagCCGTCT GTTGATATCAAGACGGTTCTTTCGGGTCTAATGGAAAGGCTTTCGAACTAT GCTGCATCCAGTGTGGAAGCATTGCCTAATTTCCTGCAAGTAGAAGCCTTT TCAAAGTTAAATTATGCCATTGGGAAGgtaaatcttgcatctacacatgaagtcaaaagtgat aacttaaattgacacttgaatgtaagtttatttattaggggtgctagtattagaaacagctgaaaaatctagcttaat ccccattgtgttgtgatatttttgtgtaatttttgttgccctactttcctctgattgtatcacattgataatctattgtatttata aacagctgatcgagtccacatctattatgattcagGTTGTAGAAGCACAGGCGGACTTGCCG GCCGCAGCATCTGTAACATTGTATTTGTTTCTTCTAAAATTCACTCTCCATG TCTATTCCGATCGTCTGGACTATGTGGACCAAGTTTTGgtataacgctttctgaggat gtataagcgtgttaagtttattcatctgtagtacttttttttaatgctttaagtagttctggttgttgctaatagacattgttgt ctctcttttttcctttagGGATCATGTGTTACTCAACTATCTGCCACAGGAAAGCTTTG TGATGACAAGGCAGCAAAACAGATCGTCGCATTTCTTAGCGCTCCATTGGA AAAGTATAATAATGTAGTAACTATCTTGAAGCTTACCAACTATCCTCTAGTA ATGGAATACCTTGACCGTGAGACAAACAAAGCAATGGCAATCATTCTAGTT CAGAGCGTTTTTAAAAATAATACTCACATTGCTACTGCTGATGAGgtgtccgtttc cttcctcgatttagtttcttcttttagcattctctattgatgtggagtttccttttattgtttgttacagGTCGATGCAT TGTTTGAACTGGCAAAGGGTCTCATGAAGGATTTTGACGGAACAATTGATG ATGAGgttctagaaccctacccttgttcgtcaattcttattgcatttgtgttcaaagtattcttgtcactcgaaagag ttgtttttctttagataacttacgctttcaaacgggaaactatgttgggttctagATTGATGAAGAAGATTT CCAGGAAGAGCAGAACCTTGTTGCACGGCTTGTTAACAAGCTCTACATTGA TGATCCAGAAGAAATGTCTAAGgtgtgtattataattagaggcgttactatcaatgtggcctgtgta cgattcttaattcgtttacccaactacagATTATTTTCACTGTAAGGAAGCACATCGTGGCT GGAGGACCTAAGCGTTTGCCTTTAACTATACCACCCCTTGTCTTTTCCGCG CTCAAGgtttgtattttttactgatcactttttcaccccatatccaacttggaaaagtttttcaatttaaaatttggga ctgttgtgtcgtttccagttttgttatggatttggtagacattgtgtttgaatttggaactccgcacattgtattgatatgct tgtgcagTTAATTCGGCGACTGCGAGGTGGAGATGAAAACCCATTTGGAGATG ATGCATCAGCAACACCGAAGAGAATTCTACAGCTGTTAAGCGAGgcaagtgta catacatcttgttgttagaaaatagtgaggacagtttagttgccttgcaaacatcattggctttgtcccagtgaaac attgtatttcgttttgccagaatttctatgatgccagtcgattttgaatctcaatttttacagACGGTGGAAGTT CTATCTGATGTTTCCGCACCTGATTTGGCATTGCGCCTCTACTTGCAGTGT GCTCAGgtatatgatcatttggactttggtgtatccactatcaaaaccaacaaaagatatagtgcatgagctg aaattttcttttattgaatctttagGCAGCAAATAACTGTGAGTTAGAAACAGTTGCTTATG AGTTCTTCACAAAGGCATATCTTCTGTATGAAGAAGAAATTTCAgtaagtgaattt cttaggaggctacatgcaactttctgcaagaaagaccttgacgtggctataaattttgggtgttttagGATTCA AAGGCACAAGTAACAGCACTACGTTTAATAATAGGGACACTCCAGAGGATG CGTGTATTTAATG[vps35c.1_insertion]TTGAGAACAGAGATACTCTGACTC ACAAGGCGACAGGGgtatctttcctgacaaaccagttgagtaagtgtgtgctacactgaatactaaca aggctaaattatgttgttgtagTATTCTGCAAGACTCCTAAGGAAGCCAGATCAGTGTA GAGCTGTTTATGAATGTGCGCATCTCTTCTGGGCTGACGAATGCGAGAAC CTGAAAGATGGAGAGAGgtatgtacacatatgaatgtatactcttaacatactaaatctcaaagttaa gattggtagctgtacgcaactaataaggggagaaaaattgaggcatattgagagtttgttgggggtttttgggat accagGGTTGTGCTTTGCCTGAAGAGGGCACAAAGGATTGCAGATGCTGTT CAACAAATGGCCAACGCATCCCGTGGTACCAGTAGCACTGGGTCGGTGTC TCTATATGTCGAACTGCTGAACAAgtaagtttctctttaagttgcatctatttagtacctactaccta cgctgcaaaatgcttgcaatcgagtactttcccagttgtcactgctttctcttaatggttttacgagtcaaatgcagG TATCTCTACTTCCTGGAGAAAGGAAACCAACAAGTAACAGGCGATACAATC 134 AAGAGCTTAGCTGAATTAATCAAAAGCGAGACAAAGAAGGTTGAGTCAGG CGCAGAGCCATTCATAAATAGTACTCTACGTTACATCGAGTTTCAGAGACA GCAAGAAGATGGTGGCATGAATGAGAAATACGAGAAAATCAAAATGGAAT GGTTTGAATGAatttatttcttgcattctgttaggcaaatcatcaataatcaatggcttctcgtcttgttgcaga agcatgaagagtaatcttgagaaataaaacttgtatcattacttttcatttcctagttacactctctaccaataaaca tgtttgcaaagtttaaaacagtaacatacaaaatatggtcagctgaagaaatggtgggaatgaaaactgattca aaacatgtggtgtttgtgtggaattatgcataggactgaatcgta VSR1 taccaatcttttgaatagttatttcacttgagctctgatcaacattttttatggtatggtaaaaaatccacctctcagat aaagcatcacatcagatgaaacacgaattttctttttcagtaccgtcagattaaaagattgcccacacgtgtttcc cgtttactcttcttagattcttattcaatttgtacaaactctttttttacatatccacgtttccggaaacgacaaagaag aacaattcaataaagtgataaacccttctccattcctaaattaaaataattaattaactcttaacattcatgaatctc taatcctaatcaccattaataatttatccattagattaaatcttctttcttctcttccttaacttagaagactcttaaatcta ataactctgttcacttccttcacacgaatcaaaaacaaaaagcttcgtacatttcgcctagctttcttcgttctaggg tttcggatcttacttaattacccaaaacccttgaatccattactctgtgtgagcttaaatctcttaccttttatcactttca atcgattgggtattaagttgtaaaggaaggaacttggaatcATGAAGCTTGGGCTTTTCACTCT CTCGTTTCTTCTGATCTTGAATCTAGCAATGGGTAGATTCGTTGTTGAGAA GAACAATCTCAAAGTTACATCACCTGATTCGATCAAAGGTATTTACGAATGT GCCATTGGTAATTTCGGAGTTCCTCAATACGGTGGTACTTTAGTCGGCACC GTCGTCTATCCTAAATCCAATCAGAAAGCTTGTAAAAGCTACTCCGATTTC GATATCTCCTTCAAATCCAAACCTGGACGATTACCAACTTTTGTCCTTATCG ATCGTGGAGgttcgatttcaaagttcttacctttttagttgttcccttaattgggtattcaaaagatctgtttttgatt caattttggaactgaattttgttttcagATTGTTACTTCACTTTGAAAGCATGGATAGCTCA ACAAGCTGGAGCAGCAGCGATACTTGTAGCTGATAGTAAAGCTGAGCCAT TGATTACAATGGATACACCTGAAGAAGATAAATCTGATGCGGATTATCTTCA GAACATTACCATTCCTTCTGCTCTCATTACTAAAACATTGGGAGACAGTATA AAGTCTGCTCTTTCCGGTGGTGATATGGTTAACATGAAGCTAGATTGGACT GAGTCGGTTCCACATCCTGATGAGCGTGTAGAGTATGAACTGTGGACTAAT AGCAATGACGAGTGTGGGAAGAAGTGTGATACTCAGATTGAGTTTCTCAAG AATTTTAAAGGAGCTGCTCAGATTCTTGAGAAAGGTGGGCATACTCAGTTC ACGCCACATTATATTACTTGGTACTGTCCTGAAGCGTTTACGTTGAGTAAA CAGTGTAAGTCTCAGTGCATCAACCATGGAAGGTATTGTGCGCCTGATCCT GAGCAGGATTTTACGAAAGGGTATGATGGAAAGGATGTTGTTGTTCAGAAT CTACGTCAGGCTTGTGTCTACAGAGTGATGAATGATACCGGTAAGCCGTG GGTCTGGTGGGACTATGTGACTGACTTTGCTATCCGGTGTCCAATGAAGG AGAAGAAGTACACCAAGGAATGCGCAGATGGAATTATTAAGTCCCTTGgtga gtataatgcttcattacttcgatggtatttgctcctgtggtatcatttaattaacatattacctgtggctagaaattagtct agaaactgaatagtattttcagatatcagctcatccataatatagaacaaccaattgattgatttaaatgagctta ccttgtatggccttcttgaatccttatattcaccaagtggatattttgttactttcaaatttgtgcatttcatcctaaattttc tgactatcagctatatatgcattcagGCATTGATCTCAAAAAGGTGGACAAGTGTATCGG AGACCCTGAAGCAGATGTGGAGAACCCAGTTCTTAAAGCAGAGCAGGAGT CACAGgtttgcctttttgtaactaattggctcttacacgtcttcccaagatcttgttggttttaagttgccatatattct ccattcctcgaggtgttcttaaaatgagtctacttcaatttcagATAGGCAAAGGTTCCCGTGGAG ACGTGACTATACTCCCAACTCTTGTCGTGAACAACAGACAATACAGAGgtatt ggggaaatgttgacttgtcagtttttcctccctttctacagcatctttcttatgtttttgctaattgaattatctttcatgctttt agGTAAATTGGAAAAGGGGGCAGTGCTTAAGGCTATGTGTTCGGGTTTTCA AGAGTCAACGGAACCAGCTATCTGTCTTACTGAAGgtactattatccacgacatttgca 135 actctattttttttaacagaaaccacagttctcttgataggccaatatatattttcaaaacttgcattagagctatcctt aattctttatctattgttttatatcatgggagagaaggatagttttcagacatgcttctgatttttgcagATTTGGA AACTAATGAATGTTTGGAAAACAACGGTGGATGCTGGCAAGACAAAGCTG CCAACATTACTGCATGCAGGgtatcatttgggcatgctattcttttcattttctaacttttggttttgttgttg ctactatgactgattactcttatttggtatgaatcggcattttagGATACTTTTAGGGGAAGATTGTG TGAGTGCCCTACTGTTCAAGGTGTTAAATTCGTTGGTGACGGTTACACTCA CTGCAAAGgtaaattattttttatttgtgtagcttaacatgttttgtgtggaacaagatacagaacaatgccata ccctctgcttagcttctgtattccacaagtatttgttactggtagtcaaacaatactctgatttgtcccttgtcgatatttt cttggaatcaacttagatggtctatttgttgatttgtaccacctgcgagtataatatcaagattgataaataaaattgt cacaatcattgattaaattttactttgaaaaagtttacacttgaagaaattcgttgatgtgactgagtggaaacattc taatgttttttatgcagttctactgaattttcccctttgcgtatatcataagttgtcttac[vsr1.2_insertion]tata atctttgttaatgtgcattttctgtatccatatgttatttcacttggcagCCTCTGGAGCTTTGCATTGTG GTATCAACAATGGAGGATGCTGGAGAGAATCCCGAGGTGGCTTCACTTAC TCTGCTTGCGTAgtaagtatttcgcttctatgtcaatgtgtttgaacaatcatcttcttaacctaaaataaaa cttctttcgttcttgaaatggatttttcagGATGATCATTCAAAGGATTGCAAATGCCCACTT GGGTTCAAGGGCGATGGAGTGAAGAACTGTGAAGgtacttgctaagtattatttacaaa atatatgcatgtggtctacttgcatatataaagagctctatgtcgatttatttagtaggcatcactagtaacatgctat tagcctattgtttatatgcatgcatgtcggtctagaatatttagtgcagatacacaattatcagtggaattttttattgta ctcaatcaaagaagtgagatcattgttaaacgtcaaattcaaattcataatcatattcttttgattgaatgtgttagA TGTGGACGAGTGCAAAGAAAAAACGGTGTGCCAGTGCCCAGAGTGTAAAT GTAAAAACACTTGGGGAAGTTATGAATGCAGCTGCAGCAACGGTTTGCTTT ACATGCGTGAGCACGACACTTGCATAGgtaaaagaaatagatcttccttcctcctagcccca tacttaatttggagtcattaatcagactttctgcttttgaatctggtgaatgtgtagGTTCAGGCAAAGTTG GAACCACAAAACTCAGCTGGAGCTTTTTGTGGATCCTTATAATCGGGGTGG GTGTTGCAGGTCTTTCTGGATATGCAGTCTACAAATACAGAATCAGGgtatac gatgttctttactctggtcttgttatcatattaaaacataggaagattttgcggtctagatagagatgctaagtaaca attgaaattatgcagAGTTACATGGATGCGGAAATTAGAGGGATCATGGCACAGT ACATGCCATTGGAAAGTCAACCACCCAACACAAGTGGTCACCATATGGATA TATGAgtgaagttatcaacacaagccaagagactctctctctttctatgagtagctttaagtttccatgtaaatg aatcaatggacttgatacaactctagagaaaagaaaaaagccacaaacagaaataagtgtacctgcactatt tgtaagaaatgtggatttgcagatggaatctcctttttgtctgcatattactcttacaatgtaaaatagacataactg gatgaataatgagtgttacattatttactgttgaaactaacttaagatccataagaagaaaaacaataaactttga ttgagagattg 136 9.2 Oleosin promoter used for transgenic expression in A. thaliana aactcgAGGAACTCTCTGGTAAGCTAGCTCCACTCCCCAGAAACAACCGGCG CCAAATTGCGCGAATTGCTGACCTGAAGACGGAACATCATCGTCGGGTCC TTGGGCGATTGCGGCGGAAGATGGGTCAGCTTGGGCTTGAGGACGAGAC CCGAATCCGAGTCTGTTGAAAAGGTTGTTCATTGGGGATTTGTATACGGAG ATTGGTCGTCGAGAGGTTTGAGGGAAAGGACAAATGGGTTTGGCTCTGGA GAAAGAGAGTGCGGCTTTAGAGAGAGAATTGAGAGGTTTAGAGAGAGATG CGGCGGCGATGAGCGGAGGAGAGACGACGAGGACCTGCATTATCAAAGC AGTGACGTGGTGAAATTTGGAACTTTTAAGAGGCAGATAGATTTATTATTTG TATCCATTTTCTTCATTGTTCTAGAATGTCGCGGAACAAATTTTAAAACTAAA TCCTAAATTTTTCTAATTTTGTTGCCAATAGTGGATATGTGGGCCGTATAGA AGGAATCTATTGAAGGCCCAAACCCATACTGACGAGCCCAAAGGTTCGTTT TGCGTTTTATGTTTCGGTTCGATGCCAACGCCACATTCTGAGCTAGGCAAA AAACAAACGTGTCTTTGAATAGACTCCTCTCGTTAACACATGCAGCGGCTG CATGGTGACGCCATTAACACGTGGCCTACAATTGCATGATGTCTCCATTGA CACGTGACTTCTCGTCTCCTTTCTTAATATATCTAACAAACACTCCTACCTC TTCCAAAATATATACACATCTTTTTGATCAATCTCTCATTCAAAATCTCATTC TCTCTAGTAAACAAGAACAAAAAAatcgat 137 9.3 Synthetic gene sequences from Chapter 5 PawS1 atcgatATGGCTAAGCTCATCATCCTCGTGGTGCTTGCTATCCTCGCTTTCGT TGAGGTTTCAGTGTCTGGATACAAGACCTCTATCTCTACCATCACCATCGA GGATAACGGAAGGTGTACCAAGTCTATCCCTCCTATCTGCTTCCCTGATGG ACTCGATAACCCTAGAGGATGCCAGATCAGAATCCAGCAGCTTAACCATTG CCAGATGCACCTCACCTCATTCGATTACAAGCTCAGGATGGCTGTTGAGAA CCCTAAGCAACAGCAGCACCTTTCACTTTGCTGCAACCAGCTTCAAGAGGT TGAGAAGCAGTGTCAGTGCGAGGCTATCAAGCAAGTTGTTGAGCAAGCTC AGAAGCAGCTTCAACAAGGACAAGGTGGACAACAGCAGGTTCAGCAGATG GTGAAGAAAGCTCAGATGCTCCCTAACCAGTGCAACCTCCAATGCTCTATC TGATGAgagctc PawS[N65]S1 atcgatATGGCTAAGCTCATCATCCTCGTGGTGCTTGCTATCCTCGCTTTCGT TGAGGTTTCAGTGTCTGGATACAAGACCTCTATCTCTACCATCACCATCGA GGATAACCCTAGGGGATGCCAGATCAGAATCCAGCAGCTTAACGGAAGGT GTACCAAGTCTATCCCTCCTATCTGCTTCCCTGATGGACTCGATAACCATT GCCAGATGCACCTCACCTCATTCGATTACAAGCTCAGGATGGCTGTTGAG AACCCTAAGCAACAGCAGCACCTTTCACTTTGCTGCAACCAGCTTCAAGAG GTTGAGAAGCAGTGTCAGTGCGAGGCTATCAAGCAAGTTGTTGAGCAAGC TCAGAAGCAGCTTCAACAAGGACAAGGTGGACAACAGCAGGTTCAGCAGA TGGTGAAGAAAGCTCAGATGCTCCCTAACCAGTGCAACCTCCAATGCTCTA TCTGATGAgagctc PawS[N84]S1 atcgatATGGCTAAGCTCATCATCCTCGTGGTGCTTGCTATCCTCGCTTTCGT TGAGGTTTCAGTGTCTGGATACAAGACCTCTATCTCTACCATCACCATCGA GGATAACCCTAGGGGATGCCAGATCAGAATCCAGCAGTTGAACCATTGCC AGATGCACCTCACCTCATTCGATTACAAGCTCAGGATGGCTGTTGAGAACG GAAGGTGTACCAAGTCTATCCCTCCTATCTGCTTCCCTGATGGACTCGATA ACCCAAAGCAACAGCAGCACCTTTCACTCTGCTGCAACCAGCTTCAAGAG GTTGAGAAGCAGTGTCAGTGCGAGGCTATCAAGCAAGTTGTTGAGCAAGC TCAGAAGCAGCTTCAACAAGGACAAGGTGGACAACAGCAGGTTCAGCAGA TGGTGAAGAAAGCTCAGATGCTCCCTAACCAGTGCAACCTCCAATGCTCTA TCTGATGAgagctc PawS[N96]S1 atcgatATGGCTAAGCTCATCATCCTCGTGGTGCTTGCTATCCTCGCTTTCGT TGAGGTTTCAGTGTCTGGATACAAGACCTCTATCTCTACCATCACCATCGA GGATAACCCTAGGGGATGCCAGATCAGAATCCAGCAGTTGAACCATTGCC AGATGCACCTCACCTCATTCGATTACAAGCTCAGGATGGCTGTTGAGAACC CTAAGCAACAGCAGCACCTTTCACTTTGCTGCAACGGAAGGTGTACCAAGT CTATCCCTCCTATCTGCTTCCCTGATGGACTCGATAACCAGCTCCAAGAGG TTGAGAAGCAGTGTCAGTGTGAGGCTATCAAGCAAGTGGTTGAGCAAGCT CAGAAGCAGCTTCAACAAGGACAAGGTGGACAACAGCAGGTTCAGCAGAT GGTGAAGAAAGCTCAGATGCTCCCTAACCAGTGCAACCTCCAATGCTCTAT CTGATGAgagctc 138 PawS[N143]S1 atcgatATGGCTAAGCTCATCATCCTCGTGGTGCTTGCTATCCTCGCTTTCGT TGAGGTTTCAGTGTCTGGATACAAGACCTCTATCTCTACCATCACCATCGA GGATAACCCTAGGGGATGCCAGATCAGAATCCAGCAGTTGAACCATTGCC AGATGCACCTCACCTCATTCGATTACAAGCTCAGGATGGCTGTTGAGAACC CTAAGCAACAGCAGCACCTTTCACTTTGCTGCAACCAGCTTCAAGAGGTTG AGAAGCAGTGTCAGTGCGAGGCTATCAAGCAAGTTGTTGAGCAAGCTCAG AAGCAGCTTCAACAAGGACAAGGTGGACAACAGCAGGTTCAGCAGATGGT GAAGAAAGCTCAGATGCTCCCTAACGGAAGGTGTACCAAGTCTATCCCTC CTATCTGCTTCCCTGATGGACTCGATAACCAGTGCAACCTCCAGTGCTCTA TCTGATGAgagctc PawS[N146]S1 atcgatATGGCTAAGCTCATCATCCTCGTGGTGCTTGCTATCCTCGCTTTCGT TGAGGTTTCAGTGTCTGGATACAAGACCTCTATCTCTACCATCACCATCGA GGATAACCCTAGGGGATGCCAGATCAGAATCCAGCAGTTGAACCATTGCC AGATGCACCTCACCTCATTCGATTACAAGCTCAGGATGGCTGTTGAGAACC CTAAGCAACAGCAGCACCTTTCACTTTGCTGCAACCAGCTTCAAGAGGTTG AGAAGCAGTGTCAGTGCGAGGCTATCAAGCAAGTTGTTGAGCAAGCTCAG AAGCAGCTTCAACAAGGACAAGGTGGACAACAGCAGGTTCAGCAGATGGT GAAGAAAGCTCAGATGCTCCCTAACCAGTGCAACGGAAGATGTACCAAGT CTATCCCTCCTATCTGCTTCCCTGATGGACTCGATAACCTCCAGTGCTCTA TCTGATGAgagctc PawS1[Δ2 21] atcgatATGGGATACAAGACCTCTATCTCTACCATCACCATCGAGGATAACGG AAGGTGTACCAAGTCTATCCCTCCTATCTGCTTCCCTGATGGACTCGATAA CCCTAGAGGATGCCAGATCAGAATCCAGCAGCTTAACCATTGCCAGATGC ACCTCACCTCATTCGATTACAAGCTCAGGATGGCTGTTGAGAACCCTAAGC AACAGCAGCACCTTTCACTTTGCTGCAACCAGCTTCAAGAGGTTGAGAAGC AGTGTCAGTGCGAGGCTATCAAGCAAGTTGTTGAGCAAGCTCAGAAGCAG CTTCAACAAGGACAAGGTGGACAACAGCAGGTTCAGCAGATGGTGAAGAA AGCTCAGATGCTCCCTAACCAGTGCAACCTCCAATGCTCTATCTGATGAga gctc SesaS1 atcgatATGGCTAACAAGCTTTTCCTCGTGTGCGCTGCTCTTGCTCTTTGCTT CCTTCTCACCAACGCTTCTATCTACAGGACCGTGGTTGAGTTCGAAGAGGA TGATGCTACCAACGGAAGGTGTACCAAGTCTATCCCTCCTATCTGCTTCCC TGATGGACTCGATAACCCTATCGGACCTAAGATGAGGAAGTGCAGAAAAG AGTTCCAGAAAGAGCAGCACCTCAGAGCTTGTCAGCAGCTTATGCTTCAAC AGGCTAGACAGGGAAGGTCTGATGAGTTCGATTTCGAGGATGATATGGAA AACCCTCAGGGACAGCAGCAAGAACAGCAACTTTTCCAGCAGTGCTGCAA CGAGCTTAGGCAAGAGGAACCTGATTGCGTTTGCCCTACTCTTAAGCAGG CTGCTAAGGCTGTTAGACTCCAAGGACAACATCAGCCTATGCAGGTTAGG AAGATCTACCAGACTGCTAAGCACCTCCCTAACGTGTGCGATATCCCTCAA GTTGATGTGTGCCCTTTCAATATCCCTTCTTTCCCATCTTTCTACTGATGAg agctc 139 SesaS2 atcgatATGGCTAACAAGCTTTTCCTCGTGTGCGCTGCTCTTGCTCTTTGCTT CCTTCTTACCAACGCTGGAAGGTGTACCAAGTCTATCCCTCCTATCTGCTT CCCTGATGGACTCGATAACCCTATCGGACCTAAGATGAGGAAGTGCAGAA AAGAGTTCCAGAAAGAGCAGCACCTCAGAGCTTGTCAGCAGCTTATGCTT CAACAGGCTAGACAGGGAAGGTCTGATGAGTTCGATTTCGAGGATGATAT GGAAAACCCTCAGGGACAGCAGCAAGAACAGCAACTTTTCCAGCAGTGCT GCAACGAGCTTAGGCAAGAGGAACCTGATTGCGTTTGCCCTACTCTTAAG CAGGCTGCTAAGGCTGTTAGACTCCAAGGACAACATCAGCCTATGCAGGT TAGGAAGATCTACCAGACTGCTAAGCACCTCCCTAACGTGTGCGATATCCC TCAAGTTGATGTGTGCCCTTTCAATATCCCTTCTTTCCCATCTTTCTACTGA TGAgagctc CraS1 atcgatATGGCTAGGGTGTCATCTCTCCTCAGTTTCTGCCTTACTCTCCTCAT CCTCTTCCACGGATATGCTGCTGGAAGGTGTACCAAGTCTATCCCTCCTAT CTGCTTCCCTGATGGACTCGATAACCAGCAAGGACAACAAGGACAGCAGT TCCCTAACGAGTGTCAGCTTGATCAGCTTAACGCTCTCGAGCCTTCTCACG TTTTGAAGTCTGAGGCTGGAAGAATCGAGGTGTGGGATCATCATGCTCCT CAGCTTAGATGCTCTGGTGTGTCTTTCGCTAGATATATCATCGAGTCTAAG GGACTCTACCTCCCTTCATTCTTCAACACCGCTAAGCTCTCATTCGTGGCT AAGGGAAGAGGACTTATGGGAAAGGTTATCCCTGGATGCGCTGAGACTTT CCAGGATTCTTCTGAGTTCCAGCCTAGGTTCGAGGGACAAGGACAATCTC AGAGGTTCAGGGATATGCACCAGAAGGTTGAGCACATCAGGTCTGGTGAT ACAATCGCTACTACTCCTGGTGTGGCTCAGTGGTTCTACAACGATGGACAA GAGCCTCTCGTGATCGTGTCTGTTTTCGATCTCGCTTCTCACCAGAACCAG CTCGATAGAAACCCTAGGCCTTTCTACCTCGCTGGAAACAACCCTCAAGGA CAGGTTTGGCTTCAGGGAAGAGAGCAACAGCCTCAAAAGAACATCTTCAA CGGATTCGGACCTGAGGTGATCGCTCAGGCTCTTAAGATTGATCTCCAGA CTGCTCAGCAGCTTCAGAACCAGGATGATAACAGGGGTAACATCGTGAGA GTGCAGGGACCTTTCGGAGTTATCAGACCTCCACTCAGAGGACAAAGGCC TCAAGAAGAAGAAGAGGAAGAAGGTAGACACGGAAGGCATGGAAACGGA CTCGAAGAGACTATCTGCTCTGCTAGGTGCACCGATAACCTCGATGATCCT TCTAGGGCTGATGTGTACAAGCCTCAGCTCGGTTACATCTCTACCCTCAAC TCTTACGATCTCCCTATCCTCAGATTCATCAGGCTCAGTGCTCTCAGGGGT TCTATCAGACAGAACGCTATGGTTCTCCCTCAGTGGAACGCTAACGCTAAT GCTATCCTCTACGTGACCGATGGTGAGGCTCAAATCCAGATCGTGAACGA TAACGGAAACAGGGTGTTCGATGGACAGGTGTCACAGGGACAACTTATCG CTGTTCCTCAGGGATTCTCTGTGGTGAAGAGAGCTACCTCTAACAGGTTCC AGTGGGTTGAGTTCAAGACCAACGCAAACGCTCAGATCAACACCCTCGCT GGTAGAACTTCTGTGCTTAGAGGACTCCCTTTGGAGGTGATCACTAACGG ATTCCAGATCTCTCCTGAAGAGGCTAGAAGGGTGAAGTTCAACACTCTCGA GACTACCCTCACCCACTCATCTGGACCTGCTTCATATGGTAGACCTAGAGT GGCTGCTGCTTGATGAgagctc 140 PapS1 atcgatATGGCTATCACCAACAAGTTGATTATCACCCTCTTGCTCCTCATCTCT ATCGCTGTGGTGCATTCTGGAAGGTGTACCAAGTCTATCCCTCCTATCTGC TTCCCTGATGGACTCGATAACCTCTCATTCAGGGTTGAGATTGATGAGTTC GAGCCTCCTCAACAGGGTGAACAAGAAGGACCTAGAAGAAGGCCTGGTG GTGGATCTGGTGAAGGATGGGAAGAAGAGTCTACCAACCACCCTTACCAC TTCAGAAAGAGATCTTTCTCAGATTGGTTCCAGTCTAAAGAGGGATTCGTT AGGGTTCTCCCTAAGTTTACCAAGCACGCTCCTGCTTTGTTCAGGGGAATC GAGAACTACAGGTTCTCACTCGTTGAGATGGAACCTACCACCTTCTTCGTG CCTCATCACCTTGATGCTGATGCTGTGTTCATCGTGCTTCAGGGAAAGGGT GTGATCGAGTTCGTGACCGATAAGACCAAAGAGTCTTTCCACATCACCAAG GGTGATGTTGTGAGGATCCCTTCTGGTGTGACCAACTTCATCACTAACACC AACCAGACCGTGCCTCTTAGACTCGCTCAAATCACTGTGCCTGTGAACAAC CCTGGAAACTACAAGGATTACTTCCCTGCTGCTTCTCAGTTCCAGCAGTCT TACTTCAACGGATTCACCAAAGAGGTGCTCTCAACCTCTTTCAACGTGCCA GAGGAACTTCTCGGTAGACTCGTGACCAGGTCAAAAGAGATTGGACAGGG AATTATCAGAAGGATCTCTCCTGATCAGATCAAAGAGCTTGCTGAGCACGC TACCTCTCCATCTAACAAGCACAAGGCTAAGAAAGAGAAAGAAGAGGATAA GGATCTCAGGACCCTCTGGACTCCTTTCAACCTTTTCGCTATTGATCCTAT CTACTCTAACGATTTCGGACACTTCCACGAGGCTCACCCTAAGAACTACAA CCAGCTTCAGGATCTCCATATCGCTGCTGCTTGGGCTAATATGACTCAGG GATCTCTTTTCCTCCCTCACTTCAACTCTAAGACCACCTTCGTGACCTTCGT TGAGAACGGATGTGCTAGGTTCGAGATGGCTACCCCTTACAAGTTCCAAA GGGGTCAACAACAATGGCCTGGACAGGGTCAAGAGGAAGAGGAAGATAT GTCTGAGAACGTGCACAAGGTGGTGTCTAGAGTGTGCAAGGGTGAGGTGT TCATTGTGCCTGCTGGTCACCCTTTCACCATCTTGTCTCAGGATCAGGATT TCATTGCTGTGGGATTCGGAATCTACGCTACCAACTCAAAGAGGACCTTCC TCGCTGGTGAAGAGAACCTTCTCTCTAACCTTAACCCAGCTGCTACCAGG GTTACCTTCGGAGTTGGATCTAAGGTTGCAGAGAAGCTTTTCACCTCTCAG AACTACTCTTACTTCGCTCCTACCTCTAGATCTCAGCAGCAGATTCCAGAG AAGCACAAGCCATCTTTCCAGTCAATCCTCGATTTCGCTGGATTCTGATGA gagctc LeaS1 atcgatATGGGACTCGAGAGGAAGGTGTACGGACTCGTTATGGTTTCTCTCG TGCTCATGGCTATCGCTACCATGTCATCTGTTCAGGCTGGTAGGTGTACCA AGTCTATCCCTCCTATCTGCTTCCCTGATGGACTCGATAACACCATCGAAG AAGAGGCTGCTAAGGATGAGTCTTGGACCGATTGGGCTAAAGAGAAGATC GGACTCAAGCACGAGGATAACATCCAGCCTACTCACACTACTACCACCGT GCAAGATGATGCTTGGAGGGCTTCTCAAAAGGCTGAGGATGCTAAAGAGG CAGCAAAGAGGAAGGCTGAAGAAGCTGTTGGAGCTGCAAAAGAGAAGGC AGGATCTGCTTACGAGACTGCTAAGTCTAAGGTGGAAGAGGGACTCGCTT CTGTGAAGGATAAGGCTTCTCAGTCTTACGATTCTGCTGGACAGGTGAAG GATGATGTGTCTCACAAGTCTAAGCAGGTTAAGGATTCTCTCAGTGGTGAT GAGAACGATGAGAGTTGGACCGGATGGGCAAAAGAAAAAATTGGTATTAA GAACGAGGATATCAACTCTCCTAACCTCGGAGAGACTGTGTCTGAGAAGG CAAAAGAGGCTAAAGAAGCAGCTAAAAGAAAAGCTGGTGATGCTAAAGAA AAGCTCGCTGAGACTGTGGAAACCGCAAAAGAAAAGGCTTCTGATATGAC CTCTGCTGCTAAAGAGAAAGCAGAGAAGCTCAAAGAAGAAGCAGAGAGGG AATCTAAGTCAGCAAAAGAGAAAATCAAAGAGTCTTACGAAACCGCTAAGA GTAAGGCTGATGAGACACTCGAGTCAGCTAAAGATAAGGCTAGTCAGAGT 141 TACGATAGTGCTGCTAGGAAGTCAGAGGAAGCTAAGGATACCGTTTCTCA CAAGAGTAAGAGGGTGAAAGAATCTCTCACCGATGATGATGCTGAGCTTT GATGAgagctc OleoS1 atcgatATGGGAAGGTGTACCAAGTCTATCCCTCCTATCTGCTTCCCTGATGG ACTCGATAACCCTGCTGATACTGCTAGAGGAACCCACCACGATATCATCG GAAGAGATCAGTACCCTATGATGGGAAGGGATAGGGATCAATACCAGATG TCTGGAAGGGGATCTGATTACTCTAAGTCTAGGCAGATCGCTAAGGCTGC TACTGCTGTTACAGCTGGTGGATCTCTTCTCGTGCTCTCTTCTCTTACTCTC GTGGGAACTGTGATCGCTCTTACTGTTGCTACTCCTCTCCTCGTGATCTTC TCTCCTATCCTTGTGCCTGCTCTTATCACCGTGGCTTTGCTCATTACCGGA TTCCTCTCATCTGGTGGATTCGGAATCGCTGCAATCACCGTGTTCTCTTGG ATCTACAAGTACGCTACTGGTGAGCACCCTCAGGGTTCTGATAAGCTCGAT TCTGCTAGGATGAAGCTCGGATCTAAGGCTCAGGATCTTAAGGATAGGGC TCAGTACTACGGACAGCAGCATACTGGTGGTGAGCATGATAGAGATAGAA CCAGAGGTGGACAGCACACTACCTGATGAgagctc Paw4SFTI atcgatATGGCTAAGCTCATCATCCTCGTGGTGCTTGCTATCCTCGCTTTCGT TGAGGTTTCAGTGTCTGGATACAAGACCTCTATCTCTACCATCACCATCGA GGATAACGGAAGGTGTACCAAGTCTATCCCTCCTATCTGCTTCCCTGATGG ACTCGATAACGGTAGATGTACTAGATCAATTCCACCAATCTGTTTCCCAGA TGGTTTGGATAATGGTAGGTGTACAAAGAGTATCCCACCAATTTGCTATCC AGATGGTCTTGATAACGGAAGATGTACAAAAAGTATCCCTCCTGTGTGTTT TCCTGATGGATTGGATAACCCTAGGGGATGCCAGATCAGAATCCAGCAGC TTAACCATTGCCAGATGCACCTCACCTCATTCGATTACAAGCTCAGGATGG CTGTTGAGAACCCTAAGCAACAGCAGCACCTTTCACTTTGCTGCAACCAGC TTCAAGAGGTTGAGAAGCAGTGTCAGTGCGAGGCTATCAAGCAAGTTGTT GAGCAAGCTCAGAAGCAGCTTCAACAAGGACAAGGTGGACAACAGCAGGT TCAGCAGATGGTGAAGAAAGCTCAGATGCTCCCTAACCAGTGCAACCTCC AGTGCTCTATCTGATGAgagctc 4SFTI atcgatATGGCTAAGCTCATCATCCTCGTGGTGCTTGCTATCCTCGCTTTCGT TGAGGTTTCAGTGTCTGGATACAAGACCTCTATCTCTACCATCACCATCGA GGATAACGGAAGGTGTACCAAGTCTATCCCTCCTATCTGCTTCCCTGATGG ACTCGATAACGGTAGATGTACTAGATCAATTCCACCAATCTGTTTCCCAGA TGGTTTGGATAATGGTAGGTGTACAAAGAGTATCCCACCAATTTGCTATCC AGATGGTCTTGATAACGGAAGATGTACAAAAAGTATCCCTCCTGTGTGTTT TCCTGATGGATTGGATAACTGATGAgagctc 3SFTI atcgatATGGCTAAGCTCATCATCCTCGTGGTGCTTGCTATCCTCGCTTTCGT TGAGGTTTCAGTGTCTGGATACAAGACCTCTATCTCTACCATCACCATCGA GGATAACGGAAGGTGTACCAAGTCTATCCCTCCTATCTGCTTCCCTGATGG ACTCGATAACGGTAGATGTACTAGATCAATTCCACCAATCTGTTTCCCAGA TGGTTTGGATAATGGTAGGTGTACAAAGAGTATCCCACCAATTTGCTATCC AGATGGTCTTGATAACTGATGAgagctc 142 2SFTI atcgatATGGCTAAGCTCATCATCCTCGTGGTGCTTGCTATCCTCGCTTTCGT TGAGGTTTCAGTGTCTGGATACAAGACCTCTATCTCTACCATCACCATCGA GGATAACGGAAGGTGTACCAAGTCTATCCCTCCTATCTGCTTCCCTGATGG ACTCGATAACGGTAGATGTACTAGATCAATTCCACCAATCTGTTTCCCAGA TGGTTTGGATAATTGATGAgagctc 1SFTI atcgatATGGCTAAGCTCATCATCCTCGTGGTGCTTGCTATCCTCGCTTTCGT TGAGGTTTCAGTGTCTGGATACAAGACCTCTATCTCTACCATCACCATCGA GGATAACGGAAGGTGTACCAAGTCTATCCCTCCTATCTGCTTCCCTGATGG ACTCGATAACTGATGAgagctc 143 9.4 Chimeric protein precursors from Chapter 5 The above sequences are amino acid sequences of a synthetic PawS1, PawS1 derivatives with SFTI-GLDN inserted after alternative asparagine sites, synthetic proteins that combine an abundant seed protein with SFTI-GLDN inserted after the ER signal (or after start Met for Oleosin), a synthetic PawS1 with four tandem SFTI-GLDN variants and a suite of derivatives lacking PawS1 albumin and including from four down to one SFTI-GLDN (or variants). In this last cases, each SFTI-1 variant sequence has one alternative residue that alters the peptide mass and that does not interfere with peptide processing. Sequences are colour coded to denote ER signal (rose), SFTI-1 (aqua), PawS1 albumin small subunit (green) and large subunit (orange) or other seed protein (brown). Spacer regions or the GLDN tail are coloured black. Mutations to the usual sequence of SFTI-1 are shown in red. PawS1 [151 residues] MAKLIILVVLAILAFVEVSVSGYKTSISTITIEDNGRCTKSIPPICFPDGLDNPRGCQI RIQQLNHCQMHLTSFDYKLRMAVENPKQQQHLSLCCNQLQEVEKQCQCEAIKQVVEQAQ KQLQQGQGGQQQVQQMVKKAQMLPNQCNLQCSI PawS[N65]S1 [151 residues] MAKLIILVVLAILAFVEVSVSGYKTSISTITIEDNPRGCQIRIQQLNGRCTKSIPPICF PDGLDNHCQMHLTSFDYKLRMAVENPKQQQHLSLCCNQLQEVEKQCQCEAIKQVVEQAQ KQLQQGQGGQQQVQQMVKKAQMLPNQCNLQCSI PawS[N84]S1 [151 residues] MAKLIILVVLAILAFVEVSVSGYKTSISTITIEDNPRGCQIRIQQLNHCQMHLTSFDYK LRMAVENGRCTKSIPPICFPDGLDNPKQQQHLSLCCNQLQEVEKQCQCEAIKQVVEQAQ KQLQQGQGGQQQVQQMVKKAQMLPNQCNLQCSI PawS[N96]S1 [151 residues] MAKLIILVVLAILAFVEVSVSGYKTSISTITIEDNPRGCQIRIQQLNHCQMHLTSFDYK LRMAVENPKQQQHLSLCCNGRCTKSIPPICFPDGLDNQLQEVEKQCQCEAIKQVVEQAQ KQLQQGQGGQQQVQQMVKKAQMLPNQCNLQCSI PawS[N143]S1 [151 residues] MAKLIILVVLAILAFVEVSVSGYKTSISTITIEDNPRGCQIRIQQLNHCQMHLTSFDYK LRMAVENPKQQQHLSLCCNQLQEVEKQCQCEAIKQVVEQAQKQLQQGQGGQQQVQQMVK KAQMLPNGRCTKSIPPICFPDGLDNQCNLQCSI PawS[N146]S1 [151 residues] MAKLIILVVLAILAFVEVSVSGYKTSISTITIEDNPRGCQIRIQQLNHCQMHLTSFDYK LRMAVENPKQQQHLSLCCNQLQEVEKQCQCEAIKQVVEQAQKQLQQGQGGQQQVQQMVK KAQMLPNQCNGRCTKSIPPICFPDGLDNLQCSI 144 PawS1[Δ2-21] [131 residues] MGYKTSISTITIEDNGRCTKSIPPICFPDGLDNPRGCQIRIQQLNHCQMHLTSFDYKLR MAVENPKQQQHLSLCCNQLQEVEKQCQCEAIKQVVEQAQKQLQQGQGGQQQVQQMVKKA QMLPNQCNLQCSI SesaS1 [182 residues] MANKLFLVCAALALCFLLTNASIYRTVVEFEEDDATNGRCTKSIPPICFPDGLDNPIGP KMRKCRKEFQKEQHLRACQQLMLQQARQGRSDEFDFEDDMENPQGQQQEQQLFQQCCNE LRQEEPDCVCPTLKQAAKAVRLQGQHQPMQVRKIYQTAKHLPNVCDIPQVDVCPFNIPS FPSFY SesaS2 [166 residues] MANKLFLVCAALALCFLLTNAGRCTKSIPPICFPDGLDNPIGPKMRKCRKEFQKEQHLR ACQQLMLQQARQGRSDEFDFEDDMENPQGQQQEQQLFQQCCNELRQEEPDCVCPTLKQA AKAVRLQGQHQPMQVRKIYQTAKHLPNVCDIPQVDVCPFNIPSFPSFY CraS1 [490 residues] MARVSSLLSFCLTLLILFHGYAAGRCTKSIPPICFPDGLDNQQGQQGQQFPNECQLDQL NALEPSHVLKSEAGRIEVWDHHAPQLRCSGVSFARYIIESKGLYLPSFFNTAKLSFVAK GRGLMGKVIPGCAETFQDSSEFQPRFEGQGQSQRFRDMHQKVEHIRSGDTIATTPGVAQ WFYNDGQEPLVIVSVFDLASHQNQLDRNPRPFYLAGNNPQGQVWLQGREQQPQKNIFNG FGPEVIAQALKIDLQTAQQLQNQDDNRGNIVRVQGPFGVIRPPLRGQRPQEEEEEEGRH GRHGNGLEETICSARCTDNLDDPSRADVYKPQLGYISTLNSYDLPILRFIRLSALRGSI RQNAMVLPQWNANANAILYVTDGEAQIQIVNDNGNRVFDGQVSQGQLIAVPQGFSVVKR ATSNRFQWVEFKTNANAQINTLAGRTSVLRGLPLEVITNGFQISPEEARRVKFNTLETT LTHSSGPASYGRPRVAAA PapS1 [504 residues] MAITNKLIITLLLLISIAVVHSGRCTKSIPPICFPDGLDNLSFRVEIDEFEPPQQGEQE GPRRRPGGGSGEGWEEESTNHPYHFRKRSFSDWFQSKEGFVRVLPKFTKHAPALFRGIE NYRFSLVEMEPTTFFVPHHLDADAVFIVLQGKGVIEFVTDKTKESFHITKGDVVRIPSG VTNFITNTNQTVPLRLAQITVPVNNPGNYKDYFPAASQFQQSYFNGFTKEVLSTSFNVP EELLGRLVTRSKEIGQGIIRRISPDQIKELAEHATSPSNKHKAKKEKEEDKDLRTLWTP FNLFAIDPIYSNDFGHFHEAHPKNYNQLQDLHIAAAWANMTQGSLFLPHFNSKTTFVTF VENGCARFEMATPYKFQRGQQQWPGQGQEEEEDMSENVHKVVSRVCKGEVFIVPAGHPF TILSQDQDFIAVGFGIYATNSKRTFLAGEENLLSNLNPAATRVTFGVGSKVAEKLFTSQ NYSYFAPTSRSQQQIPEKHKPSFQSILDFAGF LeaS1 [316 residues] MGLERKVYGLVMVSLVLMAIATMSSVQAGRCTKSIPPICFPDGLDNTIEEEAAKDESWT DWAKEKIGLKHEDNIQPTHTTTTVQDDAWRASQKAEDAKEAAKRKAEEAVGAAKEKAGS AYETAKSKVEEGLASVKDKASQSYDSAGQVKDDVSHKSKQVKDSLSGDENDESWTGWAK EKIGIKNEDINSPNLGETVSEKAKEAKEAAKRKAGDAKEKLAETVETAKEKASDMTSAA KEKAEKLKEEAERESKSAKEKIKESYETAKSKADETLESAKDKASQSYDSAARKSEEAK DTVSHKSKRVKESLTDDDAEL OleoS1 [192 residues] MGRCTKSIPPICFPDGLDNPADTARGTHHDIIGRDQYPMMGRDRDQYQMSGRGSDYSKS RQIAKAATAVTAGGSLLVLSSLTLVGTVIALTVATPLLVIFSPILVPALITVALLITGF LSSGGFGIAAITVFSWIYKYATGEHPQGSDKLDSARMKLGSKAQDLKDRAQYYGQQHTG GEHDRDRTRGGQHTT 145 Paw4SFTI [205 residues] MAKLIILVVLAILAFVEVSVSGYKTSISTITIEDNGRCTKSIPPICFPDGLDNGRCTRS IPPICFPDGLDNGRCTKSIPPICYPDGLDNGRCTKSIPPVCFPDGLDNPRGCQIRIQQL NHCQMHLTSFDYKLRMAVENPKQQQHLSLCCNQLQEVEKQCQCEAIKQVVEQAQKQLQQ GQGGQQQVQQMVKKAQMLPNQCNLQCSI 4SFTI [107 residues] MAKLIILVVLAILAFVEVSVSGYKTSISTITIEDNGRCTKSIPPICFPDGLDNGRCTRS IPPICFPDGLDNGRCTKSIPPICYPDGLDNGRCTKSIPPVCFPDGLDN 3SFTI [89 residues] MAKLIILVVLAILAFVEVSVSGYKTSISTITIEDNGRCTKSIPPICFPDGLDNGRCTRS IPPICFPDGLDNGRCTKSIPPICYPDGLDN 2SFTI [71 residues] MAKLIILVVLAILAFVEVSVSGYKTSISTITIEDNGRCTKSIPPICFPDGLDNGRCTRS IPPICFPDGLDN 1SFTI [53 residues] MAKLIILVVLAILAFVEVSVSGYKTSISTITIEDNGRCTKSIPPICFPDGLDN 146 References Abdelsamad, A. and A. Pecinka (2014). "Pollen-specific activation of Arabidopsis retrogenes is associated with global transcriptional reprogramming." Plant Cell 26(8): 3299-3313. Aboye, T. L., H. Ha, S. Majumder, F. Christ, Z. Debyser, A. Shekhtman, N. Neamati and J. A. Camarero (2012). "Design of a novel cyclotide-based CXCR4 antagonist with anti-human immunodeficiency virus (HIV)-1 activity." Journal of Medicinal Chemistry 55(23): 10729-10734. Akiva, P., A. Toporik, S. Edelheit, Y. Peretz, A. Diber, R. Shemesh, A. Novik and R. Sorek (2006). "Transcription-mediated gene fusion in the human genome." Genome Research 16(1): 30-36. Arguello, J. R., Y. Chen, S. Yang, W. Wang and M. Long (2006). "Origination of an X-linked testes chimeric gene by illegitimate recombination in Drosophila." PLoS Genetics 2(5): e77. Arnison, P. G., M. J. Bibb, G. Bierbaum, A. A. Bowers, T. S. Bugni, G. Bulaj, J. A. Camarero, D. J. Campopiano, G. L. Challis, J. Clardy, P. D. Cotter, D. J. Craik, M. Dawson, E. Dittmann, S. Donadio, P. C. Dorrestein, K.-D. Entian, M. A. Fischbach, J. S. Garavelli, U. Goransson, C. W. Gruber, D. H. Haft, T. K. Hemscheidt, C. Hertweck, C. Hill, A. R. Horswill, M. Jaspars, W. L. Kelly, J. P. Klinman, O. P. Kuipers, A. J. Link, W. Liu, M. A. Marahiel, D. A. Mitchell, G. N. Moll, B. S. Moore, R. Muller, S. K. Nair, I. F. Nes, G. E. Norris, B. M. Olivera, H. Onaka, M. L. Patchett, J. Piel, M. J. T. Reaney, S. Rebuffat, R. P. Ross, H.-G. Sahl, E. W. Schmidt, M. E. Selsted, K. Severinov, B. Shen, K. Sivonen, L. Smith, T. Stein, R. D. Sussmuth, J. R. Tagg, G.-L. Tang, A. W. Truman, J. C. Vederas, C. T. Walsh, J. D. Walton, S. C. Wenzel, J. M. Willey and W. A. van der Donk (2013). "Ribosomally synthesized and post-translationally modified peptide natural products: overview and recommendations for a universal nomenclature." Natural Product Reports 30(1): 108-160. Autelitano, D. J., A. Rajic, A. I. Smith, M. C. Berndt, L. L. Ilag and M. Vadas (2006). "The cryptome: a subset of the proteome, comprising cryptic peptides with distinct bioactivities." Drug Discovery Today 11(7-8): 306-314. Babasaki, K., T. Takao, Y. Shimonishi and K. Kurahashi (1985). "Subtilosin A, a New Antibiotic Peptide Produced by Bacillus subtilis 168: Isolation, Structural Analysis, and Biogenesis." Journal of Biochemistry 98(3): 585-603. Babushok, D. V., K. Ohshima, E. M. Ostertag, X. Chen, Y. Wang, P. K. Mandal, N. Okada, C. S. Abrams and H. H. Kazazian, Jr. (2007). "A novel testis ubiquitin-binding protein gene arose by exon shuffling in hominoids." Genome Research 17(8): 1129-1138. Bai, Y., C. Casola, C. Feschotte and E. Betrán (2007). "Comparative genomics reveals a constant rate of origination and convergent acquisition of functional retrogenes in Drosophila." Genome Biology 8(1): R11-R11. 147 Bailey, J. A., G. Liu and E. E. Eichler (2003). "An Alu Transposition Model for the Origin and Expansion of Human Segmental Duplications." American Journal of Human Genetics 73(4): 823-834. Bantscheff, M., S. Lemeer, M. M. Savitski and B. Kuster (2012). "Quantitative mass spectrometry in proteomics: critical review update from 2007 to the present." Analytical and Bioanalytical Chemistry 404(4): 939-965. Barber, C. J. S., P. T. Pujara, D. W. Reed, S. Chiwocha, H. Zhang and P. S. Covello (2013). "The two-step biosynthesis of cyclic peptides from linear precursors in a member of the plant family Caryophyllaceae involves cyclization by a serine protease-like enzyme." Journal of Biological Chemistry 288(18): 12500-12510. Barbeta, B. L., A. T. Marshall, A. D. Gillon, D. J. Craik and M. A. Anderson (2008). "Plant cyclotides disrupt epithelial cells in the midgut of lepidopteran larvae." Proceedings of the National Academy of Sciences of the United States of America 105(4): 1221-1225. Barry, D. G., N. L. Daly, R. J. Clark, L. Sando and D. J. Craik (2003). "Linearization of a naturally occurring circular protein maintains structure but eliminates hemolytic activity." Biochemistry 42(22): 6688-6695. Bechtold, N., J. Ellis and G. Pelletier (1993). "In planta Agrobacterium-mediated gene transfer by infiltration of adult Arabidopsis thaliana plants." Comptes rendus de l'Académie des sciences. Série III, Sciences de la vie 316: 11941199. Becker, C., J. Hagmann, J. Muller, D. Koenig, O. Stegle, K. Borgwardt and D. Weigel (2011). "Spontaneous epigenetic variation in the Arabidopsis thaliana methylome." Nature 480(7376): 245-249. Begun, D. J., H. A. Lindfors, A. D. Kern and C. D. Jones (2007). "Evidence for de novo evolution of testis-expressed genes in the Drosophila yakuba/Drosophila erecta clade." Genetics 176(2): 1131-1137. Beiko, R. G., T. J. Harlow and M. A. Ragan (2005). "Highways of gene sharing in prokaryotes." Proceedings of the National Academy of Sciences of the United States of America 102(40): 14332-14337. Benk, A. S. and C. Roesli (2012). "Label-free quantification using MALDI mass spectrometry: considerations and perspectives." Analytical and Bioanalytical Chemistry 404(4): 1039-1056. Bernath-Levin, K., C. Nelson, Alysha G. Elliott, Achala S. Jayasena, A. H. Millar, David J. Craik and Joshua S. Mylne (2015). "Peptide macrocyclization by a bifunctional endoprotease." Chemistry & Biology 22(5): 571-582. Betran, E. and M. Long (2003). "Dntf-2r, a young Drosophila retroposed gene with specific male expression under positive Darwinian selection." Genetics 164(3): 977-988. Black, D. L. (1998). "Splicing in the inner ear: a familiar tune, but what are the instruments?" Neuron 20(2): 165-168. Blackmer, T. M. and J. S. Schepers (1995). "Use of a chlorophyll meter to monitor nitrogen status and schedule fertigation for corn." Journal of Production Agriculture 8(1). 148 Blencowe, B. J. (2006). "Alternative splicing: new insights from global analyses." Cell 126(1): 37-47. Blond, A., J. Péduzzi, C. Goulard, M. J. Chiuchiolo, M. Barthélémy, Y. Prigent, R. A. Salomón, R. N. Farías, F. Moreno and S. Rebuffat (1999). "The cyclic structure of microcin J25, a 21-residue peptide antibiotic from Escherichia coli." European Journal of Biochemistry 259(3): 747-756. Bode, W. and R. Huber (1992). "Natural protein proteinase inhibitors and their interaction with proteinases." European Journal of Biochemistry 204(2): 433451. Bonen, L. (1993). "Trans-splicing of pre-mRNA in plants, animals, and protists." FASEB Journal 7(1): 40-46. Bornberg-Bauer, E., A. K. Huylmans and T. Sikosek (2010). "How do new proteins arise?" Current Opinion in Structural Biology 20(3): 390-396. Boy, R. G., W. Mier, E. M. Nothelfer, A. Altmann, M. Eisenhut, H. Kolmar, M. Tomaszowski, S. Kramer and U. Haberkorn (2010). "Sunflower trypsin inhibitor 1 derivatives as molecular scaffolds for the development of novel peptidic radiopharmaceuticals." Mol Imaging Biol 12(4): 377-385. Boyko, A., T. Blevins, Y. Yao, A. Golubov, A. Bilichak, Y. Ilnytskyy, J. Hollander, F. Meins, Jr. and I. Kovalchuk (2010). "Transgenerational adaptation of Arabidopsis to stress requires DNA methylation and the function of Dicer-like proteins." PLoS ONE 5(3): e9514. Breeze, E., E. Harrison, S. McHattie, L. Hughes, R. Hickman, C. Hill, S. Kiddle, Y. S. Kim, C. A. Penfold, D. Jenkins, C. Zhang, K. Morris, C. Jenner, S. Jackson, B. Thomas, A. Tabrett, R. Legaie, J. D. Moore, D. L. Wild, S. Ott, D. Rand, J. Beynon, K. Denby, A. Mead and V. Buchanan-Wollaston (2011). "High-resolution temporal profiling of transcripts during Arabidopsis leaf senescence reveals a distinct chronology of processes and regulation." Plant Cell 23(3): 873-894. Brennan, G., Y. Kozyrev and S. L. Hu (2008). "TRIMCyp expression in Old World primates Macaca nemestrina and Macaca fascicularis." Proceedings of the National Academy of Sciences of the United States of America 105(9): 3569-3574. Brennicke, A., A. Marchfelder and S. Binder (1999). "RNA editing." FEMS Microbiology Reviews 23(3): 297-316. Brosius, J. (1991). "Retroposons: seeds of evolution." Science 251(4995): 753. Buchanan-Wollaston, V., S. Earl, E. Harrison, E. Mathas, S. Navabpour, T. Page and D. Pink (2003). "The molecular analysis of leaf senescence--a genomics approach." Plant Biotechnology Journal 1(1): 3-22. Cai, J., R. Zhao, H. Jiang and W. Wang (2008). "De novo origination of a new protein-coding gene in Saccharomyces cerevisiae." Genetics 179(1): 487-496. Carrington, J. C. and W. G. Dougherty (1988). "A viral cleavage site cassette: identification of amino acid sequences required for tobacco etch virus polyprotein processing." Proceedings of the National Academy of Sciences of the United States of America 85(10): 3391-3395. 149 Carter, J. J., M. D. Daugherty, X. Qi, A. Bheda-Malge, G. C. Wipf, K. Robinson, A. Roman, H. S. Malik and D. A. Galloway (2013). "Identification of an overprinting gene in Merkel cell polyomavirus provides evolutionary insight into the birth of viral genes." Proceedings of the National Academy of Sciences of the United States of America 110(31): 12744-12749. Carvunis, A.-R., T. Rolland, I. Wapinski, M. A. Calderwood, M. A. Yildirim, N. Simonis, B. Charloteaux, C. A. Hidalgo, J. Barbette, B. Santhanam, G. A. Brar, J. S. Weissman, A. Regev, N. Thierry-Mieg, M. E. Cusick and M. Vidal (2012). "Proto-genes and de novo gene birth." Nature 487(7407): 370-374. Chan, L. Y., S. Gunasekera, S. T. Henriques, N. F. Worth, S.-J. Le, R. J. Clark, J. H. Campbell, D. J. Craik and N. L. Daly (2011). "Engineering pro-angiogenic peptides using stable disulfide-rich cyclic scaffolds." Blood 118(25): 6709-6717. Chance, R. E., R. M. Ellis and W. W. Bromer (1968). "Porcine proinsulin: characterization and amino acid sequence." Science 161(3837): 165-167. Charlesworth, D., F. L. Liu and L. Zhang (1998). "The evolution of the alcohol dehydrogenase gene family by loss of introns in plants of the genus Leavenworthia (Brassicaceae)." Molecular Biology and Evolution 15(5): 552559. Chen, S., Y. E. Zhang and M. Long (2010). "New genes in Drosophila quickly become essential." Science 330(6011): 1682-1685. Colgrave, M. L. and D. J. Craik (2004). "Thermal, chemical, and enzymatic stability of the cyclotide kalata B1: the importance of the cyclic cystine knot." Biochemistry 43(20): 5965-5975. Colgrave, M. L., A. C. Kotze, Y. H. Huang, J. O'Grady, S. M. Simonsen and D. J. Craik (2008). "Cyclotides: natural, circular plant peptides that possess significant activity against gastrointestinal nematode parasites of sheep." Biochemistry 47(20): 5581-5589. Colgrave, M. L., A. C. Kotze, S. Kopp, J. S. McCarthy, G. T. Coleman and D. J. Craik (2009). "Anthelmintic activity of cyclotides: In vitro studies with canine and human hookworms." Acta Trop 109(2): 163-166. Colorado, P. C., A. Torre, G. Kamphaus, Y. Maeshima, H. Hopfer, K. Takahashi, R. Volk, E. D. Zamborsky, S. Herman, P. K. Sarkar, M. B. Ericksen, M. Dhanabal, M. Simons, M. Post, D. W. Kufe, R. R. Weichselbaum, V. P. Sukhatme and R. Kalluri (2000). "Anti-angiogenic cues from vascular basement membrane collagen." Cancer Research 60(9): 2520-2526. Conant, G. C. and K. H. Wolfe (2008). "Turning a hobby into a job: how duplicated genes find new functions." Nature Reviews Genetics 9(12): 938-950. Condie, J. A., G. Nowak, D. W. Reed, J. John Balsevich, M. J. T. Reaney, P. G. Arnison and P. S. Covello (2011). "The biosynthesis of Caryophyllaceae-like cyclic peptides in Saponaria vaccaria L. from DNA-encoded precursors." The Plant Journal 67: 682–690. Cordaux, R., S. Udit, M. A. Batzer and C. Feschotte (2006). "Birth of a chimeric primate gene by capture of the transposase gene from a mobile element." Proceedings of the National Academy of Sciences of the United States of America 103(21): 8101-8106. 150 Coudreuse, D. Y., G. Roel, M. C. Betist, O. Destree and H. C. Korswagen (2006). "Wnt gradient formation requires retromer function in Wnt-producing cells." Science 312(5775): 921-924. Craik, D. J. (2001). "Plant cyclotides: circular, knotted peptide toxins." Toxicon 39(12): 1809-1813. Craik, D. J., N. L. Daly, T. Bond and C. Waine (1999). "Plant cyclotides: A unique family of cyclic and knotted proteins that defines the cyclic cystine knot structural motif." J. Mol. Biol. 294(5): 1327-1336. Craik, D. J., J. S. Mylne and N. L. Daly (2010). "Cyclotides: macrocyclic peptides with applications in drug design and agriculture." Cellular and Molecular Life Sciences 67(1): 9-16. Crisp, A., C. Boschetti, M. Perry, A. Tunnacliffe and G. Micklem (2015). "Expression of multiple horizontally acquired genes is a hallmark of both vertebrate and invertebrate genomes." Genome Biol 16(50): 015-0607. D'Hondt, K., J. Van Damme, C. Van Den Bossche, S. Leejeerajumnean, R. De Rycke, J. Derksen, J. Vandekerckhove and E. Krebbers (1993). "Studies of the role of the propeptides of the Arabidopsis thaliana 2S albumin." Plant Physiology 102(2): 425-433. Daly, N. L., Y.-K. Chen, F. M. Foley, P. S. Bansal, R. Bharathi, R. J. Clark, C. P. Sommerhoff and D. J. Craik (2006). "The absolute structural requirement for a proline in the P3'-position of Bowman-Birk protease inhibitors is surmounted in the minimized SFTI-1 scaffold." Journal of Biological Chemistry 281(33): 2366823675. Daly, N. L., R. J. Clark, M. R. Plan and D. J. Craik (2006). "Kalata B8, a novel antiviral circular protein, exhibits conformational flexibility in the cystine knot motif." Biochemical Journal 393(3): 619-626. Daly, N. L., K. R. Gustafson and D. J. Craik (2004). "The role of the cyclic peptide backbone in the anti-HIV activity of the cyclotide kalata B1." FEBS Lett. 574(1-3): 69-72. Delaye, L., A. Deluna, A. Lazcano and A. Becerra (2008). "The origin of a novel gene through overprinting in Escherichia coli." BMC Evolutionary Biology 8: 31. Dias, G. C., V. Gressler, S. C. Hoenzel, U. F. Silva, Dalcol, II and A. F. Morel (2007). "Constituents of the roots of Melochia chamaedrys." Phytochemistry 68(5): 668-672. Domazet-Loso, T. and D. Tautz (2003). "An evolutionary analysis of orphan genes in Drosophila." Genome Research 13(10): 2213-2219. Douglass, J., O. Civelli and E. Herbert (1984). "Polyprotein gene expression: generation of diversity of neuroendocrine peptides." Annual Review of Biochemistry 53: 665-715. Drouin, G. and G. A. Dover (1990). "Independent gene evolution in the potato actin gene family demonstrated by phylogenetic procedures for resolving gene conversions and the phylogeny of angiosperm actin genes." Journal of Molecular Evolution 31(2): 132-150. Duncan, M. W., H. Roder and S. W. Hunsucker (2008). "Quantitative matrixassisted laser desorption/ionization mass spectrometry." Brief Funct Genomic Proteomic 7(5): 355-370. 151 Dunning Hotopp, J. C., M. E. Clark, D. C. Oliveira, J. M. Foster, P. Fischer, M. C. Munoz Torres, J. D. Giebel, N. Kumar, N. Ishmael, S. Wang, J. Ingram, R. V. Nene, J. Shepard, J. Tomkins, S. Richards, D. J. Spiro, E. Ghedin, B. E. Slatko, H. Tettelin and J. H. Werren (2007). "Widespread lateral gene transfer from intracellular bacteria to multicellular eukaryotes." Science 317(5845): 17531756. Dutton, J. L., R. F. Renda, C. Waine, R. J. Clark, N. L. Daly, C. V. Jennings, M. A. Anderson and D. J. Craik (2004). "Conserved structural and sequence elements implicated in the processing of gene-encoded circular proteins." J. Biol. Chem. 279(45): 46858-46867. Edwards, K., C. Johnstone and C. Thompson (1991). "A simple and rapid method for the preparation of plant genomic DNA for PCR analysis." Nucleic Acids Research 19(6): 1349. Eliasen, R., N. L. Daly, B. S. Wulff, T. L. Andresen, K. W. Conde-Frieboes and D. J. Craik (2012). "Design, synthesis, structural and functional characterization of novel melanocortin agonists based on the cyclotide kalata B1." Journal of Biological Chemistry 287(48): 40493-40501. Elliott, A. G., C. Delay, H. Liu, Z. Phua, K. J. Rosengren, A. H. Benfield, J. L. Panero, M. L. Colgrave, A. S. Jayasena, K. M. Dunse, M. A. Anderson, E. E. Schilling, D. Ortiz-Barrientos, D. J. Craik and J. S. Mylne (2014). "Evolutionary origins of a bioactive peptide buried within preproalbumin." Plant Cell 26(3): 981-995. Ericson, M. L., J. Rödin, M. Lenman, K. Glimelius, L. G. Josefsson and L. Rask (1986). "Structure of the rapeseed 1.7 S storage protein, napin, and its precursor." Journal of Biological Chemistry 261(31): 14576-14581. Fenn, J., M. Mann, C. Meng, S. Wong and C. Whitehouse (1989). "Electrospray ionization for mass spectrometry of large biomolecules." Science 246(4926): 6471. Fenyö, D. and R. C. Beavis (2008). "Informatics development: challenges and solutions for MALDI mass spectrometry." Mass Spectrom Rev 27(1): 1-19. Force, A., M. Lynch, F. B. Pickett, A. Amores, Y. L. Yan and J. Postlethwait (1999). "Preservation of duplicate genes by complementary, degenerative mutations." Genetics 151(4): 1531-1545. Freire, J. E. C., I. M. Vasconcelos, F. B. M. B. Moreno, A. B. Batista, M. D. P. Lobo, M. L. Pereira, J. P. M. S. Lima, R. V. M. Almeida, A. J. S. Sousa, A. C. O. Monteiro-Moreira, J. T. A. Oliveira and T. B. Grangeiro (2015). "Mo-CBP3, an Antifungal chitin-binding protein from Moringa oleifera seeds, is a member of the 2S albumin family." PLoS ONE 10(3): e0119871. Fujiwara, T., E. Nambara, K. Yamagishi, D. B. Goto and S. Naito (2002). "Storage proteins." Arabidopsis Book 1(10): 30. Galindo, M. I., J. I. Pueyo, S. Fouix, S. A. Bishop and J. P. Couso (2007). "Peptides encoded by short ORFs control development and define a new eukaryotic gene family." PLoS Biology 5(5). Gan, S. and R. M. Amasino (1997). "Making sense of senescence (molecular genetic regulation and manipulation of leaf senescence)." Plant Physiology 113(2): 313-319. 152 Ganesh, V., A. Tulinsky, A. Y. Lee and J. Clardy (1996). "Comparison of the structures of the cyclotheonamide a complexes of human α-thrombin and bovine β-trypsin." Protein Science 5(5): 825-835. Gelman, J. S. and L. D. Fricker (2010). "Hemopressin and other bioactive peptides from cytosolic proteins: are these non-classical neuropeptides?" Aaps J 12(3): 279-289. George, A., H. Leahy, J. Zhou and P. J. Morin (2007). "The vacuolar-ATPase inhibitor bafilomycin and mutant VPS35 inhibit canonical Wnt signaling." Neurobiology of Disease 26(1): 125-133. Gessert, S. F., J. H. Kim, F. E. Nargang and R. L. Weiss (1994). "A polyprotein precursor of two mitochondrial enzymes in Neurospora crassa. Gene structure and precursor processing." Journal of Biological Chemistry 269(11): 8189-8203. Getz, J. A., O. Cheneval, D. J. Craik and P. S. Daugherty (2013). "Design of a cyclotide antagonist of neuropilin-1 and -2 that potently inhibits endothelial cell migration." ACS Chem Biol 8(6): 1147-1154. Gilbert, W. (1978). "Why genes in pieces?" Nature 271(5645): 501. Gladyshev, E. A., M. Meselson and I. R. Arkhipova (2008). "Massive horizontal gene transfer in bdelloid rotifers." Science 320(5880): 1210-1213. Göransson, U., M. Sjogren, E. Svangard, P. Claeson and L. Bohlin (2004). "Reversible Antifouling Effect of the Cyclotide Cycloviolacin O2 against Barnacles." J Nat Prod 67(8): 1287-1290. Gran, L. (1970). "An oxytocic principle found in Oldenlandia affinis DC." Medd. Nor. Farm. Selsk. 12: 173-180. Gran, L. (1973). "Oxytocic principles of Oldenlandia affinis." Lloydia 36: 174178. Gressent, F., P. Da Silva, V. Eyraud, L. Karaki and C. Royer (2011). "Pea Albumin 1 subunit b (PA1b), a promising bioinsecticide of plant origin." Toxins 3(12): 1502-1517. Gruber, C. W., A. G. Elliott, D. C. Ireland, P. G. Delprete, S. Dessein, U. Göransson, M. Trabi, C. K. Wang, A. B. Kinghorn, E. Robbrecht and D. J. Craik (2008). "Distribution and evolution of circular miniproteins in flowering plants." Plant Cell 20(9): 2471-2483. Gruis, D., J. Schulze and R. Jung (2004). "Storage protein accumulation in the absence of the vacuolar processing enzyme family of cysteine proteases." Plant Cell 16(1): 270-290. Gruis, D. F., D. A. Selinger, J. M. Curran and R. Jung (2002). "Redundant proteolytic mechanisms process seed storage proteins in the absence of seedtype members of the vacuolar processing enzyme family of cysteine proteases." Plant Cell 14(11): 2863-2882. Gunasekera, S., F. M. Foley, R. J. Clark, L. Sando, L. J. Fabri, D. J. Craik and N. L. Daly (2008). "Engineering stabilized vascular endothelial growth factor-A antagonists: synthesis, structural characterization, and bioactivity of grafted analogues of cyclotides." J Med Chem 51(24): 7697-7704. Guo, F. Q. and N. M. Crawford (2005). "Arabidopsis nitric oxide synthase1 is targeted to mitochondria and protects against oxidative damage and darkinduced senescence." Plant Cell 17(12): 3436-3450. 153 Gustafson, K. R., R. C. I. Sowder, L. E. Henderson, I. C. Parsons, Y. Kashman, J. H. I. Cardellina, J. B. McMahon, R. W. J. Buckheit, L. K. Pannell and M. R. Boyd (1994). "Circulins A and B: Novel HIV-inhibitory macrocyclic peptides from the tropical tree Chassalia parvifolia." J. Am. Chem. Soc. 116: 9337-9338. Hakes, L., J. W. Pinney, S. C. Lovell, S. G. Oliver and D. L. Robertson (2007). "All duplicates are not equal: the difference between small-scale and genome duplication." Genome Biology 8(10): R209-R209. Hara-Nishimura, I., K. Inoue and M. Nishimura (1991). "A unique vacuolar processing enzyme responsible for conversion of several proprotein precursors into the mature forms." FEBS Letters 294(1-2): 89-93. Hara-Nishimura, I. and M. Nishimura (1987). "Proglobulin processing enzyme in vacuoles isolated from developing pumpkin cotyledons." Plant Physiology 85(2): 440-445. Hara-Nishimura, I., Y. Takeuchi, K. Inoue and M. Nishimura (1993). "Vesicle transport and processing of the precursor to 2S albumin in pumpkin." Plant J. 4: 793-800. Hara-Nishimura, I., Y. Takeuchi and M. Nishimura (1993). "Molecular characterization of a vacuolar processing enzyme related to a putative cysteine proteinase of Schistosoma mansoni." Plant Cell 5(11): 1651-1659. Hatsugai, N., K. Yamada, S. Goto-Yamada, I. Hara-Nishimura (2015). "Vacuolar processing enzyme in plant programmed cell death." Frontier in Plant Science 6(234). Heinen, T. J. A. J., F. Staubach, D. Häming and D. Tautz (2009). "Emergence of a new gene from an intergenic region." Current Biology 19(18): 1527-1531. Heitz, A., J. F. Hernandez, J. Gagnon, T. T. Hong, T. T. Pham, T. M. Nguyen, D. Le-Nguyen and L. Chiche (2001). "Solution structure of the squash trypsin inhibitor MCoTI-II. A new family for cyclic knottins." Biochemistry 40(27): 79737983. Herman, E. and B. Larkins (1999). "Protein storage bodies and vacuoles." The Plant Cell 11(4): 601-614. Hickman, R., C. Hill, C. A. Penfold, E. Breeze, L. Bowden, J. D. Moore, P. Zhang, A. Jackson, E. Cooke, F. Bewicke-Copley, A. Mead, J. Beynon, D. L. Wild, K. J. Denby, S. Ott and V. Buchanan-Wollaston (2013). "A local regulatory network around three NAC transcription factors in stress responses and senescence in Arabidopsis leaves." Plant Journal 75(1): 26-39. Higgins, T. J., P. M. Chandler, P. J. Randall, D. Spencer, L. R. Beach, R. J. Blagrove, A. A. Kortt and A. S. Inglis (1986). "Gene structure, protein structure, and regulation of the synthesis of a sulfur-rich protein in pea seeds." Journal of Biological Chemistry 261(24): 11124-11130. Hillmer, S., A. Movafeghi, D. G. Robinson and G. Hinz (2001). "Vacuolar storage proteins are sorted in the cis-cisternae of the pea cotyledon golgi apparatus." The Journal of Cell Biology 152(1): 41-50. Hiraiwa, N., M. Kondo, M. Nishimura and I. Hara-Nishimura (1997). "An aspartic endopeptidase is involved in the breakdown of propeptides of storage proteins in protein-storage vacuoles of plants." European Journal of Biochemistry 246(1): 133-141. 154 Hiraiwa, N., M. Nishimura and I. Hara-Nishimura (1999). "Vacuolar processing enzyme is self-catalytically activated by sequential removal of the C-terminal and N-terminal propeptides." FEBS Letters 447(2-3): 213-216. Hoffmann, W., T. C. Bach, H. Seliger and G. Kreil (1983). "Biosynthesis of caerulein in the skin of Xenopus laevis: partial sequences of precursors as deduced from cDNA clones." Embo J 2(1): 111-114. Horiuchi, T., E. Giniger and T. Aigaki (2003). "Alternative trans-splicing of constant and variable exons of a Drosophila axon guidance gene, lola." Genes & Development 17(20): 2496-2501. Hsiao, K. C., C. H. Cheng, I. E. Fernandes, H. W. Detrich and A. L. DeVries (1990). "An antifreeze glycopeptide gene from the antarctic cod Notothenia coriiceps neglecta encodes a polyprotein of high peptide copy number." Proceedings of the National Academy of Sciences of the United States of America 87(23): 9265-9269. Hsieh, K. and A. H. C. Huang (2004). "Endoplasmic reticulum, oleosins, and oils in seeds and tapetum cells." Plant physiology 136(3): 3427-3434. Huang, Q., S. Liu and Y. Tang (1993). "Refined 1.6 A resolution crystal structure of the complex formed between porcine beta-trypsin and MCTI-A, a trypsin inhibitor of the squash family. Detailed comparison with bovine betatrypsin and its complex." J. Mol. Biol. 229(4): 1022-1036. Jacob, F. (1977). "Evolution and tinkering." Science 196(4295): 1161-1166. Jayasena, A. S., B. Franke, K. J. Rosengren and J. S. Mylne (2016). "A tripartite ‘omic analysis identifies the major sunflower seed albumins." Theoretical and Applied Genetics 129: 613-629. Jennings, C., J. West, C. Waine, D. Craik and M. Anderson (2001). "Biosynthesis and insecticidal properties of plant cyclotides: the cyclic knotted proteins from Oldenlandia affinis." Proceedings of the National Academy of Sciences of the United States of America 98(19): 10614-10619. Jennings, C. V., K. J. Rosengren, N. L. Daly, M. Scanlon, C. Waine, D. G. Norman, M. A. Anderson "Isolation, solution structure, and insecticidal activity protein with a twist: do Mobius strips exist in nature?" 860. Plan, J. Stevens, M. J. and D. J. Craik (2005). of kalata B2, a circular Biochemistry 44(3): 851- Ji, Y., S. Majumder, M. Millard, R. Borra, T. Bi, A. Y. Elnagar, N. Neamati, A. Shekhtman and J. A. Camarero (2013). "In vivo activation of the p53 tumor suppressor pathway by an engineered cyclotide." Journal of the American Chemical Society 135(31): 11623-11633. Jiao, Y., N. J. Wickett, S. Ayyampalayam, A. S. Chanderbali, L. Landherr, P. E. Ralph, L. P. Tomsho, Y. Hu, H. Liang, P. S. Soltis, D. E. Soltis, S. W. Clifton, S. E. Schlarbaum, S. C. Schuster, H. Ma, J. Leebens-Mack and C. W. dePamphilis (2011). "Ancestral polyploidy in seed plants and angiosperms." Nature 473(7345): 97-100. Johansson, B. L., A. Kernell, S. Sjoberg and J. Wahren (1993). "Influence of combined C-peptide and insulin administration on renal function and metabolic control in diabetes type 1." Journal of Clinical Endocrinology & Metabolism 77(4): 976-981. 155 Jouvensal, L., L. Quillien, E. Ferrasson, Y. Rahbe, J. Gueguen and F. Vovelle (2003). "PA1b, an insecticidal protein extracted from pea seeds (Pisum sativum): 1H-2-D NMR study and molecular modeling." Biochemistry 42(41): 11915-11923. Jung, R., M. P. Scott, Y.-W. Nam, T. W. Beaman, R. Bassüner, I. Saalbach, K. Müntz and N. C. Nielsen (1998). "The Role of Proteolysis in the Processing and Assembly of 11S Seed Globulins." The Plant Cell Online 10(3): 343-357. Kaessmann, H. (2010). "Origins, evolution, and phenotypic impact of new genes." Genome Research 20(10): 1313-1326. Kaessmann, H., N. Vinckenbosch and M. Long (2009). "RNA-based gene duplication: mechanistic and evolutionary insights." Nature reviews. Genetics 10(1): 19-31. Kamphaus, G. D., P. C. Colorado, D. J. Panka, H. Hopfer, R. Ramchandran, A. Torre, Y. Maeshima, J. W. Mier, V. P. Sukhatme and R. Kalluri (2000). "Canstatin, a novel matrix-derived inhibitor of angiogenesis and tumor growth." Journal of Biological Chemistry 275(2): 1209-1215. Karas, M. and F. Hillenkamp (1988). "Laser desorption ionization of proteins with molecular masses exceeding 10,000 daltons." Analytical Chemistry 60(20): 2299-2301. Kathiria, P., C. Sidler, A. Golubov, M. Kalischuk, L. M. Kawchuk and I. Kovalchuk (2010). "Tobacco mosaic virus infection results in an increase in recombination frequency and resistance to viral, bacterial, and fungal pathogens in the progeny of infected tobacco plants." Plant Physiology 153(4): 1859-1870. Kawakami, T., A. Ohta, M. Ohuchi, H. Ashigai, H. Murakami and H. Suga (2009). "Diverse backbone-cyclized peptides via codon reprogramming." Nature Chemical Biology 5(12): 888-890. Kazachkov Yu, A., V. Svitkin Yu and V. I. Agol (1983). "The position of polypeptide G on the encephalomyocarditis virus polyprotein cleavage map." FEBS Letters 154(1): 161-165. Keech, O., E. Pesquet, L. Gutierrez, A. Ahad, C. Bellini, S. M. Smith and P. Gardestrom (2010). "Leaf senescence is accompanied by an early disruption of the microtubule network in Arabidopsis." Plant Physiology 154(4): 1710-1720. Keeling, P. J. and J. D. Palmer (2008). "Horizontal gene transfer in eukaryotic evolution." Nature Reviews Genetics 9(8): 605-618. Keskitalo, J., G. Bergquist, P. Gardestrom and S. Jansson (2005). "A cellular timetable of autumn senescence." Plant Physiology 139(4): 1635-1648. Khalturin, K., G. Hemmrich, S. Fraune, R. Augustin and T. C. Bosch (2009). "More than just orphans: are taxonomically-restricted genes important in evolution?" Trends in Genetics 25(9): 404-413. 156 Kidd, J. M., G. M. Cooper, W. F. Donahue, H. S. Hayden, N. Sampas, T. Graves, N. Hansen, B. Teague, C. Alkan, F. Antonacci, E. Haugen, T. Zerr, N. A. Yamada, P. Tsang, T. L. Newman, E. Tuzun, Z. Cheng, H. M. Ebling, N. Tusneem, R. David, W. Gillett, K. A. Phelps, M. Weaver, D. Saranga, A. Brand, W. Tao, E. Gustafson, K. McKernan, L. Chen, M. Malig, J. D. Smith, J. M. Korn, S. A. McCarroll, D. A. Altshuler, D. A. Peiffer, M. Dorschner, J. Stamatoyannopoulos, D. Schwartz, D. A. Nickerson, J. C. Mullikin, R. K. Wilson, L. Bruhn, M. V. Olson, R. Kaul, D. R. Smith and E. E. Eichler (2008). "Mapping and sequencing of structural variation from eight human genomes." Nature 453(7191): 56-64. Kinoshita, T., M. Nishimura and I. Hara-Nishimura (1995). "Homologues of a vacuolar processing enzyme that are expressed in different organs in Arabidopsis thaliana." Plant Mol Biol 29(1): 81-89. Kinoshita, T., K. Yamada, N. Hiraiwa, M. Kondo, M. Nishimura and I. HaraNishimura (1999). "Vacuolar processing enzyme is up-regulated in the lytic vacuoles of vegetative tissues during senescence and under various stressed conditions." Plant J 19(1): 43-53. Klemke, M., R. H. Kehlenbach and W. B. Huttner (2001). "Two overlapping reading frames in a single exon encode interacting proteins--a novel way of gene usage." EMBO J 20(14): 3849-3860. Konarev, A. V., I. N. Anisimova, V. A. Gavrilova, T. E. Vachrusheva, G. Y. Konechnaya, M. Lewis and P. R. Shewry (2002). "Serine proteinase inhibitors in the Compositae: distribution, polymorphism and properties." Phytochemistry 59(3): 279-291. Konarev, A. V., J. Griffin, G. Y. Konechnaya and P. R. Shewry (2004). "The distribution of serine proteinase inhibitors in seeds of the Asteridae." Phytochemistry 65(22): 3003-3020. Korant, B. D. (1990). "Strategies to inhibit viral polyprotein cleavages." Annals of the New York Academy of Sciences 616: 252-257. Korsinczky, M. L., H. J. Schirra, K. J. Rosengren, J. West, B. A. Condie, L. Otvos, M. A. Anderson and D. J. Craik (2001). "Solution structures by 1H NMR of the novel cyclic trypsin inhibitor SFTI-1 from sunflower seeds and an acyclic permutant." Journal of Molecular Biology 311(3): 579-591. Kraut, J. (1977). "Serine Proteases: Structure and Mechanism of Catalysis." Annual Review of Biochemistry 46(1): 331-358. Krebbers, E., L. Herdies, A. De Clercq, J. Seurinck, J. Leemans, J. Van Damme, M. Segura, G. Gheysen, M. Van Montagu and J. Vandekerckhove (1988). "Determination of the processing sites of an Arabidopsis 2S albumin and characterization of the complete gene family." Plant Physiology 87(4): 859866. Kuntal, B.K., P. Aparoy, P. Reddanna (2010) "EasyModeller: A graphical interface to MODELLER." BioMed Central Research Notes 3(226). Kuroyanagi, M., M. Nishimura and I. Hara-Nishimura (2002). "Activation of Arabidopsis Vacuolar Processing Enzyme by Self-Catalytic Removal of an Auto-Inhibitory Domain of the C-Terminal Propeptide." Plant Cell Physiol. 43(2): 143-151. 157 Kuroyanagi, M., K. Yamada, N. Hatsugai, M. Kondo, M. Nishimura and I. HaraNishimura (2005). "Vacuolar processing enzyme is essential for mycotoxininduced cell death in Arabidopsis thaliana." Journal of Biological Chemistry 280(38): 32914-32920. Land, H., M. Grez, S. Ruppert, H. Schmale, M. Rehbein, D. Richter and G. Schutz (1983). "Deduced amino acid sequence from the bovine oxytocinneurophysin I precursor cDNA." Nature 302(5906): 342-344. Land, H., G. Schutz, H. Schmale and D. Richter (1982). "Nucleotide sequence of cloned cDNA encoding bovine arginine vasopressin-neurophysin II precursor." Nature 295(5847): 299-303. Laskowski, M., Jr. and I. Kato (1980). "Protein inhibitors of proteinases." Annu Rev Biochem 49: 593-626. Laskowski, R.A., M.W. MacArthur, D.S. Moss, J.M. Thornton (1993). "PROCHECK - a program to check the stereochemical quality of protein structures." Journal of Applied Crystallography 26:283-291. Legowska, A., D. Debowski, A. Lesner, M. Wysocka and K. Rolka (2009). "Introduction of non-natural amino acid residues into the substrate-specific P1 position of trypsin inhibitor SFTI-1 yields potent chymotrypsin and cathepsin G inhibitors." Bioorganic & Medicinal Chemistry 17(9): 3302-3307. Lehmann, K., K. Schweimer, G. Reese, S. Randow, M. Suhr, W.-M. Becker, S. Vieths and P. Rösch (2006). "Structure and stability of 2S albumin-type peanut allergens: implications for the severity of peanut allergic reactions." Biochemical Journal 395(3): 463-472. Leopold, A. C. (1961). "Senescence in plant development: the death of plants or plant parts may be of positive ecological or physiological value." Science 134(3492): 1727-1732. Li, C. Y., Y. Zhang, Z. Wang, C. Cao, P. W. Zhang, S. J. Lu, X. M. Li, Q. Yu, X. Zheng, Q. Du, G. R. Uhl, Q. R. Liu and L. Wei (2010). "A human-specific de novo protein-coding gene associated with human brain functions." PLoS Computational Biology 6(3): e1000734. Li, D., Y. Dong, Y. Jiang, H. Jiang, J. Cai and W. Wang (2010). "A de novo originated gene depresses budding yeast mating pathway and is repressed by the protein encoded by its antisense strand." Cell Research 20(4): 408-420. Li, F., X.-x. Yang, H.-c. Xia, R. Zeng, W.-g. Hu, Z. Li and Z.-c. Zhang (2003). "Purification and characterization of Luffin P1, a ribosome-inactivating peptide from the seeds of Luffa cylindrica." Peptides 24(6): 799-805. Li, L., T. Shimada, H. Takahashi, H. Ueda, Y. Fukao, M. Kondo, M. Nishimura and I. Hara-Nishimura (2006). "MAIGO2 is involved in exit of seed storage proteins from the endoplasmic reticulum in Arabidopsis thaliana." Plant Cell 18(12): 3535-3547. Li, X., L. Zhao, H. Jiang and W. Wang (2009). "Short homologous sequences are strongly associated with the generation of chimeric RNAs in eukaryotes." Journal of Molecular Evolution 68(1): 56-65. Light, S., W. Basile and A. Elofsson (2014). "Orphans and new gene origination, a structural and evolutionary perspective." Current Opinion in Structural Biology 26: 73-83. 158 Lin, C. Y., J. Anders, M. Johnson, Q. A. Sang and R. B. Dickson (1999). "Molecular cloning of cDNA for matriptase, a matrix-degrading serine protease with trypsin-like activity." J Biol Chem 274(26): 18231-18236. Lindholm, P., U. Göransson, S. Johansson, P. Claeson, J. Gulbo, R. Larsson, L. Bohlin and A. Backlund (2002). "Cyclotides: A novel type of cytotoxic agents." Mol. Cancer Ther. 1: 365-369. Linnestad, C., D. N. P. Doan, R. C. Brown, B. E. Lemmon, D. J. Meyer, R. Jung and O.-A. Olsen (1998). "Nucellain, a Barley Homolog of the Dicot VacuolarProcessing Protease, Is Localized in Nucellar Cell Walls." Plant Physiology 118(4): 1169-1180. Liu, S. L., Y. Zhuang, P. Zhang and K. L. Adams (2009). "Comparative analysis of structural diversity and sequence evolution in plant mitochondrial genes transferred to the nucleus." Molecular Biology and Evolution 26(4): 875-891. Long, M., E. Betrán, K. Thornton and W. Wang (2003). "The origin of new genes: glimpses from the young and old." Nat Rev Genet 4(11): 865-875. Long, M. and C. H. Langley (1993). "Natural selection and the origin of jingwei, a chimeric processed functional gene in Drosophila." Science 260(5104): 91-95. Long, M., N. W. VanKuren, S. Chen and M. D. Vibranovski (2013). "New gene evolution: little did we know." Annual Review of Genetics 47: 307-333. Long, Y. Q., S. L. Lee, C. Y. Lin, I. J. Enyedy, S. Wang, P. Li, R. B. Dickson and P. P. Roller (2001). "Synthesis and evaluation of the sunflower derived trypsin inhibitor as a potent inhibitor of the type II transmembrane serine protease, matriptase." Bioorganic & Medicinal Chemistry Letters 11(18): 2515-2519. Luckett, S., R. S. Garcia, J. J. Barker, A. V. Konarev, P. R. Shewry, A. R. Clarke and R. L. Brady (1999). "High-resolution structure of a potent, cyclic proteinase inhibitor from sunflower seeds." Journal of Molecular Biology 290(2): 525-533. Luco, R. F., Q. Pan, K. Tominaga, B. J. Blencowe, O. M. Pereira-Smith and T. Misteli (2010). "Regulation of alternative splicing by histone modifications." Science 327(5968): 996-1000. Luppi, P., V. Cifarelli, H. Tse, J. Piganelli and M. Trucco (2008). "Human Cpeptide antagonises high glucose-induced endothelial dysfunction through the nuclear factor-kappaB pathway." Diabetologia 51(8): 1534-1543. MacLean, B., D. M. Tomazela, N. Shulman, M. Chambers, G. L. Finney, B. Frewen, R. Kern, D. L. Tabb, D. C. Liebler and M. J. MacCoss (2010). "Skyline: an open source document editor for creating and analyzing targeted proteomics experiments." Bioinformatics 26(7): 966-968. Maere, S., S. De Bodt, J. Raes, T. Casneuf, M. Van Montagu, M. Kuiper and Y. Van de Peer (2005). "Modeling gene and genome duplications in eukaryotes." Proceedings of the National Academy of Sciences of the United States of America 102(15): 5454-5459. Maeshima, Y., A. Sudhakar, J. C. Lively, K. Ueki, S. Kharbanda, C. R. Kahn, N. Sonenberg, R. O. Hynes and R. Kalluri (2002). "Tumstatin, an endothelial cellspecific inhibitor of protein synthesis." Science 295(5552): 140-143. Mains, R. E., B. A. Eipper and N. Ling (1977). "Common precursor to corticotropins and endorphins." Proceedings of the National Academy of Sciences of the United States of America 74(7): 3014-3018. 159 Mann, M. and O. N. Jensen (2003). "Proteomic analysis of post-translational modifications." Nat Biotech 21(3): 255-261. Marcus, J. P., J. L. Green, K. C. Goulter and J. M. Manners (1999). "A family of antimicrobial peptides is produced by processing of a 7S globulin protein in Macadamia integrifolia kernels." Plant Journal 19(6): 699-710. Maria-Neto, S., R. V. Honorato, F. T. Costa, R. G. Almeida, D. S. Amaro, J. T. A. Oliveira, I. M. Vasconcelos and O. L. Franco (2011). "Bactericidal activity identified in 2S albumin from sesame seeds and in silico studies of structure– function relations." The Protein Journal 30(5): 340-350. Markwell, J., J. C. Osterman and J. L. Mitchell (1995). "Calibration of the Minolta SPAD-502 leaf chlorophyll meter." Photosynthesis Research 46(3): 467472. Marques, A. C., I. Dupanloup, N. Vinckenbosch, A. Reymond and H. Kaessmann (2005). "Emergence of young human genes after a burst of retroposition in primates." PLoS Biology 3(11): e357. Marx, U. C., M. L. Korsinczky, H. J. Schirra, A. Jones, B. Condie, L. Otvos, Jr. and D. J. Craik (2003). "Enzymatic cyclization of a potent Bowman-Birk protease inhibitor, sunflower trypsin inhibitor-1, and solution structure of an acyclic precursor peptide." J. Biol. Chem. 278(24): 21782-21789. McCarrey, J. R. and K. Thomas (1987). "Human testis-specific PGK gene lacks introns and possesses characteristics of a processed gene." Nature 326(6112): 501-505. McManus, C. J., M. O. Duff, J. Eipper-Mains and B. R. Graveley (2010). "Global analysis of trans-splicing in Drosophila." Proceedings of the National Academy of Sciences of the United States of America 107(29): 12975-12979. Menéndez-Arias, L., I. Moneo, J. DomÍNguez and R. RodrÍGuez (1988). "Primary structure of the major allergen of yellow mustard (Sinapis alba L.) seed, Sin a I." European Journal of Biochemistry 177(1): 159-166. Mighell, A. J., N. R. Smith, P. A. Robinson and A. F. Markham (2000). "Vertebrate pseudogenes." FEBS Letters 468(2-3): 109-114. Moran, J. V., R. J. DeBerardinis and H. H. Kazazian, Jr. (1999). "Exon shuffling by L1 retrotransposition." Science 283(5407): 1530-1534. Müntz, K. (1998). "Deposition of storage proteins." Plant Molecular Biology 38(1): 77-99. Mylne, J. S. and J. R. Botella (1998). "Binary vectors for sense and antisense expression of Arabidopsis ESTs." Plant Molecular Biology Reporter 16(3): 257262. Mylne, J. S., L. Y. Chan, A. H. Chanson, N. L. Daly, H. Schaefer, T. L. Bailey, P. Nguyencong, L. Cascales and D. J. Craik (2012). "Cyclic peptides arising by evolutionary parallelism via asparaginyl-endopeptidase-mediated biosynthesis." Plant Cell 24: 2765-2778. Mylne, J. S., M. L. Colgrave, N. L. Daly, A. H. Chanson, A. G. Elliott, E. J. McCallum, A. Jones and D. J. Craik (2011). "Albumins and their processing machinery are hijacked for cyclic peptides in sunflower." Nature Chemical Biology 7(5): 257-259. 160 Mylne, J. S., I. Hara-Nishimura and K. J. Rosengren (2014). "Seed storage albumins: biosynthesis, trafficking and structures." Functional Plant Biology 41(7): 671-677. Nakabayashi, K., M. Okamoto, T. Koshiba, Y. Kamiya and E. Nambara (2005). "Genome-wide profiling of stored mRNA in Arabidopsis thaliana seed germination: epigenetic and genetic regulation of transcription in seed." Plant Journal 41(5): 697-709. Nakaune, S., K. Yamada, M. Kondo, T. Kato, S. Tabata, M. Nishimura and I. Hara-Nishimura (2005). "A Vacuolar Processing Enzyme, {delta}VPE, Is Involved in Seed Coat Formation at the Early Stage of Seed Development." Plant Cell 17(3): 876-887. Nekrutenko, A., S. Wadhawan, P. Goetting-Minesky and K. D. Makova (2005). "Oscillating evolution of a mammalian locus with overlapping reading frames: an XLalphas/ALEX relay." PLoS Genetics 1(2): e18. Neme, R. and D. Tautz (2013). "Phylogenetic patterns of emergence of new genes support a model of frequent de novo evolution." BMC Genomics 14(1): 117. Neme, R. and D. Tautz (2014). "Evolution: dynamics of de novo gene emergence." Current Biology 24(6): R238-240. Neves, G., J. Zucker, M. Daly and A. Chess (2004). "Stochastic yet biased expression of multiple Dscam splice variants by individual cells." Nature Genetics 36(3): 240-246. Ng, Y.-M., Y. Yang, K.-H. Sze, X. Zhang, Y.-T. Zheng and P.-C. Shaw (2011). "Structural characterization and anti-HIV-1 activities of arginine/glutamate-rich polypeptide Luffin P1 from the seeds of sponge gourd (Luffa cylindrica)." Journal of Structural Biology 174(1): 164-172. Nguyen, G. K., S. Zhang, N. T. Nguyen, P. Q. Nguyen, M. S. Chiu, A. Hardjojo and J. P. Tam (2011). "Discovery and characterization of novel cyclotides originated from chimeric precursors consisting of albumin-1 chain a and cyclotide domains in the Fabaceae family." The Journal of Biological Chemistry 286(27): 24275-24287. Nielsen, M. S., C. Gustafsen, P. Madsen, J. R. Nyengaard, G. Hermey, O. Bakke, M. Mari, P. Schu, R. Pohlmann, A. Dennes and C. M. Petersen (2007). "Sorting by the cytoplasmic domain of the amyloid precursor protein binding receptor SorLA." Molecular and Cellular Biology 27(19): 6842-6851. Nilsen, T. W. and B. R. Graveley (2010). "Expansion of the eukaryotic proteome by alternative splicing." Nature 463(7280): 457-463. Nordlee, J. A., S. L. Taylor, J. A. Townsend, L. A. Thomas and R. K. Bush (1996). "Identification of a Brazil-Nut allergen in transgenic soybeans." New England Journal of Medicine 334(11): 688-692. Nordquist, L., R. Brown, A. Fasching, P. Persson and F. Palm (2009). "Proinsulin C-peptide reduces diabetes-induced glomerular hyperfiltration via efferent arteriole dilation and inhibition of tubular sodium reabsorption." American Journal of Physiology - Renal Physiology 297(5): 9. Nordquist, L. and J. Wahren (2009). "C-Peptide: the missing link in diabetic nephropathy?" Rev Diabet Stud 6(3): 203-210. 161 Ohno, S. (1970). Evolution by gene duplication, Springer-Verlag Berlin Heidelberg. Ohno, S. (1984). "Birth of a unique enzyme from an alternative reading frame of the preexisted, internally repetitious coding sequence." Proceedings of the National Academy of Sciences of the United States of America 81(8): 24212425. Okamura, K., L. Feuk, T. Marques-Bonet, A. Navarro and S. W. Scherer (2006). "Frequent appearance of novel protein-coding sequences by frameshift translation." Genomics 88(6): 690-697. Oliviusson, P., O. Heinzerling, S. Hillmer, G. Hinz, Y. C. Tse, L. Jiang and D. G. Robinson (2006). "Plant retromer, localized to the prevacuolar compartment and microvesicles in Arabidopsis, may interact with vacuolar sorting receptors." Plant Cell 18(5): 1239-1252. Osborne, T. B. (1909). The vegetable proteins. London, New York Longmans, Green. Otegui, M. S., R. Herder, J. Schulze, R. Jung and L. A. Staehelin (2006). "The proteolytic processing of seed storage proteins in Arabidopsis embryo cells starts in the multivesicular bodies." Plant Cell 18(10): 2567-2581. Pace, J. K., 2nd, C. Gilbert, M. S. Clark and C. Feschotte (2008). "Repeated horizontal transfer of a DNA transposon in mammals and other tetrapods." Proceedings of the National Academy of Sciences of the United States of America 105(44): 17023-17028. Parker, H. G., B. M. VonHoldt, P. Quignon, E. H. Margulies, S. Shao, D. S. Mosher, T. C. Spady, A. Elkahloun, M. Cargill, P. G. Jones, C. L. Maslen, G. M. Acland, N. B. Sutter, K. Kuroki, C. D. Bustamante, R. K. Wayne and E. A. Ostrander (2009). "An Expressed Fgf4 Retrogene Is Associated with BreedDefining Chondrodysplasia in Domestic Dogs." Science (New York, N.Y.) 325(5943): 995-998. Parmenter, D. L., J. G. Boothe, G. J. van Rooijen, E. C. Yeung and M. M. Moloney (1995). "Production of biologically active hirudin in plant seeds using oleosin partitioning." Plant Molecular Biology 29(6): 1167-1180. Patthy, L. (1999). "Genome evolution and the evolution of exon-shuffling--a review." Gene 238(1): 103-114. Paulding, C. A., M. Ruvolo and D. A. Haber (2003). "The Tre2 (USP6) oncogene is a hominoid-specific gene." Proceedings of the National Academy of Sciences of the United States of America 100(5): 2507-2511. Pavesi, A., G. Magiorkinis and D. G. Karlin (2013). "Viral proteins originated de novo by overprinting can be identified by codon usage: application to the “gene nursery” of Deltaretroviruses." PLoS Computational Biology 9(8): e1003162. Pearce, G., D. S. Moura, J. Stratmann and C. A. Ryan (2001). "Production of multiple plant hormones from a single polyprotein precursor." Nature 411(6839): 817-820. Pelegrini, P. B., B. F. Quirino and O. L. Franco (2007). "Plant cyclotides: an unusual class of defense compounds." Peptides 28(7): 1475-1481. 162 Pereira, H. J. V., M. C. O. Salgado and E. B. Oliveira (2009). "Immobilized analogues of sunflower trypsin inhibitor-1 constitute a versatile group of affinity sorbents for selective isolation of serine proteases." Journal of Chromatography B 877(22): 2039-2044. Petrov, D. A., E. R. Lozovskaya and D. L. Hartl (1996). "High intrinsic rate of DNA loss in Drosophila." Nature 384(6607): 346-349. Pimenta, D. C. and I. Lebrun (2007). "Cryptides: buried secrets in proteins." Peptides 28(12): 2403-2410. Plan, M. R., I. Saska, A. G. Cagauan and D. J. Craik (2008). "Backbone cyclised peptides from plants show molluscicidal activity against the rice pest Pomacea canaliculata (golden apple snail)." J Agric Food Chem 56(13): 52375241. Poth, A. G., M. L. Colgrave, R. E. Lyons, N. L. Daly and D. J. Craik (2011). "Discovery of an unusual biosynthetic origin for circular proteins in legumes." Proceedings of the National Academy of Sciences of the United States of America 108(25): 10127-10132. Poth, A. G., M. L. Colgrave, R. Philip, B. Kerenga, N. L. Daly, M. A. Anderson and D. J. Craik (2011). "Discovery of cyclotides in the Fabaceae plant family provides new insights into the cyclization, evolution, and distribution of circular proteins." ACS Chemical Biology on-line 20 Jan 2011: DOI:10.1021/cb100388j. Poth, A. G., J. S. Mylne, J. Grassl, R. E. Lyons, A. H. Millar, M. L. Colgrave and D. J. Craik (2012). "Cyclotides associate with leaf vasculature and are the products of a novel precursor in Petunia (Solanaceae)." Journal of Biological Chemistry 287(32): 27033-27046. Potrzebowski, L., N. Vinckenbosch, A. C. Marques, F. Chalmel, B. Jégou and H. Kaessmann (2008). "Chromosomal Gene Movements Reflect the Recent Origin and Biology of Therian Sex Chromosomes." PLoS Biology 6(4): e80. Pratley, R. E., M. Nauck, T. Bailey, E. Montanya, R. Cuddihy, S. Filetti, A. B. Thomsen, R. E. Søndergaard and M. Davies (2010). "Liraglutide versus sitagliptin for patients with type 2 diabetes who did not have adequate glycaemic control with metformin: a 26-week, randomised, parallel-group, openlabel trial." The Lancet 375(9724): 1447-1456. Pruzinska, A., G. Tanner, S. Aubry, I. Anders, S. Moser, T. Muller, K. H. Ongania, B. Krautler, J. Y. Youn, S. J. Liljegren and S. Hortensteiner (2005). "Chlorophyll breakdown in senescent Arabidopsis leaves. Characterization of chlorophyll catabolites and of chlorophyll catabolic enzymes involved in the degreening reaction." Plant Physiology 139(1): 52-63. Quimbar, P., U. Malik, C. P. Sommerhoff, Q. Kaas, L. Y. Chan, Y. H. Huang, M. Grundhuber, K. Dunse, D. J. Craik, M. A. Anderson and N. L. Daly (2013). "High-affinity cyclic peptide matriptase inhibitors." J Biol Chem 288(19): 1388513896. Quirino, B. F., J. Normanly and R. M. Amasino (1999). "Diverse range of gene activity during Arabidopsis thaliana leaf senescence includes pathogenindependent induction of defense-related genes." Plant Molecular Biology 40(2): 267-278. 163 Reinhardt, J. A., B. M. Wanjiru, A. T. Brant, P. Saelao, D. J. Begun and C. D. Jones (2013). "De novo ORFs in Drosophila are important to organismal fitness and evolved rapidly from previously non-coding sequences." PLOS Genetics 9(10): e1003860. Ribeiro, A. C., S. V. Monteiro, B. M. Carrapico and R. B. Ferreira (2014). "Are vicilins another major class of legume lectins?" Molecules 19(12): 20350-20373. Ribeiro, S. F. F., G. B. Taveira, A. O. Carvalho, G. B. Dias, M. Da Cunha, C. Santa-Catarina, R. Rodrigues and V. M. Gomes (2012). "Antifungal and other biological activities of two 2S albumin-homologous proteins against pathogenic fungi." The Protein Journal 31(1): 59-67. Roberts, I. N., C. Caputo, M. V. Criado and C. Funk (2012). "Senescenceassociated proteases in plants." Physiologia Plantarum 145(1): 130-139. Robinson, J. A., S. DeMarco, F. Gombert, K. Moehle and D. Obrecht (2008). "The design, structures and therapeutic potential of protein epitope mimetics." Drug Discovery Today 13(21–22): 944-951. Rogers, L. D. and C. M. Overall (2013). "Proteolytic post-translational modification of proteins: proteomic tools and methodology." Molecular & Cellular Proteomics 12(12): 3532-3542. Romero, D. and R. Palacios (1997). "Gene amplification and genomic plasticity in prokaryotes." Annual Review of Genetics 31: 91-111. Rotari, V. I., P. M. Dando and A. J. Barrett (2001). Legumain forms from plants and animals differ in their specificity. Biological Chemistry. 382: 953. Roth, D. B., T. N. Porter and J. H. Wilson (1985). "Mechanisms of nonhomologous recombination in mammalian cells." Molecular and Cellular Biology 5(10): 2599-2607. Rothman, B. S., E. Mayeri, R. O. Brown, P. M. Yuan and J. E. Shively (1983). "Primary structure and neuronal effects of alpha-bag cell peptide, a second candidate neurotransmitter encoded by a single gene in bag cell neurons of Aplysia." Proceedings of the National Academy of Sciences of the United States of America 80(18): 5753-5757. Rozwadowski, K., W. Yang and S. Kagale (2008). "Homologous recombinationmediated cloning and manipulation of genomic DNA regions using Gateway and recombineering systems." BMC Biotechnol 8(88): 1472-6750. Rubenstein, A. H., J. L. Clark, F. Melani and D. F. Steiner (1969). "Secretion of proinsulin C-peptide by pancreatic [beta] cells and its circulation in blood." Nature 224(5220): 697-699. Ruhfel, B. R., M. A. Gitzendanner, P. S. Soltis, D. E. Soltis and J. G. Burleigh (2014). "From algae to angiosperms-inferring the phylogeny of green plants (Viridiplantae) from 360 plastid genomes." BMC Evolutionary Biology 14(23): 1471-2148. Ruiz-Orera, J., X. Messeguer, J. A. Subirana and M. M. Alba (2014). "Long noncoding RNAs as a source of new peptides." Elife 3: e03523. Saether, O., D. J. Craik, I. D. Campbell, K. Sletten, J. Juul and D. G. Norman (1995). "Elucidation of the primary and three-dimensional structure of the uterotonic polypeptide kalata B1." Biochemistry 34: 4147-4158. 164 Samir, P. and A. J. Link (2011). "Analyzing the cryptome: uncovering secret sequences." Aaps J 13(2): 152-158. Sasaki, T., H. Larsson, J. Kreuger, M. Salmivirta, L. Claesson-Welsh, U. Lindahl, E. Hohenester and R. Timpl (1999). "Structural basis and potential role of heparin/heparan sulfate binding to the angiogenesis inhibitor endostatin." The EMBO Journal 18(22): 6240-6248. Saska, I., A. D. Gillon, N. Hatsugai, R. G. Dietzgen, I. Hara-Nishimura, M. A. Anderson and D. J. Craik (2007). "An asparaginyl endopeptidase mediates in vivo protein backbone cyclisation." Journal of Biological Chemistry 282(40): 29721-29728. Sayah, D. M., E. Sokolskaja, L. Berthoux and J. Luban (2004). "Cyclophilin A retrotransposition into TRIM5 explains owl monkey resistance to HIV-1." Nature 430(6999): 569-573. Schenk, S. and V. Quaranta (2003). "Tales from the crypt[ic] sites of the extracellular matrix." Trends in Cell Biology 13(7): 366-375. Schmid, M., T. S. Davison, S. R. Henz, U. J. Pape, M. Demar, M. Vingron, B. Scholkopf, D. Weigel and J. U. Lohmann (2005). "A gene expression map of Arabidopsis thaliana development." Nature Genetics 37(5): 501-506. Schmitz, R. J., M. D. Schultz, M. G. Lewsey, R. C. O'Malley, M. A. Urich, O. Libiger, N. J. Schork and J. R. Ecker (2011). "Transgenerational epigenetic instability is a source of novel methylation variants." Science 334(6054): 369373. Sela, S., A. Itin, S. Natanson-Yaron, C. Greenfield, D. Goldman-Wohl, S. Yagel and E. Keshet (2008). "A novel human-specific soluble vascular endothelial growth factor receptor 1: cell-type-specific splicing and implications to vascular endothelial growth factor homeostasis and preeclampsia." Circulation Research 102(12): 1566-1574. Semon, M. and K. H. Wolfe (2007). "Consequences of genome duplication." Current Opinion in Genetics & Development 17(6): 505-512. Settle, S. H., Jr., M. M. Green and K. C. Burtis (1995). "The silver gene of Drosophila melanogaster encodes multiple carboxypeptidases similar to mammalian prohormone-processing enzymes." Proceedings of the National Academy of Sciences of the United States of America 92(21): 9470-9474. Sheldon, P. S., J. N. Keen and D. J. Bowles (1996). "Post-translational peptide bond formation during concanavalin A processing in vitro." Biochemical Journal 320: 865-870. Shewry, P. (1999). Enzyme Inhibitors of Seeds: Types and Properties. Seed Proteins. P. Shewry and R. Casey. Dordrecht, Kluwer: 587-615. Shewry, P. and M. Pandya (1999). The 2S Albumin Storage Proteins. Seed Proteins. P. Shewry and R. Casey. Dordrecht, Kluwer: 563-586. Shewry, P. R. and N. G. Halford (2002). "Cereal seed storage proteins: structures, properties and role in grain utilization." Journal of Experimental Botany 53(370): 947-958. Shewry, P. R., J. A. Napier and A. S. Tatham (1995). "Seed storage proteins: structures and biosynthesis." Plant Cell 7: 945-956. 165 Shih, M.-d., F. A. Hoekstra and Y.-I. C. Hsing (2008). Late embryogenesis abundant proteins, Academic Press. Shimada, T., K. Fuji, K. Tamura, M. Kondo, M. Nishimura and I. Hara-Nishimura (2003). "Vacuolar sorting receptor for seed storage proteins in Arabidopsis thaliana." Proc Natl Acad Sci U S A 100(26): 16095-16100. Shimada, T., Y. Koumoto, L. Li, M. Yamazaki, M. Kondo, M. Nishimura and I. Hara-Nishimura (2006). "AtVPS29, a putative component of a retromer complex, is required for the efficient sorting of seed storage proteins." Plant Cell Physiol 47(9): 1187-1194. Shimada, T., K. Yamada, M. Kataoka, S. Nakaune, Y. Koumoto, M. Kuroyanagi, S. Tabata, T. Kato, K. Shinozaki, M. Seki, M. Kobayashi, M. Kondo, M. Nishimura and I. Hara-Nishimura (2003). "Vacuolar processing enzymes are essential for proper processing of seed storage proteins in Arabidopsis thaliana." Journal of Biological Chemistry 278(34): 32292-32299. Singh, A., E. Y. Chen, J. M. Lugovoy, C. N. Chang, R. A. Hitzeman and P. H. Seeburg (1983). "Saccharomyces cerevisiae contains two discrete genes coding for the α-factor pheromone." Nucleic Acids Research 11(12): 4049-4063. Sommerhoff, C. P., O. Avrutina, H. U. Schmoldt, D. Gabrijelcic-Geiger, U. Diederichsen and H. Kolmar (2010). "Engineered cystine knot miniproteins as potent inhibitors of human mast cell tryptase beta." Journal of Molecular Biology 395(1): 167-175. Souders, C. A., S. E. Maynard, J. Yan, Y. Wang, N. K. Boatright, J. Sedan, D. Balyozian, P. S. Cheslock, D. C. Molrine and T. A. Simas (2015). "Circulating levels of sFlt1 splice variants as predictive parkers for the development of preeclampsia." International Journal of Molecular Sciences 16(6): 12436-12453. Spiess, J., C. D. Mount, W. E. Nicholson and D. N. Orth (1982). "NH2-terminal amino acid sequence and peptide mapping of purified human beta-lipotropin: comparison with previously proposed sequences." Proceedings of the National Academy of Sciences of the United States of America 79(16): 5071-5075. Steiner, D. F., D. Cunningham, L. Spigelman and B. Aten (1967). "Insulin biosynthesis: evidence for a precursor." Science 157(3789): 697-700. Su, M., Y. Ling, J. Yu, J. Wu and J. Xiao (2013). "Small proteins: untapped area of potential biological importance." Front Genet 4(286): 00286. Swedberg, J. E., L. V. Nigon, J. C. Reid, S. J. de Veer, C. M. Walpole, C. R. Stephens, T. P. Walsh, T. K. Takayama, J. D. Hooper, J. A. Clements, A. M. Buckle and J. M. Harris (2009). "Substrate-guided design of a potent and selective kallikrein-related peptidase inhibitor for kallikrein 4." Chemistry & Biology 16(6): 633-643. Takagi, J., L. Renna, H. Takahashi, Y. Koumoto, K. Tamura, G. Stefano, Y. Fukao, M. Kondo, M. Nishimura, T. Shimada, F. Brandizzi and I. HaraNishimura (2013). "MAIGO5 functions in protein export from Golgi-associated endoplasmic reticulum exit sites in Arabidopsis." Plant Cell 25: 4658-4675. Takahashi, H., K. Tamura, J. Takagi, Y. Koumoto, I. Hara-Nishimura and T. Shimada (2010). "MAG4/Atp115 is a Golgi-localized tethering factor that mediates efficient anterograde transport in Arabidopsis." Plant and Cell Physiology 51(10): 1777-1787. 166 Tam, J. P., Y. A. Lu, J. L. Yang and K. W. Chiu (1999). "An unusual structural motif of antimicrobial peptides containing end-to-end macrocycle and cystineknot disulfides." Proceedings of the National Academy of Sciences of the United States of America 96(16): 8913-8918. Tan, N.-H. and J. Zhou (2006). "Plant cyclopeptides." Chemical Reviews 106(3): 840-895. Tautz, D. (2014). "The discovery of de novo gene evolution." Perspectives in Biology and Medicine 57(1): 149-161. Tautz, D. and T. Domazet-Lošo (2011). "The evolutionary origin of orphan genes." Nature Reviews Genetics 12(10): 692-702. Taylor, Z. N., D. W. Rice and J. D. Palmer (2015). "The complete moss mitochondrial genome in the angiosperm Amborella is a chimera derived from two moss whole-genome transfers." PLoS ONE 10(11). Tedford, H. W., J. I. Fletcher and G. F. King (2001). "Functional Significance of the β-Hairpin in the Insecticidal Neurotoxin ω-Atracotoxin-Hv1a." Journal of Biological Chemistry 276(28): 26568-26576. Terras, F. R. G., H. M. E. Schoofs, M. F. C. De Bolle, F. Van Leuven, S. B. Rees, J. Vanderleyden, B. P. A. Cammue and W. F. Broekaert (1992). "Analysis of two novel classes of plant antifungal proteins from radish (Raphanus sativus L.) seeds." J. Biol. Chem. 267: 15301-15309. Thongyoo, P., C. Bonomelli, R. J. Leatherbarrow and E. W. Tate (2009). "Potent inhibitors of β-tryptase and human leukocyte elastase based on the MCoTI-II scaffold." Journal of Medicinal Chemistry 52(20): 6197-6200. Thony-Meyer, L., A. Bock and H. Hennecke (1992). "Prokaryotic polyprotein precursors." FEBS Letters 307(1): 62-65. Thorpe, S. C., D. M. Kemeny, R. C. Panzani, B. McGurl and M. Lord (1988). "Allergy to castor bean: II. Identification of the major allergens in castor bean seeds." Journal of Allergy and Clinical Immunology 82(1): 67-72. Toll-Riera, M., N. Bosch, N. Bellora, R. Castelo, L. Armengol, X. Estivill and M. M. Alba (2009). "Origin of primate orphan genes: a comparative genomics approach." Molecular Biology and Evolution 26(3): 603-612. Torres, C., M. D. Fernandez, D. M. Flichman, R. H. Campos and V. A. Mbayed (2013). "Influence of overlapping genes on the evolution of human hepatitis B virus." Virology 441(1): 40-48. Torresi, J. (2002). "The virological and clinical significance of mutations in the overlapping envelope and polymerase genes of hepatitis B virus." J Clin Virol 25(2): 97-106. Trevisan, G., G. Maldaner, N. A. Velloso, S. Sant'Anna Gda, V. Ilha, C. Velho Gewehr Cde, M. A. Rubin, A. F. Morel and J. Ferreira (2009). "Antinociceptive effects of 14-membered cyclopeptide alkaloids." J Nat Prod 72(4): 608-612. Tyndall, J. D. A., T. Nall and D. P. Fairlie (2005). "Proteases universally recognize beta strands in their active sites." Chemical Reviews 105(3): 9731000. Uddling, J., J. Gelang-Alfredsson, K. Piikki and H. Pleijel (2007). "Evaluating the relationship between leaf chlorophyll concentration and SPAD-502 chlorophyll meter readings." Photosynthesis Research 91(1): 37-46. 167 Ueki, N., K. Someya, Y. Matsuo, K. Wakamatsu and H. Mukai (2007). "Cryptides: functional cryptic peptides hidden in protein structures." Biopolymers 88(2): 190-198. Van de Peer, Y., S. Maere and A. Meyer (2009). "The evolutionary significance of ancient genome duplications." Nature Reviews Genetics 10(10): 725-732. van der Graaff, E., R. Schwacke, A. Schneider, M. Desimone, U. I. Flugge and R. Kunze (2006). "Transcription analysis of Arabidopsis membrane transporters and hormone pathways during developmental and induced leaf senescence." Plant Physiology 141(2): 776-792. Vanderperre, B., J. F. Lucier, Vanderperre, M. Wisztorski, M. "Direct detection of alternative human significantly expands the C. Bissonnette, J. Motard, G. Tremblay, S. Salzet, F. M. Boisvert and X. Roucou (2013). open reading frames translation products in proteome." PLoS ONE 8(8): e70698. Vinckenbosch, N., I. Dupanloup and H. Kaessmann (2006). "Evolutionary fate of retroposed gene copies in the human genome." Proceedings of the National Academy of Sciences of the United States of America 103(9): 3220-3225. Vitousek, P. M. and R. W. Howarth (1991). "Nitrogen limitation on land and in the sea: how can it occur?" Biogeochemistry 13(2): 87-115. Voss, R. H., U. Ermler, L.-O. Essen, G. Wenzl, Y.-M. Kim and P. Flecker (1996). "Crystal Structure of the Bifunctional Soybean Bowman-Birk Inhibitor at 0.28-nm Resolution." European Journal of Biochemistry 242(1): 122-131. Wada, S., H. Ishida, M. Izumi, K. Yoshimoto, Y. Ohsumi, T. Mae and A. Makino (2009). "Autophagy plays a role in chloroplast degradation during senescence in individually darkened leaves." Plant Physiology 149(2): 885-893. Wang, Z., F. Chen, X. Li, H. Cao, M. Ding, C. Zhang, J. Zuo, C. Xu, J. Xu, X. Deng, Y. Xiang, W.J.J. Soppe, and Y. Liu (2016). "Arabidopsis seed germination speed is controlled by SNL histone deacetylase-binding factormediated regulation of AUX1." Nature Communications 7(13412). Wang, C. K., C. W. Gruber, M. Cemazar, C. Siatskas, P. Tagore, N. Payne, G. Sun, S. Wang, C. C. Bernard and D. J. Craik (2014). "Molecular grafting onto a stable framework yields novel cyclic peptides for the treatment of multiple sclerosis." ACS Chem Biol 9(1): 156-163. Wang, W., H. Zheng, C. Fan, J. Li, J. Shi, Z. Cai, G. Zhang, D. Liu, J. Zhang, S. Vang, Z. Lu, G. K.-S. Wong, M. Long and J. Wang (2006). "High Rate of Chimeric Gene Origination by Retroposition in Plant Genomes." The Plant Cell 18(8): 1791-1802. Watanabe, M., S. Balazadeh, T. Tohge, A. Erban, P. Giavalisco, J. Kopka, B. Mueller-Roeber, A. R. Fernie and R. Hoefgen (2013). "Comprehensive dissection of spatiotemporal metabolic shifts in primary, secondary, and lipid metabolism during developmental senescence in Arabidopsis." Plant Physiology 162(3): 1290-1310. Waterworth, W.M., S. Footitt, C.M. Bray, W.E. Finch-Savage, C.E. West (2016). "DNA damage checkpoint kinase ATM regulates germination and maintains genome stability in seeds." Proceedings of the National Academy of Sciences of the United States of America 113(34): 9647-9652. 168 Weaver, L. M. and R. M. Amasino (2001). "Senescence is induced in individually darkened Arabidopsis leaves, but inhibited in whole darkened plants." Plant Physiol 127(3): 876-886. Wiederstein, M. and M.J. Sippl, (2007) "ProSA-web: interactive web service for the recognition of errors in three-dimensional structures of proteins." Nucleic Acids Research 35:W407-W410. Winter, D., B. Vinegar, H. Nahal, R. Ammar, G. V. Wilson and N. J. Provart (2007). "An "electronic fluorescent pictograph" browser for exploring and analyzing large-scale biological data sets." PLoS ONE 2(8): e718. Witherup, K. M., M. J. Bogusky, P. S. Anderson, H. Ramjit, R. W. Ransom, T. Wood and M. Sardana (1994). "Cyclopsychotride A, A biologically active, 31residue cyclic peptide isolated from Psychotria longipes." J. Nat. Prod. 57: 1619-1625. Wolny, E., A. Braszewska-Zalewska and R. Hasterok (2014). "Spatial distribution of epigenetic modifications in Brachypodium distachyon embryos during seed maturation and germination." PLoS ONE 9(7). Wong, C. T., D. K. Rowlands, C. H. Wong, T. W. Lo, G. K. Nguyen, H. Y. Li and J. P. Tam (2012). "Orally active peptidic bradykinin B1 receptor antagonists engineered from a cyclotide scaffold for inflammatory pain treatment." Angew Chem Int Ed Engl 51(23): 5620-5624. Wu, D.-D., D. M. Irwin and Y.-P. Zhang (2011). "De novo origin of human protein-coding genes." PLoS Genetics 7(11): e1002379. Wu, X. and P. A. Sharp (2013). "Divergent transcription: a driving force for new gene origination?" Cell 155(5): 990-996. Xiao, W., H. Liu, Y. Li, X. Li, C. Xu, M. Long and S. Wang (2009). "A rice gene of de novo origin negatively regulates pathogen-induced defense response." PLoS ONE 4(2): e4603. Xie, C., Y. E. Zhang, J. Y. Chen, C. J. Liu, W. Z. Zhou, Y. Li, M. Zhang, R. Zhang, L. Wei and C. Y. Li (2012). "Hominoid-specific de novo protein-coding genes originating from long non-coding RNAs." PLoS Genetics 8(9): e1002942. Xu, W., L. Li, L. Du and N. Tan (2011). "Various mechanisms in cyclopeptide production from precursors synthesized independently of non-ribosomal peptide synthetases." Acta Biochimica et Biophysica Sinica 43(10): 757-762. Xue, S. A., M. D. Jones, Q. L. Lu, J. M. Middeldorp and B. E. Griffin (2003). "Genetic diversity: frameshift mechanisms alter coding of a gene (Epstein-Barr virus LF3 gene) that contains multiple 102-base-pair direct sequence repeats." Molecular and Cellular Biology 23(6): 2192-2201. Yamada, K., T. Shimada, M. Kondo, M. Nishimura and I. Hara-Nishimura (1999). "Multiple functional proteins are produced by cleaving Asn-Gln bonds of a single precursor by Vacuolar Processing Enzyme." Journal of Biological Chemistry 274(4): 2563-2570. Yamada, K., T. Shimada, M. Nishimura and I. Hara-Nishimura (2005). "A VPE family supporting various vacuolar functions in plants." Physiologia Plantarum 123(4): 369-375. 169 Yamaguchi, N., B. Anand-Apte, M. Lee, T. Sasaki, N. Fukai, R. Shapiro, I. Que, C. Lowik, R. Timpl and B. R. Olsen (1999). "Endostatin inhibits VEGF-induced endothelial cell migration and tumor growth independently of zinc binding." The EMBO Journal 18(16): 4414-4423. Yamazaki, M., T. Shimada, H. Takahashi, K. Tamura, M. Kondo, M. Nishimura and I. Hara-Nishimura (2008). "Arabidopsis VPS35, a retromer component, is required for vacuolar protein sorting and involved in plant growth and leaf senescence." Plant Cell Physiol 49(2): 142-156. Yan, A., M. Wu, L. Yan, R. Hu, I. Ali, Y. Gan (2014). "AtEXP2 is involved in seed germination and abiotic stress response in Arabidopsis." PLoS One 9(1): 1–10. Yang, S., J. R. Arguello, X. Li, Y. Ding, Q. Zhou, Y. Chen, Y. Zhang, R. Zhao, F. Brunet, L. Peng, M. Long and W. Wang (2008). "Repetitive element-mediated recombination as a mechanism for new gene origination in Drosophila." PLoS Genetics 4(1): e3. Yang, Z. and J. Huang (2011). "De novo origin of new genes with introns in Plasmodium vivax." FEBS Letters 585(4): 641-644. Yoshida, S., S. Maruyama, H. Nozaki and K. Shirasu (2010). "Horizontal gene transfer by the parasitic plant Striga hermonthica." Science 328(5982): 1128. Yuan, C., L. Chen, E. Meehan, N. Daly, D. Craik, M. Huang and J. Ngo (2011). "Structure of catalytic domain of Matriptase in complex with Sunflower trypsin inhibitor-1." BMC Structural Biology 11(1): 30. Zaaijer, H. L., F. J. van Hemert, M. H. Koppelman and V. V. Lukashov (2007). "Independent evolution of overlapping polymerase and surface protein genes of hepatitis B virus." Journal of General Virology 88(Pt 8): 2137-2143. Zhang, D., J. Chen, L. Deng, Q. Mao, J. Zheng, J. Wu, C. Zeng and Y. Li (2010). "Evolutionary selection associated with the multi-function of overlapping genes in the hepatitis B virus." Infection Genetics and Evolution 10(1): 84-88. Zhang, G., G. Guo, X. Hu, Y. Zhang, Q. Li, R. Li, R. Zhuang, Z. Lu, Z. He, X. Fang, L. Chen, W. Tian, Y. Tao, K. Kristiansen, X. Zhang, S. Li, H. Yang and J. Wang (2010). "Deep RNA sequencing at single base-pair resolution reveals high complexity of the rice transcriptome." Genome Research 20(5): 646-654. Zhang, J. (2006). "Parallel adaptive origins of digestive RNases in Asian and African leaf monkeys." Nature Genetics 38(7): 819-823. Zhang, J., A. M. Dean, F. Brunet and M. Long (2004). "Evolving protein functional diversity in new genes of Drosophila." Proceedings of the National Academy of Sciences of the United States of America 101(46): 16246-16250. Zhang, J., Y. P. Zhang and H. F. Rosenberg (2002). "Adaptive evolution of a duplicated pancreatic ribonuclease gene in a leaf-eating monkey." Nature Genetics 30(4): 411-415. Zhang, Y.-w., S. Liu, X. Zhang, W.-B. Li, Y. Chen, X. Huang, L. Sun, W. Luo, W. J. Netzer, R. Threadgill, G. Wiegand, R. Wang, S. N. Cohen, P. Greengard, F.F. Liao, L. Li and H. Xu (2009). "A Functional Mouse Retroposed Gene Fg01 Reduces Alzheimer’s β-Amyloid Levels and Tau Phosphorylation." Neuron 64(3): 10.1016/j.neuron.2009.1008.1036. 170 Zhang, Y., Y. Wu, Y. Liu and B. Han (2005). "Computational Identification of 69 Retroposons in Arabidopsis." Plant physiology 138(2): 935-948. Zhang, Y. E., P. Landback, M. D. Vibranovski and M. Long (2011). "Accelerated recruitment of new brain development genes into the human genome." PLoS Biology 9(10): e1001179. Zhao, L., P. Saelao, C. D. Jones and D. J. Begun (2014). "Origin and spread of de novo genes in Drosophila melanogaster populations." Science 343(6172): 769-772. Zhou, Q. and W. Wang (2008). "On the origin and evolution of new genes--a genomic and experimental perspective." J Genet Genomics 35(11): 639-648. Zhou, Q., G. Zhang, Y. Zhang, S. Xu, R. Zhao, Z. Zhan, X. Li, Y. Ding, S. Yang and W. Wang (2008). "On the origin of new genes in Drosophila." Genome Research 18(9): 1446-1455. Zhou, Y., J.-s. Wang, X.-j. Yang, D.-h. Lin, Y.-f. Gao, Y.-j. Su, S. Yang, Y.-j. Zhang and J.-j. Zheng (2013). "Peanut allergy, allergen composition, and methods of reducing allergenicity: A review." International Journal of Food Science & Technology 2013: 1-8. Zhu, Z., Y. Zhang and M. Long (2009). "Extensive structural renovation of retrogenes in the evolution of the Populus genome." Plant physiology 151(4): 1943-1951. Zhukov, I., L. Jaroszewski and A. Bierzynski (2000). "Conservative mutation Met8 --> Leu affects the folding process and structural stability of squash trypsin inhibitor CMTI-I." Protein Sci 9(2): 273-279. 171