Download Integration of RNA and protein expression profiles to

IntegrationofRNAandprotein expressionprofilestostudyhuman cells FridaDanielsson ©FridaDanielsson Stockholm2016 RoyalInstituteofTechnology(KTH) SchoolofBiotechnology ScienceforLifeLaboratory SE-17165Solna Sweden Printedat UniversitetsserviceUS-AB DrottningKristinasVäg53B SE-11428Stockholm Cover image: The BUB1B mitotic checkpoint serine/threonine kinase B localizedtokinetochoresduringchromosomalsegregationinmitosis,stained withtheantibodyHPA008419inatransformedBJcell. AkademiskavhandlingsommedtillståndavKTHiStockholmframläggesför granskning för avläggande av teknisk doktorsexamen fredagen den 16 December2016kl.13.00IRockefellersalen,KarolinskaInstitutet,Solna. ISBN978-91-7729-209-8 TRITA-BIOReport2016-22 ISSN16542312 2 Abstract Cellular life is highly complex. In order to expand our understanding of the workings of human cells, in particular in the context of health and disease, detailed knowledge about the underlying molecular systems is needed. The unifying theme of this thesis concerns the use of data derived from sequencing of RNA, both within the field of transcriptomics itself and as a guide for further studies at the level of protein expression. In paper I, we showed that publicly available RNA-seq datasets are consistent across different studies, requiring only light processing for the data to cluster according to biological, rather than technical characteristics. This suggests that RNA-seq has developed into a reliable and highly reproducible technology,andthattheincreasingamountofpubliclyavailableRNA-seqdata constitutes a valuable resource for meta-analyses. In paper II, we explored theabilitytoextrapolateproteinconcentrationsbytheuseofRNAexpression levels. We showed that mRNA and corresponding steady-state protein concentrations correlate well by introducing a gene-specific RNA-to-protein conversion factor that is stable across various cell types and tissues. The resultsfromthisstudyindicatetheutilityofRNA-seqalsowithinthefieldof proteomics. The second part of the thesis starts with a paper in which we used transcriptomics to guide subsequent protein studies of the molecular mechanismsunderlyingmalignanttransformation.InpaperIII,weapplieda transcriptomics approach to a cell model for defined steps of malignant transformation, and identified several genes with interesting expression patterns whose corresponding proteins were further analyzed with subcellularspatialresolution.Severaloftheseproteinswerefurtherstudied inclinicaltumorsamples,confirmingthatthiscellmodelprovidesarelevant systemforstudyingcancermechanisms.InpaperIV,wecontinuedtoexplore thetranscriptionallandscapeinthesamecellmodelundermoderatehypoxic conditions. Toconclude,thisthesisdemonstratestheusefulnessofRNA-seqdata, from a transcriptomics perspective and beyond; to guide in analyses of protein expression, with the ultimate goal to unravel the complexity of the humancell,fromaholisticpointofview. Keywords:RNA-seq,Transcriptomics,Proteomics,Malignanttransformation, Cancer,Functionalenrichment i Populärvetenskapligsammanfattning Dinkroppäruppbyggdavövertrettiosjutusenmiljarderlevandeceller.Varje enskildcellutgörettegetlitetuniversumdärmiljonermolekylerinteragerari olika funktioner som tillsammans skapar cellens identitet. Den unika kombinationenavmolekylersomverkarinneidinacelleravgörtillexempel omduärfriskellersjuk,vadsomgörennjuretillennjureochvilkenfärgdina ögon har. Den repertoar av molekyler som finns i cellerna finns förprogrammerad i dina gener som består av DNA. Sammansättningen av DNAärstabilochväldigtlikmellanindivider,mentrotsdettaärmänniskor, biologiskt sett, väldigt olika. Det beror på att DNA översätts till olika RNAmolekyler och proteiner. Proteinerna är cellens viktigaste byggstenar, de deltarimångalivsviktigafunktioner. Flödet av information från DNA, till RNA, och vidare till protein kallas den Centrala dogmen inom molekylärbiologi och formulerades på 1950-talet. Tänk dig cellen som en liten fabrik. Varje gång något ska byggas i fabriken hämtasenritningiettbibliotek.DettabibliotekkanliknasviddittDNA.Den specifika ritning som tas fram ur biblioteket är då ditt RNA, och själva produktensomsedantillverkasutifrånritningenärettprotein,detsomsedan kommer till användning. Den kompletta sammansättningen av RNAmolekyleriencellkallastranskriptom.Eftersomtranskriptometkanvariera mycketovertid,ochunderolikaförhållandenutgördetenutmärktkällatill informationomcellerstillstånd.Genomattstuderatranskriptometkanvifå ett indirekt mått på vilka proteiner som finns närvarande och bättre förstå cellersfunktioner. För att ta reda på hur mycket RNA som har skapats från varje gen används RNA-sekvensering.Ilaboratorieröverhelavärldenståravancerademaskiner och arbetar dygnet runt med att läsa av RNA-sekvenser. Den stora utmaningenärintelängreattkunnasekvenseraallaRNA-molekyleriettenda experiment, utan snarare att kunna tolka den data som kommer ut ur maskinen på ett meningsfullt sätt. Data från RNA-sekvenseringsexperiment laddas ofta upp i publika databaser. På så vis främjas ett öppet klimat där forskarekangeochtaavvarandrasdata.Förattdettaskafungerasåbrasom möjligt är det viktigt att vi kan lita på innehållet i databaserna. I den första studien i denna avhandling undersökte vi pålitligheten hos publika dataset. Genom att studera publika RNA-data från olika studier kunde vi konstatera att dessa dataset var mer lika ett annat dataset med liknande biologiskt ursprung än ett annat dataset från samma laboratorium, men av annan biologisk härkomst. Resultaten visade att publika RNA-data är relativt tillförlitliga. ii Nu kanske du undrar varför forskare är så intresserade av RNA när det faktiskt är proteinerna som är cellernas viktigaste och identitetsskapande molekyler. Eftersom RNA är lättare och mindre tidskrävande att analysera i stor skala, jämfört med proteiner, är det smidigt att använda RNA som en indikatorförhurcellersuttryckavproteinserut.Denandrastudienidenna avhandlinghandlaromrelationenmellanRNAochproteiner.Vikomframtill att varje gen har ett bestämt förhållande mellan dess koncentration av RNA ochprotein,somärkonstantiolikatyperavcellerochvävnader. Iavhandlingenstredjestudielätviossguidasavtranskriptometföratthitta intressantaproteinervarsrollärviktigiutvecklingavcancer.Härstuderade vitranskriptometiettmodellsystembeståendecellerunderfyraolikastegi cancerutveckling. På så sätt kunde vi identifiera gener med intressanta uttrycksprofiler under fyra stadier, från det normala tillståndet till ett fullt utvecklat och aggressivt cancertillstånd. Med hjälp av antikroppar kunde vi sedananalyseramotsvarandeproteiner,ochdetekteraförändringariuttryck på en mycket detaljerad nivå. Vi identifierade ett flertal potentiellt viktiga markörer för cancerutveckling och validerade resultaten genom att studera samma proteiner i tumörprover. I den fjärde studien studerade vi hur transkriptometförändrasunderolikagraderavcancerutvecklingtillföljdav låg syretillförsel. Vi utsatte samma cancer-modell för odling i 3 % syrehalt och identifierade ett antal intressanta gener vars uttryck påverkades avden förändrade syrenivån, till exempel gener involverade i celldelning och fettmetabolismen.DenhäravhandlingendemonstrerarhuranalyseravRNA kan användas som en vägvisare för att identifiera relevanta proteiner att fokuseravidarepå.Förattökavårförståelsefördekomplexaprocessersom pågåricellensuniversum,behövsfortsattastudieravhurproteineruttrycks, förändrasochinteragerar. iii Thesisdefense This thesis will be defended December 16th 2016, at 13.00, for the degreeofDoctorofPhilosophy(PhD)ininBiotechnology. Location:Rockefellersalen,Nobelsväg11,Solna. Respondent FridaDanielssongraduatedasaMasterofScienceandEngineeringfromKTH School of Biotechnology in 2011 and pursued her PhD studies in the Cell ProfilingGroupatthedepartmentofProteomicsandNanobiotechnology. FacultyOpponent ChristineVogel,AssistantprofessorinBiologyattheDepartmentofBiology, CenterforGenomicsandSystemsBiology,NewYorkUniversity,NY Evaluationcommittee CarolinaWählby,ProfessorinquantitativemicroscopyatUppsalaUniversity, Uppsala Carsten Daub, Assistant Professor in Bioinformatics, heading the Clinical Transcriptomics research at the Department of Biosciences and Nutrition at KarolinskaInstitute,Solna Malin Andersson, Assistant professor at the Department of Pharmaceutical Biosciences,UppsalaUniversity,Uppsala ChairmanoftheThesisDefense Stefan Ståhl, Professor in Molecular Biotechnology at KTH School of Biotechnology,Stockholm MainSupervisor Emma Lundberg, Associate professor in Cell Biology Proteomics at KTH and headingtheCellProfilingGroupinTheHumanProteinAtlas,Stockholm Co-supervisors Mikael Huss, Associate professor in bioinformatics at KTH and academic consulting data scientist/bioinformatician in the SciLifeLab long-term bioinformaticssupportunit,Stockholm Mathias Uhlén, Professor in Microbiology at KTH School of Biotechnology, Stockholm iv Listofpublications This thesis includes the following publications, which are referred to in the textbythecorrespondingRomannumerals(I-IV)andfoundattheendofthe thesis.Allarticlesarereproducedwithpermissionfromtheirpublishers. I Frida Danielsson, Tojo James, David Gomez-Cabrero, Mikael Huss (2015). AssessingtheconsistencyofpublichumantissueRNA-seqdatasets. BriefingsinBioinformatics,16(6),2015,941–949. DOI:10.1093/bib/bbv017 II Fredrik Edfors, Frida Danielsson, Björn M Hallström, Lukas Käll, EmmaLundberg,FredrikPontén,BjörnForsströmandMathiasUhlén (2016). Gene-specificcorrelationofRNAandproteinlevelsinhumancellsand tissues. MolecularSystemsBiology(2016)12:883. DOI:10.15252/msb.20167144 III Frida Danielsson, Marie Skogs, Mikael Huss, Elton Rexhepaj, Gillian O’Hurley,DanielKlevebring,FredrikPontén,AnnicaK.B.Gad,Mathias Uhlén,EmmaLundberg(2013). Majority of differentially expressed genes are down-regulated during malignanttransformationinafour-stagemodel. ProceedingsoftheNationalAcademyofSciences110(17)·April2013 DOI:10.1073/pnas.1216436110 IV Frida Danielsson, Erik Fasterius, Kemal Sanli, Cheng Zhang, Adil Mardinoglu, Cristina Al-Khalili, Mikael Huss, Mathias Uhlén, Emma Lundberg Transcriptome profiling of a cell line model for malignant transformationinresponsetomoderatehypoxia.(2016) Manuscript v Respondent’scontributionstotheincludedpapers I II III IV vi Main responsibility for bioinformatics analyses and co-responsible authorduringmanuscriptwriting. Main responsibility for cell cultivation, transcriptomics experiments andrelateddataanalysis. Main responsibility for experimental planning, laboratory performance and data visualization. Co-responsible author during manuscriptwriting. Main responsibility for experimental planning, performance, data analysisandmainresponsibleauthorduringmanuscriptwriting. Abbreviations AQUA CAGE cDNA CLSM CSF cyTOF cycIF DDA DIA DNA EB EBI ELISA ENCODE ESI EST FPKM FRAP FRET GEO GO GTEx HPA hTERT HTS ICAT IF IHC IRES iTRAQ IWGAV LC MALDI MRM mRNA MS MS/MS MS1 MS2 Absolutequantification capanalysisofgeneexpression complementaryDNA confocallaserscanningmicroscopy cerebrospinalfluid cytometrybytime-of-flight cyclicimmunofluorescence data-dependentacquisition data-independentacquisition deoxyribonucleicacid exabyte EuropeanBioinformaticsInstitute enzyme-linkedimmunosorbentassay EncyclopediaofDNAElements electrosprayionization expressedsequencetag fragmentsperkilobaseoftranscriptpermillionreadsmapped fluorescencerecoveryafterphotobleaching Försterresonanceenergytransfer GeneExpressionOmnibus GeneOntology Genotype-TissueExpression HumanProteinAtlas telomerasereversetranscriptase high-throughputSequencing isotope-codedaffinitytag immunofluorescence immunohistochemistry internalribosomeentrysite isobarictagsforrelativeandabsolutequantitation InternationalWorkingGroupforAntibodyValidation liquidchromatography matrix-assistedlaserdesorptionionization multiplereactionmonitoring messengerRNA massspectrometry tandemmassspectrometry firststageofMS/MS secondstageofMS/MS vii NCBINationalCenterforBiotechnologyInformation NIH NationalInstituteofHealth ORF openreadingframe PB petabyte PCR polychainreaction PFA paraformaldehyde Poly-A poly-adenylated PRIDE ThePRoteomicsIDEntificationsdatabase PRM parallelreactionmonitoring PTM post-translationalmodification RIA radioimmunoassay RIN RNAIntegrityNumber RNA ribonucleicacid RNA-seq RNAsequencing RPF ribosomeprotectedfragments SAM sequencealignmentmap SILAC stableisotopelabelingwithaminoacidsincellculture SOP standardoperatingprocedures SRA sequencereadarchive SRM selectedreactionmonitoring SV40 Simianvirus40 TMT tandemmasstags TPM transcriptspermillion viii Contents Abstract i Populärvetenskapligsammanfattning ii Thesisdefense iv Listofpublications v Respondent’scontributionstotheincludedpapers vi Abbreviations vii 1.Thehumancell Themacromoleculesofthecell Compartmentalizationofbiologicalprocesses Thecentraldogmaofmolecularbiology Evolutionandcancer Celllines Malignanttransformation TheBJmodel 1 2 4 5 5 6 7 8 2.Thehumantranscriptome RNAsequencing RNAextraction,enrichment,librarypreparationandsequencing Mappingandquantifyingtranscriptomes Differentialgeneexpression Challengesandconsiderations 9 9 10 12 14 15 3.Studyingthehumanproteome MS-basedproteomics 17 18 19 21 22 22 24 25 Quantitativemassspectrometry Targetedmassspectrometry Affinity-basedproteomics Immunofluorescence UsingRNAlevelsasanindirectmeasureofproteinexpression Challengesandconsiderations 4.Dataanalysisandfunctionalinterpretation Publicdataresources Transcriptomicsdatabases Proteomicsdatabases TheHumanProteinAtlas GeneOntology Functionalenrichmentanalysis Challengesandconsiderations 27 27 27 28 29 29 30 31 5.Aimsofthethesis 33 6.Presentinvestigation AssessingtheconsistencyamongpublicRNA-seqdata(PaperI) Usingtranscriptomicstoguideproteomics(PaperII) 34 35 36 Combining transcriptomics and protein profiling to study malignant transformation(paperIIIandIV) 38 Concludingobservationsandfutureperspectives 39 Acknowledgements 41 Bibliography 43 2 1.Thehumancell Withinourbodiesareover37trillioncells,anenormousnumberthatisfar beyondthenumberofstarsinourgalaxy1,2.Thecellisoftendescribedasthe smallest living entity, the basic unit of all living organisms that harbors the most fundamental properties of life; that is to grow, reproduce, sense and respond to the environment, and to maintain a chemical composition that favorsitspersistence. Thestructureofthecellwasfirstdescribedinthe17thcenturybytwofellows oftheRoyalSociety,RobertHookeandAntonievanLeeuwenhoek,whoboth developedopticalinstrumentationtostudymicroorganisms.In1665,Robert Hookepublishedthebook“Micrographia”withdetaileddrawingsofvarious matters based on his observations, using a first generation compound microscope. In this groundbreaking book the term “cell” was coined, as he describedtheunitsofasliceofdeadcorkobservedunderhislenses.Theuse of this word stems from the Latin word “cella”, meaning small room. The Dutch businessman and scientist Antonie van Leeuwenhoek perfected the making of single lens microscopes, enabling him to observe bacteria, sperm cellsandbloodcells3,4.AlthoughHookeandLeeuwenhoeksharethehonorof developing the first optical instrumentation required for the observation of cells,ittookalmostanothertwocenturiesuntilthecurrentviewofacellas the basic unit for a living organism emerged, the so called cell theory. This theorywasofficiallypostulatedin1838-1839asaresultofconclusionsmade bythebotanistMatthiasJacobSchleidenandzoologistTheodorSchwann4,5. The great fascination of the human cell has attracted substantial attention amongresearcherseversinceitwasfirstdiscovered,continuouslyexpanding theunderstandingofthecellularlandscapeanditsdiversity.Consideringits original notion as a small room, current knowledge rather supports a view, where opening the door to that small room will lead us to a bustling metropolis, harboring millions of molecules that interact and orchestrate a myriad of biological functions through complex networks. Cellular life is a balancing act. In order for this highly complex system to function, a high degreeofregulationandcontrolisrequired.Tounderstandcellularfunction, and in particular in the context of health and disease, detailed knowledge aboutthecellularsystemisneeded,fromaholisticpointofview. 1 Themacromoleculesofthecell Themostabundantchemicalcompoundfoundwithincellsiswater,takingup over70%ofthetotalcellmass.Therestofthecellvolumecontainsinorganic ions and larger organic molecules. The organic molecules in the cell are generallyclassifiedintolipids,carbohydrates,nucleicacidsandproteins.The latter three are commonly referred to as the macromolecules of the cell. Macromolecules are large molecules (> 5kD) that are formed by the joining (polymerization)ofhundreds,oreventhousands,ofsmallerprecursors,and occupyabout80-90%ofthecellsdryweight6. DNA,ordeoxyribonucleicacid,isthebasicunitofgeneticmaterial.Itcarries information encoded in sequences of nucleotides that are made of a fivecarbon sugar (deoxyribose), a phosphate group and one of the four bases adenine(A),cytosine(C),guanine(G)andthymine(T).TheDNAmoleculeis formed by the joining of two strands that are held together by hydrogen bondingbetweenthebasesinapairwisemanner,whereAalwaysbindstoT, and C always binds to G. This double-strand molecule is spatially arranged into a helix-like structure that provides the robustness needed for its most prominentfunction,toreplicateandtransfergeneticinformationduringcell division. The story of DNA began in the 19th century when Gregor Mendel performedgroundbreakingexperimentsonpeaplantsthatrevealedthebasic principles of hereditary transmission7. Around the same time, the Swiss scientist Friedrich Miescher was examining leukocytes, in which he discovered a precipitate of unknown character in the nuclei of the cells. He concluded that this precipitate must be a “multi-based acid” and named it Nuclein8,9.AlthoughtheseearlyfindingspioneeredtheresearchonDNA,they are often overshadowed by the works of James Watson and Francis Crick, whoalmost100yearslatersolvedthestructureoftheDNAmolecule,which laidthegroundforacontinuouslyevolvingdevelopmentofmethodologiesto studygeneticmaterial10.Ineukaryoticcells,mostDNAispresentinthenuclei but a small part is also found in mitochondria. Packed around protein complexes in the chromosomes, it coordinates the cellular distribution of other functional molecules through the expression of genes. The term gene wasfirstusedalready1909whentheworkofMendelgainednewattention;it was then referred to as a “discrete unit of heredity”. The definition of this termhasbeenwidelydebatedoverthelastcentury,alongwithnewadvances inthefieldofgenomics,butamorerecentdefinitionstates“ageneisaunion of genomic sequences encoding a coherent set of potentially overlapping functionalproducts”11. 2 RNA, or ribonucleic acid, is the outcome of transcription, where regions of DNAarecopiedintocomplementaryRNAmoleculescalledtranscripts.RNAis structurally related to DNA with a few exceptions. Like the DNA molecule, RNAisalsobuiltupbyfournucleotides,butwiththemajordifferencesthat itsfive-carbonsugarisriboseandthatitcontainsthebaseUracil(U)instead of thymine (T). In addition, RNA appears in most cases as a single stranded moleculeanddisplaysalowermolecularweightthanDNA.ElliotVolkinand Lawrence Astrachan discovered the RNA molecule in 1956 and soon thereafter, its intermediary role as a messenger between DNA and protein wasdiscoveredbyseveralindependentresearchers12-15.Itwaslongbelieved that RNA most of all serves as a messenger bridging between DNA and protein, a view that has gradually changed with increased understanding of the repertoire of different RNA subtypes that carry out various biological functions.Over60%ofthecellulargenomeistranscribedintoRNA.Outofthe total RNA, 85-90% constitutes ribosomal RNA, 5% is mRNA that codes for protein,andtherestisothernon-codingRNAthathavegottenmoreattention in recent years, proven to display important regulatory functions 16-18. In higher eukaryotes, mRNA transcripts are processed into different isoforms via the process of alternative splicing. Over 90% of all protein-coding genes containing multiple exons are subject to alternative splicing, thereby expanding both the cellular transcriptome and proteome tremendously, withouttheneedforadditionalgenes17,19. Proteinswerefirstdescribedinthe18thcenturybyresearchersworkingwith plants,wheatglutenandalbuminfrombloodandeggwhites20,21.TheSwedish chemistJönsJacobBerzeliusisacknowledgedforcoiningthetermProtein,by suggestingtheusageoftheterminalettertohisfellowscientistMulderwho thenusedtheterminapublicationfrom183822.ThewordproteinisofGreek origin and means “of primary importance” or “of the first rank”, a name it certainlydeservesasalmostallprocessesinthecellinvolvetheactivityand interactionsofproteins.Comparedtoothercellularcomponents,proteinsare relativelylargemolecules,formedbysequencesofaminoacidsthatareheld together by peptide bonds. The sequence of amino acids is dictated by the sequence of the corresponding nucleotides that undergo translation, and humancellscontaingeneticrecipesfor20differentaminoacids.Asequence ofaminoacidsiscalledapeptide.Thesearrangeintofunctionalproteinsby folding into three-dimensional structures. After the process of translation, when proteins are formed, they often undergo further modifications, called post-translational modifications (PTMs). PTMs are commonly mediated by enzymatic activity and have significant impact over the protein’s functional activity.Theycanoccuratanytimeduringaprotein’slifecycleandtwoofthe mostcommonPTMsareglycosylationandphosphorylation. 3 Compartmentalizationofbiologicalprocesses In biology, compartmentalization of different functions is evident across several scales. The spatial partitioning of biological functions results in a hierarchy of specialized and robust systems, from individual organs in our bodies, to organelles within the single cell23. At the cellular level, compartmentalization is a fundamental property of the eukaryotic cell and manifests itself through the organization into cellular organelles. The molecular landscape of the cell is rich and complex. A cell is densely populatedbymillionsofproteinmoleculesthatinteractindifferentwaysto carry out designated functions. Organization into specialized membrane bound organelles enables independent molecular interactions to occur simultaneously,withoutcrosstalkandunderoptimalchemicalconditions.For example the processes of protein translation and degradation can occur in parallel24. Moving deeper into the hierarchy of biological compartmentalization, even organelles are subjected to subcompartmentalization into more refined structures, for example different bodiesandspecklesfoundwithinthenucleus.Compartmentalizationhasalso been proposed as a cellular strategy for passive noise filtering, by enabling molecularprocessestooccurinseparatecompartmentsfromwhichnoise,in termsofirrelevantmolecules,arepreventedtoenter25. The Golgi apparatus Nucleus Microtubules Nucleolus Vesicles Cytoplasm Mitochondria Actin filaments Endoplasmic reticulum Plasma membrane Figure 1. A schematic illustration of the human cell and selected subcellular compartments. Illustrated by Annica Åberg. 4 Thecentraldogmaofmolecularbiology OneofthekeystonesofscienceingeneralistheCentralDogmaofMolecular biology; a schematic description of how genetic information is carried over between molecules, from DNA to RNA in a process called transcription and from RNA to protein in a process called translation. The central dogma was first elaborated by Francis Crick in 1956 and soon thereafter published as part of a Society for Experimental Biology symposium in 195826,27. Subsequent research on RNA cancer viruses led to the discovery of RNA reverse transcriptase that enables synthesis of DNA from RNA28,29. These findingsleadsomeresearcherstoquestiontheaccuracyofthecentraldogma, whichwassoonthereaftercommentedbyCrickinanattempttoeradicatethe widespread misconceptions30,31. The Central Dogma is still often misunderstood. In its original form, the Central Dogma states, “Once information has passed into protein, it cannot get out again”, a take-home messagethat,whenproperlyunderstood,stillholds. Transcription DNA Replication RNA Protein Translation Figure2.Thecentraldogmaofmolecularbiology. Evolutionandcancer A fundamental theory of biology is evolution, the explanation for a set of observations that was published by Charles Darwin in his groundbreaking bookOntheOriginofSpeciesin185932.Simplyput,theevolutiontheorystates that gradual changes continuously occur in the genetic composition of organisms,drivenbyanaturalselectionforadvantageouscharacteristicsand increased reproduction. Evolution is typically thought of as something that actsonwholepopulationsoverlongstretchesoftime.However,evolutionary processes are also evident within individual bodies, exemplified by the 5 process of malignant transformation when a single cell acquires cancerous mutations that are passed on through replication. This can be considered a typeofmicroevolutionactingwithinthelifetimeofanindividual,byselection acting on cancerous mutations that provide growth advantages. The microcosmsofevolutionthatwecallcancershaveproventohaveextremely harmful impact on modern society. According to the World Health Organization,cancerisamongtheleadingcausesofdeathworldwideandthe annualincidenceisexpectedtorisefrom14milliontoover20millioncases withinthenexttwodecades33.Cancerisabroaddiagnosisconsistingofmore than 200 different cancer types, each affecting a certain cell type and pathway. Celllines The fact that the majority of all living human cells are embedded in tissues makesthemdifficulttoattainforexperimentalpurposes.Thestudyofliving cellsthereforerequiressystemswherecellscanbedirectlyaccessedandkept undercontrolledconditions.Apopularapproachistheuseofcelllines;those arecellssuitableforcultivationinvitrothatserveasmodelsystemsfortheir cell type or tissue of origin. Cell lines are often derived from tumor tissue where they have already acquired the ability to replicate for an unlimited number of cycles. Cell lines can also be of primary origin with a limited lifespan, or of primary origin but further immortalized by the expression of telomerase34. The use of cell lines has several advantages, for example the possibility to obtain almost unlimited amount of sample material, which enables users to try out and optimize different experimental approaches withoutwastingprecioussamples.Celllinesalsoprovideconsistentsamples ofapurepopulation,whichcanbeusedtoreproduceresultsacrossdifferent laboratories. In addition, cell lines are relatively cheap and can be stored in freezers for long periods of time. Even though cell lines are tremendously useful,theycanonlyservetoapproximatethecharacteristicsofcellspresent within the more complex in vivo environment. Long-term use of cell lines bearstheriskofgenotypicdriftthatmakethecellsdivergegeneticallyfrom theiroriginalsource,byundergoingselectionforcellswithincreasedcapacity toproliferateunderthegivencircumstances35,36.Thefirsthumancelllinewas establishedin1951fromcervicalcancercellsandnamedHeLaafteritsdonor Henrietta Lacks. What Henrietta Lacks never knew was that over 20 tons of hercellsweretobecultivatedinworldwidelaboratoriesafterherdeath.The storyofHenriettaLacks,theHeLacelllineanditssurroundingethicalissues are discussed in the book “The immortal life of Henrietta Lacks” that was publishedin201037. 6 Malignanttransformation Malignant transformation refers to the process when primary cells become cancerous. This process involves changes in the genome that lead to disruption of regulatory circuits that together result in acquirement of new cellular characteristics. Malignant transformation is a complex molecular process. Despite an almost incalculable body of literature, a complete mechanisticunderstandingofthisprocessstillremainselusive. In the year 2000 a review paper entitled The Hallmarks of Cancer was published38.Inthispaper,theauthorsaimedtoidentifythecommonfeatures anddisruptedpathwaysthatarerequiredtoreachacancerousphenotype.By reviewingthelatterhalfofthe20thcentury’sliterature,detailedinformation about malignant transformation was compiled into one organizing principle foritsunderlyingbiology.Intheoriginalversionofthispaper,thehallmarks of cancer encompassed six essential cellular alterations that collectively dictate malignant transformation; Sustaining proliferative signaling, Evading growth suppressors, Resisting cell death, Enabling replicative immortality, InducingangiogenesisandActivatinginvasionandmetastasis. Ten years after the Hallmarks of Cancer first came out, its sequel “The Hallmarks of Cancer: Next Generation” was published39. This time, the concept was complemented with two emerging hallmarks, Reprogramming energy metabolism and Evading Immune destruction. In addition, two enabling characteristics were defined; Tumor-promoting inflammation and Genome instability and mutation, and the paper also covered discussion abouttheroleofthesurroundingtumormicroenvironment.Thegeneralized conceptforthemolecularbiologybehindmalignanttransformationprovided bythesetwopapershasheavilydominatedthefieldofcancerresearchever since. 38 Figure3.ThesixoriginalHallmarksofcancer.FigureadaptedfromHanahanetal. 7 TheBJmodel Malignant transformation is a multistep process. To reveal the underlying mechanisms behind this complex progression, a model system enabling the study of defined steps of transformation is advantageous. Traditionally, primarymurinecellshavebeengeneticallymodifiedtoundergotumorigenic conversion and serve as models for malignant transformation. Compared to murine cells, human cells are more challenging to transform, mainly due to differences in telomere biology. In contrast to murine cells, human primary cells progressively lose telomeric DNA when passaged in vitro cultivation, leadingtocellularsenescencethatischaracterizedbyadrasticproliferation decline.Severalattemptstotransformhumancellshavepointedtothelackof telomerase expression as a significant barrier to reach malignancy. In 1999, Weinberg and co-workers created a model system consisting of four human fibroblast cell lines by stepwise additions of genetic alterations. Following initialexpressionofthecatalyticsubunitoftelomerasereversetranscriptase (hTERT) in primary cells, the cells were further transformed by sequential introductionoftheSimianVirus40(SV40)Large-Toncogeneandoncogenic H-Ras. With this combination of genetic alterations, the cells gained a fully transformed phenotype, which was also confirmed invivo. By expression of hTERT, cells become immortal and are able to replicate for an indefinite number of cycles40. The second alteration introduced in this model is the Large-T antigen, expressed by the SV40 virus. Large-T contributes to the transformationofcellsbyinactivationofthewell-studiedtumorsuppressors p53 and Rb41. The final alteration in this model system is introduction of a mutatedversionofH-ras(G12V)havingoneGlycineexchangedtoValinethat makesthisGTPasekeptconstantlyinitsactiveform.Eversincethecreation of the BJ model for malignant transformation, the same combination of hTERT, Large-T and oncogenic H-Ras has been used by numerous other researchersandprovedtosuccessfullytransformavarietyofprimaryhuman celltypes42-46. 8 2.Thehumantranscriptome The word transcriptome refers to the full set of transcribed RNA molecules within a cell, tissue or organism at a given time point, but it can also be assignedtothefullsetoftranscriptsbelongingtoaspecificsub-populationof RNA, for example protein coding mRNA, regulatory RNA, ribosomal RNA, smallRNAortRNA.Incontrasttothegenome,whichischaracterizedbyits stability over different cells within an organism, the transcriptome varies greatlyanddisplaysahighdegreeofsensitivitytobothinternalchangessuch as developmental stage and circadian rhythm, and external changes such as stress and changes in the environment. The plastic nature of the transcriptomehasmadeitappealingtostudyinthecontextofdisease,owing toitspotentialtoserveasaproxyforcellularidentityanddiversity. An apparent strategy to reveal the active parts of the genome is to explore what genes are transcribed, to what quantity, and how the levels of RNA transcriptsvarybetweendifferentpopulations,overtimeandoverdifferent physiologicalconditions.Inthepast,thestudyoftranscriptomeswasmainly performed using hybridization-based methods where RNA samples were added to pre-designed cDNA probes for complementary hybridization. A major limit with these methods is dependence on an annotated genome, problemswithhighbackgroundsignalsduetocross-hybridization,alimited dynamic range due to oversaturation of signals and complicated normalization methods47. Owing to a successful era of development within thefieldofsequencingtechnology,High-ThroughputSequencing(HTS)based approachestotheanalysisofthetranscriptomeemergedinthe1990´s.These aremostcommonlyusedtoday.TheessenceofRNAsequencing(RNA-seq)is to count reads generated from different regions in the genome in order to reveal the transcriptional abundance of each and every analyzed genomic region. RNAsequencing Mining the literature, the term RNA sequencing appears for the first time in 2008intwo,todate,wellcitedpapers48,49.However,withoutusingtheterm RNA sequencing, its underlying concept had already been published before, forexamplesequencingofexpressedsequencetags(EST)thatwasapopular approach to analyze transcripts by sequencing of cDNA50. In these early innovativepapers,researcherswerecapableofanticipatingsequencingbased approachestosubsequentlyreplacethepreviousmethodsasstateoftheart technology for the study of transcriptomes51-54. Today, sequencing experiments are often employed at specialized service facilities, like the National Genomics Infrastructure at SciLifeLab in Stockholm. In this way, 9 researcherscangetaccesstoacompetitiveinfrastructureequippedwiththe latesttechnology,accompaniedwithtechnicalsupportbytrainedspecialists. The standard workflow of an RNA-seq experiment can be divided into two parts, first a laboratory part covering RNA extraction, enrichment, library preparationandtheactualsequencing.Thesecondpartisthebioinformatics part which depends on computers to process, analyze and interpret the outputfromthesequencingmachine.Thistypicallyincludesqualitycontrols, identificationoforiginwithinareferencegenome(orassemblyintocontigs), quantification of transcript abundance and often analysis of differential expressionacrossdifferentconditions.Denovotranscriptomeassemblyisan importantprocedure,especiallyintheanalysisoforganismsforwhichthere is a lack of reference genome annotation, however it is beyond the frame of thisthesis. ThemostcommonlyusedcommercialplatformsforRNA-seqarebasedonthe conversionofRNAintocomplementarycDNA,throughreversetranscription before sequencing takes place. The major reason for introducing this additional step, instead of performing direct sequencing of RNA, is to retain stability of the molecules undergoing sequencing. RNases, i.e. enzymes that degrade RNA, are ubiquitous in nature, which makes RNA highly unstable outside the intracellular environment. In addition, DNases are easier to inactivate compared to RNases, Polymerase Chain Reaction (PCR) amplificationismoresuitableforDNAthanRNA,andRNA-seqheavilyrelyon HTSprotocolsthatwereinitiallydevelopedforDNAsequencing.Despitethis trend, several recent studies claim direct sequencing of RNA to be more suitableandefficient55. RNA extraction, enrichment, library preparation and sequencing The initial step of the RNA-seq experiment is to extract RNA from cells, whether the biological sample being analyzed is a tissue sample or a pellet derived from invitrocultivation. As the majority of total RNA within cells is ribosomalrRNA,thatismostoftenconsidereduninformative,thenextstepis to enrich for the RNA subpopulation of interest. For studies of the proteincodingpartsofthegenome,mRNAistypicallyenrichedbytakingadvantage ofthepolyadenylated(poly-A)tagsatthe3’endthatcanbecapturedbythe addition of poly-T oligomers. An alternative method is to use a negative enrichment approach, where rRNA is instead depleted using rRNA-specific probes,howeverthatwillleavemorethanjustmRNAleftinthesample.After purification of the desired type of RNA, purity and intactness are usually measured.Thisistypicallyperformedwithanelectrophoresis-basedmethod where the length distribution of the RNA transcripts is compared to the ribosomal complexes, resulting in a calculated RNA Integrity Number (RIN 10 value) where 1 represents complete degradation and 10 represents pure intactness56. When RNA samples of desired concentration and quality are ready,itistimetobuildasequencinglibrary,whichisthecollectionofcDNA moleculesthataretobesequenced. Buildingasequencinglibraryconventionallystartswithfragmentationofthe RNA into appropriate size prior to cDNA synthesis. To ensure that the RNA fragments are intact without overhangs, end-repair is usually performed. Reverse transcriptase and random oligo (dNTPs) primers are first added to build an initial single-stranded cDNA library (first strand synthesis) that is further converted into double-stranded cDNA by the addition of DNA polymerase. To enable parallel sequencing of more than just one sample without the loss of information about sample origin, indexing primers are added to the ends of the cDNA. Before amplification with PCR, adaptors are alsoaddedtoenablehybridizationtosurface-boundprimersontheflowcells whereamplificationtakesplace(Figure4).Afterlibrarypreparationitistime for the real showpiece of the experiment, the actual sequencing when the orderofthebasesaredetermined.Astheworkcoveredinthisthesisdidnot involveanylaboratoryworkrelatedtotheprocessofRNA-seq,exceptforthe preparationofRNAsamplesandmeasurementsofintactness,morein-depth contentcoveringaspectsoftheactualsequencingexperimentisthereforeleft outandfocusisdirectedtothepost-sequencingpartoftheanalyses,covered inthenextsection. 5´ AAAA TTTT Figure 4.Thebasicworkflowof RNA-seqmRNAlibrarypreparation.mRNAisfirst isolated, followed by fragmentation, addition of primers, first strand synthesis, secondstrandsynthesis,adaptorligationandPCRamplification. 11 Mappingandquantifyingtranscriptomes ThebioinformaticspartoftheRNA-seqexperimentbeginsatthelevelofraw sequence data in the form of a FASTQ file. The FASTQ file contains information for all sequenced reads organized into sections covering a sequenceidentifier,thenucleotidesequence,aqualityscore identifieranda quality score assigned to each base. With the FASTQ file in hand, an initial stepistypicallytotakeadvantageofthequalityscorestointerprettheoverall qualityofthesequencerun,followedbyfilteringtoremovelowqualityreads. Inthenextphaseoftheanalysis,readsarealignedtoareferencegenomein order to identify their origin of transcription. Reference genomes are commonly stored in a text-based file format called FASTA and can easily be downloaded online from the ENSEMBL consortium or the UCSC genome browser57,58.SincethebeginningoftheRNA-seqera,awidelyusedprogram foralignmentofRNA-seqreadshasbeenTopHat,whichispartofanRNA-seq pipeline named Tuxedo that offers a collection of open-source tools for comprehensiveRNA-seqdataanalysis59.TopHatmakesuseoftheshortread aligner Bowtie, initially developed for alignment of DNA sequences, which first indexes the reference genome before aligning the reads60. After initial alignmentusingBowtie,Tophatchopsallunmappedreadsintosmallerpieces that are realigned using the same principle. At present TopHat is largely superseded by the faster and more recently released program HISAT2, developedbythesameauthors61.AnotherpopulartoolforalignmentofRNAseq reads is Star, which uses a fundamentally different approach62. Due to time-efficientalignmentthatrequireslessresourceintermsofdataclusters, HISAT2 and Star have gained wide popularity in the RNA-seq community. Read alignment generates output in the form of a Sequence Alignment Map (SAM) file which contains relevant information about the alignments in a TAB-delimited text format. As SAM files are relatively large, they are often compressedintoabinaryformatcalledBAMthatrequireslessspace. Asstatedearlier,theultimategoalofRNA-seqistoquantifyexpressionlevels bycountingthenumberofsequencedreadsthatmapbacktoadefinedregion of interest, and report this in terms that are as closely proportional to the relative molar RNA concentrations as possible. To perform relative RNA abundanceestimation,atleasttwofactorsmustbeconsideredandcorrected for.First,thesequencingdepth,thatisthetotalnumberofreadsinasample thataremappedtoanyregionwithinthegenome.Thelongerthesequencing process continues, the more reads will be generated resulting in increased readcounts.Secondly,readcountsarebiasedbythelengthofthetranscripts. As longer transcripts produce more reads, they display an increased probability of having reads mapping back to them. One of the simplest and most widely used normalization strategies, introduced early in the RNA-seq eraistheFPKMunit,whichstandsfor“FragmentsperKilobaseoftranscript 12 permillionreadsmapped”.Ifyousequenceonemillionfragments,theFPKM valueforacertainfeatureistheexpectednumberoffragmentsidentifiedper thousand bases in that feature. Another popular normalization approach, suggested to eliminate bias inherited in the FPKM measure, is to report relativeRNAabundanceinTranscriptsPerMillion(TPM)63.TheTPMunitis similartoFPKMinthesensethatitalsonormalizesfortargetlengthandread depth. The major difference between these normalization methods is that TPMprovidesameasurementoftheproportionoftranscriptsassociatedtoa certainfeatureinthepoolofRNA.SinceTPMnormalizesforthedifferencein transcriptcompositionbysimplyprovidingafractionofthetotalexpression, itissuggestedtobebettersuitableforcomparisonacrosssamples,asthesum ofallTPMvalueswillbethesameacrosssamples64.FPKMvaluescaneasily beconvertedtoTPMthroughsimplecalculations,andviceversa65. FPKM and TPM values are widely used as relative measurements for RNA abundance. However, comparisons of FPKM or TPM values for the same transcriptorgeneacrossdifferentsamplesareproblematic.Thisstemsfrom thefactthatthereaddepth,whichboththesenormalizationmethodsaccount for, is dependent on the sample’s overall read composition. Imagine a transcriptXthatispresentatequalamountsintwosamplesAandB.Given thatallothertranscriptsareequallyabundant,ifanothertranscriptYistwice asabundantinsampleAcomparedtoB,sampleAwillcontainmorereadsin total.TheFPKMorTPMvaluefortranscriptXwillthereforebelessinsample A,comparedtosampleB. In 2013-2014, a new and fundamentally different concept for abundance estimation of sequence reads emerged. This concept challenged the widely acceptedideathatprecisereference-guidedalignments(atleastaspreviously interpreted) are necessary for quantification of RNA-seq reads. The concept of a read-alignment was now redefined to only include information about whichtargetsequenceareadoriginatesfrom,withoutfurtherspecifications ofexactpositiononthebaselevel.Thisnewconceptwasnamed“lightweight mapping” and first brought about in the program Sailfish that was further developedintoSalmon66.Similarideasonalignment-freequantificationhave also been explored by others, for example “quasi-mapping” used in RapMap and“pseudo-alignment"usedinKallisto67.InKallisto,claimedbyitscreators to be “near optimal in speed and accuracy”, a new method for rapid string matching was also introduced, by indexing the reference genome with a De Brujin Graph to which k-mers are matched. Several programs have been developed that perform counting of reads per feature, for example featureCounts and HTSeq-count68,69. Read count values generated by such programscanbeusedasinputinseveralprogramsforanalysisofdifferential expression,coveredinthenextsection. 13 Differentialgeneexpression RNA-seq experiments typically aim to reveal changes in gene expression, by measuring the relative transcriptional abundance across biologically interesting conditions. While normalized expression levels such as FPKM or TPMvaluesareintendedtoreportrelativeabundancewithinsamples,more sophisticatedstatisticalmethodsaregenerallyneededforcomparisonsacross samples. In such analyses, statistical testing is performed to reveal whether anobserveddifferenceinRNAabundanceissignificant,thatis,greaterthan onewouldexpectduetojustnaturaltechnicalandbiologicalvariation. Numerous tools have been developed for the analysis of differential expression. These tools typically make assumptions about the underlying distributionofreadcounts,whicharethenusedtomodelthedatainorderto test for differential expression. As RNA-seq read counts are positive integer values,thenormaldistributionisnotapplicable.ThePoissondistributionhas previouslybeenproposedasareasonablemodelforRNA-seqdata,however asasingleparameterdistributionwherethemeanisequaltothevariance,it has been shown to be too restrictive, predicting a smaller variation than is actually present due to biological variation. The negative binominal distribution, first used in the program edgeR, is the most widely used distribution today, as it allows for greater variability between biological replicatesthanthePoissondistributioncanaccountfor. InthecontemporaryfieldofRNA-seq,twodifferenttraditionshaveevolvedin parallel and no consensus seems to be reached regarding whether RNA abundanceestimationanddifferentialexpressionanalysisshouldbepursued atthelevelofisoformsoratthelevelofgenes.Ononesidearedevelopersand dedicatedusersofprogramssuchasedgeRandDESEq.Theseareprograms that operate on the gene level and use raw read counts as input. Assuming thattechnicalreplicatesareunnecessaryastheyrevealaPoissondistribution, they focus their algorithms on modeling the biological variability. The other schoolofthought,mainlyfrontedbytheprofessorandscientificbloggerLior Pachter, claims that abundance estimation and analysis of differential expressionrathershouldbeperformedattheresolutionofisoforms,realized inprogramssuchasCuffdiff2,Salmon,Sailfish,RSEM,Kallistoandsleuth70-74. Tobridgethegapbetweenthesetwoschools,aprogramcalledtximportwas recently released. As default, tximport imports abundance estimates, counts and feature lengths on transcript-level and outputs corresponding variables onthegenelevel75. In the work covered by this thesis, the tools DESeq and EdgeR have been used76-78. The output from these programs consists of a table with results fromthestatisticaltestingassignedtoeachgeneproductanalyzed.Thefinal stepistoapplyacutoffforthep-value,aftermultipletestingcorrections,to 14 obtain a list of genes or transcripts that are significantly differentially expressed.AcutoffforFoldChangeiscommonlyalsoapplied. Challengesandconsiderations The RNA-seq experiment is a multi-step process that involves several laboratory steps where bias can be introduced, such as extraction, fragmentation, amplification and sequencing. The fact that RNA-seq experimentsoftenarecarriedoutatspecializedfacilitiesservestominimize theseproblems.Withconsistenthandlingofsamplesthroughdevelopmentof Standard Operating Procedures (SOPs), data quality can be optimized. Sequencingfacilitiescommonlyconsulttheircustomersinchoosingthebest suitable methods and reagents, and make sure that samples are of proper quality and quantity before the sequencing experiment starts. As previously described, RNA is sensitive to degradation and for RNA-seq purposes, it is important to retain a clean working environment where the risk of degradationisminimized.WhenRNA-seqisemployedonmRNAderivedfrom millions of cells in a sample, the output data is limited to the estimation of mean behaviors. Cell-to-cell variation due to differences in cell cycle stage, differentiationoranyothervariationwillbeaveragedoutanditisimpossible to resolve the intra-population distribution of transcripts derived from a certain genomic region. To reach beyond this mean-based characterization, sequencing the transcriptome of single cells has gained more attention in recentyears79. To improve accuracy of results, replication is important. This enables estimation of the variability within conditions. In RNA-seq, technical replicatesareoftenclaimedtobeunnecessary,sincetheRNA-seqtechnology has proven to be robust enough to make technical variation negligible. Biologicalreplicatesaremoreprioritized.Theuseofatleastthreebiological replicates seems to be somewhat of a gold standard, for cell lines however two replicates are often enough, indicated by genome-wide correlations of over 0.99. When working with cell lines, an interesting question is what to consider a biological replicate. A common strategy is to regard early splits, twosubpopulationsfromthesameoriginalpopulation,togrowinparallelfor anumberofpopulationdoublings. As mentioned, there is an ongoing debate whether Differential Expression Analysis should be performed at the gene level or at the level of individual transcripts (splice isoforms). Performing such analyses on transcript level evokes a few questions regarding what scenario should be viewed as differential expression. Are two transcripts differentially expressed even though their corresponding gene shows a zero net difference between the comparedsamples?Aregenesdifferentiallyexpressedbetweensamplesifat 15 least one corresponding transcript is differentially expressed? To answer such questions requires some degree of biological insight and relate to a common issue in todays, more computer-dependent field of biology, where people are often forced into a role of being both biologists and computer scientistsatonce. With today’s use of high-throughput technologies, the greatest challenge of biologydoesnotlieinthegenerationofdata,butratherintheinterpretation ofthem.Inordertounderstandbiologicalprocessesbytheuseofadvanced data analysis, statistical principles must be well tested and evaluated to obtain the most robust and efficient algorithms possible. In the field of transcriptomics, an explosion in number of different strategies and tools characterizes the recent years. A great challenge for researchers trained in biologyistoguidethemselvesthroughthisjungleofdifferenttools.Whena new method is published, authors often claim superiority over previous methods, which can lead to somewhat contradictory information regarding which method to choose. With the lack of standard pipelines for analysis, developers are free to choose validation methods in favor of their own developedtools,ascenariodescribedasthe“self-assessmenttrap”80. RNA-seqisapowerfultoolfordeterminingwhatgenesareexpressedandto what extent across samples, information that can be used to predict and model cellular behavior. It is important though to understand that this techniqueonlycapturesanearlystepinalongsequenceofregulatoryevents that collectively dictate cellular function and phenotype. Many functions depend on proteins that are activated only in the presence of certain interaction partners, and PTMs and subcellular localization are also important factors to consider when studying cellular behavior. By studying proteins, we can confirm what has been seen on the transcriptome level, or provideadditional,moredetailed,informationaboutcellularfunction. 16 3.Studyingthehumanproteome Thefullsetofproteinsatacertaintimepointwithinadefinedsystem,suchas an organism, an organ, a cell type or a cellular compartment, is called proteome. The word “Proteome” has been around since the late 90’s, first used by the Australian student Marc Wilkins and heavily inspired by the terms“genome”and“genomics”81.Analysisoftheproteomecanbedrivenby either a protein-centric approach or a gene-centric approach82. Proteincentric studies are exploratory and aim is to detect all proteins that are present in a sample mixture, given the limitations of the technology being used. In proteome analyses with a gene-centric approach, genomic annotationsareusedtoguidetheexplorationofproteins. Alongside the decline in number of predicted protein-coding genes seen in recent years, knowledge about the structural diversity of the human proteome continues to strengthen. The complexity of the human proteomic landscape extends well beyond just the number of distinct corresponding genes and rather manifests itself through tremendous protein diversity. Genetic variation, somatic recombination, alternative mRNA splicing and PTMs are important factors that contribute to the presence of protein isoforms. Rapid technology development have allowed for systematic exploration of the human proteome in various diverse systems, such as different cell lines and tissues. One interesting finding from these studies is that the collection of detected proteome constituents is surprisingly similar across diverse biological systems83-88. The ubiquitous protein expression observed in these studies point to the importance of abundance and organization, and not only the absence or presence of proteins, to dictate cellular identity and function. In the past, measurements of protein abundance have primarily relied on the use of affinity reagents, such as quantitative western blot and Enzyme-linked immunosorbent assay (ELISA)89,90. In the 1970’s, two-dimensional gel electrophoresis was developed and provided new opportunities to study up to thousands of proteins in a single experiment, preferably proteins of relatively high abundance91. In the late 1990’s, technological advances in the field of mass spectrometry(MS)ledtoafundamentalchangeinexperimentalpractice,and since then quantitative proteome analyses are mainly dominated by MSbasedapproaches. Eventhoughthefieldofproteomicsisstilllaggingbehindtranscriptomicsin termsofcoverage,large-scalesystematicexplorationofthehumanproteome istodaypossible,detectingthousandsofproteinsinasinglerun.In2014,two drafts of the human proteome were published, based on MS experiments in combinationwithcrowdsourcingofdata.ThetwostudiesreportedMS-based evidence for 84% and 92% of the annotated proteome92,93. Following these twopublicationswasadebatethatledtoreanalysisoftheresultsusingother 17 principles for false discovery rate estimation, generating a somewhat lower coverage94-96.In2015,thefirstdraftofthehumantissue-basedproteomewas published,usingantibodiestargeting90%oftheprotein-codinggenome97. MS-basedproteomics MSenablesaccurateandhighthroughputidentificationandquantificationof proteinsfromcomplexmixtures.TheroleofMSwithinthefieldofproteomics began in the late 1980’s when two methods were developed that made it possible to ionize proteins and peptides under conditions that were mild enough to keep the molecules intact, Electrospray Ionization (ESI) and Matrix-Assisted Laser Desorption Ionization (MALDI)98,99. The success of these two ionization methods pave the way for a rapid technology developmentandtheproteomicsfieldhaseversincebeenheavilydominated bytechnologiesthatarebasedonvariousMSprinciplesandapplications. In simple terms, the technique is based on ionization of protein fragments that are directed into a high vacuum system called a mass analyzer, where theybecomeseparatedaccordingtotheirmass-to-chargeratio(m/z)before detection. This generates a mass spectrum in which relative ion intensity is plottedagainstm/z.MSexperimentsaretypicallyperformedwithabottomupapproach(alsoknownasshotgunproteomics)involvingsequence-specific proteolyticdigestionofproteinsintomultiplepeptidespriortotheanalysis. The analysis of peptides instead of intact proteins is advantageous since peptidesareeasiertoionizecomparedtofull-lengthproteins.Italsoenables ionization of insoluble proteins that would otherwise be challenging to analyzewithMS,andidentificationofproteinswithmodificationsthatwould makeidentificationbyknownmassdifficult100.Followingdigestion,peptides areusuallyseparatedbasedonchemicalcharacteristicsonachromatographic device directly coupled to the MS instrumentation. In this way, peptide ions can be produced within a defined range at a time, which increases the sensitivity.ThestandardMSinstrumentationcanbedividedintothreeparts, anionizationsource,amassanalyzerandadetector101.Severaldifferentmass analyzersareusedtoday,suchasquadrupole,TOFs,orbitrapandquadrupole iontraps100. A common approach is to combine mass spectrometers into a tandem system equipped with a mass filter102. In such a setup, peptide ions that enter the mass spectrometer together are first separated in one of the mass analyzers (MS1) to generate a precursor ion spectrum that represents differentionizedpeptides.Afewindividualpeptideionsarethenselectedto undergo further fragmentation by collision with an inert gas, such as argon, heliumornitrogen.Inasecondmassanalyzer,thefragmentions(oftencalled productions)areseparatedaccordingtotheirmass(MS2),toyieldfragment ion spectra (Figure 5). This procedure will be repeated throughout the chromatographic run in a defined set of cycles100. Today, identification of peptides is usually performed at the MS2 level, by search algorithms that 18 comparepeakintensitiesintheexperimentaldatatotheoreticalspectrathat are found in databases such as SEQUEST, Andromeda, Mascot and X!Tandem103. By combining information about the precursor peptide mass (revealed in MS1) with information from the matched fragment ion spectra generatedinMS2,identificationofpeptidescanbeachieved. MS1 Spectra MS2 Spectra Sample Fragmentation MS1 MS2 Figure 5. Overview of a typical MS/MS setup. The workflow includes ionization, peptideionseparation,fragmentation,fragmentionseparationanddetection. Quantitativemassspectrometry As the signal intensities of peptide ions are not directly proportional to abundance, MS analysis is not inherently quantitative. Proteomic research commonlystrivestoreachquantitativeinformationaboutproteinexpression andseveralstrategiesforquantitativeMShaveemergedthroughoutthelast decade. Quantitative MS can be performed either in a relative manner, by comparingproteinabundancebetweentwoormoresamples;orinabsolute mannertoyieldabsoluteproteincopynumberspercellorsample,commonly achieved by the addition of an internal standard of known concentration. Both relative and absolute quantification can also be performed with labelfree strategies, by analysis of precursor signal intensities or by spectral counting ApopularstrategyinquantitativeMSislabelingofproteinsorpeptideswith heavy isotopes. This produces isotopologues that are identical molecules except for differences in isotope composition. Due to similar chemical properties,theyco-eluteduringliquidchromatography(LC)andentertheMS instrumentation together, but due to different isotope composition they will bedifferentiatedintheMSanalysis,identifiedviaamassshiftinthespectra. To minimize bias that can be introduced at several steps during the 19 subsequent MS experiment, early mixing of heavy and light samples is desirable. A number of different strategies for protein quantification using differentlabelingstrategieshavebeendeveloped104. Apopularmethodforrelativequantificationofproteinsismetaboliclabeling in vivo with Stable Isotope Labeling of Cells (SILAC)105. In SILAC, heavy arginine and lysine are added to cell cultivation medium and metabolically incorporatedintothecellsproteomebythecellsownmachineryforprotein synthesis. Subsequently, heavy and light cell lysates are mixed followed by trypsin digestion and LC-MS/MS analysis. During trypsin digestion, proteins are cleaved at the C-terminus of Arginine and Lysine, hence making all cleaved peptides having at least one heavy amino acid. As heavy and light samples are combined before sample preparation begins, the level of laboratorybiasinSILACisrelativelylow. In contrast to cell lines that are easy to cultivate for efficient metabolic labeling in vivo, many biological samples require alternative methods for isotope labeling, for example when working with clinical blood samples or tissue samples. In such cases, chemical or enzymatic labeling with isotopic atomsorisotope-codedtagscanbeused.Examplesoflabelingwithisotopic atoms include enzymatic labeling with heavy oxygen during proteolytic digestion with trypsin and labeling of primary amino groups in digested peptideswithdeuterium.Onemajorbottleneckwithmethodsthatlabelwith deuterium is that it produces a small shift in over-all chemical properties of the peptides so that co-elusion during chromatography is not longer achieved106,107. Stable isotope dimethylation with formaldehyde in heavy water however keeps the molecules chemically intact and is better suitable forchromatographicpurposes108. Labeling with isotope-coded tags is a commonly used approach in quantitative proteomics. One of the earliest methods based on isotopic labeling for protein quantification, developed already in the late 90s, is Isotope Coded Affinity Tags (ICAT)109. In this method, free thiol groups on cysteines are labeled with ICATs prior to digestion. The ICAT tags are comprisedofasulfhydryl-reactivecrosslinkinggroup,aheavyorlightlinker and a biotin molecule. The method takes advantage of the high affinity between biotin and avidin/streptavidin, and enables purification of labeled peptidesbyimmobilizationoveravidinorstreptavidinbeforeheavyandlight samplesarecombinedandanalyzed. Today, isobaric tags are also popular for labeling of peptides after protein digestion.Isobarictagsarecomposedofamassreporter(theactualtag)anda balancingmassnormalizer.Sinceisobarictagshavebothidenticalnetmasses and chemical properties, isotopologues can co-elute together. During fragmentation, the tags are cleaved and their reporter ions are used to distinguish them from each other, based on differences in mass that enable 20 relativeintensitiestobeanalyzedbetweensamplesinoneMS/MSspectrum. Labelingwithisobarictagsistypicallyperformedatprimaryamineresidues. Severaldifferenttypesofisobarictagsapplicableformultiplexedquantitative analyses are available, Tandem Mass Tags (TMT) and Isobaric Tags for RelativeandAbsoluteQuantification(iTRAQ)arepopular110,111 In contrast to the above-mentioned methods for relative quantification of proteins, absolute quantification aims to reveal absolute concentrations of proteins across samples. This is most commonly achieved by spiking in known concentrations of heavy-labeled standards before protein digestion, followed by comparison of achieved signal intensities.112 Popular standards include Absolute Quantification (AQUA) peptides and QconCAT octamers113,114 Full-length protein standards can also be spiked-in, which generateshighlyaccuratequantificationasthespikedinproteinwillgenerate identicalpeptidesupontrypsindigestionasitsendogenouscounterpart.115-117 A major drawback with the use of full-length protein standards is that they are challenging to produce. Recently, a strategy for absolute quantification was established, using heavy labeled protein fragments of 25-150 amino acids,calledQPrESTs118.ThismethodisdescribedinmoredetailinpaperII. Targetedmassspectrometry In discovery proteomics, where the aim is to detect and quantify as many proteinsaspossible,giventhecapacityoftheinstrumentation,theanalysisis oftenperformedbymeansofDataDependentAcquisition(DDA).Thismeans that the instrument is programmed to choose ions of highest intensity for furtherfragmentationintheMS2stage.Intargetedproteomicsstudies,where theaimistoobtainresultsforadefinedsubsetofproteins,precursorionsare instead chosen beforehand, independent on the data that is generated throughout the experiment. This targeted, Data Independent Acquisition (DIA) approach, enables reproducible and sensitive results as proteins irrelevantinthestudy,thatcouldpotentiallymaskproteinsofinterest,canbe filtered out from the analysis119,120. A popular method used in targeted proteomics is Selected Reaction Monitoring (SRM, also termed Multiple ReactionMonitoring(MRM))121,122.InSRM,atriplequadrupoleinstrumentis used consisting of three quadrupole mass analyzers and two stages of mass filtering. A precursor ion of interest is preselected in the first quadrupole following fragmentation by collision with gas in the second quadrupole. InsteadofobtainingafullMS/MSscan(asinconventionallytandemMS),only asmallnumberofsequence-specificfragmentionsareanalyzedinthethird quadrupoleandpassedonfordetection.Theprecursor/productionpairsare calledtransitions.Oncetheprocedureisestablishedandoptimized,multiple samples can be analyzed over a relative short period of time, generating highly sensitive and reproducible results. The SRM technology has gained 21 wide popularity within the field of drug discovery, as it allows for unbiased quantification of low-abundant proteins across large sample cohorts. An alternative method to SRM is Parallel Reaction Monitoring (PRM), which enables multiple peptides to be quantified in parallel by the use of hybrid Quadrupole-Orbitrap instruments that combines precursor ion selection by the quadrupole with high resolution accurate (HRAM) detection of all fragmentionsbytheOrbitrap123. Affinity-basedproteomics Affinity is a measure of binding strength between molecules. In proteomics, several methods make use of affinity reagents that can selectively bind proteinsinamixtureandtherebyenabledetectionandquantification. The most commonly used affinity reagents are antibodies. Antibodies are producedbyBcellsintheimmunesystemandarecharacterizedbyastriking diversityasaresultofgeneticrearrangementsduringBcelldevelopment,in combination with affinity maturation after binding of target antigen. Antibodies as research reagents were first used in the 1960s when Rosalyn Yalow and Solomon Berson developed the first radioimmunoassay (RIA)124. In this method, antibodies were used to detect and quantify molecules by competitive binding between isotope-labeled and unlabeled target. In the 1970s, Peter Perlmann and Eva Engvall developed the enzyme-linked immunosorbentassay(ELISA)whereenzyme-coupledantibodiesareusedto detectproteins90. Immunofluorescence Immunofluorescence (IF) technology relies on the binding between fluorescently labeled antibodies and their target proteins. When combined withahigh-resolutionimagingtechnique,IFenablestheidentificationandin situ visualization of selected targets. IF can be applied in proteomics, to provide a window to the physiology of cells and tissues by enabling researchers to reveal the spatial distribution of endogenous protein expression. The spatial information provided by IF analysis is an important advantage that differentiates this method from complementary technologies such as MS-based applications. As the detection is based on local protein concentrationsratherthanthebulkabundance,ahighsensitivityisachieved. Forexample,low-abundantproteinsthatarelocalizedtosmallorganelleslike thecentrosomecanbedetected.IFismostoftenusedforqualitativeanalysis of protein expression, but it can also be used for quantitative purposes, by comparison of relative intensities or more advanced techniques such as Förster resonance energy transfer (FRET) and fluorescence recovery after 22 photobleaching (FRAP)125,126. IF experiments can be performed either in a direct or indirect manner. In direct IF, a single antibody is used, that is conjugated to a fluorophore for detection. In indirect IF, two antibodies are used,oneprimaryantibodyintendedtotargettheproteinofinterestandone secondarylabeledantibodythattargetstheprimaryantibodyandisusedfor detection.ThefollowingtextconsidersonlyindirectIFexperimentsusingin vitrocultivatedcells. An IF experiment starts with fixation and permeabilization of the cells containingthetargetprotein.Fixationisperformedtoretaintheintracellular architecture during permeabilization, which is required for antibodies to enter the cell and reach their targets. A commonly used approach is to first cross-link the proteins by adding paraformaldehyde (PFA) that reacts with nitrogen molecules on the proteins to form bridges that retain molecules in place. Following cross-linking, the membrane is permeabilized, aided by saponinsornon-ionicdetergentssuchasTween20orTritonX-100.Fixation and permeabilization can also be performed simultaneously by the use of organic solvents such as acetone, ethanol or methanol. Sample preparation including fixation and permeabilization requires careful optimization of experimental protocols, customized for the protein targets of interest. Chemical conditions must be mild enough to keep the internal organization intact,andstrongenoughtoprovideaccesstoallintracellularcompartments. Well-optimized fixation and permeabilization is important to minimize the risk for artifacts. For example, dehydration with alcohols works well for analysisofthecytoskeletonbutislesssuitableforsolubleproteins.Toolong incubation with PFA can lead to blocking of antibody epitopes and mild detergents such as saponins are less effective when studying mitochondrial proteins since the inner mitochondrial membrane contains relatively little cholesterol127. When cells are fixed and permeabilized, the sample is incubated with the primary antibody followed by washing and incubation withthesecondaryantibody.Toenablesubsequentlocalizationoftheprotein in relation to the inner architecture of the cell, co-localization is commonly performed,thatisthesimultaneoususeofadditionalaffinityreagentsordyes that are added together with the primary antibody. For example antibodies targetingthecytoskeletonofthecellsandsmallchemicalssuchasDAPIthat bindtotheDNAinthenucleusandenablesvisualizationofthecellnucleiasa reference.FollowingIFstaining,targetedproteinsarevisualizedbydetecting antibodies through their conjugation to a fluorophore. In fluorescence microscopy the sample is illuminated with a certain wavelength (excitation) followed by emission of light with a longer wavelength. The difference in wavelength between emitted and absorbed light is known as Stokes shift, a phenomenon that is fundamental to the sensitivity and low background of IF128. By filtering, only the emitted light is allowed to reach the detector. In confocal laser scanning microscopy (CLSM) higher resolution is obtained by illuminatingthesamplewithahighpowerlaserbeamofdistinctwavelength and the emitted light beam path is gated so that only the emission from the 23 focal plane reaches the detector. When using multiple fluorophores it is important to minimize spectral overlap, which can be achieved by narrow bandpassfiltersandsequentialscans. UsingRNAlevelsasanindirectmeasureofproteinexpression The cellular abundances of mRNA and protein depend on the fundamental processesoftranscription,mRNAstorage,translation,mRNAdegradationand proteindegradation. MethodsformonitoringtherateoftranscriptionincludelabelingofRNAwith 4-thiouridine(4sU)priortomicroarrayorRNA-seqanalysis,therebyenabling the user to distinguish newly transcribed RNA from the pre-existing population129.Tominimizebiasduetodegradationduringlabeling,thetime of the 4sU labeling can be kept as short as 10 minutes. In addition to 4sU labeling,alternativemethodslikeRATE-seqandSnapShot-seqwhichispurely computational, are also available130,131. After transcription, mRNA molecules are processed before they exit the nucleus to enter the cytosol where translation occurs. mRNA transcripts that display equal levels of expression can generate different amounts of protein products due to differences in translation rates and protein degradation. Translation rates are partly influencedbythemRNAsequencethatdictatescodoncomposition,upstream OpenReadingFrames(ORFs)andribosomeentrysites(IRES)132.Inaddition, translation rates can be modulated by the availability of ribosomes and the binding of regulatory proteins and non-coding RNA molecules to the mRNA transcripts133.Thepost-transcriptionalprocessingandtransportofmRNAto thecytosolarecontrolsthatleadtoadelayinproteintranslation.Changesin transcriptabundancewillthereforeinfluenceproteinabundancealwayswith a certain degree of temporal delay. Several studies have showed that the temporaldelayinproteinsynthesisisprotein-specific,andthatmorecareful considerationofthisdelaywouldincreasethecorrelationbetweenmRNAand protein abundances134. Protein kinetics, that is protein synthesis and turnover, is typically monitored using puls-labeling of newly synthesized proteins with heavy isotopes or ribosome- or polysome profiling135,136. In ribosomeprofiling,ribosome-protectedfragmentsareisolatedindependently ofwhetherthemRNAisactivelytranscribedornot,followedbyRNA-seqto revealinformationaboutthetranscriptlocationsoccupiedbyribosomes137.In polysomeprofiling,mRNAmoleculesareseparatedbythenumberofbound ribosomes, which enables analysis of transcripts that are considered “efficiently translated”. A commonly used cutoff for the identification of “heavypolysomes”istheoccupancyofatleastthreeribosomes138. 24 Even though the existence of a general correlation between mRNA and protein levels is still being debated, several studies from the last decade conclude that the level of mRNA variation largely explains steady-state differencesbetweencorrespondingproteinlevels134,139,140.Aslivingsystems, cells are continuously exposed to changes in their molecular compositions, mainlyduetodynamicprocessessuchasproliferationanddifferentiation.In moststudies,steadystateconditionsaredefinedascircumstanceswherethe averagemRNAandproteinlevelsremainmoreorlessconstantoverseveral ours. Challengesandconsiderations Thecellularproteomeishighlycomplex.IsoformsandPTMscontributetoa huge diversity of proteoforms and on top of that is a tremendous dynamic rangeofproteinconcentrations.Inquantitativestudies,thedynamicrangeof proteinexpressionischallenging,especiallyinsamplesderivedfromvarious body fluids such as plasma and cerebrospinal fluid (CSF) where proteins display concentrations that can vary up to more than ten orders of magnitude141. In comparison to transcriptomics studies where PCR enables amplification prior to sequencing, proteins are measured in their original concentrations. To maximize sequence coverage across the full range of concentrations, various methods are employed to reduce sample complexity, either by enrichmentofinterestingtargetsordepletionofhigh-abundantproteins.This is commonly achieved by gel electrophoresis, liquid chromatography or the useofaffinityreagents.InIF,animportantadvantageistheabilitytodetect localconcentrationsduetothespatialresolution.Proteinsthatdisplayahigh localconcentrationatsmallerdefinedareas,suchasthecentrosome,canbe identifiedeventhoughtheirrelativecellularconcentrationisverylow. The major challenge in affinity-based proteomics is to ensure quality of the affinity reagents that are used. Antibodies are among the most commonly used research reagents today, but they suffer extensively from crossreactivity and therefore require careful consideration regarding specificity, that is the ability to discriminate between molecules that compete for the samebindingsite.Toestablishastandardizedschemeforantibodyvalidation, a committee of international scientists from various fields was formed, The International Working Group for Antibody Validation (IWGAV). Recently, IWGAV presented a proposal of a set of standard guidelines for antibody validation, built upon five pillars: Genetic strategies, Orthogonal strategies, Independent Antibody Strategies, Expression of tagged proteins and Immunocapture followed by mass spectrometry142. In antibody-based proteomics,challengesremaintoreachthesamelevelofmultiplexingascan be obtained in MS-based approaches. Simultaneous detection of more than 25 one target protein using multiple antibodies requires the availability of secondary antibodies that are specific to antibodies from only one type of host species. In addition, fluorophores that are conjugated to secondary antibodiesmustbewellresolvedtominimizespectralbleed-throughthatcan lead to false detection of co-localization. A typical limit for robust IF experimentsin96-and384-wellplatesisthreetofourchannels,butseveral attemptshavebeencarriedouttoincreasethemultiplicityofIF,forexample infra-red shifted fluorophores, quantum dots, bar coding and cyclic immunofluorescence (cycIF)143. In mass cytometry (cyTOF), antibodies are conjugated to rare metals, enabling over 30 different antibodies to be used simultaneouslywithoutproblemswithspectraloverlap144. 26 4.Dataanalysisandfunctionalinterpretation Publicdataresources Accumulationofinformationisacceleratingfast.Thescaleofbiologicaldata produced in the world’s labs is now well beyond petabyte (PB) and even Exabyte (EB)145,146. Open-access of interpretable results contributes immenselytotheabilitytousethislargeamountofinformationinmeaningful waysandtakefulladvantageofit.Severalresourcesareavailableproviding userswithbothrawdataandpre-computedgeneexpressionlevels. Transcriptomicsdatabases The amount of data generated in transcriptomics studies has grown into an explosion of information that continues to grow. Today, many scientific journals require transcriptomics data to be uploaded into public data repositories before acceptance for publication. This enables open-access of large amounts of transcriptomics data that enhance reproducibility and maximizes usefulness by enabling meta-analyses and the ability to use already existing data in new scientific contexts. The Gene Expression Omnibus(GEO),hostedbytheNationalCenterforBiotechnologyInformation (NCBI), is a public data repository for both array- and sequence-based transcriptomics data submitted by the research community147. For RNA-seq data, processed data files can be uploaded to GEO, and raw sequence files submittedtoNCBIsSequenceReadArchive(SRA)database.ArrayExpressand the Gene Expression Atlas, both hosted by the European Bioinformatics Institute(EBI)areequivalentEuropeaninitiatives,alsoallowingresearchers tosubmitandbrowsedata148,149. Inadditiontodatabasesthatcollectsequencingdatafromthewholeresearch community, several initiatives have also been launched, aiming to provide comprehensive maps of the human transcriptome across various cells and tissues, including for example The Genotype-Tissue Expression (GTEx) project, the Human Protein Atlas (HPA) and Fantom5150-152. The GTEx consortium is based at the Broad Institute and supported by the National Institute of Health (NIH). The GTEx project was initiated with the aim to providecomprehensivedatasetsthatcanbeusedtostudygeneexpressionin relation to genetic variation. At the GTEx web portal, users can query gene expression profiles for genotyped tissue samples derived from over 1,600 post-mortem samples, covering 43 different tissue types. The HPA project also encompasses large amounts of RNA-seq data, integrated in the imagebased protein database, covering 32 different tissue types derived from sampling of 95 individuals. The HPA RNA-seq data covers both internally generatedRNA-seqdata,aswellasimporteddatafromtheGTExconsortium. 27 TheFantom5(FunctionalAnalysisofMammalianGenomes5)consortiumisa large project based at RIKEN in Japan, mainly focused on mapping the transcriptional promoter landscape, which is important for transcriptional regulation by transcription factors and enhancers. Data generated by the FANTOM5 project relies on the in-house developed Cap Analysis of Gene Expression (CAGE) technology. CAGE is based on the capture of capped 5´start sites of mRNA molecules that are further sequenced with the Heliscope sequencer without the need for PCR amplification. The FANTOM5 project provides RNA expression profiles, reported as TPM values for 975 human samples including cancer cell lines, primary cells and various tissues153. Several studies have been performed with the aim to investigate the over-all consistency among public sources of RNA-seq data. The results from these studies show that there is a high genome-wide correlation betweendatafromsimilartissuesgeneratedatdifferentlabsusingdifferent technologies154-156. Proteomicsdatabases In the mid 1990s, several computer-driven and EST-based studies were published aiming to estimate the number of human protein-coding genes, resulting in estimates between 50,000-100,000 genes157,158. In the following years, the estimates were further condensed, and when the human genome sequence was first announced in 2001, the estimated number of protein coding genes was smaller than ever expected, revealing 30,000-40,000 accordingtothepublicconsortiumand38,588(26,588withstrongevidence plus an additional 12,000) according to the competing private consortium159,160.Throughtherecentyears,ashrinkingsetofprotein-coding geneshasbeenacontinuouslydevelopingtrend.In2012,theEncyclopediaof DNA Elements (ENCODE) project, aiming to catalogue all functional parts of the genome, identified 20,687 protein-coding genes and according to the latest Ensembl annotation update, there are 20,441 protein-coding human genestodate161. The UniProtKB/Swissprot is a manually curated database of protein sequences and functional information. The latest release comprises 20,197 human proteins, of which 14,876 have been experimentally shown to exist1. The annotation of the human proteome in UniProtKB is an ongoing project, continuouslystrivingtoimprovequalityofthedata.Currently,589proteins aremarkedasuncertainduetolackofevidencefortheirexistence,morethan half of these are predicted to be translation products from putative pseudo genes.Theirexistenceisthereforeamatterofdebateandexplainspartlythe lack of complete overlap between the databases162. A commonly used databaseforproteomicsdatageneratedwithMSmethodsisthePRoteomics 1basedonasearch2016-09-12 28 IDEntifications (PRIDE) database that is a public repository for proteomics dataincludingproteinpeptideidentifications,PTMsandspectralevidence163. TheHumanProteinAtlas HPA is a research program aimed to systematically characterize the human proteome in cell lines and tissues of both normal and cancerous origin164. Since the launch of the project in 2003, the HPA continues its exploration through the proteomic landscape, by continuously gathering data into a publicly available knowledge database. The HPA project operates from a gene-centricpointofviewandreliesmainlyontheuseofin-houseproduced antibodies that are used in immunohistochemistry (IHC) and IF applications97,165. Organized into a Cell Atlas, a Pathology Atlas and a Tissue Atlas,proteinexpressionispresentedinhigh-resolutionimagesaccompanied with transcriptomics data derived both from internal RNA-seq experiments andexternaldatafromtheGTExconsortium.Asantibodyreagentscarrythe risk of unspecific interactions with multiple targets, their usage requires carefulvalidationofspecificityandperformanceacrossthedifferentassaysin whichtheyareused.AllantibodiesthatareusedintheHPAareextensively validatedwiththeresultsintegratedontheatlas.MajormilestonesintheHPA project include the release of the Tissue Atlas in 2015, with the major conclusion that protein expression is ubiquitous over diverse tissue types, with only a few genes showing tissue-specific expression, specifically enriched in brain and testis97. In the end of 2016, the Cell Atlas will be released, containing subcellular localization data for > 12,000 human proteins166. GeneOntology To gain meaningful biological insight from large-scale biological data, typically from a list of transcripts, genes or proteins of potential interest, functionalannotationsareuseful.Severaldatabasesareavailablethatcollect functional annotations for genes and gene products, and provide interactive web-basedsearchfunctions.Thesegenerallycontaininformationoncellular localization, biological functions, and pathway involvement and interaction partners. TheGeneOntology(GO)isalargecollaborativeprojectthatwasinitiatedin 1998 with the aim to develop a controlled vocabulary for biological characteristics of gene products167. In addition, GO also provides several publiclyavailabletoolsforaccessandinterpretationoftheontologydata.GO usesbothmanualandautomaticapproachestocollectannotationsfromthe literatureandtranslateitintoGOtermsthatarefurtherorganizedintoatree 29 structure. GO terms are accompanied by a description, a digital identifier (accession) and an evidence code that denotes the type of evidence upon whichtheannotationisbased.TheGOdataisseparatedintothreeunrelated domains; Cellular Compartment, Biological Process and Molecular Function. Each of these domains is organized into a semi-hierarchical system of GO terms that are connected by edges that represent their relationship. The systemstartsfromtherootdomain,andcontinuesdownwardsthroughmore specialized child terms. However there is no strict hierarchy and GO terms canhavemultipleparentsaswellasmultipledescendants.EachGOtermcan beassociatedwithanynumberofgeneproducts,andeachgeneproductcan be associated with any number of GO terms. Since related GO terms that appear at different depths in the hierarchy often share associated gene products, analyses based on GO terms often suffers from redundancy. To overcome this problem, several attempts have been made to improve GO analysis, for example by considering the correlations among terms in the hierarchy and trimming strategies to reduce redundancy among GO terms168,169. Functionalenrichmentanalysis For proper interpretation of gene lists derived from RNA-seq experiments, annotations for individual gene products are usually not sufficient per se. What has become almost a standard routine among researchers with large genelistsintheirhandsistousestatisticalmethodstofindannotationsthat areover-representedintheirdatasets.Thebasisforsuchanalysesistolook for the proportion of genes associated with a certain annotation within the dataset and compare this to the proportion of genes associated with that annotation in the background dataset, for example the full set of proteincodinggenesinthegenome.Numeroustoolsareavailableforsuchanalyses, both as packages suitable for statistical programming environments, as well as web-based tools, for example TopGO, enrichGO, Gorilla, the Database for Annotation, Visualization and Integrated Discovery (DAVID) and FunRich170174.Functionalenrichmentanalysescanbebasedonanytypeofannotation,a popularapproachistomakeuseoftheGOsystemofclassifications. 30 Challengesandconsiderations The direct access to pre-processed gene expression values at several web portalsisconvenientinmanyaspects.Forexample,researchersthatarenot performing RNA-seq experiments themselves can access expression levels withoutgoingthroughtheextensivedataprocessingthatdownloadingofraw datawouldrequire.Italsoenablesresearcherstoqueryonlyafewgenesof interest, in a tissue-specific context, instead of having to deal with large genome-wide datasets. However, the usage of these expression levels, often reported in terms of FPKM or TPM values should be considered with some caution. In theory, these values are not comparable over different samples, due to inherent sample dependence (as discussed in chapter 2), and more sophisticated statistical methods are required to establish whether a gene/transcript is significantly differentially expressed. Due to the comprehensive collections of samples in the databases described above, it wouldbepracticallyinconvenienttoperformpairwiseanalysesofdifferential expressionacrossthousandsofsamples.Itiseasiertosimplypresentthedata in terms of abundance levels. Even though comparisons of RNA data needs statisticsintheory,directcomparisonsofFPKMorTPMvalueshaveproven toworkrelativelywellinpractice.Forusersthatarenotfamiliarenoughwith RNA-seq data to take this into consideration, comparison of reported expressionlevelscanhoweverbemisleading.SmalldifferencesinFPKMand TPM values bear the risk of being overrated, leading to downstream bias in theinterpretations. Asbiologicaldatacontinuestoexpand,notonlyintermsofquantitybutalso in terms of various new data sources, achieving a standardized annotation system is emerging as a major challenge. Functional annotations should, at best,enabletheiruserstobothcontributeandextractinformationinarobust and feasible manner. To achieve a uniform system with consistent formats and languages, large-scale adoption is required by the whole research community.GOwasinitiatedwiththeaimtoestablishastandardizedsystem, and is widely used today. However, the GO system displays a lot of redundancy by allowing GO terms to have multiple parents and child terms with overlapping genes linked to them. An additional drawback with GO is that the numerical portion of GO terms has no inherent meaning related to the hierarchy, which makes them less useful from a computational point of view.TheIDsofGOtermsareassignedinnumericalorderbytheeditingtool, therebyonlyindicatingtheorderinwhichtheywerecreated.Overthepast, the global research community has collectively focused attention on a relatively small fraction of the proteome. A recent bibliometric analysis revealed that this fraction has remained relatively stable over the past 20 years175. This skewed focus is considerable in the context of functional analyses. As well-studied proteins are richer in annotations, they are more likelytoappearinqueryresultsthusaccumulatinginfutureanalyses. 31 Thedevelopmentofhigh-throughputtechnologieshasreshapedbiologyinto aquantitativeandcomputationallyintenseareaofresearch.Researcherscan now perform large-scale and quantitative analyses of whole genomes and transcriptomes,andalargepartofproteomes.Thesetechnologicaladvances have led to a promising starting point for modeling the cell and its inner workings. In order to gain complete understanding of the complex circuitry underlying cell identity and diversity, understanding of multi-level gene expression is important, which requires integration of heterogeneous data sources. In addition to that, spatial resolution of expression is crucial in the overallunderstandingandcharacterizationofthehumanproteome. 32 5.Aimsofthethesis While the previous chapters are dedicated to the surrounding field of researchthatIamactingwithin,thefollowingtwochaptersaimtodiscussthe scientific contribution of myself and co-workers over the past years. This contributionisgatheredinthefollowingcollectionofpapers: I The aim of this study was to explore the consistency among a set of RNA-seq datasets generated at different laboratories using different strategies. By comparison of mRNA expression levels for ostensibly similar tissues we showed that publicly available RNA-seq data is relatively consistent, requiring only light processing for samples to clusteraccordingtotissuetyperatherthancorrespondinglaboratory. II In this study we investigated the quantitative relationship between mRNA and protein levels under steady-state conditions in various humancelltypesandtissues.BycombiningRNA-seqtechnologywith an in-house developed approach for targeted mass spectrometry, we showedthatthereisagene-specificconversionfactorbetweenmRNA andproteinlevelsthatcanbeusedtopredictproteinabundanceina steady-statecontext. III TheaimofthisstudywastouseRNA-seqtoguideanalysisofprotein expression in a four-step cell model for malignant transformation, calledtheBJmodel.RNA-seqfollowedbyproteinprofilingwasapplied on this model, enabling us to scrutinize changes in gene expression acrossdefinedstepsoftumorigenesis.Weconcludedthatthemajority of differentially expressed genes were downregulated within the model, and the combined transcriptomics and protein profiling approach enabled the identification of several proteins with interesting expression profiles that were further validated in patient tumorsamples. IV Tumor tissues often contain local areas deprived of oxygen where malignant cells are still able to sustain rapid proliferation. In this study, we used a transcriptomics approach to explore differential responses to decreased oxygen in the four stages of the BJ model separately. The results show that even though the cell lines in this modelarehighlysimilar,theyrespondstrikinglydifferenttomoderate hypoxia.TheintroductionofSV40Large-Tintroducesmajormolecular changes with impact on the cells ability to adapt to lowered oxygen concentration. This shows the potential of the BJ model to also undergo studies of perturbed scenarios in relation to malignant transformation. 33 6.Presentinvestigation While the work covered in this thesis is concerned with multiple biological questions,thereisoneunifyingaspect;assessingtheuseofdataderivedfrom sequencingofmRNAofhumancellsandtissues.ThedynamicnatureofmRNA makes RNA-seq an attractive method to generate a snapshot of the transcriptional landscape at a particular moment of interest. RNA-seq has becomewidelypopularandnewdatasetsarecontinuallybeingproducedand uploadedinrepositoriesforpublicuse.Thepublicavailabilityopensupnew windowsforanalysisoftranscriptomes,enablingresearcherstoexpandtheir databyincorporationofexternallygenerateddatasetintotheirownprojects. This has great potential in adding power to analyses, but requires careful considerationregardingreproducibilityofresults.Toassesstheusefulnessof data generated in different labs, using different equipment, strategies and data processing pipelines, we analyzed the consistency of a number of publiclyavailableRNA-seqdatasets(paperI). Asassertedbythecentraldogmaofmolecularbiology,mRNAisamessenger carrying information that is ultimately used for proteins to be synthesized. TheincreasingnumberofstudiesthatuseRNA-seqtocharacterizebiological functions commonly assume that differences in mRNA abundance would be reflected on the protein level where the “real” action takes place. To investigatethedegreeofcorrelationbetweenmRNAandproteinabundance, wecombinedRNA-seqanalysiswithatargetedmassspectrometryapproach for absolute protein quantification (paper II). Here, we observed a genespecific relation between mRNA and protein levels, conserved across the panel of cell lines and tissues included in the study. The results from this study indicate great promise for transcriptomics to guide and focus proteomicsstudies. One example of a study where information derived by RNA-seq was successfullyusedtoguideproteinanalysisispresentedinpaperIII.Thefield of cancer research constitutes one of the main branches in contemporary molecular biology, but the molecular mechanisms underlying tumor developmentarestillfarfromcompletelyunderstood.Intheworkpresented inpaperIII,weperformedthefirstglobalanalysisoftranscriptomechanges within a four-step model for malignant transformation (see chapter 1 for moredetailsonthemodelsystem).AsbeingpartoftheHumanProteinAtlas project, we have access to a unique library of antibodies towards all human proteins.Thisenabledustoperformfollow-upstudiesoninterestingproteins selectedbyanalysisoftranscriptomicsdata.Intheworkpresentedinpaper III,wefoundseveralinterestingproteincandidateswhoseexpressioninthe cell line model was further evaluated in tumor tissue samples of different malignantstages.Weconcludedthatthecelllinemodel,despiteitsinherent limitations,servesasagoodmodeltostudygeneexpressionchangesduring 34 malignanttransformation.Wethereforecontinuedtoexposethismodeltoa perturbedenvironmentwithdecreasedoxygensupply,presentedintheform ofamanuscript(paper IV).Tumortissuesoftencontainlocalareasthatare less oxygenated due to uncontrolled oncogene-driven proliferation of the presenttumorcells.Animportantcharacteristicoftransformedcancercellsis theircapabilitytorewiretheirenergymetabolismandcontinuetoproliferate independently of oxygen. This aggressive phenotype is known to be associated with increased resistance to radiotherapy as well as increased metastatic potential176. To continue our study of the BJ model for malignant transformation, we set out to explore the existence of such metabolic flexibility, from a transcriptomics perspective. Our results show that the BJ modelcanbedividedintotwogroups,onepremalignantphenotypeincluding the primary and immortalized cells and one postmalignant phenotype in which the introduction of the SV40 give rise to major metabolic changes mainly involving upregulation of lipid, carbohydrate and amino acid metabolic pathways. In the end of this thesis are the articles and a manuscript,containingdetaileddescriptionsoftheworkanditsfindings. AssessingtheconsistencyamongpublicRNA-seqdata(PaperI) Aboutthreeyearsago,oneofmysupervisorswasinvolvedinacollaborative study aiming to describe the baseline transcriptome of human skeletal muscle. Browsing the web, he stumbled upon a blog post that immediately caughthisattention.Ayoungmanhadbecomeaself-taughtbioinformatician by exploring RNA-seq data after his wife had been diagnosed with a genetic disease. In this blog-post, he described his excitement over the increasing availability of public RNA-seq data, especially the Illumina Body Map from which he had downloaded BAM files and quantified gene expression in a numberofdifferenttissuetypes177. “Often I find myself needing a quick reference on how highly expressed a particular gene is across different tissues, either to browse visually or to combinewithotherdatainananalysis.“178 SincemysupervisorwasworkingwithRNA-seqdataformusclesamples,he was well familiar with the gene TTN, encoding the third most abundant protein(Titin)inhumanmuscleafterMyosinandActin179.Therefore,hegot surprised when he noticed that this gene was reported to have FPKM=0 in dataformuscletissuepresentedontheblog.Hestartedtodigdeeperintothis finding and performed reanalysis of the same data using different program parameters. This resulted in about 2.7 million aligned reads to the genomic regionoccupiedbyTTN,correspondingtoanFPKMvalueofabout200.The reasonforthisdiscrepancywasadefaultlimitonthenumberofalignments perlocus,setbytheprogramCufflinks. 35 This was the starting point for the project that is presented inpaper I. In a timewhenRNA-seqdataisheavilybeingusedbypeoplewithabroadrange ofknowledgeitismoreimportantthanevertosecuretheusefulnessofdata andcontrolforbias.Wethereforesetouttoassesstheconsistencyofpublicly available RNA-seq data on a global level. Our focus subsequently turned towards global expression analyses. Instead of looking at individual gene expression values, we analyzed the variability among the different samples and explored the impact on a number of factors related to the data generation. If samples clustered according to tissue type rather than laboratory of origin, or some other factors related to experimental parameters,weconsideredthemconsistentandusefulformeta-analyses.For example,onecouldusedatageneratedbyotherstoidentifytissueoriginofan unknownsample,anduseexternaldatatoexpandthenumberofreplicatesin onesownexperiment.Weinitiallyanalyzedtheconsistencyamongfoursets of pre-computed gene expression levels (FPKM and RPKM values) for the human tissue types brain, heart, and kidney. These datasets had several differencesintheparametersrelatedtothegenerationofthem.Ouraimwith this study was not to propose a gold standard protocol that would keep datasets more consistent between labs, but rather to allow parameters to vary and explore the possibility to process the data in a way that would maximizetheirusefulness.Ourmajortake-homemessagefromthisstudywas that public RNA-seq datasets are relatively consistent, requiring only light pre-processing for samples to cluster according to their tissue of origin. We found that reprocessing of raw data did not bring about any significant improvements,exceptforallowingthecompletesetofgenestobeincluded. In contrast to the blog post that initiated this project, luckily we did not encounteranyproblemswithmissingoutonimportanttissue-specificgenes whileanalyzingthepublisheddata.Thisprojectispresentedinmoredetailin paperI. Usingtranscriptomicstoguideproteomics(PaperII) Sinceproteinsarecommonlyattributedthemajorfunctionalrolewithincells, sequencing of mRNA is based on the assumption that a difference in mRNA abundance reflects different phenotypes via differences in corresponding protein levels. To investigate the accuracy of this assumption, translation research has evolved into a field of its own, aimed at establishing the link between mRNA and protein under various scenarios134,180. During the last decade, numerous studies have been conducted aiming to reveal the quantitative relationship between mRNA and corresponding protein products. One way to approach the question whether a correlation between mRNAlevelsandproteinlevelsexististosimplycorrelatemRNAandprotein expression levels for numerous different genes. This can be done for both 36 relativeandabsolutevaluesofexpression.Anotherapproach,whichwasused inthestudypresentedinpaper II,istocomparetherelativeabundancesof onegeneatatime,acrossvarioussamples.Anabsoluteconsensushasnotyet been reached regarding the extent to which mRNA levels dictate protein levels.However,mostresearchersagreethattheRNA-to-protein(RTP)ratios dependonmultiplerelatedprocessesandcanvarygreatlybetweengenesand gene categories and even display pathological disturbances181. Given this multitudeandheterogeneityofinteractionsthatcontributetotheRTPratios, a more straightforward approach is to investigate this relationship for each gene individually. In paper II, a targeted mass spectrometry assay was developed, allowing absolute quantification of protein molecules per cell. In this method, isotope-labeled recombinant protein fragments of known amountswereusedasinternalstandards,enablingabsolutequantificationof 55proteinswiththePRMtechnology.Thepanelof55proteinswasquantified in9celllinesand11tissuesamples.Normalizationagainstnumberofcellsis straightforward for cell lines as there are convenient methods for cell counting. For tissue samples however it is more challenging, due to the heterogeneouscompositionoftissuesandtheextracellularmatrixcontaining a vast amount of protein. An essential part of this study was therefore to develop a strategy for normalization against cell number in the tissue samples.ThetargetedPRMapproachwasdevelopedforthefourcorehistone subunits H2A, H2B, H3 and H4, previously shown to be evenly distributed alongthechromosomes182.Inthisway,thenumberofcellswithineachtissue sample could be calculated. The results presented in this paper show that there is a gene-specific ratio between absolute concentration of protein and thecorrespondingmRNAlevels(presentedasnormalizedTPMvalues)thatis conservedacrossthevariouscelltypesandtissuesincludedinthestudy.The RTPratiosevaluatedinthisstudyshowedhighvariability,rangingfrom200 for the transcription factor MYBL2 to 220,000 for the serine protease inhibitor SERPINB1. The diversity observed in this relatively small set of 55 genes indicates the importance of gene-specific characteristics and posttranscriptional regulation to determine protein concentrations. How thesedifferencesappearbetweengenesandgenecategoriesonaglobalscale remainsforthefuturetobeevaluated.Ourstudysuggeststhateachgenehas its own inherent ratio between mRNA and protein, owing to multiple regulatory factors that collectively control the abundance of these related molecules.ThisfindingsuggeststhatcorrelationbetweenmRNAandprotein levels should be assessed for each gene individually, rather than being generalizedonaglobalscale,andthattranscriptomicsisapowerfultoolwith greatpotentialtoguideproteomicsstudiesintherightdirections. 37 Combining transcriptomics and protein profiling to study malignanttransformation(paperIIIandIV) Malignanttransformationisacomplexprocessinvolvingmultiplemolecular events that subsequently lead to a cancerous phenotype183. To characterize the molecular changes underlying malignant transformation, we combined transcriptomics and immunofluorescence protein profiling applied on a cell line model for malignant transformation. The cell line model is of fibroblast origin and was created in order to mimic the natural steps of malignant transformation. This was achieved by the sequential additions of three geneticalterations(seechapter1formoredetails).Transcriptomicprofiling ofthefourstepsinthemodelindicateddifferentialexpressionofabout6%of theprotein-codinggenome,inatleastonestepgoingfromtheprimarystate totheimmortalizedstate,viathetransformedstatetothefinalmetastasizing state. Among the differentially expressed genes, a majority (80%) showed downregulation over the course of increased malignancy. Downregulation was shown to mainly affect genes encoding proteins expressed on surface areas of the cell while upregulation was mainly affecting genes involved in proliferation. The unique repository of in-house produced antibodies in the HumanProteinAtlasenabledsubsequentanalysisofproteinswithinteresting expression profiles throughout the model, identified by the transcriptomics analysis184. Confocal imaging of immunofluorescence-stained cells enabled proteinstobevisualizedwithspatialresolutiononasinglecelllevel.Inthis way, subcellular translocation could be observed and upregulation on the mRNA level was not only reflected in increased staining intensity in the IF imagesbutalsoinincreasednumberofcellsexpressingacertainprotein,for exampleinthecaseofAnnilin. The BJ model has several limitations, for example by being a cell line model thatisadaptedtoaninvitroenvironmentandbeingoffibroblastoriginwhich isnotthemosttypicalcelltypetoundergomalignanttransformationinvivo. However, genome-wide transcriptomics analysis of this model enabled the identificationofseveralinterestingcandidateswithpotentialimplicationsin cancer.Someoftheseproteinswerefurtherevaluatedincohortsofprostate and colon tumor tissues of matched malignancy grade. Here, the observed differential expression of ANLN, ANPEP, ANXA1 and BDH1 in the BJ model was confirmed on a larger scale, covering almost a hundred clinical tumor samples. The expansion of this study to also include in vivo samples is important and shows that the BJ model is a relevant model system for the studyofmalignanttransformationandincreasedproliferativecapability.Asa follow-up study, we exposed this model to an environment containing 3% oxygen. This is interesting from the perspective of tumor hypoxia, a 38 commonly encountered scenario in solid tumors where local areas are deprivedofoxygen,somethingtumorcells,butnotnormalcells,cansurvive. In these hypoxic microenvironments, tumor cells can still proliferate due to rewiring of molecular functions such as energy metabolism, increased vascularization and metastatic behavior. Another interesting aspect of cultivation in 3% oxygen is that it resembles the in vivo conditions better, compared to the relative extreme level of oxygen concentration provided in theatmosphericinvitrocultivationenvironment.Thefindingsfromthisstudy indicate once again that the most dramatic change of identity occurs as a resultoftheSV40transformationwithinthemodelsystem.Notonlydidwe observe a higher number of genes differentially regulated in the SV40transformed stage, the differentially expressed genes shared with the previous stage were in most cases regulated in the opposite direction. Analysis of metabolic pathways involving differentially expressed genes showed that several genes involved in carbohydrate, lipid and amino acid metabolic pathways are upregulated in the transformed cells in response to the altered oxygen supply. An important consideration is that the BJ model was originally created in atmospheric oxygen condition, limiting the possibilitytostudythehypoxicphenotypeindirectrelationtotheprocessof malignant transformation itself. Still, we demonstrate that, even though the cell lines of this isogenic model are highly similar from a global view, the three genetic introductions have large impact on the cell’s response to a perturbedenvironment.Thedifferentiallyexpressedgenesthatweidentified in paper III were again identified in the data generate in this study, in the atmosphericcondition,addingadditionalsupporttoourpreviousfindings. Concludingobservationsandfutureperspectives I find myself in the middle of an exciting era. Technology development is movingfast.Onlyadecadeago,researchinggeneexpressionbyexamination of RNA transcript content was limited to the use of hybridization-based methods, covering only a subset of the cellular transcriptome. Today, highthroughput sequencing has taken over and continues to evolve through applicationsonthesinglecelllevelandevenwithspatialresolution185,186.In the first paper covered by this thesis, we concluded that publicly available RNA-seqdatasetsarefairlyreproducibleonagenerallevel.Thisfindinghas alsobeenconfirmedmorerecentlybyothers,concludingthattheGTExdata aswellasFantom5datashowglobalconsistencywiththedatageneratedby the Human Protein Atlas, even when different transcriptomics methods are employed155,187. These studies collectively show that transcriptomics has developed into a highly reliable technology, indicating that the increasing body of available datasets will continue to add strength to future metaanalyses.TherobustnessofmodernRNA-seqtechnologyisnotonlyvaluable 39 within the field of transcriptomics itself. As we showed in paper II, its potentialexpandsbeyondthisandcanpossiblybeusedtoguideexploration oftheproteome.Wehaveshowedthatthereisagene-specificratiobetween mRNA and protein abundances, indicating that RNA-seq can be used to extrapolatetheamountofproteininthecell.Itisimportanttonotethatthisis only applicable to steady-state conditions and not valid in scenarios where the cellular environment has been altered. However, as experiments are commonlyperformedduringsteady-stateconditions,gene-specificRTPratios can till be useful, especially if the identification of RTP ratios is further expanded to involve the entire protein-coding genome. This is theoretically possible considering the large repository of recombinant protein fragments generated within the Human Protein Atlas. In order to gain complete mechanistic understanding of the regulation of gene expression, especially with regard to the great diversity seen among RTP ratios, more detailed analyses should however be considered. Neither should we forget that translationalmechanismsoftenaredysregulatedinthecontextofdisease,for example by repression through binding of microRNAs to mRNA transcripts188,189.ThegreatpotentialofRNA-seqtoguideinanalysesaiming to understand complex biological systems is well exemplified in the study presented in paper III. With a transcriptomics view of the cells at different stages of cancer development, we could identify a subset of interesting proteins that were further studied in detail with spatial resolution on a subcellularlevel.Wehaverapidlymovedintoatimewheretechnologyallows for systematic exploration of entire genomes and transcriptomes with high accuracy. The field of proteomics is not yet capable of reaching the same coverage as the “-omics” fields concerned with nucleotides, as no universal techniqueavailableiscapableofquantifyingtheentirehumanproteomeina single experiment96. The immense size and complexity of the human proteome presents a great challenge for the future. On the other hand, a multitudeofspecificbiologicalaspectsrelatedtotheproteomiclandscapeare perhaps more important for understanding the workings of the cell. Rather than just focusing on global quantification, large efforts are likely to be directed towards studies aimed at quantifying protein activity in terms of PTMs, protein-protein interactions, spatial distribution as well as stoichiometry of protein complexes with implications in cell signaling. I speculate that advances over the last decade within the field of transcriptomicswillcontributevastlytoguideandfocusproteomicsstudies. Inthefuture,onemaycometofocusmoreonmeasuringasubsetofproteins isolated in a specific biological context, with high accuracy. Integration of multiplestrategiesdevelopedoutofasystemsbiologyframeworkisneeded to reach the ultimate goal of understanding the enormous complexity of the cell. 40 Acknowledgements Hej, vad kul att du läser min avhandling! Kanske har du just läst de övriga sidorna och funderat på det här med mRNA, transcriptomics och proteiner, ellersåhardu-ganskatroligt-ivrigtbläddratdigdirekthit,tilldenviktigaste sidan i hela boken. Ni är många som har varit med mig under de här åren - inspirerat, skrattat, lyssnat, diskuterat och gjort det här möjligt. TACK alla handledare,kollegor,samarbetspartners,vännerochfamilj: Emma–duanarintehurstoltjagäröveratthafåttvaradindoktorand.Din kvickhet, förmåga att se möjligheter i nästan precis vad som helst och få kompliceradegrejerattbliheltsjälvklaraharvaritsåinspirerande!Duären stjärnapåallasättochjagkommeralltidattbäramedmigkänslanefterett mötemeddig,dåingentingkännsroligareochmermeningsfulltänattfåhålla påmedforskning. Mikael–vilkenturförmigattjaglyckadeshaffadigtidigtsombihandledare innannågonannantogdig.Duharvaritheltfantastisk!Tackföralldintid,för alltdulärtmigombioinformatik,förallapodcast-tipsochtrevligakaffemöten påMellqvists. Mathias-fördinotroligaentusiasm,attdustartadeHPAochförattdulärde migviktenavattkunnasammanfattasinforskningienendamening. TillhelagruppenCellProfiling,niharvaritfantastiskaallihop,onecellisno cell!Mikaela,DianaochLovisa–minaSanFranSisters,förattvialltidstöttar varandraochförattnibaraärsåroliga,smartaochunderbara!Utanerhade den här boken aldrig blivit klar. Charlotte – min forsknings-storasyster när jagbörjadesomdoktorand,duharvaritenförebildpåmångasätt,bådeioch utanför jobbet. Marie – min första handledare under exjobbet, tack för alla roligautflykter,konstigafrukterochfördinrakaattityd.Trotsattdulämnade oss för Dalarna i mars känns du fortfarande som en helt självklar del av gruppen.AnnaochJenny–förattniärsåunderbaraatthaomkringsig,attfå jobbamederärlyx.Hammou–förbästahärjetiPhillyochfördinoslagbara kollpåvärldsläget.Martin–förgottbrödochetttryggtlugnsomharvarittill väldigt stor hjälp under alla år. Christian und Peter – danke für all die schönenzeit,Siesindgenial.Lasse–fördinoändligakunskap,nyfikenhetoch hjälpsamhet, du kan ju typ... allt? Devin – for all the fun and for always helping out making my transitions from spoken to written text more rapid and direct. Casper – för trevligt prat och bra svar på frågor om maskininlärning mm. Ulla och Annica – drottningarna av psykoplasman, ni harvaritsaknade. 41 Annica Gad-förettfintsamarbeteochvärdefullasynpunkterfrånenriktig biologsperspektiv.ErikFasterius–försmartafrågorochtrevligamötenom RNA-seqmmochalltstödunderslutspurten!ChengandKemal–formaking nicefiguresandforsmoothcollaboration.ToveBochFredrikE–föralltni lärtmigomMS. Sanna-minäldstavänpåjobbet,detharvaritsåsköntatthadignäraunder avhandlingsarbetet och vad kul det var att vi fick tid att hänga ordentligt i Grekland!kännermigalltidlitelugnareefteratthapratatmeddig. Arash och Fredrik-förbästa(bad)sällskapetochförattniräddademittliv underdenvildaforsränningeniBrixen!Tony-förattdugavlabbetliteflärd och lärde mig att inte rynka ögonbrynen allt för mycket. Beata – för många trevligaluncherochfikor.Ina–förattduinspirerarochintegrerarkonsten ochyoganivårforskningvärld.Leo–forbeingsuchasocialanimal,fornever missing out on cakes and for fantastic guided tours in Vienna. Tove A, Per, Åsa, HPA IT, Björn H, Figge, EvelinaochallaandraHPA:are–tackförallt trevligt! Cecilia Williams – för värdefulla synpunkter och granskningen av minavhandling.PerKraulis-förmångabrafrågorunderpresentationerjag haft genom åren, du är en av SciLifes riktiga hjältar! Claudia - för bra förklaringar av MS-principer. Ola – för bra guidning inom translationsfältet. Ulf–förattdulyserupphelaplan2ochgördettillensåmyckettrevligare arbetsplats,vilkenturattdukomtillbakafrånOmaha!Johannes och Axel– BAM! har varit superkul att hänga med er och ordna seminarier, hoppas ni fortsätterisammastil.AllaprofessorerochPIssomgörplan2såbäst:Peter ochJochen,Cecilia,Jan,Jacob,BoochOla. Ulrika, Maia, Kimi, Ronald, Björn, Cissi, Maria, David, Tea, Julia, MunGwan,Elin,Eni,Phil,Maria,Lucia,Mahya,Julie,FahmiochLauramfl-för den fina stämningen på bästa planet. Alla roliga typer på plan 3: Mickan, Erik,Anders,Fredrik,Nemo,Sverkermfl.Seppen-assarochAlbaNova-folk: SaraK,Tove,Tarek,SaraL,Micke,Jenny,Cristina,Per-Åke,Stefan. Allaminavännerutanförforskningsvärldensomfårmigattkommaihågvad livets byggstenar egentligen är. Och såklart mamma, pappa, min älskade systerKajsa,Tobias,IvarochIrma,förertgränslösastödialltjaggör,niär de allra bästa! Faster Ulla för evig inspiration och fint stöd från det färdigdisputerade perspektivet. Stephen för att du tog dig tid att över telefon trasslamiglossurlångakrångligameningarpåengelska.FarmorLisbethför ettunderbartavbrottiskrivandetmedbesökietthöstvackertVilhelminahos dig. Lena, Lasse, Martin, Johanna, Alvar, Lovisa och Gethy, jag är så tacksamoverattfåvaraendelaverfamiljockså.Viktor–fördinvärmeoch nyfikenhet, ditt oändliga tålamod med alla sena kvällar då jag suttit fastklistradviddatorn,allanyavärldarduöppnaruppförmigochförattdu barafunnitsdär! 42 Bibliography 1. Bianconi, E. et al. An estimation of the number of cells in the human body. Ann. Hum. Biol. 40, 463–471 (2013). Morrison, D., Wolff, S., Fraknoi, A., Pasachoff, J. M. & Pinsky, L. S. Abell's Exploration of the Universe, and Astronomy: From the Earth to the Universe. Physics Today 49, 70–70 (1996). Gest, H. The discovery of microorganisms by Robert Hooke and Antoni Van Leeuwenhoek, fellows of the Royal Society. Notes and records of the Royal Society of London 58, 187–201 (2004). Fleming, D., Needham, J., Grant, E. & Roger, J. The DSB: A Review Symposium The Dictionary of Scientific Biography. Charles Coulston Gillispie. Isis 71, 633–652 (1980). Mikroskopische Üntersuchungen über die Übereinstimmung in der Struktur und dem Wachstume der Tiere und Pflanzen. Nature 86, 41–41 (1911). Cooper, G. M. & Hausman, R. E. The Cell. Burndy Library,, Mendel, G. & Punnett, R. C. Versuche über Pflanzen-Hybriden /. (Im Verlage des Vereines,, 1866). doi:10.5962/bhl.title.61004 Dahm, R. Discovering DNA: Friedrich Miescher and the early years of nucleic acid research. Human genetics 122, 565–581 (2008). Choudhuri, S. The Path from Nuclein to Human Genome: A Brief History of DNA with a Note on Human Genome Sequencing and Its Impact on Future Research in Biology. Bulletin of Science, Technology and Society 23, 360–367 (2003). WATSON, J. D. & CRICK, F. H. Genetical implications of the structure of deoxyribonucleic acid. Nature 171, 964–967 (1953). Gerstein, M. B. et al. What is a gene, post-ENCODE? History and updated definition. Genome Res. 17, 669–681 (2007). ASTRACHAN, L. & VOLKIN, E. The absence of ribonucleic acid in bacteriophage T2r+. Virology 2, 594–598 (1956). JACOB, F. & MONOD, J. Genetic regulatory mechanisms in the synthesis of proteins. J. Mol. Biol. 3, 318–356 (1961). BRENNER, S., JACOB, F. & MESELSON, M. An unstable intermediate carrying information from genes to ribosomes for protein synthesis. Nature 190, 576–581 (1961). GROS, F. et al. Unstable ribonucleic acid revealed by pulse labelling of Escherichia coli. Nature 190, 581–585 (1961). Djebali, S. et al. Landscape of transcription in human cells. Nature 489, 101–108 (2012). Wang, X. Next-Generation Sequencing Data Analysis. (CRC 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 43 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 44 Press, 2016). doi:10.1201/b19532 Morris, K. V. & Mattick, J. S. The rise of regulatory RNA. Nat. Rev. Genet. 15, 423–437 (2014). Wang, E. T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476 (2008). Tan, S. C. & Yiap, B. C. DNA, RNA, and Protein Extraction: The Past and The Present. Journal of Biomedicine and Biotechnology 2009, 1–10 (2009). Perrett, D. From ‘protein’ to the beginnings of clinical proteomics. Proteomics Clin Appl 1, 720–738 (2007). VICKERY, H. B. The origin of the word protein. Yale J Biol Med 22, 387–393 (1950). Kitano, H. Biological robustness. Nat. Rev. Genet. 5, 826–837 (2004). Lowe, M. & Barr, F. A. Inheritance and biogenesis of organelles in the secretory pathway. Nature Reviews Molecular Cell Biology 8, 429–439 (2007). Stoeger, T., Battich, N. & Pelkmans, L. Passive Noise Filtering by Cellular Compartmentalization. Cell 164, 1151–1161 (2016). Nucleic Acids and Protein Synthesis. Nature 177, 602–603 (1956). CRICK, F. H. On protein synthesis. Symp. Soc. Exp. Biol. 12, 138–163 (1958). Baltimore, D. RNA-dependent DNA polymerase in virions of RNA tumour viruses. Nature 226, 1209–1211 (1970). Temin, H. M. & Mizutani, S. RNA-dependent DNA polymerase in virions of Rous sarcoma virus. Nature 226, 1211–1213 (1970). Central dogma reversed. Nature 226, 1198–1199 (1970). Crick, F. Central dogma of molecular biology. Nature 227, 561– 563 (1970). Burndy Library,, Darwin, C., Edmonds & Remnants, Henry Sotheran Ltd. On the origin of species by means of natural selection, or, The preservation of favoured races in the struggle for life /. (John Murray ...,, 1859). doi:10.5962/bhl.title.59991 World Cancer Report 2014. International Agency for Research on Cancer Maqsood, M. I., Matin, M. M., Bahrami, A. R. & Ghasroldasht, M. M. Immortality of cell lines: challenges and advantages of establishment. Cell Biol. Int. 37, 1038–1045 (2013). Burdall, S. E., Hanby, A. M., Lansdown, M. R. J. & Speirs, V. Breast cancer cell lines: friend or foe? Breast Cancer Res. 5, 89–95 (2003). Osborne, C. K., Hobbs, K. & Trent, J. M. Biological differences among MCF-7 human breast cancer cell lines from different laboratories. Breast Cancer Research and Treatment 9, 111– 121 (1987). Moore, J. M. The Immortal Life of Henrietta Lacks. By Rebecca Skloot . 2010. Crown Publishing. (ISBN 9781400052172). 384 pages. Hardcover. $26.00. The American Biology Teacher 72, 586–587 (2010). Hanahan, D. & Weinberg, R. A. The hallmarks of cancer. Cell 100, 57–70 (2000). Hanahan, D. & Weinberg, R. A. Hallmarks of cancer: the next generation. Cell 144, 646–674 (2011). Harrington, L. et al. Human telomerase contains evolutionarily conserved catalytic and structural subunits. Genes & Development 11, 3109–3115 (1997). Pütz, S. M., Vogiatzi, F., Stiewe, T. & Sickmann, A. Malignant transformation in a defined genetic background: proteome changes displayed by 2D-PAGE. Mol. Cancer 9, 254 (2010). Rich, J. N. et al. A genetically tractable model of human glioma formation. Cancer Res. 61, 3556–3560 (2001). Yu, J., Boyapati, A. & Rundell, K. Critical Role for SV40 Small-t Antigen in Human Cell Transformation. Virology 290, 192–198 (2001). Elenbaas, B. et al. Human breast cancer cells generated by oncogenic transformation of primary mammary epithelial cells. Genes & Development 15, 50–65 (2001). MacKenzie, K. L. et al. Multiple stages of malignant transformation of human endothelial cells modelled by coexpression of telomerase reverse transcriptase, SV40 T antigen and oncogenic N-ras. Oncogene 21, 4200–4211 (2002). Lundberg, A. S. et al. Immortalization and transformation of primary human airway epithelial cells by gene transfer. Oncogene 21, 4577–4586 (2002). Wang, Z., Gerstein, M. & Snyder, M. RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 10, 57–63 (2009). Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNASeq. Nat. Methods 5, 621–628 (2008). Nagalakshmi, U. et al. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320, 1344–1349 (2008). Adams, M. D. et al. Complementary DNA sequencing: expressed sequence tags and human genome project. Science 252, 1651–1656 (1991). Bainbridge, M. N. et al. Analysis of the prostate cancer cell line LNCaP transcriptome using a sequencing-by-synthesis approach. BMC Genomics 7, 246 (2006). Cheung, F. et al. Sequencing Medicago truncatula expressed sequenced tags using 454 Life Sciences technology. BMC Genomics 7, 272 (2006). Emrich, S. J., Barbazuk, W. B., Li, L. & Schnable, P. S. Gene discovery and annotation using LCM-454 transcriptome 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 45 54. 55. 56. 57. 58. 59. 60. 61. 62. 63. 64. 65. 66. 67. 68. 69. 70. 46 sequencing. Genome Res. 17, 69–73 (2007). Weber, A. P. M. Discovering New Biology through Sequencing of RNA. Plant Physiol. 169, 1524–1531 (2015). Garalde, D. R. et al. Highly parallel direct RNA sequencing on an array of nanopores. bioRxiv 068809 (Cold Spring Harbor Labs Journals, 2016). doi:10.1101/068809 Schroeder, A. et al. The RIN: an RNA integrity number for assigning integrity values to RNA measurements. BMC Mol. Biol. 7, 3 (2006). Aken, B. L. et al. The Ensembl gene annotation system. Database (Oxford) 2016, baw093 (2016). Rosenbloom, K. R. et al. The UCSC Genome Browser database: 2015 update. Nucleic Acids Res. 43, D670–81 (2015). Trapnell, C. et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc 7, 562–578 (2012). Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009). Pertea, M., Kim, D., Pertea, G. M., Leek, J. T. & Salzberg, S. L. Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nat Protoc 11, 1650–1667 (2016). Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013). Li, B., Ruotti, V., Stewart, R. M., Thomson, J. A. & Dewey, C. N. RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics 26, 493–500 (2010). Wagner, G. P., Kin, K. & Lynch, V. J. Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Theory Biosci. 131, 281–285 (2012). Pachter, L. Models for transcript quantification from RNA-Seq. (2011). Patro, R., Mount, S. M. & Kingsford, C. Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat. Biotechnol. 32, 462–464 (2014). Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525– 527 (2016). Liao, Y., Smyth, G. K. & Shi, W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30, 923–930 (2014). Anders, S., Pyl, P. T. & Huber, W. HTSeq--a Python framework to work with high-throughput sequencing data. Bioinformatics 31, 166–169 (2015). Trapnell, C. et al. Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat. Biotechnol. 31, 46–53 (2013). Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides accurate, fast, and bias-aware transcript expression estimates using dual-phase inference. bioRxiv 021592 (2016). doi:10.1101/021592 Patro, R., Mount, S. M. & Kingsford, C. Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat. Biotechnol. 32, 462–464 (2014). Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12, 323 (2011). Pimentel, H. J., Bray, N., Puente, S., Melsted, P. & Pachter, L. Differential analysis of RNA-Seq incorporating quantification uncertainty. bioRxiv 058164 (2016). doi:10.1101/058164 Soneson, C., Love, M. I. & Robinson, M. D. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Res 4, 1521 (2015). Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014). Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010). McCarthy, D. J., Chen, Y. & Smyth, G. K. Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res. 40, 4288–4297 (2012). Kanter, I. & Kalisky, T. Single cell transcriptomics: methods and applications. Front Oncol 5, 53 (2015). Norel, R., Rice, J. J. & Stolovitzky, G. The self-assessment trap: can we all be better than average? Molecular Systems Biology 7, 537–537 (2011). Wilkins, M. R. et al. From Proteins to Proteomes: Large Scale Protein Identification by Two-Dimensional Electrophoresis and Arnino Acid Analysis. Nat. Biotechnol. 14, 61–65 (1996). Rabilloud, T., Hochstrasser, D. & Simpson, R. J. Is a genecentric human proteome project the best way for proteomics to serve biology? Proteomics 10, 3067–3072 (2010). Ponten, F. et al. A global view of protein expression in human cells, tissues, and organs. Molecular Systems Biology 5, 337 (2009). Lundberg, E. et al. Defining the transcriptome and proteome in three functionally different human cell lines. Molecular Systems Biology 6, 450 (2010). Geiger, T., Wehner, A., Schaab, C., Cox, J. & Mann, M. Comparative proteomic analysis of eleven common cell lines reveals ubiquitous but varying expression of most proteins. Mol. Cell Proteomics 11, M111.014050–M111.014050 (2012). 71. 72. 73. 74. 75. 76. 77. 78. 79. 80. 81. 82. 83. 84. 85. 47 86. 87. 88. 89. 90. 91. 92. 93. 94. 95. 96. 97. 98. 99. 100. 101. 102. 48 Azimifar, S. B., Nagaraj, N., Cox, J. & Mann, M. Cell-TypeResolved Quantitative Proteomics of Murine Liver. Cell Metabolism 20, 1076–1087 (2014). Richards, A. L., Merrill, A. E. & Coon, J. J. Proteome sequencing goes deep. Current Opinion in Chemical Biology 24, 11–17 (2015). Sharma, K. et al. Cell type– and brain region–resolved mouse brain proteome. Nature Neuroscience 18, 1819–1831 (2015). Towbin, H., Staehelin, T. & Gordon, J. Electrophoretic transfer of proteins from polyacrylamide gels to nitrocellulose sheets: procedure and some applications. Proceedings of the National Academy of Sciences 76, 4350–4354 (1979). Engvall, E. & Perlmann, P. Enzyme-linked immunosorbent assay (ELISA). Quantitative assay of immunoglobulin G. Immunochemistry 8, 871–874 (1971). O'Farrell, P. H. High resolution two-dimensional electrophoresis of proteins. J. Biol. Chem. 250, 4007–4021 (1975). Kim, M.-S. et al. A draft map of the human proteome. Nature 509, 575–581 (2014). Wilhelm, M. et al. Mass-spectrometry-based draft of the human proteome. Nature 509, 582–587 (2014). Savitski, M. M., Wilhelm, M., Hahne, H., Kuster, B. & Bantscheff, M. A Scalable Approach for Protein False Discovery Rate Estimation in Large Proteomic Data Sets. Mol. Cell Proteomics 14, 2394–2404 (2015). Ezkurdia, I., Vázquez, J., Valencia, A. & Tress, M. Analyzing the first drafts of the human proteome. J. Proteome Res. 13, 3854– 3855 (2014). Omenn, G. S. et al. Metrics for the Human Proteome Project 2016: Progress on Identifying and Characterizing the Human Proteome, Including Post-Translational Modifications. J. Proteome Res. acs.jproteome.6b00511 (2016). doi:10.1021/acs.jproteome.6b00511 Uhlén, M. et al. Proteomics. Tissue-based map of the human proteome. Science 347, 1260419–1260419 (2015). Fenn, J. B., Mann, M., Meng, C. K., Wong, S. F. & Whitehouse, C. M. Electrospray ionization for mass spectrometry of large biomolecules. Science 246, 64–71 (1989). Karas, M. & Hillenkamp, F. Laser desorption ionization of proteins with molecular masses exceeding 10,000 daltons. Analytical Chemistry 60, 2299–2301 (1988). Steen, H. & Mann, M. The ABC‘s (and XYZ’s) of peptide sequencing. Nature Reviews Molecular Cell Biology 5, 699–711 (2004). Aebersold, R. & Mann, M. Mass spectrometry-based proteomics. Nature 422, 198–207 (2003). Payne, A. H. & Glish, G. L. in Biological Mass Spectrometry 402, 103. 104. 105. 106. 107. 108. 109. 110. 111. 112. 113. 114. 115. 116. 109–148 (Elsevier, 2005). Perkins, D. N., Pappin, D. J., Creasy, D. M. & Cottrell, J. S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20, 3551–3567 (1999). Bantscheff, M., Lemeer, S., Savitski, M. M. & Kuster, B. Quantitative mass spectrometry in proteomics: critical review update from 2007 to the present. Anal Bioanal Chem 404, 939– 965 (2012). Ong, S.-E. et al. Stable isotope labeling by amino acids in cell culture, SILAC, as a simple and accurate approach to expression proteomics. Molecular & Cellular Proteomics 1, 376– 386 (2002). Mirgorodskaya, O. A. et al. Quantitation of peptides and proteins by matrix-assisted laser desorption/ionization mass spectrometry using (18)O-labeled internal standards. Rapid Communications in Mass Spectrometry 14, 1226–1232 (2000). Chakraborty, A. & Regnier, F. E. Global internal standard technology for comparative proteomics. J Chromatogr A 949, 173–184 (2002). Hsu, J.-L., Huang, S.-Y., Chow, N.-H. & Chen, S.-H. Stableisotope dimethyl labeling for quantitative proteomics. Analytical Chemistry 75, 6843–6852 (2003). Gygi, S. P. et al. Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nat. Biotechnol. 17, 994–999 (1999). Thompson, A. et al. Tandem Mass Tags: A Novel Quantification Strategy for Comparative Analysis of Complex Protein Mixtures by MS/MS. Analytical Chemistry 75, 4942–4942 (2003). Ross, P. L. Multiplexed Protein Quantitation in Saccharomyces cerevisiae Using Amine-reactive Isobaric Tagging Reagents. Molecular & Cellular Proteomics 3, 1154–1169 (2004). Brun, V., Masselon, C., Garin, J. & Dupuis, A. Isotope dilution strategies for absolute quantitative proteomics. Journal of Proteomics 72, 740–749 (2009). Pratt, J. M. et al. Multiplexed absolute quantification for proteomics using concatenated signature peptides encoded by QconCAT genes. Nat Protoc 1, 1029–1043 (2006). Gerber, S. A., Rush, J., Stemman, O., Kirschner, M. W. & Gygi, S. P. Absolute quantification of proteins and phosphoproteins from cell lysates by tandem MS. Proceedings of the National Academy of Sciences 100, 6940–6945 (2003). Hanke, S., Besir, H., Oesterhelt, D. & Mann, M. Absolute SILAC for Accurate Quantitation of Proteins in Complex Mixtures Down to the Attomole Level. J. Proteome Res. 7, 1118–1130 (2008). Brun, V. et al. Isotope-labeled protein standards: toward absolute quantitative proteomics. Molecular & Cellular 49 117. 118. 119. 120. 121. 122. 123. 124. 125. 126. 127. 128. 129. 50 Proteomics 6, 2139–2149 (2007). Singh, S., Springer, M., Steen, J., Kirschner, M. W. & Steen, H. FLEXIQuant: A Novel Tool for the Absolute Quantification of Proteins, and the Simultaneous Identification and Quantification of Potentially Modified Peptides. J. Proteome Res. 8, 2201– 2210 (2009). Zeiler, M., Straube, W. L., Lundberg, E., Uhlen, M. & Mann, M. A Protein Epitope Signature Tag (PrEST) Library Allows SILACbased Absolute Quantification and Multiplexed Determination of Protein Copy Numbers in Cell Lines. Molecular & Cellular Proteomics 11, O111.009613–O111.009613 (2012). Kuzyk, M. A. et al. Multiple reaction monitoring-based, multiplexed, absolute quantitation of 45 proteins in human plasma. Mol. Cell Proteomics 8, 1860–1877 (2009). Domanski, D. et al. MRM-based multiplexed quantitation of 67 putative cardiovascular disease biomarkers in human plasma. Proteomics 12, 1222–1243 (2012). Anderson, L. Quantitative Mass Spectrometric Multiple Reaction Monitoring Assays for Major Plasma Proteins. Molecular & Cellular Proteomics 5, 573–588 (2005). Picotti, P. & Aebersold, R. Selected reaction monitoring–based proteomics: workflows, potential, pitfalls and future directions. Nat. Methods 9, 555–566 (2012). Gallien, S. et al. Targeted proteomic quantification on quadrupole-orbitrap mass spectrometer. Mol. Cell Proteomics 11, 1709–1723 (2012). YALOW, R. S. & BERSON, S. A. Immunoassay of endogenous plasma insulin in man. Journal of Clinical Investigation 39, 1157–1175 (1960). Stryer, L. Fluorescence energy transfer as a spectroscopic ruler. Annual Review of Biochemistry 47, 819–846 (1978). Snapp, E. L., Altan, N. & Lippincott-Schwartz, J. Measuring Protein Mobility by Photobleaching GFP Chimeras in Living Cells. 21.1.1–21.1.24 (John Wiley & Sons, Inc., 2001). doi:10.1002/0471143030.cb2101s19 Stadler, C., Skogs, M., Brismar, H., Uhlén, M. & Lundberg, E. A single fixation protocol for proteome-wide immunofluorescence localization studies. Journal of Proteomics 73, 1067–1078 (2010). Stokes, G. G. On the Change of Refrangibility of Light. Philosophical Transactions of the Royal Society of London 142, 463–562 (1852). Cleary, M. D., Meiering, C. D., Jan, E., Guymon, R. & Boothroyd, J. C. Biosynthetic labeling of RNA with uracil phosphoribosyltransferase allows cell-specific microarray analysis of mRNA synthesis and decay. Nat. Biotechnol. 23, 232–237 (2005). 130. 131. 132. 133. 134. 135. 136. 137. 138. 139. 140. 141. 142. 143. 144. Neymotin, B., Athanasiadou, R. & Gresham, D. Determination of in vivo RNA kinetics using RATE-seq. RNA 20, 1645–1652 (2014). Gray, J. M. et al. SnapShot-Seq: a method for extracting genome-wide, in vivo mRNA dynamics from a single total RNA sample. PLoS ONE 9, e89673 (2014). Wethmar, K., Smink, J. J. & Leutz, A. Upstream open reading frames: Molecular switches in (patho)physiology. BioEssays 32, 885–893 (2010). Barrett, L. W., Fletcher, S. & Wilton, S. D. Regulation of eukaryotic gene expression by the untranslated gene regions and other non-coding elements. Cell. Mol. Life Sci. 69, 3613– 3634 (2012). Liu, Y., Beyer, A. & Aebersold, R. On the Dependency of Cellular Protein Levels on mRNA Abundance. Cell 165, 535– 550 (2016). Boisvert, F.-M. et al. A quantitative spatial proteomics analysis of proteome turnover in human cells. Mol. Cell Proteomics 11, M111.011429–M111.011429 (2012). Brar, G. A. & Weissman, J. S. Ribosome profiling reveals the what, when, where and how of protein synthesis. Nature Reviews Molecular Cell Biology 16, 651–664 (2015). Ingolia, N. T., Ghaemmaghami, S., Newman, J. R. S. & Weissman, J. S. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 324, 218–223 (2009). Piccirillo, C. A., Bjur, E., Topisirovic, I., Sonenberg, N. & Larsson, O. Translational control of immune responses: from transcripts to translatomes. Nature Immunology 15, 503–511 (2014). Lundberg, E. et al. Defining the transcriptome and proteome in three functionally different human cell lines. Molecular Systems Biology 6, 450 (2010). Wilhelm, M. et al. Mass-spectrometry-based draft of the human proteome. Nature 509, 582–587 (2014). Anderson, N. L. The Human Plasma Proteome: History, Character, and Diagnostic Prospects. Molecular & Cellular Proteomics 1, 845–867 (2002). Uhlén, M. et al. A proposal for validation of antibodies. Nat. Methods 13, 823–827 (2016). Lin, J.-R., Fallahi-Sichani, M. & Sorger, P. K. Highly multiplexed imaging of single cells using a high-throughput cyclic immunofluorescence method. Nature Communications 6, 8390 (2015). Bandura, D. R. et al. Mass cytometry: technique for real time single cell multitarget immunoassay based on inductively coupled plasma time-of-flight mass spectrometry. Analytical 51 145. 146. 147. 148. 149. 150. 151. 152. 153. 154. 155. 156. 157. 158. 159. 160. 161. 162. 163. 52 Chemistry 81, 6813–6822 (2009). Li, Y. & Chen, L. Big Biological Data: Challenges and Opportunities. Genomics, Proteomics & Bioinformatics 12, 187– 189 (2014). Marx, V. Biology: The big challenges of big data. Nature 498, 255–260 (2013). Barrett, T. et al. NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res. 41, D991–5 (2013). Rustici, G. et al. ArrayExpress update--trends in database growth and links to data analysis tools. Nucleic Acids Res. 41, D987–90 (2013). Petryszak, R. et al. Expression Atlas update--an integrated database of gene and protein expression in humans, animals and plants. Nucleic Acids Res. 44, D746–52 (2016). GTEx Consortium. The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 45, 580–585 (2013). Uhlén, M. et al. Proteomics. Tissue-based map of the human proteome. Science 347, 1260419–1260419 (2015). Lizio, M. et al. Gateways to the FANTOM5 promoter level mammalian expression atlas. Genome Biol. 16, 22 (2015). FANTOM Consortium and the RIKEN PMI and CLST (DGT) et al. A promoter-level mammalian expression atlas. Nature 507, 462–470 (2014). Danielsson, F., James, T., Gomez-Cabrero, D. & Huss, M. Assessing the consistency of public human tissue RNA-seq data sets. Brief. Bioinformatics 16, 941–949 (2015). Uhlen, M. et al. Transcriptomics resources of human tissues and organs. Molecular Systems Biology 12, 862–862 (2016). Yu, N. Y.-L. et al. Complementing tissue characterization by integrating transcriptome profiling from the Human Protein Atlas and from the FANTOM5 consortium. Nucleic Acids Res. 43, 6787–6798 (2015). Antequera, F. & Bird, A. Number of CpG islands and genes in human and mouse. Proceedings of the National Academy of Sciences 90, 11995–11999 (1993). Fields, C., Adams, M. D., White, O. & Venter, J. C. How many genes in the human genome? Nat. Genet. 7, 345–346 (1994). Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001). Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001). Cunningham, F. et al. Ensembl 2015. Nucleic Acids Res. 43, D662–9 (2015). Breuza, L. et al. The UniProtKB guide to the human proteome. Database (Oxford) 2016, bav120 (2016). Vizcaíno, J. A. et al. 2016 update of the PRIDE database and its related tools. Nucleic Acids Res. 44, D447–56 (2016). 164. 165. 166. . 167. 168. 169. 170. 171. 172. 173. 174. 175. 176. 177. 178. 179. Uhlén, M. et al. A human protein atlas for normal and cancer tissues based on antibody proteomics. Molecular & Cellular Proteomics 4, 1920–1932 (2005). Barbe, L. et al. Toward a confocal subcellular atlas of the human proteome. Mol. Cell Proteomics 7, 499–508 (2008). Thul, P. et al. A subcellular map of the human proteome Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nat. Genet. 25, 25–29 (2000). Jantzen, S. G., Sutherland, B. J., Minkley, D. R. & Koop, B. F. GO Trimming: Systematically reducing redundancy in large Gene Ontology datasets. BMC Research Notes 4, 267 (2011). Yu, G., Wang, L.-G., Han, Y. & He, Q.-Y. clusterProfiler: an R Package for Comparing Biological Themes Among Gene Clusters. OMICS: A Journal of Integrative Biology 16, 284–287 (2012). Alexa, A. & Rahnenfuhrer, J. topGO: Enrichment Analysis for Gene Ontology. Yu, G., Wang, L.-G., Han, Y. & He, Q.-Y. clusterProfiler: an R Package for Comparing Biological Themes Among Gene Clusters. OMICS: A Journal of Integrative Biology 16, 284–287 (2012). Eden, E., Navon, R., Steinfeld, I., Lipson, D. & Yakhini, Z. GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists. BMC Bioinformatics 10, 48 (2009). Huang, D. W., Sherman, B. T. & Lempicki, R. A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protocols 4, 44–57 (2008). Pathan, M. et al. FunRich: An open access standalone functional enrichment and interaction network analysis tool. Proteomics 15, 2597–2601 (2015). Edwards, A. M. et al. Too many roads not taken. Nature 470, 163–165 (2011). Dewhirst, M. W., Cao, Y. & Moeller, B. Cycling hypoxia and free radicals regulate angiogenesis and radiotherapy response. Nat. Rev. Cancer 8, 425–437 (2008). Asmann, Y. W. et al. Detection of redundant fusion transcripts as biomarkers or disease-specific therapeutic targets in breast cancer. Cancer Res. 72, 1921–1928 (2012). Minikel, E. V. Tissue-specific gene expression data based on Human BodyMap 2.0. (2013). Available at: http://www.cureffi.org/2013/07/11/tissue-specific-geneexpression-data-based-on-human-bodymap-2-0/. (Accessed: 11 November 2016) Labeit, S., Kolmerer, B. & Linke, W. A. The Giant Protein Titin: Emerging Roles in Physiology and Pathophysiology. Circulation Research 80, 290–294 (1997). 53 180. 181. 182. 183. 184. 185. 186. 187. 188. 189. 54 McManus, J., Cheng, Z. & Vogel, C. Next-generation analysis of gene expression regulation--comparing the roles of synthesis and degradation. Mol Biosyst 11, 2680–2689 (2015). Piccirillo, C. A., Bjur, E., Topisirovic, I., Sonenberg, N. & Larsson, O. Translational control of immune responses: from transcripts to translatomes. Nature Immunology 15, 503–511 (2014). van Holde, K. E. Chromatin. (Springer New York, 1989). doi:10.1007/978-1-4612-3490-6 Vogelstein, B. & Kinzler, K. W. The multistep nature of cancer. Trends Genet. 9, 138–141 (1993). Nilsson, P. et al. Towards a human proteome atlas: highthroughput generation of mono-specific antibodies for tissue profiling. Proteomics 5, 4327–4337 (2005). Shapiro, E., Biezuner, T. & Linnarsson, S. Single-cell sequencing-based technologies will revolutionize wholeorganism science. Nat. Rev. Genet. 14, 618–630 (2013). Ståhl, P. L. et al. Visualization and analysis of gene expression in tissue sections by spatial transcriptomics. Science 353, 78–82 (2016). Yu, N. Y.-L. et al. Complementing tissue characterization by integrating transcriptome profiling from the Human Protein Atlas and from the FANTOM5 consortium. Nucleic Acids Res. 43, 6787–6798 (2015). Grzmil, M. & Hemmings, B. A. Translation regulation as a therapeutic target in cancer. Cancer Res. 72, 3891–3900 (2012). Ha, M. & Kim, V. N. Regulation of microRNA biogenesis. Nature Reviews Molecular Cell Biology 15, 509–524 (2014).

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Integration of RNA and protein expression profiles to