Download Integration of RNA and protein expression profiles to

Document related concepts
no text concepts found
Transcript
IntegrationofRNAandprotein
expressionprofilestostudyhuman
cells
FridaDanielsson
©FridaDanielsson
Stockholm2016
RoyalInstituteofTechnology(KTH)
SchoolofBiotechnology
ScienceforLifeLaboratory
SE-17165Solna
Sweden
Printedat
UniversitetsserviceUS-AB
DrottningKristinasVäg53B
SE-11428Stockholm
Cover image: The BUB1B mitotic checkpoint serine/threonine kinase B
localizedtokinetochoresduringchromosomalsegregationinmitosis,stained
withtheantibodyHPA008419inatransformedBJcell.
AkademiskavhandlingsommedtillståndavKTHiStockholmframläggesför
granskning för avläggande av teknisk doktorsexamen fredagen den 16
December2016kl.13.00IRockefellersalen,KarolinskaInstitutet,Solna.
ISBN978-91-7729-209-8
TRITA-BIOReport2016-22
ISSN16542312
2 Abstract
Cellular life is highly complex. In order to expand our understanding of the
workings of human cells, in particular in the context of health and disease,
detailed knowledge about the underlying molecular systems is needed. The
unifying theme of this thesis concerns the use of data derived from
sequencing of RNA, both within the field of transcriptomics itself and as a
guide for further studies at the level of protein expression. In paper I, we
showed that publicly available RNA-seq datasets are consistent across
different studies, requiring only light processing for the data to cluster
according to biological, rather than technical characteristics. This suggests
that RNA-seq has developed into a reliable and highly reproducible
technology,andthattheincreasingamountofpubliclyavailableRNA-seqdata
constitutes a valuable resource for meta-analyses. In paper II, we explored
theabilitytoextrapolateproteinconcentrationsbytheuseofRNAexpression
levels. We showed that mRNA and corresponding steady-state protein
concentrations correlate well by introducing a gene-specific RNA-to-protein
conversion factor that is stable across various cell types and tissues. The
resultsfromthisstudyindicatetheutilityofRNA-seqalsowithinthefieldof
proteomics.
The second part of the thesis starts with a paper in which we used
transcriptomics to guide subsequent protein studies of the molecular
mechanismsunderlyingmalignanttransformation.InpaperIII,weapplieda
transcriptomics approach to a cell model for defined steps of malignant
transformation, and identified several genes with interesting expression
patterns whose corresponding proteins were further analyzed with
subcellularspatialresolution.Severaloftheseproteinswerefurtherstudied
inclinicaltumorsamples,confirmingthatthiscellmodelprovidesarelevant
systemforstudyingcancermechanisms.InpaperIV,wecontinuedtoexplore
thetranscriptionallandscapeinthesamecellmodelundermoderatehypoxic
conditions.
Toconclude,thisthesisdemonstratestheusefulnessofRNA-seqdata,
from a transcriptomics perspective and beyond; to guide in analyses of
protein expression, with the ultimate goal to unravel the complexity of the
humancell,fromaholisticpointofview.
Keywords:RNA-seq,Transcriptomics,Proteomics,Malignanttransformation,
Cancer,Functionalenrichment
i
Populärvetenskapligsammanfattning
Dinkroppäruppbyggdavövertrettiosjutusenmiljarderlevandeceller.Varje
enskildcellutgörettegetlitetuniversumdärmiljonermolekylerinteragerari
olika funktioner som tillsammans skapar cellens identitet. Den unika
kombinationenavmolekylersomverkarinneidinacelleravgörtillexempel
omduärfriskellersjuk,vadsomgörennjuretillennjureochvilkenfärgdina
ögon har. Den repertoar av molekyler som finns i cellerna finns
förprogrammerad i dina gener som består av DNA. Sammansättningen av
DNAärstabilochväldigtlikmellanindivider,mentrotsdettaärmänniskor,
biologiskt sett, väldigt olika. Det beror på att DNA översätts till olika RNAmolekyler och proteiner. Proteinerna är cellens viktigaste byggstenar, de
deltarimångalivsviktigafunktioner.
Flödet av information från DNA, till RNA, och vidare till protein kallas den
Centrala dogmen inom molekylärbiologi och formulerades på 1950-talet.
Tänk dig cellen som en liten fabrik. Varje gång något ska byggas i fabriken
hämtasenritningiettbibliotek.DettabibliotekkanliknasviddittDNA.Den
specifika ritning som tas fram ur biblioteket är då ditt RNA, och själva
produktensomsedantillverkasutifrånritningenärettprotein,detsomsedan
kommer till användning. Den kompletta sammansättningen av RNAmolekyleriencellkallastranskriptom.Eftersomtranskriptometkanvariera
mycketovertid,ochunderolikaförhållandenutgördetenutmärktkällatill
informationomcellerstillstånd.Genomattstuderatranskriptometkanvifå
ett indirekt mått på vilka proteiner som finns närvarande och bättre förstå
cellersfunktioner.
För att ta reda på hur mycket RNA som har skapats från varje gen används
RNA-sekvensering.Ilaboratorieröverhelavärldenståravancerademaskiner
och arbetar dygnet runt med att läsa av RNA-sekvenser. Den stora
utmaningenärintelängreattkunnasekvenseraallaRNA-molekyleriettenda
experiment, utan snarare att kunna tolka den data som kommer ut ur
maskinen på ett meningsfullt sätt. Data från RNA-sekvenseringsexperiment
laddas ofta upp i publika databaser. På så vis främjas ett öppet klimat där
forskarekangeochtaavvarandrasdata.Förattdettaskafungerasåbrasom
möjligt är det viktigt att vi kan lita på innehållet i databaserna. I den första
studien i denna avhandling undersökte vi pålitligheten hos publika dataset.
Genom att studera publika RNA-data från olika studier kunde vi konstatera
att dessa dataset var mer lika ett annat dataset med liknande biologiskt
ursprung än ett annat dataset från samma laboratorium, men av annan
biologisk härkomst. Resultaten visade att publika RNA-data är relativt
tillförlitliga.
ii Nu kanske du undrar varför forskare är så intresserade av RNA när det
faktiskt är proteinerna som är cellernas viktigaste och identitetsskapande
molekyler. Eftersom RNA är lättare och mindre tidskrävande att analysera i
stor skala, jämfört med proteiner, är det smidigt att använda RNA som en
indikatorförhurcellersuttryckavproteinserut.Denandrastudienidenna
avhandlinghandlaromrelationenmellanRNAochproteiner.Vikomframtill
att varje gen har ett bestämt förhållande mellan dess koncentration av RNA
ochprotein,somärkonstantiolikatyperavcellerochvävnader.
Iavhandlingenstredjestudielätviossguidasavtranskriptometföratthitta
intressantaproteinervarsrollärviktigiutvecklingavcancer.Härstuderade
vitranskriptometiettmodellsystembeståendecellerunderfyraolikastegi
cancerutveckling. På så sätt kunde vi identifiera gener med intressanta
uttrycksprofiler under fyra stadier, från det normala tillståndet till ett fullt
utvecklat och aggressivt cancertillstånd. Med hjälp av antikroppar kunde vi
sedananalyseramotsvarandeproteiner,ochdetekteraförändringariuttryck
på en mycket detaljerad nivå. Vi identifierade ett flertal potentiellt viktiga
markörer för cancerutveckling och validerade resultaten genom att studera
samma proteiner i tumörprover. I den fjärde studien studerade vi hur
transkriptometförändrasunderolikagraderavcancerutvecklingtillföljdav
låg syretillförsel. Vi utsatte samma cancer-modell för odling i 3 % syrehalt
och identifierade ett antal intressanta gener vars uttryck påverkades avden
förändrade syrenivån, till exempel gener involverade i celldelning och
fettmetabolismen.DenhäravhandlingendemonstrerarhuranalyseravRNA
kan användas som en vägvisare för att identifiera relevanta proteiner att
fokuseravidarepå.Förattökavårförståelsefördekomplexaprocessersom
pågåricellensuniversum,behövsfortsattastudieravhurproteineruttrycks,
förändrasochinteragerar.
iii
Thesisdefense
This thesis will be defended December 16th 2016, at 13.00, for the
degreeofDoctorofPhilosophy(PhD)ininBiotechnology.
Location:Rockefellersalen,Nobelsväg11,Solna.
Respondent
FridaDanielssongraduatedasaMasterofScienceandEngineeringfromKTH
School of Biotechnology in 2011 and pursued her PhD studies in the Cell
ProfilingGroupatthedepartmentofProteomicsandNanobiotechnology.
FacultyOpponent
ChristineVogel,AssistantprofessorinBiologyattheDepartmentofBiology,
CenterforGenomicsandSystemsBiology,NewYorkUniversity,NY
Evaluationcommittee
CarolinaWählby,ProfessorinquantitativemicroscopyatUppsalaUniversity,
Uppsala
Carsten Daub, Assistant Professor in Bioinformatics, heading the Clinical
Transcriptomics research at the Department of Biosciences and Nutrition at
KarolinskaInstitute,Solna
Malin Andersson, Assistant professor at the Department of Pharmaceutical
Biosciences,UppsalaUniversity,Uppsala
ChairmanoftheThesisDefense
Stefan Ståhl, Professor in Molecular Biotechnology at KTH School of
Biotechnology,Stockholm
MainSupervisor
Emma Lundberg, Associate professor in Cell Biology Proteomics at KTH and
headingtheCellProfilingGroupinTheHumanProteinAtlas,Stockholm
Co-supervisors
Mikael Huss, Associate professor in bioinformatics at KTH and academic
consulting data scientist/bioinformatician in the SciLifeLab long-term
bioinformaticssupportunit,Stockholm
Mathias Uhlén, Professor in Microbiology at KTH School of Biotechnology,
Stockholm
iv Listofpublications
This thesis includes the following publications, which are referred to in the
textbythecorrespondingRomannumerals(I-IV)andfoundattheendofthe
thesis.Allarticlesarereproducedwithpermissionfromtheirpublishers.
I
Frida Danielsson, Tojo James, David Gomez-Cabrero, Mikael Huss
(2015).
AssessingtheconsistencyofpublichumantissueRNA-seqdatasets.
BriefingsinBioinformatics,16(6),2015,941–949.
DOI:10.1093/bib/bbv017
II
Fredrik Edfors, Frida Danielsson, Björn M Hallström, Lukas Käll,
EmmaLundberg,FredrikPontén,BjörnForsströmandMathiasUhlén
(2016).
Gene-specificcorrelationofRNAandproteinlevelsinhumancellsand
tissues.
MolecularSystemsBiology(2016)12:883.
DOI:10.15252/msb.20167144
III
Frida Danielsson, Marie Skogs, Mikael Huss, Elton Rexhepaj, Gillian
O’Hurley,DanielKlevebring,FredrikPontén,AnnicaK.B.Gad,Mathias
Uhlén,EmmaLundberg(2013).
Majority of differentially expressed genes are down-regulated during
malignanttransformationinafour-stagemodel.
ProceedingsoftheNationalAcademyofSciences110(17)·April2013
DOI:10.1073/pnas.1216436110
IV
Frida Danielsson, Erik Fasterius, Kemal Sanli, Cheng Zhang, Adil
Mardinoglu, Cristina Al-Khalili, Mikael Huss, Mathias Uhlén, Emma
Lundberg
Transcriptome profiling of a cell line model for malignant
transformationinresponsetomoderatehypoxia.(2016)
Manuscript
v
Respondent’scontributionstotheincludedpapers
I
II
III
IV
vi Main responsibility for bioinformatics analyses and co-responsible
authorduringmanuscriptwriting.
Main responsibility for cell cultivation, transcriptomics experiments
andrelateddataanalysis.
Main responsibility for experimental planning, laboratory
performance and data visualization. Co-responsible author during
manuscriptwriting.
Main responsibility for experimental planning, performance, data
analysisandmainresponsibleauthorduringmanuscriptwriting.
Abbreviations
AQUA
CAGE cDNA
CLSM
CSF cyTOF
cycIF DDA DIA DNA EB EBI ELISA
ENCODE
ESI EST FPKM
FRAP
FRET GEO GO
GTEx HPA hTERT
HTS ICAT IF
IHC IRES iTRAQ
IWGAV
LC
MALDI
MRM mRNA
MS MS/MS
MS1 MS2 Absolutequantification
capanalysisofgeneexpression
complementaryDNA
confocallaserscanningmicroscopy
cerebrospinalfluid
cytometrybytime-of-flight
cyclicimmunofluorescence
data-dependentacquisition
data-independentacquisition
deoxyribonucleicacid
exabyte
EuropeanBioinformaticsInstitute
enzyme-linkedimmunosorbentassay
EncyclopediaofDNAElements
electrosprayionization
expressedsequencetag
fragmentsperkilobaseoftranscriptpermillionreadsmapped
fluorescencerecoveryafterphotobleaching
Försterresonanceenergytransfer
GeneExpressionOmnibus
GeneOntology
Genotype-TissueExpression
HumanProteinAtlas
telomerasereversetranscriptase
high-throughputSequencing
isotope-codedaffinitytag
immunofluorescence
immunohistochemistry
internalribosomeentrysite
isobarictagsforrelativeandabsolutequantitation
InternationalWorkingGroupforAntibodyValidation
liquidchromatography
matrix-assistedlaserdesorptionionization
multiplereactionmonitoring
messengerRNA
massspectrometry
tandemmassspectrometry
firststageofMS/MS
secondstageofMS/MS
vii
NCBINationalCenterforBiotechnologyInformation
NIH NationalInstituteofHealth
ORF openreadingframe
PB
petabyte
PCR polychainreaction
PFA paraformaldehyde
Poly-A
poly-adenylated
PRIDE
ThePRoteomicsIDEntificationsdatabase
PRM parallelreactionmonitoring
PTM post-translationalmodification
RIA radioimmunoassay
RIN RNAIntegrityNumber
RNA ribonucleicacid
RNA-seq
RNAsequencing
RPF ribosomeprotectedfragments
SAM sequencealignmentmap
SILAC stableisotopelabelingwithaminoacidsincellculture
SOP standardoperatingprocedures
SRA sequencereadarchive
SRM selectedreactionmonitoring
SV40 Simianvirus40
TMT tandemmasstags
TPM transcriptspermillion
viii
Contents
Abstract
i
Populärvetenskapligsammanfattning
ii
Thesisdefense
iv
Listofpublications
v
Respondent’scontributionstotheincludedpapers
vi
Abbreviations
vii
1.Thehumancell
Themacromoleculesofthecell
Compartmentalizationofbiologicalprocesses
Thecentraldogmaofmolecularbiology
Evolutionandcancer
Celllines
Malignanttransformation
TheBJmodel
1
2
4
5
5
6
7
8
2.Thehumantranscriptome
RNAsequencing
RNAextraction,enrichment,librarypreparationandsequencing
Mappingandquantifyingtranscriptomes
Differentialgeneexpression
Challengesandconsiderations
9
9
10
12
14
15
3.Studyingthehumanproteome
MS-basedproteomics
17
18
19
21
22
22
24
25
Quantitativemassspectrometry
Targetedmassspectrometry
Affinity-basedproteomics
Immunofluorescence
UsingRNAlevelsasanindirectmeasureofproteinexpression
Challengesandconsiderations
4.Dataanalysisandfunctionalinterpretation
Publicdataresources
Transcriptomicsdatabases
Proteomicsdatabases
TheHumanProteinAtlas
GeneOntology
Functionalenrichmentanalysis
Challengesandconsiderations
27
27
27
28
29
29
30
31
5.Aimsofthethesis
33
6.Presentinvestigation
AssessingtheconsistencyamongpublicRNA-seqdata(PaperI)
Usingtranscriptomicstoguideproteomics(PaperII)
34
35
36
Combining transcriptomics and protein profiling to study malignant
transformation(paperIIIandIV)
38
Concludingobservationsandfutureperspectives
39
Acknowledgements
41
Bibliography
43
2 1.Thehumancell
Withinourbodiesareover37trillioncells,anenormousnumberthatisfar
beyondthenumberofstarsinourgalaxy1,2.Thecellisoftendescribedasthe
smallest living entity, the basic unit of all living organisms that harbors the
most fundamental properties of life; that is to grow, reproduce, sense and
respond to the environment, and to maintain a chemical composition that
favorsitspersistence.
Thestructureofthecellwasfirstdescribedinthe17thcenturybytwofellows
oftheRoyalSociety,RobertHookeandAntonievanLeeuwenhoek,whoboth
developedopticalinstrumentationtostudymicroorganisms.In1665,Robert
Hookepublishedthebook“Micrographia”withdetaileddrawingsofvarious
matters based on his observations, using a first generation compound
microscope. In this groundbreaking book the term “cell” was coined, as he
describedtheunitsofasliceofdeadcorkobservedunderhislenses.Theuse
of this word stems from the Latin word “cella”, meaning small room. The
Dutch businessman and scientist Antonie van Leeuwenhoek perfected the
making of single lens microscopes, enabling him to observe bacteria, sperm
cellsandbloodcells3,4.AlthoughHookeandLeeuwenhoeksharethehonorof
developing the first optical instrumentation required for the observation of
cells,ittookalmostanothertwocenturiesuntilthecurrentviewofacellas
the basic unit for a living organism emerged, the so called cell theory. This
theorywasofficiallypostulatedin1838-1839asaresultofconclusionsmade
bythebotanistMatthiasJacobSchleidenandzoologistTheodorSchwann4,5.
The great fascination of the human cell has attracted substantial attention
amongresearcherseversinceitwasfirstdiscovered,continuouslyexpanding
theunderstandingofthecellularlandscapeanditsdiversity.Consideringits
original notion as a small room, current knowledge rather supports a view,
where opening the door to that small room will lead us to a bustling
metropolis, harboring millions of molecules that interact and orchestrate a
myriad of biological functions through complex networks. Cellular life is a
balancing act. In order for this highly complex system to function, a high
degreeofregulationandcontrolisrequired.Tounderstandcellularfunction,
and in particular in the context of health and disease, detailed knowledge
aboutthecellularsystemisneeded,fromaholisticpointofview.
1
Themacromoleculesofthecell
Themostabundantchemicalcompoundfoundwithincellsiswater,takingup
over70%ofthetotalcellmass.Therestofthecellvolumecontainsinorganic
ions and larger organic molecules. The organic molecules in the cell are
generallyclassifiedintolipids,carbohydrates,nucleicacidsandproteins.The
latter three are commonly referred to as the macromolecules of the cell.
Macromolecules are large molecules (> 5kD) that are formed by the joining
(polymerization)ofhundreds,oreventhousands,ofsmallerprecursors,and
occupyabout80-90%ofthecellsdryweight6.
DNA,ordeoxyribonucleicacid,isthebasicunitofgeneticmaterial.Itcarries
information encoded in sequences of nucleotides that are made of a fivecarbon sugar (deoxyribose), a phosphate group and one of the four bases
adenine(A),cytosine(C),guanine(G)andthymine(T).TheDNAmoleculeis
formed by the joining of two strands that are held together by hydrogen
bondingbetweenthebasesinapairwisemanner,whereAalwaysbindstoT,
and C always binds to G. This double-strand molecule is spatially arranged
into a helix-like structure that provides the robustness needed for its most
prominentfunction,toreplicateandtransfergeneticinformationduringcell
division. The story of DNA began in the 19th century when Gregor Mendel
performedgroundbreakingexperimentsonpeaplantsthatrevealedthebasic
principles of hereditary transmission7. Around the same time, the Swiss
scientist Friedrich Miescher was examining leukocytes, in which he
discovered a precipitate of unknown character in the nuclei of the cells. He
concluded that this precipitate must be a “multi-based acid” and named it
Nuclein8,9.AlthoughtheseearlyfindingspioneeredtheresearchonDNA,they
are often overshadowed by the works of James Watson and Francis Crick,
whoalmost100yearslatersolvedthestructureoftheDNAmolecule,which
laidthegroundforacontinuouslyevolvingdevelopmentofmethodologiesto
studygeneticmaterial10.Ineukaryoticcells,mostDNAispresentinthenuclei
but a small part is also found in mitochondria. Packed around protein
complexes in the chromosomes, it coordinates the cellular distribution of
other functional molecules through the expression of genes. The term gene
wasfirstusedalready1909whentheworkofMendelgainednewattention;it
was then referred to as a “discrete unit of heredity”. The definition of this
termhasbeenwidelydebatedoverthelastcentury,alongwithnewadvances
inthefieldofgenomics,butamorerecentdefinitionstates“ageneisaunion
of genomic sequences encoding a coherent set of potentially overlapping
functionalproducts”11.
2 RNA, or ribonucleic acid, is the outcome of transcription, where regions of
DNAarecopiedintocomplementaryRNAmoleculescalledtranscripts.RNAis
structurally related to DNA with a few exceptions. Like the DNA molecule,
RNAisalsobuiltupbyfournucleotides,butwiththemajordifferencesthat
itsfive-carbonsugarisriboseandthatitcontainsthebaseUracil(U)instead
of thymine (T). In addition, RNA appears in most cases as a single stranded
moleculeanddisplaysalowermolecularweightthanDNA.ElliotVolkinand
Lawrence Astrachan discovered the RNA molecule in 1956 and soon
thereafter, its intermediary role as a messenger between DNA and protein
wasdiscoveredbyseveralindependentresearchers12-15.Itwaslongbelieved
that RNA most of all serves as a messenger bridging between DNA and
protein, a view that has gradually changed with increased understanding of
the repertoire of different RNA subtypes that carry out various biological
functions.Over60%ofthecellulargenomeistranscribedintoRNA.Outofthe
total RNA, 85-90% constitutes ribosomal RNA, 5% is mRNA that codes for
protein,andtherestisothernon-codingRNAthathavegottenmoreattention
in recent years, proven to display important regulatory functions 16-18. In
higher eukaryotes, mRNA transcripts are processed into different isoforms
via the process of alternative splicing. Over 90% of all protein-coding genes
containing multiple exons are subject to alternative splicing, thereby
expanding both the cellular transcriptome and proteome tremendously,
withouttheneedforadditionalgenes17,19.
Proteinswerefirstdescribedinthe18thcenturybyresearchersworkingwith
plants,wheatglutenandalbuminfrombloodandeggwhites20,21.TheSwedish
chemistJönsJacobBerzeliusisacknowledgedforcoiningthetermProtein,by
suggestingtheusageoftheterminalettertohisfellowscientistMulderwho
thenusedtheterminapublicationfrom183822.ThewordproteinisofGreek
origin and means “of primary importance” or “of the first rank”, a name it
certainlydeservesasalmostallprocessesinthecellinvolvetheactivityand
interactionsofproteins.Comparedtoothercellularcomponents,proteinsare
relativelylargemolecules,formedbysequencesofaminoacidsthatareheld
together by peptide bonds. The sequence of amino acids is dictated by the
sequence of the corresponding nucleotides that undergo translation, and
humancellscontaingeneticrecipesfor20differentaminoacids.Asequence
ofaminoacidsiscalledapeptide.Thesearrangeintofunctionalproteinsby
folding into three-dimensional structures. After the process of translation,
when proteins are formed, they often undergo further modifications, called
post-translational modifications (PTMs). PTMs are commonly mediated by
enzymatic activity and have significant impact over the protein’s functional
activity.Theycanoccuratanytimeduringaprotein’slifecycleandtwoofthe
mostcommonPTMsareglycosylationandphosphorylation.
3
Compartmentalizationofbiologicalprocesses
In biology, compartmentalization of different functions is evident across
several scales. The spatial partitioning of biological functions results in a
hierarchy of specialized and robust systems, from individual organs in our
bodies, to organelles within the single cell23. At the cellular level,
compartmentalization is a fundamental property of the eukaryotic cell and
manifests itself through the organization into cellular organelles. The
molecular landscape of the cell is rich and complex. A cell is densely
populatedbymillionsofproteinmoleculesthatinteractindifferentwaysto
carry out designated functions. Organization into specialized membrane
bound organelles enables independent molecular interactions to occur
simultaneously,withoutcrosstalkandunderoptimalchemicalconditions.For
example the processes of protein translation and degradation can occur in
parallel24. Moving deeper into the hierarchy of biological
compartmentalization, even organelles are subjected to subcompartmentalization into more refined structures, for example different
bodiesandspecklesfoundwithinthenucleus.Compartmentalizationhasalso
been proposed as a cellular strategy for passive noise filtering, by enabling
molecularprocessestooccurinseparatecompartmentsfromwhichnoise,in
termsofirrelevantmolecules,arepreventedtoenter25.
The Golgi apparatus
Nucleus
Microtubules
Nucleolus
Vesicles
Cytoplasm
Mitochondria
Actin filaments
Endoplasmic reticulum
Plasma membrane
Figure 1. A schematic illustration of the human cell and
selected subcellular compartments. Illustrated by Annica
Åberg.
4 Thecentraldogmaofmolecularbiology
OneofthekeystonesofscienceingeneralistheCentralDogmaofMolecular
biology; a schematic description of how genetic information is carried over
between molecules, from DNA to RNA in a process called transcription and
from RNA to protein in a process called translation. The central dogma was
first elaborated by Francis Crick in 1956 and soon thereafter published as
part of a Society for Experimental Biology symposium in 195826,27.
Subsequent research on RNA cancer viruses led to the discovery of RNA
reverse transcriptase that enables synthesis of DNA from RNA28,29. These
findingsleadsomeresearcherstoquestiontheaccuracyofthecentraldogma,
whichwassoonthereaftercommentedbyCrickinanattempttoeradicatethe
widespread misconceptions30,31. The Central Dogma is still often
misunderstood. In its original form, the Central Dogma states, “Once
information has passed into protein, it cannot get out again”, a take-home
messagethat,whenproperlyunderstood,stillholds.
Transcription
DNA
Replication
RNA
Protein
Translation
Figure2.Thecentraldogmaofmolecularbiology.
Evolutionandcancer
A fundamental theory of biology is evolution, the explanation for a set of
observations that was published by Charles Darwin in his groundbreaking
bookOntheOriginofSpeciesin185932.Simplyput,theevolutiontheorystates
that gradual changes continuously occur in the genetic composition of
organisms,drivenbyanaturalselectionforadvantageouscharacteristicsand
increased reproduction. Evolution is typically thought of as something that
actsonwholepopulationsoverlongstretchesoftime.However,evolutionary
processes are also evident within individual bodies, exemplified by the
5
process of malignant transformation when a single cell acquires cancerous
mutations that are passed on through replication. This can be considered a
typeofmicroevolutionactingwithinthelifetimeofanindividual,byselection
acting on cancerous mutations that provide growth advantages. The
microcosmsofevolutionthatwecallcancershaveproventohaveextremely
harmful impact on modern society. According to the World Health
Organization,cancerisamongtheleadingcausesofdeathworldwideandthe
annualincidenceisexpectedtorisefrom14milliontoover20millioncases
withinthenexttwodecades33.Cancerisabroaddiagnosisconsistingofmore
than 200 different cancer types, each affecting a certain cell type and
pathway.
Celllines
The fact that the majority of all living human cells are embedded in tissues
makesthemdifficulttoattainforexperimentalpurposes.Thestudyofliving
cellsthereforerequiressystemswherecellscanbedirectlyaccessedandkept
undercontrolledconditions.Apopularapproachistheuseofcelllines;those
arecellssuitableforcultivationinvitrothatserveasmodelsystemsfortheir
cell type or tissue of origin. Cell lines are often derived from tumor tissue
where they have already acquired the ability to replicate for an unlimited
number of cycles. Cell lines can also be of primary origin with a limited
lifespan, or of primary origin but further immortalized by the expression of
telomerase34. The use of cell lines has several advantages, for example the
possibility to obtain almost unlimited amount of sample material, which
enables users to try out and optimize different experimental approaches
withoutwastingprecioussamples.Celllinesalsoprovideconsistentsamples
ofapurepopulation,whichcanbeusedtoreproduceresultsacrossdifferent
laboratories. In addition, cell lines are relatively cheap and can be stored in
freezers for long periods of time. Even though cell lines are tremendously
useful,theycanonlyservetoapproximatethecharacteristicsofcellspresent
within the more complex in vivo environment. Long-term use of cell lines
bearstheriskofgenotypicdriftthatmakethecellsdivergegeneticallyfrom
theiroriginalsource,byundergoingselectionforcellswithincreasedcapacity
toproliferateunderthegivencircumstances35,36.Thefirsthumancelllinewas
establishedin1951fromcervicalcancercellsandnamedHeLaafteritsdonor
Henrietta Lacks. What Henrietta Lacks never knew was that over 20 tons of
hercellsweretobecultivatedinworldwidelaboratoriesafterherdeath.The
storyofHenriettaLacks,theHeLacelllineanditssurroundingethicalissues
are discussed in the book “The immortal life of Henrietta Lacks” that was
publishedin201037.
6 Malignanttransformation
Malignant transformation refers to the process when primary cells become
cancerous. This process involves changes in the genome that lead to
disruption of regulatory circuits that together result in acquirement of new
cellular characteristics. Malignant transformation is a complex molecular
process. Despite an almost incalculable body of literature, a complete
mechanisticunderstandingofthisprocessstillremainselusive.
In the year 2000 a review paper entitled The Hallmarks of Cancer was
published38.Inthispaper,theauthorsaimedtoidentifythecommonfeatures
anddisruptedpathwaysthatarerequiredtoreachacancerousphenotype.By
reviewingthelatterhalfofthe20thcentury’sliterature,detailedinformation
about malignant transformation was compiled into one organizing principle
foritsunderlyingbiology.Intheoriginalversionofthispaper,thehallmarks
of cancer encompassed six essential cellular alterations that collectively
dictate malignant transformation; Sustaining proliferative signaling, Evading
growth suppressors, Resisting cell death, Enabling replicative immortality,
InducingangiogenesisandActivatinginvasionandmetastasis.
Ten years after the Hallmarks of Cancer first came out, its sequel “The
Hallmarks of Cancer: Next Generation” was published39. This time, the
concept was complemented with two emerging hallmarks, Reprogramming
energy metabolism and Evading Immune destruction. In addition, two
enabling characteristics were defined; Tumor-promoting inflammation and
Genome instability and mutation, and the paper also covered discussion
abouttheroleofthesurroundingtumormicroenvironment.Thegeneralized
conceptforthemolecularbiologybehindmalignanttransformationprovided
bythesetwopapershasheavilydominatedthefieldofcancerresearchever
since.
38
Figure3.ThesixoriginalHallmarksofcancer.FigureadaptedfromHanahanetal.
7
TheBJmodel
Malignant transformation is a multistep process. To reveal the underlying
mechanisms behind this complex progression, a model system enabling the
study of defined steps of transformation is advantageous. Traditionally,
primarymurinecellshavebeengeneticallymodifiedtoundergotumorigenic
conversion and serve as models for malignant transformation. Compared to
murine cells, human cells are more challenging to transform, mainly due to
differences in telomere biology. In contrast to murine cells, human primary
cells progressively lose telomeric DNA when passaged in vitro cultivation,
leadingtocellularsenescencethatischaracterizedbyadrasticproliferation
decline.Severalattemptstotransformhumancellshavepointedtothelackof
telomerase expression as a significant barrier to reach malignancy. In 1999,
Weinberg and co-workers created a model system consisting of four human
fibroblast cell lines by stepwise additions of genetic alterations. Following
initialexpressionofthecatalyticsubunitoftelomerasereversetranscriptase
(hTERT) in primary cells, the cells were further transformed by sequential
introductionoftheSimianVirus40(SV40)Large-Toncogeneandoncogenic
H-Ras. With this combination of genetic alterations, the cells gained a fully
transformed phenotype, which was also confirmed invivo. By expression of
hTERT, cells become immortal and are able to replicate for an indefinite
number of cycles40. The second alteration introduced in this model is the
Large-T antigen, expressed by the SV40 virus. Large-T contributes to the
transformationofcellsbyinactivationofthewell-studiedtumorsuppressors
p53 and Rb41. The final alteration in this model system is introduction of a
mutatedversionofH-ras(G12V)havingoneGlycineexchangedtoValinethat
makesthisGTPasekeptconstantlyinitsactiveform.Eversincethecreation
of the BJ model for malignant transformation, the same combination of
hTERT, Large-T and oncogenic H-Ras has been used by numerous other
researchersandprovedtosuccessfullytransformavarietyofprimaryhuman
celltypes42-46.
8 2.Thehumantranscriptome
The word transcriptome refers to the full set of transcribed RNA molecules
within a cell, tissue or organism at a given time point, but it can also be
assignedtothefullsetoftranscriptsbelongingtoaspecificsub-populationof
RNA, for example protein coding mRNA, regulatory RNA, ribosomal RNA,
smallRNAortRNA.Incontrasttothegenome,whichischaracterizedbyits
stability over different cells within an organism, the transcriptome varies
greatlyanddisplaysahighdegreeofsensitivitytobothinternalchangessuch
as developmental stage and circadian rhythm, and external changes such as
stress and changes in the environment. The plastic nature of the
transcriptomehasmadeitappealingtostudyinthecontextofdisease,owing
toitspotentialtoserveasaproxyforcellularidentityanddiversity.
An apparent strategy to reveal the active parts of the genome is to explore
what genes are transcribed, to what quantity, and how the levels of RNA
transcriptsvarybetweendifferentpopulations,overtimeandoverdifferent
physiologicalconditions.Inthepast,thestudyoftranscriptomeswasmainly
performed using hybridization-based methods where RNA samples were
added to pre-designed cDNA probes for complementary hybridization. A
major limit with these methods is dependence on an annotated genome,
problemswithhighbackgroundsignalsduetocross-hybridization,alimited
dynamic range due to oversaturation of signals and complicated
normalization methods47. Owing to a successful era of development within
thefieldofsequencingtechnology,High-ThroughputSequencing(HTS)based
approachestotheanalysisofthetranscriptomeemergedinthe1990´s.These
aremostcommonlyusedtoday.TheessenceofRNAsequencing(RNA-seq)is
to count reads generated from different regions in the genome in order to
reveal the transcriptional abundance of each and every analyzed genomic
region.
RNAsequencing
Mining the literature, the term RNA sequencing appears for the first time in
2008intwo,todate,wellcitedpapers48,49.However,withoutusingtheterm
RNA sequencing, its underlying concept had already been published before,
forexamplesequencingofexpressedsequencetags(EST)thatwasapopular
approach to analyze transcripts by sequencing of cDNA50. In these early
innovativepapers,researcherswerecapableofanticipatingsequencingbased
approachestosubsequentlyreplacethepreviousmethodsasstateoftheart
technology for the study of transcriptomes51-54. Today, sequencing
experiments are often employed at specialized service facilities, like the
National Genomics Infrastructure at SciLifeLab in Stockholm. In this way,
9
researcherscangetaccesstoacompetitiveinfrastructureequippedwiththe
latesttechnology,accompaniedwithtechnicalsupportbytrainedspecialists.
The standard workflow of an RNA-seq experiment can be divided into two
parts, first a laboratory part covering RNA extraction, enrichment, library
preparationandtheactualsequencing.Thesecondpartisthebioinformatics
part which depends on computers to process, analyze and interpret the
outputfromthesequencingmachine.Thistypicallyincludesqualitycontrols,
identificationoforiginwithinareferencegenome(orassemblyintocontigs),
quantification of transcript abundance and often analysis of differential
expressionacrossdifferentconditions.Denovotranscriptomeassemblyisan
importantprocedure,especiallyintheanalysisoforganismsforwhichthere
is a lack of reference genome annotation, however it is beyond the frame of
thisthesis.
ThemostcommonlyusedcommercialplatformsforRNA-seqarebasedonthe
conversionofRNAintocomplementarycDNA,throughreversetranscription
before sequencing takes place. The major reason for introducing this
additional step, instead of performing direct sequencing of RNA, is to retain
stability of the molecules undergoing sequencing. RNases, i.e. enzymes that
degrade RNA, are ubiquitous in nature, which makes RNA highly unstable
outside the intracellular environment. In addition, DNases are easier to
inactivate compared to RNases, Polymerase Chain Reaction (PCR)
amplificationismoresuitableforDNAthanRNA,andRNA-seqheavilyrelyon
HTSprotocolsthatwereinitiallydevelopedforDNAsequencing.Despitethis
trend, several recent studies claim direct sequencing of RNA to be more
suitableandefficient55.
RNA extraction, enrichment, library preparation and
sequencing
The initial step of the RNA-seq experiment is to extract RNA from cells,
whether the biological sample being analyzed is a tissue sample or a pellet
derived from invitrocultivation. As the majority of total RNA within cells is
ribosomalrRNA,thatismostoftenconsidereduninformative,thenextstepis
to enrich for the RNA subpopulation of interest. For studies of the proteincodingpartsofthegenome,mRNAistypicallyenrichedbytakingadvantage
ofthepolyadenylated(poly-A)tagsatthe3’endthatcanbecapturedbythe
addition of poly-T oligomers. An alternative method is to use a negative
enrichment approach, where rRNA is instead depleted using rRNA-specific
probes,howeverthatwillleavemorethanjustmRNAleftinthesample.After
purification of the desired type of RNA, purity and intactness are usually
measured.Thisistypicallyperformedwithanelectrophoresis-basedmethod
where the length distribution of the RNA transcripts is compared to the
ribosomal complexes, resulting in a calculated RNA Integrity Number (RIN
10 value) where 1 represents complete degradation and 10 represents pure
intactness56. When RNA samples of desired concentration and quality are
ready,itistimetobuildasequencinglibrary,whichisthecollectionofcDNA
moleculesthataretobesequenced.
Buildingasequencinglibraryconventionallystartswithfragmentationofthe
RNA into appropriate size prior to cDNA synthesis. To ensure that the RNA
fragments are intact without overhangs, end-repair is usually performed.
Reverse transcriptase and random oligo (dNTPs) primers are first added to
build an initial single-stranded cDNA library (first strand synthesis) that is
further converted into double-stranded cDNA by the addition of DNA
polymerase. To enable parallel sequencing of more than just one sample
without the loss of information about sample origin, indexing primers are
added to the ends of the cDNA. Before amplification with PCR, adaptors are
alsoaddedtoenablehybridizationtosurface-boundprimersontheflowcells
whereamplificationtakesplace(Figure4).Afterlibrarypreparationitistime
for the real showpiece of the experiment, the actual sequencing when the
orderofthebasesaredetermined.Astheworkcoveredinthisthesisdidnot
involveanylaboratoryworkrelatedtotheprocessofRNA-seq,exceptforthe
preparationofRNAsamplesandmeasurementsofintactness,morein-depth
contentcoveringaspectsoftheactualsequencingexperimentisthereforeleft
outandfocusisdirectedtothepost-sequencingpartoftheanalyses,covered
inthenextsection.
5´
AAAA
TTTT
Figure 4.Thebasicworkflowof RNA-seqmRNAlibrarypreparation.mRNAisfirst
isolated, followed by fragmentation, addition of primers, first strand synthesis,
secondstrandsynthesis,adaptorligationandPCRamplification.
11
Mappingandquantifyingtranscriptomes
ThebioinformaticspartoftheRNA-seqexperimentbeginsatthelevelofraw
sequence data in the form of a FASTQ file. The FASTQ file contains
information for all sequenced reads organized into sections covering a
sequenceidentifier,thenucleotidesequence,aqualityscore identifieranda
quality score assigned to each base. With the FASTQ file in hand, an initial
stepistypicallytotakeadvantageofthequalityscorestointerprettheoverall
qualityofthesequencerun,followedbyfilteringtoremovelowqualityreads.
Inthenextphaseoftheanalysis,readsarealignedtoareferencegenomein
order to identify their origin of transcription. Reference genomes are
commonly stored in a text-based file format called FASTA and can easily be
downloaded online from the ENSEMBL consortium or the UCSC genome
browser57,58.SincethebeginningoftheRNA-seqera,awidelyusedprogram
foralignmentofRNA-seqreadshasbeenTopHat,whichispartofanRNA-seq
pipeline named Tuxedo that offers a collection of open-source tools for
comprehensiveRNA-seqdataanalysis59.TopHatmakesuseoftheshortread
aligner Bowtie, initially developed for alignment of DNA sequences, which
first indexes the reference genome before aligning the reads60. After initial
alignmentusingBowtie,Tophatchopsallunmappedreadsintosmallerpieces
that are realigned using the same principle. At present TopHat is largely
superseded by the faster and more recently released program HISAT2,
developedbythesameauthors61.AnotherpopulartoolforalignmentofRNAseq reads is Star, which uses a fundamentally different approach62. Due to
time-efficientalignmentthatrequireslessresourceintermsofdataclusters,
HISAT2 and Star have gained wide popularity in the RNA-seq community.
Read alignment generates output in the form of a Sequence Alignment Map
(SAM) file which contains relevant information about the alignments in a
TAB-delimited text format. As SAM files are relatively large, they are often
compressedintoabinaryformatcalledBAMthatrequireslessspace.
Asstatedearlier,theultimategoalofRNA-seqistoquantifyexpressionlevels
bycountingthenumberofsequencedreadsthatmapbacktoadefinedregion
of interest, and report this in terms that are as closely proportional to the
relative molar RNA concentrations as possible. To perform relative RNA
abundanceestimation,atleasttwofactorsmustbeconsideredandcorrected
for.First,thesequencingdepth,thatisthetotalnumberofreadsinasample
thataremappedtoanyregionwithinthegenome.Thelongerthesequencing
process continues, the more reads will be generated resulting in increased
readcounts.Secondly,readcountsarebiasedbythelengthofthetranscripts.
As longer transcripts produce more reads, they display an increased
probability of having reads mapping back to them. One of the simplest and
most widely used normalization strategies, introduced early in the RNA-seq
eraistheFPKMunit,whichstandsfor“FragmentsperKilobaseoftranscript
12 permillionreadsmapped”.Ifyousequenceonemillionfragments,theFPKM
valueforacertainfeatureistheexpectednumberoffragmentsidentifiedper
thousand bases in that feature. Another popular normalization approach,
suggested to eliminate bias inherited in the FPKM measure, is to report
relativeRNAabundanceinTranscriptsPerMillion(TPM)63.TheTPMunitis
similartoFPKMinthesensethatitalsonormalizesfortargetlengthandread
depth. The major difference between these normalization methods is that
TPMprovidesameasurementoftheproportionoftranscriptsassociatedtoa
certainfeatureinthepoolofRNA.SinceTPMnormalizesforthedifferencein
transcriptcompositionbysimplyprovidingafractionofthetotalexpression,
itissuggestedtobebettersuitableforcomparisonacrosssamples,asthesum
ofallTPMvalueswillbethesameacrosssamples64.FPKMvaluescaneasily
beconvertedtoTPMthroughsimplecalculations,andviceversa65.
FPKM and TPM values are widely used as relative measurements for RNA
abundance. However, comparisons of FPKM or TPM values for the same
transcriptorgeneacrossdifferentsamplesareproblematic.Thisstemsfrom
thefactthatthereaddepth,whichboththesenormalizationmethodsaccount
for, is dependent on the sample’s overall read composition. Imagine a
transcriptXthatispresentatequalamountsintwosamplesAandB.Given
thatallothertranscriptsareequallyabundant,ifanothertranscriptYistwice
asabundantinsampleAcomparedtoB,sampleAwillcontainmorereadsin
total.TheFPKMorTPMvaluefortranscriptXwillthereforebelessinsample
A,comparedtosampleB.
In 2013-2014, a new and fundamentally different concept for abundance
estimation of sequence reads emerged. This concept challenged the widely
acceptedideathatprecisereference-guidedalignments(atleastaspreviously
interpreted) are necessary for quantification of RNA-seq reads. The concept
of a read-alignment was now redefined to only include information about
whichtargetsequenceareadoriginatesfrom,withoutfurtherspecifications
ofexactpositiononthebaselevel.Thisnewconceptwasnamed“lightweight
mapping” and first brought about in the program Sailfish that was further
developedintoSalmon66.Similarideasonalignment-freequantificationhave
also been explored by others, for example “quasi-mapping” used in RapMap
and“pseudo-alignment"usedinKallisto67.InKallisto,claimedbyitscreators
to be “near optimal in speed and accuracy”, a new method for rapid string
matching was also introduced, by indexing the reference genome with a De
Brujin Graph to which k-mers are matched. Several programs have been
developed that perform counting of reads per feature, for example
featureCounts and HTSeq-count68,69. Read count values generated by such
programscanbeusedasinputinseveralprogramsforanalysisofdifferential
expression,coveredinthenextsection.
13
Differentialgeneexpression
RNA-seq experiments typically aim to reveal changes in gene expression, by
measuring the relative transcriptional abundance across biologically
interesting conditions. While normalized expression levels such as FPKM or
TPMvaluesareintendedtoreportrelativeabundancewithinsamples,more
sophisticatedstatisticalmethodsaregenerallyneededforcomparisonsacross
samples. In such analyses, statistical testing is performed to reveal whether
anobserveddifferenceinRNAabundanceissignificant,thatis,greaterthan
onewouldexpectduetojustnaturaltechnicalandbiologicalvariation.
Numerous tools have been developed for the analysis of differential
expression. These tools typically make assumptions about the underlying
distributionofreadcounts,whicharethenusedtomodelthedatainorderto
test for differential expression. As RNA-seq read counts are positive integer
values,thenormaldistributionisnotapplicable.ThePoissondistributionhas
previouslybeenproposedasareasonablemodelforRNA-seqdata,however
asasingleparameterdistributionwherethemeanisequaltothevariance,it
has been shown to be too restrictive, predicting a smaller variation than is
actually present due to biological variation. The negative binominal
distribution, first used in the program edgeR, is the most widely used
distribution today, as it allows for greater variability between biological
replicatesthanthePoissondistributioncanaccountfor.
InthecontemporaryfieldofRNA-seq,twodifferenttraditionshaveevolvedin
parallel and no consensus seems to be reached regarding whether RNA
abundanceestimationanddifferentialexpressionanalysisshouldbepursued
atthelevelofisoformsoratthelevelofgenes.Ononesidearedevelopersand
dedicatedusersofprogramssuchasedgeRandDESEq.Theseareprograms
that operate on the gene level and use raw read counts as input. Assuming
thattechnicalreplicatesareunnecessaryastheyrevealaPoissondistribution,
they focus their algorithms on modeling the biological variability. The other
schoolofthought,mainlyfrontedbytheprofessorandscientificbloggerLior
Pachter, claims that abundance estimation and analysis of differential
expressionrathershouldbeperformedattheresolutionofisoforms,realized
inprogramssuchasCuffdiff2,Salmon,Sailfish,RSEM,Kallistoandsleuth70-74.
Tobridgethegapbetweenthesetwoschools,aprogramcalledtximportwas
recently released. As default, tximport imports abundance estimates, counts
and feature lengths on transcript-level and outputs corresponding variables
onthegenelevel75.
In the work covered by this thesis, the tools DESeq and EdgeR have been
used76-78. The output from these programs consists of a table with results
fromthestatisticaltestingassignedtoeachgeneproductanalyzed.Thefinal
stepistoapplyacutoffforthep-value,aftermultipletestingcorrections,to
14 obtain a list of genes or transcripts that are significantly differentially
expressed.AcutoffforFoldChangeiscommonlyalsoapplied.
Challengesandconsiderations
The RNA-seq experiment is a multi-step process that involves several
laboratory steps where bias can be introduced, such as extraction,
fragmentation, amplification and sequencing. The fact that RNA-seq
experimentsoftenarecarriedoutatspecializedfacilitiesservestominimize
theseproblems.Withconsistenthandlingofsamplesthroughdevelopmentof
Standard Operating Procedures (SOPs), data quality can be optimized.
Sequencingfacilitiescommonlyconsulttheircustomersinchoosingthebest
suitable methods and reagents, and make sure that samples are of proper
quality and quantity before the sequencing experiment starts. As previously
described, RNA is sensitive to degradation and for RNA-seq purposes, it is
important to retain a clean working environment where the risk of
degradationisminimized.WhenRNA-seqisemployedonmRNAderivedfrom
millions of cells in a sample, the output data is limited to the estimation of
mean behaviors. Cell-to-cell variation due to differences in cell cycle stage,
differentiationoranyothervariationwillbeaveragedoutanditisimpossible
to resolve the intra-population distribution of transcripts derived from a
certain genomic region. To reach beyond this mean-based characterization,
sequencing the transcriptome of single cells has gained more attention in
recentyears79.
To improve accuracy of results, replication is important. This enables
estimation of the variability within conditions. In RNA-seq, technical
replicatesareoftenclaimedtobeunnecessary,sincetheRNA-seqtechnology
has proven to be robust enough to make technical variation negligible.
Biologicalreplicatesaremoreprioritized.Theuseofatleastthreebiological
replicates seems to be somewhat of a gold standard, for cell lines however
two replicates are often enough, indicated by genome-wide correlations of
over 0.99. When working with cell lines, an interesting question is what to
consider a biological replicate. A common strategy is to regard early splits,
twosubpopulationsfromthesameoriginalpopulation,togrowinparallelfor
anumberofpopulationdoublings.
As mentioned, there is an ongoing debate whether Differential Expression
Analysis should be performed at the gene level or at the level of individual
transcripts (splice isoforms). Performing such analyses on transcript level
evokes a few questions regarding what scenario should be viewed as
differential expression. Are two transcripts differentially expressed even
though their corresponding gene shows a zero net difference between the
comparedsamples?Aregenesdifferentiallyexpressedbetweensamplesifat
15
least one corresponding transcript is differentially expressed? To answer
such questions requires some degree of biological insight and relate to a
common issue in todays, more computer-dependent field of biology, where
people are often forced into a role of being both biologists and computer
scientistsatonce.
With today’s use of high-throughput technologies, the greatest challenge of
biologydoesnotlieinthegenerationofdata,butratherintheinterpretation
ofthem.Inordertounderstandbiologicalprocessesbytheuseofadvanced
data analysis, statistical principles must be well tested and evaluated to
obtain the most robust and efficient algorithms possible. In the field of
transcriptomics, an explosion in number of different strategies and tools
characterizes the recent years. A great challenge for researchers trained in
biologyistoguidethemselvesthroughthisjungleofdifferenttools.Whena
new method is published, authors often claim superiority over previous
methods, which can lead to somewhat contradictory information regarding
which method to choose. With the lack of standard pipelines for analysis,
developers are free to choose validation methods in favor of their own
developedtools,ascenariodescribedasthe“self-assessmenttrap”80.
RNA-seqisapowerfultoolfordeterminingwhatgenesareexpressedandto
what extent across samples, information that can be used to predict and
model cellular behavior. It is important though to understand that this
techniqueonlycapturesanearlystepinalongsequenceofregulatoryevents
that collectively dictate cellular function and phenotype. Many functions
depend on proteins that are activated only in the presence of certain
interaction partners, and PTMs and subcellular localization are also
important factors to consider when studying cellular behavior. By studying
proteins, we can confirm what has been seen on the transcriptome level, or
provideadditional,moredetailed,informationaboutcellularfunction.
16 3.Studyingthehumanproteome
Thefullsetofproteinsatacertaintimepointwithinadefinedsystem,suchas
an organism, an organ, a cell type or a cellular compartment, is called
proteome. The word “Proteome” has been around since the late 90’s, first
used by the Australian student Marc Wilkins and heavily inspired by the
terms“genome”and“genomics”81.Analysisoftheproteomecanbedrivenby
either a protein-centric approach or a gene-centric approach82. Proteincentric studies are exploratory and aim is to detect all proteins that are
present in a sample mixture, given the limitations of the technology being
used. In proteome analyses with a gene-centric approach, genomic
annotationsareusedtoguidetheexplorationofproteins.
Alongside the decline in number of predicted protein-coding genes seen in
recent years, knowledge about the structural diversity of the human
proteome continues to strengthen. The complexity of the human proteomic
landscape extends well beyond just the number of distinct corresponding
genes and rather manifests itself through tremendous protein diversity.
Genetic variation, somatic recombination, alternative mRNA splicing and
PTMs are important factors that contribute to the presence of protein
isoforms. Rapid technology development have allowed for systematic
exploration of the human proteome in various diverse systems, such as
different cell lines and tissues. One interesting finding from these studies is
that the collection of detected proteome constituents is surprisingly similar
across diverse biological systems83-88. The ubiquitous protein expression
observed in these studies point to the importance of abundance and
organization, and not only the absence or presence of proteins, to dictate
cellular identity and function. In the past, measurements of protein
abundance have primarily relied on the use of affinity reagents, such as
quantitative western blot and Enzyme-linked immunosorbent assay
(ELISA)89,90. In the 1970’s, two-dimensional gel electrophoresis was
developed and provided new opportunities to study up to thousands of
proteins in a single experiment, preferably proteins of relatively high
abundance91. In the late 1990’s, technological advances in the field of mass
spectrometry(MS)ledtoafundamentalchangeinexperimentalpractice,and
since then quantitative proteome analyses are mainly dominated by MSbasedapproaches.
Eventhoughthefieldofproteomicsisstilllaggingbehindtranscriptomicsin
termsofcoverage,large-scalesystematicexplorationofthehumanproteome
istodaypossible,detectingthousandsofproteinsinasinglerun.In2014,two
drafts of the human proteome were published, based on MS experiments in
combinationwithcrowdsourcingofdata.ThetwostudiesreportedMS-based
evidence for 84% and 92% of the annotated proteome92,93. Following these
twopublicationswasadebatethatledtoreanalysisoftheresultsusingother
17
principles for false discovery rate estimation, generating a somewhat lower
coverage94-96.In2015,thefirstdraftofthehumantissue-basedproteomewas
published,usingantibodiestargeting90%oftheprotein-codinggenome97.
MS-basedproteomics
MSenablesaccurateandhighthroughputidentificationandquantificationof
proteinsfromcomplexmixtures.TheroleofMSwithinthefieldofproteomics
began in the late 1980’s when two methods were developed that made it
possible to ionize proteins and peptides under conditions that were mild
enough to keep the molecules intact, Electrospray Ionization (ESI) and
Matrix-Assisted Laser Desorption Ionization (MALDI)98,99. The success of
these two ionization methods pave the way for a rapid technology
developmentandtheproteomicsfieldhaseversincebeenheavilydominated
bytechnologiesthatarebasedonvariousMSprinciplesandapplications.
In simple terms, the technique is based on ionization of protein fragments
that are directed into a high vacuum system called a mass analyzer, where
theybecomeseparatedaccordingtotheirmass-to-chargeratio(m/z)before
detection. This generates a mass spectrum in which relative ion intensity is
plottedagainstm/z.MSexperimentsaretypicallyperformedwithabottomupapproach(alsoknownasshotgunproteomics)involvingsequence-specific
proteolyticdigestionofproteinsintomultiplepeptidespriortotheanalysis.
The analysis of peptides instead of intact proteins is advantageous since
peptidesareeasiertoionizecomparedtofull-lengthproteins.Italsoenables
ionization of insoluble proteins that would otherwise be challenging to
analyzewithMS,andidentificationofproteinswithmodificationsthatwould
makeidentificationbyknownmassdifficult100.Followingdigestion,peptides
areusuallyseparatedbasedonchemicalcharacteristicsonachromatographic
device directly coupled to the MS instrumentation. In this way, peptide ions
can be produced within a defined range at a time, which increases the
sensitivity.ThestandardMSinstrumentationcanbedividedintothreeparts,
anionizationsource,amassanalyzerandadetector101.Severaldifferentmass
analyzersareusedtoday,suchasquadrupole,TOFs,orbitrapandquadrupole
iontraps100. A common approach is to combine mass spectrometers into a
tandem system equipped with a mass filter102. In such a setup, peptide ions
that enter the mass spectrometer together are first separated in one of the
mass analyzers (MS1) to generate a precursor ion spectrum that represents
differentionizedpeptides.Afewindividualpeptideionsarethenselectedto
undergo further fragmentation by collision with an inert gas, such as argon,
heliumornitrogen.Inasecondmassanalyzer,thefragmentions(oftencalled
productions)areseparatedaccordingtotheirmass(MS2),toyieldfragment
ion spectra (Figure 5). This procedure will be repeated throughout the
chromatographic run in a defined set of cycles100. Today, identification of
peptides is usually performed at the MS2 level, by search algorithms that
18 comparepeakintensitiesintheexperimentaldatatotheoreticalspectrathat
are found in databases such as SEQUEST, Andromeda, Mascot and
X!Tandem103. By combining information about the precursor peptide mass
(revealed in MS1) with information from the matched fragment ion spectra
generatedinMS2,identificationofpeptidescanbeachieved.
MS1 Spectra
MS2 Spectra
Sample
Fragmentation
MS1
MS2
Figure 5. Overview of a typical MS/MS setup. The workflow includes ionization,
peptideionseparation,fragmentation,fragmentionseparationanddetection.
Quantitativemassspectrometry
As the signal intensities of peptide ions are not directly proportional to
abundance, MS analysis is not inherently quantitative. Proteomic research
commonlystrivestoreachquantitativeinformationaboutproteinexpression
andseveralstrategiesforquantitativeMShaveemergedthroughoutthelast
decade. Quantitative MS can be performed either in a relative manner, by
comparingproteinabundancebetweentwoormoresamples;orinabsolute
mannertoyieldabsoluteproteincopynumberspercellorsample,commonly
achieved by the addition of an internal standard of known concentration.
Both relative and absolute quantification can also be performed with labelfree strategies, by analysis of precursor signal intensities or by spectral
counting
ApopularstrategyinquantitativeMSislabelingofproteinsorpeptideswith
heavy isotopes. This produces isotopologues that are identical molecules
except for differences in isotope composition. Due to similar chemical
properties,theyco-eluteduringliquidchromatography(LC)andentertheMS
instrumentation together, but due to different isotope composition they will
bedifferentiatedintheMSanalysis,identifiedviaamassshiftinthespectra.
To minimize bias that can be introduced at several steps during the
19
subsequent MS experiment, early mixing of heavy and light samples is
desirable. A number of different strategies for protein quantification using
differentlabelingstrategieshavebeendeveloped104.
Apopularmethodforrelativequantificationofproteinsismetaboliclabeling
in vivo with Stable Isotope Labeling of Cells (SILAC)105. In SILAC, heavy
arginine and lysine are added to cell cultivation medium and metabolically
incorporatedintothecellsproteomebythecellsownmachineryforprotein
synthesis. Subsequently, heavy and light cell lysates are mixed followed by
trypsin digestion and LC-MS/MS analysis. During trypsin digestion, proteins
are cleaved at the C-terminus of Arginine and Lysine, hence making all
cleaved peptides having at least one heavy amino acid. As heavy and light
samples are combined before sample preparation begins, the level of
laboratorybiasinSILACisrelativelylow.
In contrast to cell lines that are easy to cultivate for efficient metabolic
labeling in vivo, many biological samples require alternative methods for
isotope labeling, for example when working with clinical blood samples or
tissue samples. In such cases, chemical or enzymatic labeling with isotopic
atomsorisotope-codedtagscanbeused.Examplesoflabelingwithisotopic
atoms include enzymatic labeling with heavy oxygen during proteolytic
digestion with trypsin and labeling of primary amino groups in digested
peptideswithdeuterium.Onemajorbottleneckwithmethodsthatlabelwith
deuterium is that it produces a small shift in over-all chemical properties of
the peptides so that co-elusion during chromatography is not longer
achieved106,107. Stable isotope dimethylation with formaldehyde in heavy
water however keeps the molecules chemically intact and is better suitable
forchromatographicpurposes108.
Labeling with isotope-coded tags is a commonly used approach in
quantitative proteomics. One of the earliest methods based on isotopic
labeling for protein quantification, developed already in the late 90s, is
Isotope Coded Affinity Tags (ICAT)109. In this method, free thiol groups on
cysteines are labeled with ICATs prior to digestion. The ICAT tags are
comprisedofasulfhydryl-reactivecrosslinkinggroup,aheavyorlightlinker
and a biotin molecule. The method takes advantage of the high affinity
between biotin and avidin/streptavidin, and enables purification of labeled
peptidesbyimmobilizationoveravidinorstreptavidinbeforeheavyandlight
samplesarecombinedandanalyzed.
Today, isobaric tags are also popular for labeling of peptides after protein
digestion.Isobarictagsarecomposedofamassreporter(theactualtag)anda
balancingmassnormalizer.Sinceisobarictagshavebothidenticalnetmasses
and chemical properties, isotopologues can co-elute together. During
fragmentation, the tags are cleaved and their reporter ions are used to
distinguish them from each other, based on differences in mass that enable
20 relativeintensitiestobeanalyzedbetweensamplesinoneMS/MSspectrum.
Labelingwithisobarictagsistypicallyperformedatprimaryamineresidues.
Severaldifferenttypesofisobarictagsapplicableformultiplexedquantitative
analyses are available, Tandem Mass Tags (TMT) and Isobaric Tags for
RelativeandAbsoluteQuantification(iTRAQ)arepopular110,111
In contrast to the above-mentioned methods for relative quantification of
proteins, absolute quantification aims to reveal absolute concentrations of
proteins across samples. This is most commonly achieved by spiking in
known concentrations of heavy-labeled standards before protein digestion,
followed by comparison of achieved signal intensities.112 Popular standards
include Absolute Quantification (AQUA) peptides and QconCAT
octamers113,114 Full-length protein standards can also be spiked-in, which
generateshighlyaccuratequantificationasthespikedinproteinwillgenerate
identicalpeptidesupontrypsindigestionasitsendogenouscounterpart.115-117
A major drawback with the use of full-length protein standards is that they
are challenging to produce. Recently, a strategy for absolute quantification
was established, using heavy labeled protein fragments of 25-150 amino
acids,calledQPrESTs118.ThismethodisdescribedinmoredetailinpaperII.
Targetedmassspectrometry
In discovery proteomics, where the aim is to detect and quantify as many
proteinsaspossible,giventhecapacityoftheinstrumentation,theanalysisis
oftenperformedbymeansofDataDependentAcquisition(DDA).Thismeans
that the instrument is programmed to choose ions of highest intensity for
furtherfragmentationintheMS2stage.Intargetedproteomicsstudies,where
theaimistoobtainresultsforadefinedsubsetofproteins,precursorionsare
instead chosen beforehand, independent on the data that is generated
throughout the experiment. This targeted, Data Independent Acquisition
(DIA) approach, enables reproducible and sensitive results as proteins
irrelevantinthestudy,thatcouldpotentiallymaskproteinsofinterest,canbe
filtered out from the analysis119,120. A popular method used in targeted
proteomics is Selected Reaction Monitoring (SRM, also termed Multiple
ReactionMonitoring(MRM))121,122.InSRM,atriplequadrupoleinstrumentis
used consisting of three quadrupole mass analyzers and two stages of mass
filtering. A precursor ion of interest is preselected in the first quadrupole
following fragmentation by collision with gas in the second quadrupole.
InsteadofobtainingafullMS/MSscan(asinconventionallytandemMS),only
asmallnumberofsequence-specificfragmentionsareanalyzedinthethird
quadrupoleandpassedonfordetection.Theprecursor/productionpairsare
calledtransitions.Oncetheprocedureisestablishedandoptimized,multiple
samples can be analyzed over a relative short period of time, generating
highly sensitive and reproducible results. The SRM technology has gained
21
wide popularity within the field of drug discovery, as it allows for unbiased
quantification of low-abundant proteins across large sample cohorts. An
alternative method to SRM is Parallel Reaction Monitoring (PRM), which
enables multiple peptides to be quantified in parallel by the use of hybrid
Quadrupole-Orbitrap instruments that combines precursor ion selection by
the quadrupole with high resolution accurate (HRAM) detection of all
fragmentionsbytheOrbitrap123.
Affinity-basedproteomics
Affinity is a measure of binding strength between molecules. In proteomics,
several methods make use of affinity reagents that can selectively bind
proteinsinamixtureandtherebyenabledetectionandquantification.
The most commonly used affinity reagents are antibodies. Antibodies are
producedbyBcellsintheimmunesystemandarecharacterizedbyastriking
diversityasaresultofgeneticrearrangementsduringBcelldevelopment,in
combination with affinity maturation after binding of target antigen.
Antibodies as research reagents were first used in the 1960s when Rosalyn
Yalow and Solomon Berson developed the first radioimmunoassay (RIA)124.
In this method, antibodies were used to detect and quantify molecules by
competitive binding between isotope-labeled and unlabeled target. In the
1970s, Peter Perlmann and Eva Engvall developed the enzyme-linked
immunosorbentassay(ELISA)whereenzyme-coupledantibodiesareusedto
detectproteins90.
Immunofluorescence
Immunofluorescence (IF) technology relies on the binding between
fluorescently labeled antibodies and their target proteins. When combined
withahigh-resolutionimagingtechnique,IFenablestheidentificationandin
situ visualization of selected targets. IF can be applied in proteomics, to
provide a window to the physiology of cells and tissues by enabling
researchers to reveal the spatial distribution of endogenous protein
expression. The spatial information provided by IF analysis is an important
advantage that differentiates this method from complementary technologies
such as MS-based applications. As the detection is based on local protein
concentrationsratherthanthebulkabundance,ahighsensitivityisachieved.
Forexample,low-abundantproteinsthatarelocalizedtosmallorganelleslike
thecentrosomecanbedetected.IFismostoftenusedforqualitativeanalysis
of protein expression, but it can also be used for quantitative purposes, by
comparison of relative intensities or more advanced techniques such as
Förster resonance energy transfer (FRET) and fluorescence recovery after
22 photobleaching (FRAP)125,126. IF experiments can be performed either in a
direct or indirect manner. In direct IF, a single antibody is used, that is
conjugated to a fluorophore for detection. In indirect IF, two antibodies are
used,oneprimaryantibodyintendedtotargettheproteinofinterestandone
secondarylabeledantibodythattargetstheprimaryantibodyandisusedfor
detection.ThefollowingtextconsidersonlyindirectIFexperimentsusingin
vitrocultivatedcells.
An IF experiment starts with fixation and permeabilization of the cells
containingthetargetprotein.Fixationisperformedtoretaintheintracellular
architecture during permeabilization, which is required for antibodies to
enter the cell and reach their targets. A commonly used approach is to first
cross-link the proteins by adding paraformaldehyde (PFA) that reacts with
nitrogen molecules on the proteins to form bridges that retain molecules in
place. Following cross-linking, the membrane is permeabilized, aided by
saponinsornon-ionicdetergentssuchasTween20orTritonX-100.Fixation
and permeabilization can also be performed simultaneously by the use of
organic solvents such as acetone, ethanol or methanol. Sample preparation
including fixation and permeabilization requires careful optimization of
experimental protocols, customized for the protein targets of interest.
Chemical conditions must be mild enough to keep the internal organization
intact,andstrongenoughtoprovideaccesstoallintracellularcompartments.
Well-optimized fixation and permeabilization is important to minimize the
risk for artifacts. For example, dehydration with alcohols works well for
analysisofthecytoskeletonbutislesssuitableforsolubleproteins.Toolong
incubation with PFA can lead to blocking of antibody epitopes and mild
detergents such as saponins are less effective when studying mitochondrial
proteins since the inner mitochondrial membrane contains relatively little
cholesterol127. When cells are fixed and permeabilized, the sample is
incubated with the primary antibody followed by washing and incubation
withthesecondaryantibody.Toenablesubsequentlocalizationoftheprotein
in relation to the inner architecture of the cell, co-localization is commonly
performed,thatisthesimultaneoususeofadditionalaffinityreagentsordyes
that are added together with the primary antibody. For example antibodies
targetingthecytoskeletonofthecellsandsmallchemicalssuchasDAPIthat
bindtotheDNAinthenucleusandenablesvisualizationofthecellnucleiasa
reference.FollowingIFstaining,targetedproteinsarevisualizedbydetecting
antibodies through their conjugation to a fluorophore. In fluorescence
microscopy the sample is illuminated with a certain wavelength (excitation)
followed by emission of light with a longer wavelength. The difference in
wavelength between emitted and absorbed light is known as Stokes shift, a
phenomenon that is fundamental to the sensitivity and low background of
IF128. By filtering, only the emitted light is allowed to reach the detector. In
confocal laser scanning microscopy (CLSM) higher resolution is obtained by
illuminatingthesamplewithahighpowerlaserbeamofdistinctwavelength
and the emitted light beam path is gated so that only the emission from the
23
focal plane reaches the detector. When using multiple fluorophores it is
important to minimize spectral overlap, which can be achieved by narrow
bandpassfiltersandsequentialscans.
UsingRNAlevelsasanindirectmeasureofproteinexpression
The cellular abundances of mRNA and protein depend on the fundamental
processesoftranscription,mRNAstorage,translation,mRNAdegradationand
proteindegradation.
MethodsformonitoringtherateoftranscriptionincludelabelingofRNAwith
4-thiouridine(4sU)priortomicroarrayorRNA-seqanalysis,therebyenabling
the user to distinguish newly transcribed RNA from the pre-existing
population129.Tominimizebiasduetodegradationduringlabeling,thetime
of the 4sU labeling can be kept as short as 10 minutes. In addition to 4sU
labeling,alternativemethodslikeRATE-seqandSnapShot-seqwhichispurely
computational, are also available130,131. After transcription, mRNA molecules
are processed before they exit the nucleus to enter the cytosol where
translation occurs. mRNA transcripts that display equal levels of expression
can generate different amounts of protein products due to differences in
translation rates and protein degradation. Translation rates are partly
influencedbythemRNAsequencethatdictatescodoncomposition,upstream
OpenReadingFrames(ORFs)andribosomeentrysites(IRES)132.Inaddition,
translation rates can be modulated by the availability of ribosomes and the
binding of regulatory proteins and non-coding RNA molecules to the mRNA
transcripts133.Thepost-transcriptionalprocessingandtransportofmRNAto
thecytosolarecontrolsthatleadtoadelayinproteintranslation.Changesin
transcriptabundancewillthereforeinfluenceproteinabundancealwayswith
a certain degree of temporal delay. Several studies have showed that the
temporaldelayinproteinsynthesisisprotein-specific,andthatmorecareful
considerationofthisdelaywouldincreasethecorrelationbetweenmRNAand
protein abundances134. Protein kinetics, that is protein synthesis and
turnover, is typically monitored using puls-labeling of newly synthesized
proteins with heavy isotopes or ribosome- or polysome profiling135,136. In
ribosomeprofiling,ribosome-protectedfragmentsareisolatedindependently
ofwhetherthemRNAisactivelytranscribedornot,followedbyRNA-seqto
revealinformationaboutthetranscriptlocationsoccupiedbyribosomes137.In
polysomeprofiling,mRNAmoleculesareseparatedbythenumberofbound
ribosomes, which enables analysis of transcripts that are considered
“efficiently translated”. A commonly used cutoff for the identification of
“heavypolysomes”istheoccupancyofatleastthreeribosomes138.
24 Even though the existence of a general correlation between mRNA and
protein levels is still being debated, several studies from the last decade
conclude that the level of mRNA variation largely explains steady-state
differencesbetweencorrespondingproteinlevels134,139,140.Aslivingsystems,
cells are continuously exposed to changes in their molecular compositions,
mainlyduetodynamicprocessessuchasproliferationanddifferentiation.In
moststudies,steadystateconditionsaredefinedascircumstanceswherethe
averagemRNAandproteinlevelsremainmoreorlessconstantoverseveral
ours.
Challengesandconsiderations
Thecellularproteomeishighlycomplex.IsoformsandPTMscontributetoa
huge diversity of proteoforms and on top of that is a tremendous dynamic
rangeofproteinconcentrations.Inquantitativestudies,thedynamicrangeof
proteinexpressionischallenging,especiallyinsamplesderivedfromvarious
body fluids such as plasma and cerebrospinal fluid (CSF) where proteins
display concentrations that can vary up to more than ten orders of
magnitude141.
In comparison to transcriptomics studies where PCR enables amplification
prior to sequencing, proteins are measured in their original concentrations.
To maximize sequence coverage across the full range of concentrations,
various methods are employed to reduce sample complexity, either by
enrichmentofinterestingtargetsordepletionofhigh-abundantproteins.This
is commonly achieved by gel electrophoresis, liquid chromatography or the
useofaffinityreagents.InIF,animportantadvantageistheabilitytodetect
localconcentrationsduetothespatialresolution.Proteinsthatdisplayahigh
localconcentrationatsmallerdefinedareas,suchasthecentrosome,canbe
identifiedeventhoughtheirrelativecellularconcentrationisverylow.
The major challenge in affinity-based proteomics is to ensure quality of the
affinity reagents that are used. Antibodies are among the most commonly
used research reagents today, but they suffer extensively from crossreactivity and therefore require careful consideration regarding specificity,
that is the ability to discriminate between molecules that compete for the
samebindingsite.Toestablishastandardizedschemeforantibodyvalidation,
a committee of international scientists from various fields was formed, The
International Working Group for Antibody Validation (IWGAV). Recently,
IWGAV presented a proposal of a set of standard guidelines for antibody
validation, built upon five pillars: Genetic strategies, Orthogonal strategies,
Independent Antibody Strategies, Expression of tagged proteins and
Immunocapture followed by mass spectrometry142. In antibody-based
proteomics,challengesremaintoreachthesamelevelofmultiplexingascan
be obtained in MS-based approaches. Simultaneous detection of more than
25
one target protein using multiple antibodies requires the availability of
secondary antibodies that are specific to antibodies from only one type of
host species. In addition, fluorophores that are conjugated to secondary
antibodiesmustbewellresolvedtominimizespectralbleed-throughthatcan
lead to false detection of co-localization. A typical limit for robust IF
experimentsin96-and384-wellplatesisthreetofourchannels,butseveral
attemptshavebeencarriedouttoincreasethemultiplicityofIF,forexample
infra-red shifted fluorophores, quantum dots, bar coding and cyclic
immunofluorescence (cycIF)143. In mass cytometry (cyTOF), antibodies are
conjugated to rare metals, enabling over 30 different antibodies to be used
simultaneouslywithoutproblemswithspectraloverlap144.
26 4.Dataanalysisandfunctionalinterpretation
Publicdataresources
Accumulationofinformationisacceleratingfast.Thescaleofbiologicaldata
produced in the world’s labs is now well beyond petabyte (PB) and even
Exabyte (EB)145,146. Open-access of interpretable results contributes
immenselytotheabilitytousethislargeamountofinformationinmeaningful
waysandtakefulladvantageofit.Severalresourcesareavailableproviding
userswithbothrawdataandpre-computedgeneexpressionlevels.
Transcriptomicsdatabases
The amount of data generated in transcriptomics studies has grown into an
explosion of information that continues to grow. Today, many scientific
journals require transcriptomics data to be uploaded into public data
repositories before acceptance for publication. This enables open-access of
large amounts of transcriptomics data that enhance reproducibility and
maximizes usefulness by enabling meta-analyses and the ability to use
already existing data in new scientific contexts. The Gene Expression
Omnibus(GEO),hostedbytheNationalCenterforBiotechnologyInformation
(NCBI), is a public data repository for both array- and sequence-based
transcriptomics data submitted by the research community147. For RNA-seq
data, processed data files can be uploaded to GEO, and raw sequence files
submittedtoNCBIsSequenceReadArchive(SRA)database.ArrayExpressand
the Gene Expression Atlas, both hosted by the European Bioinformatics
Institute(EBI)areequivalentEuropeaninitiatives,alsoallowingresearchers
tosubmitandbrowsedata148,149.
Inadditiontodatabasesthatcollectsequencingdatafromthewholeresearch
community, several initiatives have also been launched, aiming to provide
comprehensive maps of the human transcriptome across various cells and
tissues, including for example The Genotype-Tissue Expression (GTEx)
project, the Human Protein Atlas (HPA) and Fantom5150-152. The GTEx
consortium is based at the Broad Institute and supported by the National
Institute of Health (NIH). The GTEx project was initiated with the aim to
providecomprehensivedatasetsthatcanbeusedtostudygeneexpressionin
relation to genetic variation. At the GTEx web portal, users can query gene
expression profiles for genotyped tissue samples derived from over 1,600
post-mortem samples, covering 43 different tissue types. The HPA project
also encompasses large amounts of RNA-seq data, integrated in the imagebased protein database, covering 32 different tissue types derived from
sampling of 95 individuals. The HPA RNA-seq data covers both internally
generatedRNA-seqdata,aswellasimporteddatafromtheGTExconsortium.
27
TheFantom5(FunctionalAnalysisofMammalianGenomes5)consortiumisa
large project based at RIKEN in Japan, mainly focused on mapping the
transcriptional promoter landscape, which is important for transcriptional
regulation by transcription factors and enhancers. Data generated by the
FANTOM5 project relies on the in-house developed Cap Analysis of Gene
Expression (CAGE) technology. CAGE is based on the capture of capped
5´start sites of mRNA molecules that are further sequenced with the
Heliscope sequencer without the need for PCR amplification. The FANTOM5
project provides RNA expression profiles, reported as TPM values for 975
human samples including cancer cell lines, primary cells and various
tissues153. Several studies have been performed with the aim to investigate
the over-all consistency among public sources of RNA-seq data. The results
from these studies show that there is a high genome-wide correlation
betweendatafromsimilartissuesgeneratedatdifferentlabsusingdifferent
technologies154-156.
Proteomicsdatabases
In the mid 1990s, several computer-driven and EST-based studies were
published aiming to estimate the number of human protein-coding genes,
resulting in estimates between 50,000-100,000 genes157,158. In the following
years, the estimates were further condensed, and when the human genome
sequence was first announced in 2001, the estimated number of protein
coding genes was smaller than ever expected, revealing 30,000-40,000
accordingtothepublicconsortiumand38,588(26,588withstrongevidence
plus an additional 12,000) according to the competing private
consortium159,160.Throughtherecentyears,ashrinkingsetofprotein-coding
geneshasbeenacontinuouslydevelopingtrend.In2012,theEncyclopediaof
DNA Elements (ENCODE) project, aiming to catalogue all functional parts of
the genome, identified 20,687 protein-coding genes and according to the
latest Ensembl annotation update, there are 20,441 protein-coding human
genestodate161.
The UniProtKB/Swissprot is a manually curated database of protein
sequences and functional information. The latest release comprises 20,197
human proteins, of which 14,876 have been experimentally shown to exist1.
The annotation of the human proteome in UniProtKB is an ongoing project,
continuouslystrivingtoimprovequalityofthedata.Currently,589proteins
aremarkedasuncertainduetolackofevidencefortheirexistence,morethan
half of these are predicted to be translation products from putative pseudo
genes.Theirexistenceisthereforeamatterofdebateandexplainspartlythe
lack of complete overlap between the databases162. A commonly used
databaseforproteomicsdatageneratedwithMSmethodsisthePRoteomics
1basedonasearch2016-09-12
28 IDEntifications (PRIDE) database that is a public repository for proteomics
dataincludingproteinpeptideidentifications,PTMsandspectralevidence163.
TheHumanProteinAtlas
HPA is a research program aimed to systematically characterize the human
proteome in cell lines and tissues of both normal and cancerous origin164.
Since the launch of the project in 2003, the HPA continues its exploration
through the proteomic landscape, by continuously gathering data into a
publicly available knowledge database. The HPA project operates from a
gene-centricpointofviewandreliesmainlyontheuseofin-houseproduced
antibodies that are used in immunohistochemistry (IHC) and IF
applications97,165. Organized into a Cell Atlas, a Pathology Atlas and a Tissue
Atlas,proteinexpressionispresentedinhigh-resolutionimagesaccompanied
with transcriptomics data derived both from internal RNA-seq experiments
andexternaldatafromtheGTExconsortium.Asantibodyreagentscarrythe
risk of unspecific interactions with multiple targets, their usage requires
carefulvalidationofspecificityandperformanceacrossthedifferentassaysin
whichtheyareused.AllantibodiesthatareusedintheHPAareextensively
validatedwiththeresultsintegratedontheatlas.MajormilestonesintheHPA
project include the release of the Tissue Atlas in 2015, with the major
conclusion that protein expression is ubiquitous over diverse tissue types,
with only a few genes showing tissue-specific expression, specifically
enriched in brain and testis97. In the end of 2016, the Cell Atlas will be
released, containing subcellular localization data for > 12,000 human
proteins166.
GeneOntology
To gain meaningful biological insight from large-scale biological data,
typically from a list of transcripts, genes or proteins of potential interest,
functionalannotationsareuseful.Severaldatabasesareavailablethatcollect
functional annotations for genes and gene products, and provide interactive
web-basedsearchfunctions.Thesegenerallycontaininformationoncellular
localization, biological functions, and pathway involvement and interaction
partners.
TheGeneOntology(GO)isalargecollaborativeprojectthatwasinitiatedin
1998 with the aim to develop a controlled vocabulary for biological
characteristics of gene products167. In addition, GO also provides several
publiclyavailabletoolsforaccessandinterpretationoftheontologydata.GO
usesbothmanualandautomaticapproachestocollectannotationsfromthe
literatureandtranslateitintoGOtermsthatarefurtherorganizedintoatree
29
structure. GO terms are accompanied by a description, a digital identifier
(accession) and an evidence code that denotes the type of evidence upon
whichtheannotationisbased.TheGOdataisseparatedintothreeunrelated
domains; Cellular Compartment, Biological Process and Molecular Function.
Each of these domains is organized into a semi-hierarchical system of GO
terms that are connected by edges that represent their relationship. The
systemstartsfromtherootdomain,andcontinuesdownwardsthroughmore
specialized child terms. However there is no strict hierarchy and GO terms
canhavemultipleparentsaswellasmultipledescendants.EachGOtermcan
beassociatedwithanynumberofgeneproducts,andeachgeneproductcan
be associated with any number of GO terms. Since related GO terms that
appear at different depths in the hierarchy often share associated gene
products, analyses based on GO terms often suffers from redundancy. To
overcome this problem, several attempts have been made to improve GO
analysis, for example by considering the correlations among terms in the
hierarchy and trimming strategies to reduce redundancy among GO
terms168,169.
Functionalenrichmentanalysis
For proper interpretation of gene lists derived from RNA-seq experiments,
annotations for individual gene products are usually not sufficient per se.
What has become almost a standard routine among researchers with large
genelistsintheirhandsistousestatisticalmethodstofindannotationsthat
areover-representedintheirdatasets.Thebasisforsuchanalysesistolook
for the proportion of genes associated with a certain annotation within the
dataset and compare this to the proportion of genes associated with that
annotation in the background dataset, for example the full set of proteincodinggenesinthegenome.Numeroustoolsareavailableforsuchanalyses,
both as packages suitable for statistical programming environments, as well
as web-based tools, for example TopGO, enrichGO, Gorilla, the Database for
Annotation, Visualization and Integrated Discovery (DAVID) and FunRich170174.Functionalenrichmentanalysescanbebasedonanytypeofannotation,a
popularapproachistomakeuseoftheGOsystemofclassifications.
30 Challengesandconsiderations
The direct access to pre-processed gene expression values at several web
portalsisconvenientinmanyaspects.Forexample,researchersthatarenot
performing RNA-seq experiments themselves can access expression levels
withoutgoingthroughtheextensivedataprocessingthatdownloadingofraw
datawouldrequire.Italsoenablesresearcherstoqueryonlyafewgenesof
interest, in a tissue-specific context, instead of having to deal with large
genome-wide datasets. However, the usage of these expression levels, often
reported in terms of FPKM or TPM values should be considered with some
caution. In theory, these values are not comparable over different samples,
due to inherent sample dependence (as discussed in chapter 2), and more
sophisticated statistical methods are required to establish whether a
gene/transcript is significantly differentially expressed. Due to the
comprehensive collections of samples in the databases described above, it
wouldbepracticallyinconvenienttoperformpairwiseanalysesofdifferential
expressionacrossthousandsofsamples.Itiseasiertosimplypresentthedata
in terms of abundance levels. Even though comparisons of RNA data needs
statisticsintheory,directcomparisonsofFPKMorTPMvalueshaveproven
toworkrelativelywellinpractice.Forusersthatarenotfamiliarenoughwith
RNA-seq data to take this into consideration, comparison of reported
expressionlevelscanhoweverbemisleading.SmalldifferencesinFPKMand
TPM values bear the risk of being overrated, leading to downstream bias in
theinterpretations.
Asbiologicaldatacontinuestoexpand,notonlyintermsofquantitybutalso
in terms of various new data sources, achieving a standardized annotation
system is emerging as a major challenge. Functional annotations should, at
best,enabletheiruserstobothcontributeandextractinformationinarobust
and feasible manner. To achieve a uniform system with consistent formats
and languages, large-scale adoption is required by the whole research
community.GOwasinitiatedwiththeaimtoestablishastandardizedsystem,
and is widely used today. However, the GO system displays a lot of
redundancy by allowing GO terms to have multiple parents and child terms
with overlapping genes linked to them. An additional drawback with GO is
that the numerical portion of GO terms has no inherent meaning related to
the hierarchy, which makes them less useful from a computational point of
view.TheIDsofGOtermsareassignedinnumericalorderbytheeditingtool,
therebyonlyindicatingtheorderinwhichtheywerecreated.Overthepast,
the global research community has collectively focused attention on a
relatively small fraction of the proteome. A recent bibliometric analysis
revealed that this fraction has remained relatively stable over the past 20
years175. This skewed focus is considerable in the context of functional
analyses. As well-studied proteins are richer in annotations, they are more
likelytoappearinqueryresultsthusaccumulatinginfutureanalyses.
31
Thedevelopmentofhigh-throughputtechnologieshasreshapedbiologyinto
aquantitativeandcomputationallyintenseareaofresearch.Researcherscan
now perform large-scale and quantitative analyses of whole genomes and
transcriptomes,andalargepartofproteomes.Thesetechnologicaladvances
have led to a promising starting point for modeling the cell and its inner
workings. In order to gain complete understanding of the complex circuitry
underlying cell identity and diversity, understanding of multi-level gene
expression is important, which requires integration of heterogeneous data
sources. In addition to that, spatial resolution of expression is crucial in the
overallunderstandingandcharacterizationofthehumanproteome.
32 5.Aimsofthethesis
While the previous chapters are dedicated to the surrounding field of
researchthatIamactingwithin,thefollowingtwochaptersaimtodiscussthe
scientific contribution of myself and co-workers over the past years. This
contributionisgatheredinthefollowingcollectionofpapers:
I
The aim of this study was to explore the consistency among a set of
RNA-seq datasets generated at different laboratories using different
strategies. By comparison of mRNA expression levels for ostensibly
similar tissues we showed that publicly available RNA-seq data is
relatively consistent, requiring only light processing for samples to
clusteraccordingtotissuetyperatherthancorrespondinglaboratory.
II
In this study we investigated the quantitative relationship between
mRNA and protein levels under steady-state conditions in various
humancelltypesandtissues.BycombiningRNA-seqtechnologywith
an in-house developed approach for targeted mass spectrometry, we
showedthatthereisagene-specificconversionfactorbetweenmRNA
andproteinlevelsthatcanbeusedtopredictproteinabundanceina
steady-statecontext.
III
TheaimofthisstudywastouseRNA-seqtoguideanalysisofprotein
expression in a four-step cell model for malignant transformation,
calledtheBJmodel.RNA-seqfollowedbyproteinprofilingwasapplied
on this model, enabling us to scrutinize changes in gene expression
acrossdefinedstepsoftumorigenesis.Weconcludedthatthemajority
of differentially expressed genes were downregulated within the
model, and the combined transcriptomics and protein profiling
approach enabled the identification of several proteins with
interesting expression profiles that were further validated in patient
tumorsamples.
IV
Tumor tissues often contain local areas deprived of oxygen where
malignant cells are still able to sustain rapid proliferation. In this
study, we used a transcriptomics approach to explore differential
responses to decreased oxygen in the four stages of the BJ model
separately. The results show that even though the cell lines in this
modelarehighlysimilar,theyrespondstrikinglydifferenttomoderate
hypoxia.TheintroductionofSV40Large-Tintroducesmajormolecular
changes with impact on the cells ability to adapt to lowered oxygen
concentration. This shows the potential of the BJ model to also
undergo studies of perturbed scenarios in relation to malignant
transformation.
33
6.Presentinvestigation
While the work covered in this thesis is concerned with multiple biological
questions,thereisoneunifyingaspect;assessingtheuseofdataderivedfrom
sequencingofmRNAofhumancellsandtissues.ThedynamicnatureofmRNA
makes RNA-seq an attractive method to generate a snapshot of the
transcriptional landscape at a particular moment of interest. RNA-seq has
becomewidelypopularandnewdatasetsarecontinuallybeingproducedand
uploadedinrepositoriesforpublicuse.Thepublicavailabilityopensupnew
windowsforanalysisoftranscriptomes,enablingresearcherstoexpandtheir
databyincorporationofexternallygenerateddatasetintotheirownprojects.
This has great potential in adding power to analyses, but requires careful
considerationregardingreproducibilityofresults.Toassesstheusefulnessof
data generated in different labs, using different equipment, strategies and
data processing pipelines, we analyzed the consistency of a number of
publiclyavailableRNA-seqdatasets(paperI).
Asassertedbythecentraldogmaofmolecularbiology,mRNAisamessenger
carrying information that is ultimately used for proteins to be synthesized.
TheincreasingnumberofstudiesthatuseRNA-seqtocharacterizebiological
functions commonly assume that differences in mRNA abundance would be
reflected on the protein level where the “real” action takes place. To
investigatethedegreeofcorrelationbetweenmRNAandproteinabundance,
wecombinedRNA-seqanalysiswithatargetedmassspectrometryapproach
for absolute protein quantification (paper II). Here, we observed a genespecific relation between mRNA and protein levels, conserved across the
panel of cell lines and tissues included in the study. The results from this
study indicate great promise for transcriptomics to guide and focus
proteomicsstudies.
One example of a study where information derived by RNA-seq was
successfullyusedtoguideproteinanalysisispresentedinpaperIII.Thefield
of cancer research constitutes one of the main branches in contemporary
molecular biology, but the molecular mechanisms underlying tumor
developmentarestillfarfromcompletelyunderstood.Intheworkpresented
inpaperIII,weperformedthefirstglobalanalysisoftranscriptomechanges
within a four-step model for malignant transformation (see chapter 1 for
moredetailsonthemodelsystem).AsbeingpartoftheHumanProteinAtlas
project, we have access to a unique library of antibodies towards all human
proteins.Thisenabledustoperformfollow-upstudiesoninterestingproteins
selectedbyanalysisoftranscriptomicsdata.Intheworkpresentedinpaper
III,wefoundseveralinterestingproteincandidateswhoseexpressioninthe
cell line model was further evaluated in tumor tissue samples of different
malignantstages.Weconcludedthatthecelllinemodel,despiteitsinherent
limitations,servesasagoodmodeltostudygeneexpressionchangesduring
34 malignanttransformation.Wethereforecontinuedtoexposethismodeltoa
perturbedenvironmentwithdecreasedoxygensupply,presentedintheform
ofamanuscript(paper IV).Tumortissuesoftencontainlocalareasthatare
less oxygenated due to uncontrolled oncogene-driven proliferation of the
presenttumorcells.Animportantcharacteristicoftransformedcancercellsis
theircapabilitytorewiretheirenergymetabolismandcontinuetoproliferate
independently of oxygen. This aggressive phenotype is known to be
associated with increased resistance to radiotherapy as well as increased
metastatic potential176. To continue our study of the BJ model for malignant
transformation, we set out to explore the existence of such metabolic
flexibility, from a transcriptomics perspective. Our results show that the BJ
modelcanbedividedintotwogroups,onepremalignantphenotypeincluding
the primary and immortalized cells and one postmalignant phenotype in
which the introduction of the SV40 give rise to major metabolic changes
mainly involving upregulation of lipid, carbohydrate and amino acid
metabolic pathways. In the end of this thesis are the articles and a
manuscript,containingdetaileddescriptionsoftheworkanditsfindings.
AssessingtheconsistencyamongpublicRNA-seqdata(PaperI)
Aboutthreeyearsago,oneofmysupervisorswasinvolvedinacollaborative
study aiming to describe the baseline transcriptome of human skeletal
muscle. Browsing the web, he stumbled upon a blog post that immediately
caughthisattention.Ayoungmanhadbecomeaself-taughtbioinformatician
by exploring RNA-seq data after his wife had been diagnosed with a genetic
disease. In this blog-post, he described his excitement over the increasing
availability of public RNA-seq data, especially the Illumina Body Map from
which he had downloaded BAM files and quantified gene expression in a
numberofdifferenttissuetypes177.
“Often I find myself needing a quick reference on how highly expressed a
particular gene is across different tissues, either to browse visually or to
combinewithotherdatainananalysis.“178
SincemysupervisorwasworkingwithRNA-seqdataformusclesamples,he
was well familiar with the gene TTN, encoding the third most abundant
protein(Titin)inhumanmuscleafterMyosinandActin179.Therefore,hegot
surprised when he noticed that this gene was reported to have FPKM=0 in
dataformuscletissuepresentedontheblog.Hestartedtodigdeeperintothis
finding and performed reanalysis of the same data using different program
parameters. This resulted in about 2.7 million aligned reads to the genomic
regionoccupiedbyTTN,correspondingtoanFPKMvalueofabout200.The
reasonforthisdiscrepancywasadefaultlimitonthenumberofalignments
perlocus,setbytheprogramCufflinks.
35
This was the starting point for the project that is presented inpaper I. In a
timewhenRNA-seqdataisheavilybeingusedbypeoplewithabroadrange
ofknowledgeitismoreimportantthanevertosecuretheusefulnessofdata
andcontrolforbias.Wethereforesetouttoassesstheconsistencyofpublicly
available RNA-seq data on a global level. Our focus subsequently turned
towards global expression analyses. Instead of looking at individual gene
expression values, we analyzed the variability among the different samples
and explored the impact on a number of factors related to the data
generation. If samples clustered according to tissue type rather than
laboratory of origin, or some other factors related to experimental
parameters,weconsideredthemconsistentandusefulformeta-analyses.For
example,onecouldusedatageneratedbyotherstoidentifytissueoriginofan
unknownsample,anduseexternaldatatoexpandthenumberofreplicatesin
onesownexperiment.Weinitiallyanalyzedtheconsistencyamongfoursets
of pre-computed gene expression levels (FPKM and RPKM values) for the
human tissue types brain, heart, and kidney. These datasets had several
differencesintheparametersrelatedtothegenerationofthem.Ouraimwith
this study was not to propose a gold standard protocol that would keep
datasets more consistent between labs, but rather to allow parameters to
vary and explore the possibility to process the data in a way that would
maximizetheirusefulness.Ourmajortake-homemessagefromthisstudywas
that public RNA-seq datasets are relatively consistent, requiring only light
pre-processing for samples to cluster according to their tissue of origin. We
found that reprocessing of raw data did not bring about any significant
improvements,exceptforallowingthecompletesetofgenestobeincluded.
In contrast to the blog post that initiated this project, luckily we did not
encounteranyproblemswithmissingoutonimportanttissue-specificgenes
whileanalyzingthepublisheddata.Thisprojectispresentedinmoredetailin
paperI.
Usingtranscriptomicstoguideproteomics(PaperII)
Sinceproteinsarecommonlyattributedthemajorfunctionalrolewithincells,
sequencing of mRNA is based on the assumption that a difference in mRNA
abundance reflects different phenotypes via differences in corresponding
protein levels. To investigate the accuracy of this assumption, translation
research has evolved into a field of its own, aimed at establishing the link
between mRNA and protein under various scenarios134,180. During the last
decade, numerous studies have been conducted aiming to reveal the
quantitative relationship between mRNA and corresponding protein
products. One way to approach the question whether a correlation between
mRNAlevelsandproteinlevelsexististosimplycorrelatemRNAandprotein
expression levels for numerous different genes. This can be done for both
36 relativeandabsolutevaluesofexpression.Anotherapproach,whichwasused
inthestudypresentedinpaper II,istocomparetherelativeabundancesof
onegeneatatime,acrossvarioussamples.Anabsoluteconsensushasnotyet
been reached regarding the extent to which mRNA levels dictate protein
levels.However,mostresearchersagreethattheRNA-to-protein(RTP)ratios
dependonmultiplerelatedprocessesandcanvarygreatlybetweengenesand
gene categories and even display pathological disturbances181. Given this
multitudeandheterogeneityofinteractionsthatcontributetotheRTPratios,
a more straightforward approach is to investigate this relationship for each
gene individually. In paper II, a targeted mass spectrometry assay was
developed, allowing absolute quantification of protein molecules per cell. In
this method, isotope-labeled recombinant protein fragments of known
amountswereusedasinternalstandards,enablingabsolutequantificationof
55proteinswiththePRMtechnology.Thepanelof55proteinswasquantified
in9celllinesand11tissuesamples.Normalizationagainstnumberofcellsis
straightforward for cell lines as there are convenient methods for cell
counting. For tissue samples however it is more challenging, due to the
heterogeneouscompositionoftissuesandtheextracellularmatrixcontaining
a vast amount of protein. An essential part of this study was therefore to
develop a strategy for normalization against cell number in the tissue
samples.ThetargetedPRMapproachwasdevelopedforthefourcorehistone
subunits H2A, H2B, H3 and H4, previously shown to be evenly distributed
alongthechromosomes182.Inthisway,thenumberofcellswithineachtissue
sample could be calculated. The results presented in this paper show that
there is a gene-specific ratio between absolute concentration of protein and
thecorrespondingmRNAlevels(presentedasnormalizedTPMvalues)thatis
conservedacrossthevariouscelltypesandtissuesincludedinthestudy.The
RTPratiosevaluatedinthisstudyshowedhighvariability,rangingfrom200
for the transcription factor MYBL2 to 220,000 for the serine protease
inhibitor SERPINB1. The diversity observed in this relatively small set of 55
genes indicates the importance of gene-specific characteristics and
posttranscriptional regulation to determine protein concentrations. How
thesedifferencesappearbetweengenesandgenecategoriesonaglobalscale
remainsforthefuturetobeevaluated.Ourstudysuggeststhateachgenehas
its own inherent ratio between mRNA and protein, owing to multiple
regulatory factors that collectively control the abundance of these related
molecules.ThisfindingsuggeststhatcorrelationbetweenmRNAandprotein
levels should be assessed for each gene individually, rather than being
generalizedonaglobalscale,andthattranscriptomicsisapowerfultoolwith
greatpotentialtoguideproteomicsstudiesintherightdirections.
37
Combining transcriptomics and protein profiling to study
malignanttransformation(paperIIIandIV)
Malignanttransformationisacomplexprocessinvolvingmultiplemolecular
events that subsequently lead to a cancerous phenotype183. To characterize
the molecular changes underlying malignant transformation, we combined
transcriptomics and immunofluorescence protein profiling applied on a cell
line model for malignant transformation. The cell line model is of fibroblast
origin and was created in order to mimic the natural steps of malignant
transformation. This was achieved by the sequential additions of three
geneticalterations(seechapter1formoredetails).Transcriptomicprofiling
ofthefourstepsinthemodelindicateddifferentialexpressionofabout6%of
theprotein-codinggenome,inatleastonestepgoingfromtheprimarystate
totheimmortalizedstate,viathetransformedstatetothefinalmetastasizing
state. Among the differentially expressed genes, a majority (80%) showed
downregulation over the course of increased malignancy. Downregulation
was shown to mainly affect genes encoding proteins expressed on surface
areas of the cell while upregulation was mainly affecting genes involved in
proliferation. The unique repository of in-house produced antibodies in the
HumanProteinAtlasenabledsubsequentanalysisofproteinswithinteresting
expression profiles throughout the model, identified by the transcriptomics
analysis184. Confocal imaging of immunofluorescence-stained cells enabled
proteinstobevisualizedwithspatialresolutiononasinglecelllevel.Inthis
way, subcellular translocation could be observed and upregulation on the
mRNA level was not only reflected in increased staining intensity in the IF
imagesbutalsoinincreasednumberofcellsexpressingacertainprotein,for
exampleinthecaseofAnnilin.
The BJ model has several limitations, for example by being a cell line model
thatisadaptedtoaninvitroenvironmentandbeingoffibroblastoriginwhich
isnotthemosttypicalcelltypetoundergomalignanttransformationinvivo.
However, genome-wide transcriptomics analysis of this model enabled the
identificationofseveralinterestingcandidateswithpotentialimplicationsin
cancer.Someoftheseproteinswerefurtherevaluatedincohortsofprostate
and colon tumor tissues of matched malignancy grade. Here, the observed
differential expression of ANLN, ANPEP, ANXA1 and BDH1 in the BJ model
was confirmed on a larger scale, covering almost a hundred clinical tumor
samples. The expansion of this study to also include in vivo samples is
important and shows that the BJ model is a relevant model system for the
studyofmalignanttransformationandincreasedproliferativecapability.Asa
follow-up study, we exposed this model to an environment containing 3%
oxygen. This is interesting from the perspective of tumor hypoxia, a
38 commonly encountered scenario in solid tumors where local areas are
deprivedofoxygen,somethingtumorcells,butnotnormalcells,cansurvive.
In these hypoxic microenvironments, tumor cells can still proliferate due to
rewiring of molecular functions such as energy metabolism, increased
vascularization and metastatic behavior. Another interesting aspect of
cultivation in 3% oxygen is that it resembles the in vivo conditions better,
compared to the relative extreme level of oxygen concentration provided in
theatmosphericinvitrocultivationenvironment.Thefindingsfromthisstudy
indicate once again that the most dramatic change of identity occurs as a
resultoftheSV40transformationwithinthemodelsystem.Notonlydidwe
observe a higher number of genes differentially regulated in the SV40transformed stage, the differentially expressed genes shared with the
previous stage were in most cases regulated in the opposite direction.
Analysis of metabolic pathways involving differentially expressed genes
showed that several genes involved in carbohydrate, lipid and amino acid
metabolic pathways are upregulated in the transformed cells in response to
the altered oxygen supply. An important consideration is that the BJ model
was originally created in atmospheric oxygen condition, limiting the
possibilitytostudythehypoxicphenotypeindirectrelationtotheprocessof
malignant transformation itself. Still, we demonstrate that, even though the
cell lines of this isogenic model are highly similar from a global view, the
three genetic introductions have large impact on the cell’s response to a
perturbedenvironment.Thedifferentiallyexpressedgenesthatweidentified
in paper III were again identified in the data generate in this study, in the
atmosphericcondition,addingadditionalsupporttoourpreviousfindings.
Concludingobservationsandfutureperspectives
I find myself in the middle of an exciting era. Technology development is
movingfast.Onlyadecadeago,researchinggeneexpressionbyexamination
of RNA transcript content was limited to the use of hybridization-based
methods, covering only a subset of the cellular transcriptome. Today, highthroughput sequencing has taken over and continues to evolve through
applicationsonthesinglecelllevelandevenwithspatialresolution185,186.In
the first paper covered by this thesis, we concluded that publicly available
RNA-seqdatasetsarefairlyreproducibleonagenerallevel.Thisfindinghas
alsobeenconfirmedmorerecentlybyothers,concludingthattheGTExdata
aswellasFantom5datashowglobalconsistencywiththedatageneratedby
the Human Protein Atlas, even when different transcriptomics methods are
employed155,187. These studies collectively show that transcriptomics has
developed into a highly reliable technology, indicating that the increasing
body of available datasets will continue to add strength to future metaanalyses.TherobustnessofmodernRNA-seqtechnologyisnotonlyvaluable
39
within the field of transcriptomics itself. As we showed in paper II, its
potentialexpandsbeyondthisandcanpossiblybeusedtoguideexploration
oftheproteome.Wehaveshowedthatthereisagene-specificratiobetween
mRNA and protein abundances, indicating that RNA-seq can be used to
extrapolatetheamountofproteininthecell.Itisimportanttonotethatthisis
only applicable to steady-state conditions and not valid in scenarios where
the cellular environment has been altered. However, as experiments are
commonlyperformedduringsteady-stateconditions,gene-specificRTPratios
can till be useful, especially if the identification of RTP ratios is further
expanded to involve the entire protein-coding genome. This is theoretically
possible considering the large repository of recombinant protein fragments
generated within the Human Protein Atlas. In order to gain complete
mechanistic understanding of the regulation of gene expression, especially
with regard to the great diversity seen among RTP ratios, more detailed
analyses should however be considered. Neither should we forget that
translationalmechanismsoftenaredysregulatedinthecontextofdisease,for
example by repression through binding of microRNAs to mRNA
transcripts188,189.ThegreatpotentialofRNA-seqtoguideinanalysesaiming
to understand complex biological systems is well exemplified in the study
presented in paper III. With a transcriptomics view of the cells at different
stages of cancer development, we could identify a subset of interesting
proteins that were further studied in detail with spatial resolution on a
subcellularlevel.Wehaverapidlymovedintoatimewheretechnologyallows
for systematic exploration of entire genomes and transcriptomes with high
accuracy. The field of proteomics is not yet capable of reaching the same
coverage as the “-omics” fields concerned with nucleotides, as no universal
techniqueavailableiscapableofquantifyingtheentirehumanproteomeina
single experiment96. The immense size and complexity of the human
proteome presents a great challenge for the future. On the other hand, a
multitudeofspecificbiologicalaspectsrelatedtotheproteomiclandscapeare
perhaps more important for understanding the workings of the cell. Rather
than just focusing on global quantification, large efforts are likely to be
directed towards studies aimed at quantifying protein activity in terms of
PTMs, protein-protein interactions, spatial distribution as well as
stoichiometry of protein complexes with implications in cell signaling. I
speculate that advances over the last decade within the field of
transcriptomicswillcontributevastlytoguideandfocusproteomicsstudies.
Inthefuture,onemaycometofocusmoreonmeasuringasubsetofproteins
isolated in a specific biological context, with high accuracy. Integration of
multiplestrategiesdevelopedoutofasystemsbiologyframeworkisneeded
to reach the ultimate goal of understanding the enormous complexity of the
cell.
40 Acknowledgements
Hej, vad kul att du läser min avhandling! Kanske har du just läst de övriga
sidorna och funderat på det här med mRNA, transcriptomics och proteiner,
ellersåhardu-ganskatroligt-ivrigtbläddratdigdirekthit,tilldenviktigaste
sidan i hela boken. Ni är många som har varit med mig under de här åren -
inspirerat, skrattat, lyssnat, diskuterat och gjort det här möjligt. TACK alla
handledare,kollegor,samarbetspartners,vännerochfamilj:
Emma–duanarintehurstoltjagäröveratthafåttvaradindoktorand.Din
kvickhet, förmåga att se möjligheter i nästan precis vad som helst och få
kompliceradegrejerattbliheltsjälvklaraharvaritsåinspirerande!Duären
stjärnapåallasättochjagkommeralltidattbäramedmigkänslanefterett
mötemeddig,dåingentingkännsroligareochmermeningsfulltänattfåhålla
påmedforskning.
Mikael–vilkenturförmigattjaglyckadeshaffadigtidigtsombihandledare
innannågonannantogdig.Duharvaritheltfantastisk!Tackföralldintid,för
alltdulärtmigombioinformatik,förallapodcast-tipsochtrevligakaffemöten
påMellqvists.
Mathias-fördinotroligaentusiasm,attdustartadeHPAochförattdulärde
migviktenavattkunnasammanfattasinforskningienendamening.
TillhelagruppenCellProfiling,niharvaritfantastiskaallihop,onecellisno
cell!Mikaela,DianaochLovisa–minaSanFranSisters,förattvialltidstöttar
varandraochförattnibaraärsåroliga,smartaochunderbara!Utanerhade
den här boken aldrig blivit klar. Charlotte – min forsknings-storasyster när
jagbörjadesomdoktorand,duharvaritenförebildpåmångasätt,bådeioch
utanför jobbet. Marie – min första handledare under exjobbet, tack för alla
roligautflykter,konstigafrukterochfördinrakaattityd.Trotsattdulämnade
oss för Dalarna i mars känns du fortfarande som en helt självklar del av
gruppen.AnnaochJenny–förattniärsåunderbaraatthaomkringsig,attfå
jobbamederärlyx.Hammou–förbästahärjetiPhillyochfördinoslagbara
kollpåvärldsläget.Martin–förgottbrödochetttryggtlugnsomharvarittill
väldigt stor hjälp under alla år. Christian und Peter – danke für all die
schönenzeit,Siesindgenial.Lasse–fördinoändligakunskap,nyfikenhetoch
hjälpsamhet, du kan ju typ... allt? Devin – for all the fun and for always
helping out making my transitions from spoken to written text more rapid
and direct. Casper – för trevligt prat och bra svar på frågor om
maskininlärning mm. Ulla och Annica – drottningarna av psykoplasman, ni
harvaritsaknade.
41
Annica Gad-förettfintsamarbeteochvärdefullasynpunkterfrånenriktig
biologsperspektiv.ErikFasterius–försmartafrågorochtrevligamötenom
RNA-seqmmochalltstödunderslutspurten!ChengandKemal–formaking
nicefiguresandforsmoothcollaboration.ToveBochFredrikE–föralltni
lärtmigomMS.
Sanna-minäldstavänpåjobbet,detharvaritsåsköntatthadignäraunder
avhandlingsarbetet och vad kul det var att vi fick tid att hänga ordentligt i
Grekland!kännermigalltidlitelugnareefteratthapratatmeddig.
Arash och Fredrik-förbästa(bad)sällskapetochförattniräddademittliv
underdenvildaforsränningeniBrixen!Tony-förattdugavlabbetliteflärd
och lärde mig att inte rynka ögonbrynen allt för mycket. Beata – för många
trevligaluncherochfikor.Ina–förattduinspirerarochintegrerarkonsten
ochyoganivårforskningvärld.Leo–forbeingsuchasocialanimal,fornever
missing out on cakes and for fantastic guided tours in Vienna. Tove A, Per,
Åsa, HPA IT, Björn H, Figge, EvelinaochallaandraHPA:are–tackförallt
trevligt! Cecilia Williams – för värdefulla synpunkter och granskningen av
minavhandling.PerKraulis-förmångabrafrågorunderpresentationerjag
haft genom åren, du är en av SciLifes riktiga hjältar! Claudia - för bra
förklaringar av MS-principer. Ola – för bra guidning inom translationsfältet.
Ulf–förattdulyserupphelaplan2ochgördettillensåmyckettrevligare
arbetsplats,vilkenturattdukomtillbakafrånOmaha!Johannes och Axel–
BAM! har varit superkul att hänga med er och ordna seminarier, hoppas ni
fortsätterisammastil.AllaprofessorerochPIssomgörplan2såbäst:Peter
ochJochen,Cecilia,Jan,Jacob,BoochOla.
Ulrika, Maia, Kimi, Ronald, Björn, Cissi, Maria, David, Tea, Julia, MunGwan,Elin,Eni,Phil,Maria,Lucia,Mahya,Julie,FahmiochLauramfl-för
den fina stämningen på bästa planet. Alla roliga typer på plan 3: Mickan,
Erik,Anders,Fredrik,Nemo,Sverkermfl.Seppen-assarochAlbaNova-folk:
SaraK,Tove,Tarek,SaraL,Micke,Jenny,Cristina,Per-Åke,Stefan.
Allaminavännerutanförforskningsvärldensomfårmigattkommaihågvad
livets byggstenar egentligen är. Och såklart mamma, pappa, min älskade
systerKajsa,Tobias,IvarochIrma,förertgränslösastödialltjaggör,niär
de allra bästa! Faster Ulla för evig inspiration och fint stöd från det färdigdisputerade perspektivet. Stephen för att du tog dig tid att över telefon
trasslamiglossurlångakrångligameningarpåengelska.FarmorLisbethför
ettunderbartavbrottiskrivandetmedbesökietthöstvackertVilhelminahos
dig. Lena, Lasse, Martin, Johanna, Alvar, Lovisa och Gethy, jag är så
tacksamoverattfåvaraendelaverfamiljockså.Viktor–fördinvärmeoch
nyfikenhet, ditt oändliga tålamod med alla sena kvällar då jag suttit
fastklistradviddatorn,allanyavärldarduöppnaruppförmigochförattdu
barafunnitsdär!
42 Bibliography
1.
Bianconi, E. et al. An estimation of the number of cells in the
human body. Ann. Hum. Biol. 40, 463–471 (2013).
Morrison, D., Wolff, S., Fraknoi, A., Pasachoff, J. M. & Pinsky, L.
S. Abell's Exploration of the Universe, and Astronomy: From the
Earth to the Universe. Physics Today 49, 70–70 (1996).
Gest, H. The discovery of microorganisms by Robert Hooke and
Antoni Van Leeuwenhoek, fellows of the Royal Society. Notes
and records of the Royal Society of London 58, 187–201 (2004).
Fleming, D., Needham, J., Grant, E. & Roger, J. The DSB: A
Review Symposium The Dictionary of Scientific Biography.
Charles Coulston Gillispie. Isis 71, 633–652 (1980).
Mikroskopische Üntersuchungen über die Übereinstimmung in
der Struktur und dem Wachstume der Tiere und Pflanzen.
Nature 86, 41–41 (1911).
Cooper, G. M. & Hausman, R. E. The Cell.
Burndy Library,, Mendel, G. & Punnett, R. C. Versuche über
Pflanzen-Hybriden /. (Im Verlage des Vereines,, 1866).
doi:10.5962/bhl.title.61004
Dahm, R. Discovering DNA: Friedrich Miescher and the early
years of nucleic acid research. Human genetics 122, 565–581
(2008).
Choudhuri, S. The Path from Nuclein to Human Genome: A
Brief History of DNA with a Note on Human Genome
Sequencing and Its Impact on Future Research in Biology.
Bulletin of Science, Technology and Society 23, 360–367
(2003).
WATSON, J. D. & CRICK, F. H. Genetical implications of the
structure of deoxyribonucleic acid. Nature 171, 964–967 (1953).
Gerstein, M. B. et al. What is a gene, post-ENCODE? History
and updated definition. Genome Res. 17, 669–681 (2007).
ASTRACHAN, L. & VOLKIN, E. The absence of ribonucleic acid
in bacteriophage T2r+. Virology 2, 594–598 (1956).
JACOB, F. & MONOD, J. Genetic regulatory mechanisms in the
synthesis of proteins. J. Mol. Biol. 3, 318–356 (1961).
BRENNER, S., JACOB, F. & MESELSON, M. An unstable
intermediate carrying information from genes to ribosomes for
protein synthesis. Nature 190, 576–581 (1961).
GROS, F. et al. Unstable ribonucleic acid revealed by pulse
labelling of Escherichia coli. Nature 190, 581–585 (1961).
Djebali, S. et al. Landscape of transcription in human cells.
Nature 489, 101–108 (2012).
Wang, X. Next-Generation Sequencing Data Analysis. (CRC
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
43
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
44 Press, 2016). doi:10.1201/b19532
Morris, K. V. & Mattick, J. S. The rise of regulatory RNA. Nat.
Rev. Genet. 15, 423–437 (2014).
Wang, E. T. et al. Alternative isoform regulation in human tissue
transcriptomes. Nature 456, 470–476 (2008).
Tan, S. C. & Yiap, B. C. DNA, RNA, and Protein Extraction: The
Past and The Present. Journal of Biomedicine and
Biotechnology 2009, 1–10 (2009).
Perrett, D. From ‘protein’ to the beginnings of clinical
proteomics. Proteomics Clin Appl 1, 720–738 (2007).
VICKERY, H. B. The origin of the word protein. Yale J Biol Med
22, 387–393 (1950).
Kitano, H. Biological robustness. Nat. Rev. Genet. 5, 826–837
(2004).
Lowe, M. & Barr, F. A. Inheritance and biogenesis of organelles
in the secretory pathway. Nature Reviews Molecular Cell
Biology 8, 429–439 (2007).
Stoeger, T., Battich, N. & Pelkmans, L. Passive Noise Filtering
by Cellular Compartmentalization. Cell 164, 1151–1161 (2016).
Nucleic Acids and Protein Synthesis. Nature 177, 602–603
(1956).
CRICK, F. H. On protein synthesis. Symp. Soc. Exp. Biol. 12,
138–163 (1958).
Baltimore, D. RNA-dependent DNA polymerase in virions of
RNA tumour viruses. Nature 226, 1209–1211 (1970).
Temin, H. M. & Mizutani, S. RNA-dependent DNA polymerase in
virions of Rous sarcoma virus. Nature 226, 1211–1213 (1970).
Central dogma reversed. Nature 226, 1198–1199 (1970).
Crick, F. Central dogma of molecular biology. Nature 227, 561–
563 (1970).
Burndy Library,, Darwin, C., Edmonds & Remnants, Henry
Sotheran Ltd. On the origin of species by means of natural
selection, or, The preservation of favoured races in the struggle
for life /. (John Murray ...,, 1859). doi:10.5962/bhl.title.59991
World Cancer Report 2014. International Agency for Research
on Cancer
Maqsood, M. I., Matin, M. M., Bahrami, A. R. & Ghasroldasht, M.
M. Immortality of cell lines: challenges and advantages of
establishment. Cell Biol. Int. 37, 1038–1045 (2013).
Burdall, S. E., Hanby, A. M., Lansdown, M. R. J. & Speirs, V.
Breast cancer cell lines: friend or foe? Breast Cancer Res. 5,
89–95 (2003).
Osborne, C. K., Hobbs, K. & Trent, J. M. Biological differences
among MCF-7 human breast cancer cell lines from different
laboratories. Breast Cancer Research and Treatment 9, 111–
121 (1987).
Moore, J. M. The Immortal Life of Henrietta Lacks. By Rebecca
Skloot . 2010. Crown Publishing. (ISBN 9781400052172). 384
pages. Hardcover. $26.00. The American Biology Teacher 72,
586–587 (2010).
Hanahan, D. & Weinberg, R. A. The hallmarks of cancer. Cell
100, 57–70 (2000).
Hanahan, D. & Weinberg, R. A. Hallmarks of cancer: the next
generation. Cell 144, 646–674 (2011).
Harrington, L. et al. Human telomerase contains evolutionarily
conserved catalytic and structural subunits. Genes &
Development 11, 3109–3115 (1997).
Pütz, S. M., Vogiatzi, F., Stiewe, T. & Sickmann, A. Malignant
transformation in a defined genetic background: proteome
changes displayed by 2D-PAGE. Mol. Cancer 9, 254 (2010).
Rich, J. N. et al. A genetically tractable model of human glioma
formation. Cancer Res. 61, 3556–3560 (2001).
Yu, J., Boyapati, A. & Rundell, K. Critical Role for SV40 Small-t
Antigen in Human Cell Transformation. Virology 290, 192–198
(2001).
Elenbaas, B. et al. Human breast cancer cells generated by
oncogenic transformation of primary mammary epithelial cells.
Genes & Development 15, 50–65 (2001).
MacKenzie, K. L. et al. Multiple stages of malignant
transformation of human endothelial cells modelled by coexpression of telomerase reverse transcriptase, SV40 T antigen
and oncogenic N-ras. Oncogene 21, 4200–4211 (2002).
Lundberg, A. S. et al. Immortalization and transformation of
primary human airway epithelial cells by gene transfer.
Oncogene 21, 4577–4586 (2002).
Wang, Z., Gerstein, M. & Snyder, M. RNA-Seq: a revolutionary
tool for transcriptomics. Nat. Rev. Genet. 10, 57–63 (2009).
Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. & Wold,
B. Mapping and quantifying mammalian transcriptomes by RNASeq. Nat. Methods 5, 621–628 (2008).
Nagalakshmi, U. et al. The transcriptional landscape of the yeast
genome defined by RNA sequencing. Science 320, 1344–1349
(2008).
Adams, M. D. et al. Complementary DNA sequencing:
expressed sequence tags and human genome project. Science
252, 1651–1656 (1991).
Bainbridge, M. N. et al. Analysis of the prostate cancer cell line
LNCaP transcriptome using a sequencing-by-synthesis
approach. BMC Genomics 7, 246 (2006).
Cheung, F. et al. Sequencing Medicago truncatula expressed
sequenced tags using 454 Life Sciences technology. BMC
Genomics 7, 272 (2006).
Emrich, S. J., Barbazuk, W. B., Li, L. & Schnable, P. S. Gene
discovery and annotation using LCM-454 transcriptome
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52.
53.
45
54.
55.
56.
57.
58.
59.
60.
61.
62.
63.
64.
65.
66.
67.
68.
69.
70.
46 sequencing. Genome Res. 17, 69–73 (2007).
Weber, A. P. M. Discovering New Biology through Sequencing
of RNA. Plant Physiol. 169, 1524–1531 (2015).
Garalde, D. R. et al. Highly parallel direct RNA sequencing on
an array of nanopores. bioRxiv 068809 (Cold Spring Harbor
Labs Journals, 2016). doi:10.1101/068809
Schroeder, A. et al. The RIN: an RNA integrity number for
assigning integrity values to RNA measurements. BMC Mol.
Biol. 7, 3 (2006).
Aken, B. L. et al. The Ensembl gene annotation system.
Database (Oxford) 2016, baw093 (2016).
Rosenbloom, K. R. et al. The UCSC Genome Browser
database: 2015 update. Nucleic Acids Res. 43, D670–81 (2015).
Trapnell, C. et al. Differential gene and transcript expression
analysis of RNA-seq experiments with TopHat and Cufflinks. Nat
Protoc 7, 562–578 (2012).
Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. L. Ultrafast
and memory-efficient alignment of short DNA sequences to the
human genome. Genome Biol. 10, R25 (2009).
Pertea, M., Kim, D., Pertea, G. M., Leek, J. T. & Salzberg, S. L.
Transcript-level expression analysis of RNA-seq experiments
with HISAT, StringTie and Ballgown. Nat Protoc 11, 1650–1667
(2016).
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner.
Bioinformatics 29, 15–21 (2013).
Li, B., Ruotti, V., Stewart, R. M., Thomson, J. A. & Dewey, C. N.
RNA-Seq gene expression estimation with read mapping
uncertainty. Bioinformatics 26, 493–500 (2010).
Wagner, G. P., Kin, K. & Lynch, V. J. Measurement of mRNA
abundance using RNA-seq data: RPKM measure is inconsistent
among samples. Theory Biosci. 131, 281–285 (2012).
Pachter, L. Models for transcript quantification from RNA-Seq.
(2011).
Patro, R., Mount, S. M. & Kingsford, C. Sailfish enables
alignment-free isoform quantification from RNA-seq reads using
lightweight algorithms. Nat. Biotechnol. 32, 462–464 (2014).
Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal
probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525–
527 (2016).
Liao, Y., Smyth, G. K. & Shi, W. featureCounts: an efficient
general purpose program for assigning sequence reads to
genomic features. Bioinformatics 30, 923–930 (2014).
Anders, S., Pyl, P. T. & Huber, W. HTSeq--a Python framework
to work with high-throughput sequencing data. Bioinformatics
31, 166–169 (2015).
Trapnell, C. et al. Differential analysis of gene regulation at
transcript resolution with RNA-seq. Nat. Biotechnol. 31, 46–53
(2013).
Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C.
Salmon provides accurate, fast, and bias-aware transcript
expression estimates using dual-phase inference. bioRxiv
021592 (2016). doi:10.1101/021592
Patro, R., Mount, S. M. & Kingsford, C. Sailfish enables
alignment-free isoform quantification from RNA-seq reads using
lightweight algorithms. Nat. Biotechnol. 32, 462–464 (2014).
Li, B. & Dewey, C. N. RSEM: accurate transcript quantification
from RNA-Seq data with or without a reference genome. BMC
Bioinformatics 12, 323 (2011).
Pimentel, H. J., Bray, N., Puente, S., Melsted, P. & Pachter, L.
Differential analysis of RNA-Seq incorporating quantification
uncertainty. bioRxiv 058164 (2016). doi:10.1101/058164
Soneson, C., Love, M. I. & Robinson, M. D. Differential analyses
for RNA-seq: transcript-level estimates improve gene-level
inferences. F1000Res 4, 1521 (2015).
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold
change and dispersion for RNA-seq data with DESeq2. Genome
Biol. 15, 550 (2014).
Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a
Bioconductor package for differential expression analysis of
digital gene expression data. Bioinformatics 26, 139–140 (2010).
McCarthy, D. J., Chen, Y. & Smyth, G. K. Differential expression
analysis of multifactor RNA-Seq experiments with respect to
biological variation. Nucleic Acids Res. 40, 4288–4297 (2012).
Kanter, I. & Kalisky, T. Single cell transcriptomics: methods and
applications. Front Oncol 5, 53 (2015).
Norel, R., Rice, J. J. & Stolovitzky, G. The self-assessment trap:
can we all be better than average? Molecular Systems Biology
7, 537–537 (2011).
Wilkins, M. R. et al. From Proteins to Proteomes: Large Scale
Protein Identification by Two-Dimensional Electrophoresis and
Arnino Acid Analysis. Nat. Biotechnol. 14, 61–65 (1996).
Rabilloud, T., Hochstrasser, D. & Simpson, R. J. Is a genecentric human proteome project the best way for proteomics to
serve biology? Proteomics 10, 3067–3072 (2010).
Ponten, F. et al. A global view of protein expression in human
cells, tissues, and organs. Molecular Systems Biology 5, 337
(2009).
Lundberg, E. et al. Defining the transcriptome and proteome in
three functionally different human cell lines. Molecular Systems
Biology 6, 450 (2010).
Geiger, T., Wehner, A., Schaab, C., Cox, J. & Mann, M.
Comparative proteomic analysis of eleven common cell lines
reveals ubiquitous but varying expression of most proteins. Mol.
Cell Proteomics 11, M111.014050–M111.014050 (2012).
71.
72.
73.
74.
75.
76.
77.
78.
79.
80.
81.
82.
83.
84.
85.
47
86.
87.
88.
89.
90.
91.
92.
93.
94.
95.
96.
97.
98.
99.
100.
101.
102.
48 Azimifar, S. B., Nagaraj, N., Cox, J. & Mann, M. Cell-TypeResolved Quantitative Proteomics of Murine Liver. Cell
Metabolism 20, 1076–1087 (2014).
Richards, A. L., Merrill, A. E. & Coon, J. J. Proteome sequencing
goes deep. Current Opinion in Chemical Biology 24, 11–17
(2015).
Sharma, K. et al. Cell type– and brain region–resolved mouse
brain proteome. Nature Neuroscience 18, 1819–1831 (2015).
Towbin, H., Staehelin, T. & Gordon, J. Electrophoretic transfer of
proteins from polyacrylamide gels to nitrocellulose sheets:
procedure and some applications. Proceedings of the National
Academy of Sciences 76, 4350–4354 (1979).
Engvall, E. & Perlmann, P. Enzyme-linked immunosorbent
assay (ELISA). Quantitative assay of immunoglobulin G.
Immunochemistry 8, 871–874 (1971).
O'Farrell, P. H. High resolution two-dimensional electrophoresis
of proteins. J. Biol. Chem. 250, 4007–4021 (1975).
Kim, M.-S. et al. A draft map of the human proteome. Nature
509, 575–581 (2014).
Wilhelm, M. et al. Mass-spectrometry-based draft of the human
proteome. Nature 509, 582–587 (2014).
Savitski, M. M., Wilhelm, M., Hahne, H., Kuster, B. & Bantscheff,
M. A Scalable Approach for Protein False Discovery Rate
Estimation in Large Proteomic Data Sets. Mol. Cell Proteomics
14, 2394–2404 (2015).
Ezkurdia, I., Vázquez, J., Valencia, A. & Tress, M. Analyzing the
first drafts of the human proteome. J. Proteome Res. 13, 3854–
3855 (2014).
Omenn, G. S. et al. Metrics for the Human Proteome Project
2016: Progress on Identifying and Characterizing the Human
Proteome, Including Post-Translational Modifications. J.
Proteome Res. acs.jproteome.6b00511 (2016).
doi:10.1021/acs.jproteome.6b00511
Uhlén, M. et al. Proteomics. Tissue-based map of the human
proteome. Science 347, 1260419–1260419 (2015).
Fenn, J. B., Mann, M., Meng, C. K., Wong, S. F. & Whitehouse,
C. M. Electrospray ionization for mass spectrometry of large
biomolecules. Science 246, 64–71 (1989).
Karas, M. & Hillenkamp, F. Laser desorption ionization of
proteins with molecular masses exceeding 10,000 daltons.
Analytical Chemistry 60, 2299–2301 (1988).
Steen, H. & Mann, M. The ABC‘s (and XYZ’s) of peptide
sequencing. Nature Reviews Molecular Cell Biology 5, 699–711
(2004).
Aebersold, R. & Mann, M. Mass spectrometry-based
proteomics. Nature 422, 198–207 (2003).
Payne, A. H. & Glish, G. L. in Biological Mass Spectrometry 402,
103.
104.
105.
106.
107.
108.
109.
110.
111.
112.
113.
114.
115.
116.
109–148 (Elsevier, 2005).
Perkins, D. N., Pappin, D. J., Creasy, D. M. & Cottrell, J. S.
Probability-based protein identification by searching sequence
databases using mass spectrometry data. Electrophoresis 20,
3551–3567 (1999).
Bantscheff, M., Lemeer, S., Savitski, M. M. & Kuster, B.
Quantitative mass spectrometry in proteomics: critical review
update from 2007 to the present. Anal Bioanal Chem 404, 939–
965 (2012).
Ong, S.-E. et al. Stable isotope labeling by amino acids in cell
culture, SILAC, as a simple and accurate approach to
expression proteomics. Molecular & Cellular Proteomics 1, 376–
386 (2002).
Mirgorodskaya, O. A. et al. Quantitation of peptides and proteins
by matrix-assisted laser desorption/ionization mass
spectrometry using (18)O-labeled internal standards. Rapid
Communications in Mass Spectrometry 14, 1226–1232 (2000).
Chakraborty, A. & Regnier, F. E. Global internal standard
technology for comparative proteomics. J Chromatogr A 949,
173–184 (2002).
Hsu, J.-L., Huang, S.-Y., Chow, N.-H. & Chen, S.-H. Stableisotope dimethyl labeling for quantitative proteomics. Analytical
Chemistry 75, 6843–6852 (2003).
Gygi, S. P. et al. Quantitative analysis of complex protein
mixtures using isotope-coded affinity tags. Nat. Biotechnol. 17,
994–999 (1999).
Thompson, A. et al. Tandem Mass Tags: A Novel Quantification
Strategy for Comparative Analysis of Complex Protein Mixtures
by MS/MS. Analytical Chemistry 75, 4942–4942 (2003).
Ross, P. L. Multiplexed Protein Quantitation in Saccharomyces
cerevisiae Using Amine-reactive Isobaric Tagging Reagents.
Molecular & Cellular Proteomics 3, 1154–1169 (2004).
Brun, V., Masselon, C., Garin, J. & Dupuis, A. Isotope dilution
strategies for absolute quantitative proteomics. Journal of
Proteomics 72, 740–749 (2009).
Pratt, J. M. et al. Multiplexed absolute quantification for
proteomics using concatenated signature peptides encoded by
QconCAT genes. Nat Protoc 1, 1029–1043 (2006).
Gerber, S. A., Rush, J., Stemman, O., Kirschner, M. W. & Gygi,
S. P. Absolute quantification of proteins and phosphoproteins
from cell lysates by tandem MS. Proceedings of the National
Academy of Sciences 100, 6940–6945 (2003).
Hanke, S., Besir, H., Oesterhelt, D. & Mann, M. Absolute SILAC
for Accurate Quantitation of Proteins in Complex Mixtures Down
to the Attomole Level. J. Proteome Res. 7, 1118–1130 (2008).
Brun, V. et al. Isotope-labeled protein standards: toward
absolute quantitative proteomics. Molecular & Cellular
49
117.
118.
119.
120.
121.
122.
123.
124.
125.
126.
127.
128.
129.
50 Proteomics 6, 2139–2149 (2007).
Singh, S., Springer, M., Steen, J., Kirschner, M. W. & Steen, H.
FLEXIQuant: A Novel Tool for the Absolute Quantification of
Proteins, and the Simultaneous Identification and Quantification
of Potentially Modified Peptides. J. Proteome Res. 8, 2201–
2210 (2009).
Zeiler, M., Straube, W. L., Lundberg, E., Uhlen, M. & Mann, M. A
Protein Epitope Signature Tag (PrEST) Library Allows SILACbased Absolute Quantification and Multiplexed Determination of
Protein Copy Numbers in Cell Lines. Molecular & Cellular
Proteomics 11, O111.009613–O111.009613 (2012).
Kuzyk, M. A. et al. Multiple reaction monitoring-based,
multiplexed, absolute quantitation of 45 proteins in human
plasma. Mol. Cell Proteomics 8, 1860–1877 (2009).
Domanski, D. et al. MRM-based multiplexed quantitation of 67
putative cardiovascular disease biomarkers in human plasma.
Proteomics 12, 1222–1243 (2012).
Anderson, L. Quantitative Mass Spectrometric Multiple Reaction
Monitoring Assays for Major Plasma Proteins. Molecular &
Cellular Proteomics 5, 573–588 (2005).
Picotti, P. & Aebersold, R. Selected reaction monitoring–based
proteomics: workflows, potential, pitfalls and future directions.
Nat. Methods 9, 555–566 (2012).
Gallien, S. et al. Targeted proteomic quantification on
quadrupole-orbitrap mass spectrometer. Mol. Cell Proteomics
11, 1709–1723 (2012).
YALOW, R. S. & BERSON, S. A. Immunoassay of endogenous
plasma insulin in man. Journal of Clinical Investigation 39,
1157–1175 (1960).
Stryer, L. Fluorescence energy transfer as a spectroscopic ruler.
Annual Review of Biochemistry 47, 819–846 (1978).
Snapp, E. L., Altan, N. & Lippincott-Schwartz, J. Measuring
Protein Mobility by Photobleaching GFP Chimeras in Living
Cells. 21.1.1–21.1.24 (John Wiley & Sons, Inc., 2001).
doi:10.1002/0471143030.cb2101s19
Stadler, C., Skogs, M., Brismar, H., Uhlén, M. & Lundberg, E. A
single fixation protocol for proteome-wide immunofluorescence
localization studies. Journal of Proteomics 73, 1067–1078
(2010).
Stokes, G. G. On the Change of Refrangibility of Light.
Philosophical Transactions of the Royal Society of London 142,
463–562 (1852).
Cleary, M. D., Meiering, C. D., Jan, E., Guymon, R. &
Boothroyd, J. C. Biosynthetic labeling of RNA with uracil
phosphoribosyltransferase allows cell-specific microarray
analysis of mRNA synthesis and decay. Nat. Biotechnol. 23,
232–237 (2005).
130.
131.
132.
133.
134.
135.
136.
137.
138.
139.
140.
141.
142.
143.
144.
Neymotin, B., Athanasiadou, R. & Gresham, D. Determination of
in vivo RNA kinetics using RATE-seq. RNA 20, 1645–1652
(2014).
Gray, J. M. et al. SnapShot-Seq: a method for extracting
genome-wide, in vivo mRNA dynamics from a single total RNA
sample. PLoS ONE 9, e89673 (2014).
Wethmar, K., Smink, J. J. & Leutz, A. Upstream open reading
frames: Molecular switches in (patho)physiology. BioEssays 32,
885–893 (2010).
Barrett, L. W., Fletcher, S. & Wilton, S. D. Regulation of
eukaryotic gene expression by the untranslated gene regions
and other non-coding elements. Cell. Mol. Life Sci. 69, 3613–
3634 (2012).
Liu, Y., Beyer, A. & Aebersold, R. On the Dependency of
Cellular Protein Levels on mRNA Abundance. Cell 165, 535–
550 (2016).
Boisvert, F.-M. et al. A quantitative spatial proteomics analysis
of proteome turnover in human cells. Mol. Cell Proteomics 11,
M111.011429–M111.011429 (2012).
Brar, G. A. & Weissman, J. S. Ribosome profiling reveals the
what, when, where and how of protein synthesis. Nature
Reviews Molecular Cell Biology 16, 651–664 (2015).
Ingolia, N. T., Ghaemmaghami, S., Newman, J. R. S. &
Weissman, J. S. Genome-wide analysis in vivo of translation
with nucleotide resolution using ribosome profiling. Science 324,
218–223 (2009).
Piccirillo, C. A., Bjur, E., Topisirovic, I., Sonenberg, N. &
Larsson, O. Translational control of immune responses: from
transcripts to translatomes. Nature Immunology 15, 503–511
(2014).
Lundberg, E. et al. Defining the transcriptome and proteome in
three functionally different human cell lines. Molecular Systems
Biology 6, 450 (2010).
Wilhelm, M. et al. Mass-spectrometry-based draft of the human
proteome. Nature 509, 582–587 (2014).
Anderson, N. L. The Human Plasma Proteome: History,
Character, and Diagnostic Prospects. Molecular & Cellular
Proteomics 1, 845–867 (2002).
Uhlén, M. et al. A proposal for validation of antibodies. Nat.
Methods 13, 823–827 (2016).
Lin, J.-R., Fallahi-Sichani, M. & Sorger, P. K. Highly multiplexed
imaging of single cells using a high-throughput cyclic
immunofluorescence method. Nature Communications 6, 8390
(2015).
Bandura, D. R. et al. Mass cytometry: technique for real time
single cell multitarget immunoassay based on inductively
coupled plasma time-of-flight mass spectrometry. Analytical
51
145.
146.
147.
148.
149.
150.
151.
152.
153.
154.
155.
156.
157.
158.
159.
160.
161.
162.
163.
52 Chemistry 81, 6813–6822 (2009).
Li, Y. & Chen, L. Big Biological Data: Challenges and
Opportunities. Genomics, Proteomics & Bioinformatics 12, 187–
189 (2014).
Marx, V. Biology: The big challenges of big data. Nature 498,
255–260 (2013).
Barrett, T. et al. NCBI GEO: archive for functional genomics
data sets--update. Nucleic Acids Res. 41, D991–5 (2013).
Rustici, G. et al. ArrayExpress update--trends in database
growth and links to data analysis tools. Nucleic Acids Res. 41,
D987–90 (2013).
Petryszak, R. et al. Expression Atlas update--an integrated
database of gene and protein expression in humans, animals
and plants. Nucleic Acids Res. 44, D746–52 (2016).
GTEx Consortium. The Genotype-Tissue Expression (GTEx)
project. Nat. Genet. 45, 580–585 (2013).
Uhlén, M. et al. Proteomics. Tissue-based map of the human
proteome. Science 347, 1260419–1260419 (2015).
Lizio, M. et al. Gateways to the FANTOM5 promoter level
mammalian expression atlas. Genome Biol. 16, 22 (2015).
FANTOM Consortium and the RIKEN PMI and CLST (DGT) et
al. A promoter-level mammalian expression atlas. Nature 507,
462–470 (2014).
Danielsson, F., James, T., Gomez-Cabrero, D. & Huss, M.
Assessing the consistency of public human tissue RNA-seq data
sets. Brief. Bioinformatics 16, 941–949 (2015).
Uhlen, M. et al. Transcriptomics resources of human tissues and
organs. Molecular Systems Biology 12, 862–862 (2016).
Yu, N. Y.-L. et al. Complementing tissue characterization by
integrating transcriptome profiling from the Human Protein Atlas
and from the FANTOM5 consortium. Nucleic Acids Res. 43,
6787–6798 (2015).
Antequera, F. & Bird, A. Number of CpG islands and genes in
human and mouse. Proceedings of the National Academy of
Sciences 90, 11995–11999 (1993).
Fields, C., Adams, M. D., White, O. & Venter, J. C. How many
genes in the human genome? Nat. Genet. 7, 345–346 (1994).
Lander, E. S. et al. Initial sequencing and analysis of the human
genome. Nature 409, 860–921 (2001).
Venter, J. C. et al. The sequence of the human genome.
Science 291, 1304–1351 (2001).
Cunningham, F. et al. Ensembl 2015. Nucleic Acids Res. 43,
D662–9 (2015).
Breuza, L. et al. The UniProtKB guide to the human proteome.
Database (Oxford) 2016, bav120 (2016).
Vizcaíno, J. A. et al. 2016 update of the PRIDE database and its
related tools. Nucleic Acids Res. 44, D447–56 (2016).
164.
165.
166.
.
167.
168.
169.
170.
171.
172.
173.
174.
175.
176.
177.
178.
179.
Uhlén, M. et al. A human protein atlas for normal and cancer
tissues based on antibody proteomics. Molecular & Cellular
Proteomics 4, 1920–1932 (2005).
Barbe, L. et al. Toward a confocal subcellular atlas of the human
proteome. Mol. Cell Proteomics 7, 499–508 (2008).
Thul, P. et al. A subcellular map of the human proteome
Ashburner, M. et al. Gene Ontology: tool for the unification of
biology. Nat. Genet. 25, 25–29 (2000).
Jantzen, S. G., Sutherland, B. J., Minkley, D. R. & Koop, B. F.
GO Trimming: Systematically reducing redundancy in large
Gene Ontology datasets. BMC Research Notes 4, 267 (2011).
Yu, G., Wang, L.-G., Han, Y. & He, Q.-Y. clusterProfiler: an R
Package for Comparing Biological Themes Among Gene
Clusters. OMICS: A Journal of Integrative Biology 16, 284–287
(2012).
Alexa, A. & Rahnenfuhrer, J. topGO: Enrichment Analysis for
Gene Ontology.
Yu, G., Wang, L.-G., Han, Y. & He, Q.-Y. clusterProfiler: an R
Package for Comparing Biological Themes Among Gene
Clusters. OMICS: A Journal of Integrative Biology 16, 284–287
(2012).
Eden, E., Navon, R., Steinfeld, I., Lipson, D. & Yakhini, Z.
GOrilla: a tool for discovery and visualization of enriched GO
terms in ranked gene lists. BMC Bioinformatics 10, 48 (2009).
Huang, D. W., Sherman, B. T. & Lempicki, R. A. Systematic and
integrative analysis of large gene lists using DAVID
bioinformatics resources. Nat. Protocols 4, 44–57 (2008).
Pathan, M. et al. FunRich: An open access standalone
functional enrichment and interaction network analysis tool.
Proteomics 15, 2597–2601 (2015).
Edwards, A. M. et al. Too many roads not taken. Nature 470,
163–165 (2011).
Dewhirst, M. W., Cao, Y. & Moeller, B. Cycling hypoxia and free
radicals regulate angiogenesis and radiotherapy response. Nat.
Rev. Cancer 8, 425–437 (2008).
Asmann, Y. W. et al. Detection of redundant fusion transcripts
as biomarkers or disease-specific therapeutic targets in breast
cancer. Cancer Res. 72, 1921–1928 (2012).
Minikel, E. V. Tissue-specific gene expression data based on
Human BodyMap 2.0. (2013). Available at:
http://www.cureffi.org/2013/07/11/tissue-specific-geneexpression-data-based-on-human-bodymap-2-0/. (Accessed: 11
November 2016)
Labeit, S., Kolmerer, B. & Linke, W. A. The Giant Protein Titin:
Emerging Roles in Physiology and Pathophysiology. Circulation
Research 80, 290–294 (1997).
53
180.
181.
182.
183.
184.
185.
186.
187.
188.
189.
54 McManus, J., Cheng, Z. & Vogel, C. Next-generation analysis of
gene expression regulation--comparing the roles of synthesis
and degradation. Mol Biosyst 11, 2680–2689 (2015).
Piccirillo, C. A., Bjur, E., Topisirovic, I., Sonenberg, N. &
Larsson, O. Translational control of immune responses: from
transcripts to translatomes. Nature Immunology 15, 503–511
(2014).
van Holde, K. E. Chromatin. (Springer New York, 1989).
doi:10.1007/978-1-4612-3490-6
Vogelstein, B. & Kinzler, K. W. The multistep nature of cancer.
Trends Genet. 9, 138–141 (1993).
Nilsson, P. et al. Towards a human proteome atlas: highthroughput generation of mono-specific antibodies for tissue
profiling. Proteomics 5, 4327–4337 (2005).
Shapiro, E., Biezuner, T. & Linnarsson, S. Single-cell
sequencing-based technologies will revolutionize wholeorganism science. Nat. Rev. Genet. 14, 618–630 (2013).
Ståhl, P. L. et al. Visualization and analysis of gene expression
in tissue sections by spatial transcriptomics. Science 353, 78–82
(2016).
Yu, N. Y.-L. et al. Complementing tissue characterization by
integrating transcriptome profiling from the Human Protein Atlas
and from the FANTOM5 consortium. Nucleic Acids Res. 43,
6787–6798 (2015).
Grzmil, M. & Hemmings, B. A. Translation regulation as a
therapeutic target in cancer. Cancer Res. 72, 3891–3900
(2012).
Ha, M. & Kim, V. N. Regulation of microRNA biogenesis. Nature
Reviews Molecular Cell Biology 15, 509–524 (2014).