Download Profiling adaptive immune repertoires across multiple human tissues

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Immune system wikipedia , lookup

Hygiene hypothesis wikipedia , lookup

12-Hydroxyeicosatetraenoic acid wikipedia , lookup

Adaptive immune system wikipedia , lookup

Innate immune system wikipedia , lookup

Cancer immunotherapy wikipedia , lookup

Immunosuppressive drug wikipedia , lookup

Polyclonal B cell response wikipedia , lookup

Adoptive cell transfer wikipedia , lookup

Psychoneuroimmunology wikipedia , lookup

Molecular mimicry wikipedia , lookup

Immunomics wikipedia , lookup

Transcript
bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
Title:ProfilingadaptiveimmunerepertoiresacrossmultiplehumantissuesbyRNASequencing
Authors:SergheiMangul*#1,2,IgorMandric*3,HarryTaegyunYang1,NicolasStrauli4,Dennis
Montoya5,JeremyRotman1,WillVanDerWey1,JiemR.Ronas6,BenjaminStatz1,DouglasYao5,
AlexZelikovsky3,RobertoSpreafico2,SagivShifman**7,NoahZaitlen**8,MauraRossetti**9,
K.MarkAnsel**10,EleazarEskin**#1,11
Affiliations:
1
DepartmentofComputerScience,UniversityofCaliforniaLosAngeles,LosAngeles,CA,USA
2
InstituteforQuantitativeandComputationalBiosciences,UniversityofCaliforniaLosAngeles,LosAngeles,CA,USA
3
DepartmentofComputerScience,GeorgiaStateUniversity,Atlanta,USA
4
BiomedicalSciencesGraduateProgram,UniversityofCalifornia,SanFrancisco,CA,USA
5
DepartmentofMolecular,Cell,andDevelopmentalBiology,UniversityofCalifornia,LosAngeles,LosAngeles,CA,
USA
6
Department of Microbiology, Immunology, and Molecular Genetics, University of California, Los Angeles, Los
Angeles,CA90095,USA
7
DepartmentofGenetics,TheInstituteofLifeSciences,TheHebrewUniversityofJerusalem,Jerusalem,Israel
8
DepartmentofMedicine,UniversityofCalifornia,SanFrancisco,CA,USA
9
Immunogenetics Center, Department of Pathology and Laboratory Medicine, David Geffen School of Medicine,
UniversityofCaliforniaLosAngeles,LosAngeles,CA,USA
10
DepartmentofMicrobiologyandImmunology,SandlerAsthmaBasicResearchCenter,UniversityofCalifornia,San
Francisco,SanFrancisco,CA,USA
11
DepartmentofHumanGenetics,UniversityofCaliforniaLosAngeles,LosAngeles,USA
#Correspondenceto:[email protected]@cs.ucla.edu
*Equalcontribution
**Equalcontribution
bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
Abstract
Assay-basedapproachesprovideadetailedviewoftheadaptiveimmunesystembyprofilingT
andBcellreceptorrepertoires.However,thesemethodscarryahighcostandlackthescaleof
standard RNA sequencing (RNA-Seq). Here we report the development of ImReP, a novel
computationalmethodforrapidandaccurateprofilingoftheadaptiveimmunerepertoirefrom
regular RNA-Seq data. We applied our novel method to 8,555 samples across 544 individuals
from 53 tissues from the Genotype-Tissue Expression (GTEx v6) project. ImReP is able to
efficiently extract TCR- and BCR-derived reads from RNA-Seq data. ImReP can also accurately
assemblethecomplementarydeterminingregions3(CDR3s),themostvariableregionsofBand
T cell receptors, and determine their antigen specificity. Using ImReP, we have created a
systematicatlasofimmunologicalsequencesforBandTcellrepertoiresacrossabroadrangeof
tissuetypes,mostofwhichhavenotbeenstudiedforBandTcellreceptorrepertoires.Wealso
compared the GTEx tissues to track the flow of T- and B-clonotypes across immune-related
tissues,includingsecondarylymphoidorgansandorgansencompassingmucosal,exocrine,and
endocrinesites,andweexaminedthecompositionalsimilaritiesofclonalpopulationsbetween
these tissues. The atlas of T and B cell receptors, freely available at
https://sergheimangul.wordpress.com/atlas-immune-repertoires/, is the largest collection of
CDR3sequencesandtissuetypes.Weanticipatethisrecoursewillenhancefutureimmunology
studiesandadvancedevelopmentoftherapiesforhumandiseases.ImRePisfreelyavailableat
https://sergheimangul.wordpress.com/imrep/.
bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
Introduction
Akeyfunctionoftheadaptiveimmunesystem,whichiscomposedofB-cellsandT-cells,isto
mountprotectivememoryresponsestoagivenantigen.BandTcellsrecognizetheirspecific
antigens through their surface antigen receptors (B and T cell receptors, BCR and TCR,
respectively),whichareuniquetoeachcellanditsprogeny.BCRandTCRarediversifiedthrough
somaticrecombination,aprocessthatrandomlycombinesvariable(V),diversity(D),andjoining
(J)genesegments,andinsertsordeletesnon-templatedbasesattherecombinationjunctions1
(Figure1a).TheresultingDNAsequencesarethentranslatedintotheantigenreceptorproteins.
Thisprocessallowsforanastonishingdiversityofthelymphocyterepertoire(i.e.,thecollection
of antigen receptors of a given individual), with >1013 theoretically possible distinct
immunological receptors1. This diversity is key for the immune system to confer protection
againstawidevarietyofpotentialpathogens2.Inaddition,uponactivationofaB-cell,somatic
hypermutationfurtherdiversifiesBCRsintheirvariableregion.Thesechangesaremostlysinglebasesubstitutionsoccurringatextremelyhighrates(10–5to10–3 mutationsperbasepairper
generation)3. Isotype switching is another mechanism that contributes to B-cell functional
diversity.Here,antigenspecificityremainsunchangedwhiletheheavychainVDJregionsjoinwith
different constant (C) regions, such as IgG, IgA, or IgE isotypes, and alter the immunological
propertiesofaBCR.
High-throughputtechnologiesenableunprecedentedaccuracywhenprofilingtheBCRandTCR
repertoires. Commonly used assay-based approaches provide a detailed view of the adaptive
bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
immunesystemwithdeepsequencingofamplifiedDNAorRNAfromthevariableregionofBCR
orTCRloci(Rep-Seq)4.Thosetechnologiesareusuallyrestrictedtoonechain,withthemajority
of studies focusing on the beta chain of TCRs and the heavy chain of BCRs. Recent studies2
successfully applied assay-based approaches to characterize the immune repertoire of the
peripheralblood.However,littleisknownabouttheimmunologicalrepertoiresofotherhuman
tissues,includingbarriertissueslikeskinandmucosae.Studiesinvolvingassay-basedprotocols
usually have small sample sizes, thus limiting analysis of intra-individual variation of
immunologicalreceptorsacrossdiversehumantissues.
RNASequencing(RNA-Seq)traditionallyusesthereadsmappedontohumangenomereferences
to study the transcriptional landscape of both single cells and entire cellular populations. In
contrasttoassay-basedprotocolsthatproducereadsfromtheamplifiedvariableregionofBCR
orTCRloci,RNA-Seqisabletocapturetheentirecellularpopulationofthesample,includingB
andTcells.However,duetotherepetitivenatureoflociencodingforBCRsandTCRs,aswellas
theextremelevelofdiversityinBCRandTCRtranscripts,mostmappingtoolsareillequippedto
handle immune repertoire sequences. Despite this, BCR and TCR transcripts often occur in
sufficient numbers within the transcriptome of many tissues to characterize their respective
immunologicalrepertoires5.
In this study, we developed ImReP, a novel computational method for rapid and accurate
profilingoftheadaptiveimmunerepertoirefromregularRNA-Seqdata.Weapplieditto8,555
samplesacross544individualsfrom53tissuesobtainedfromGenotype-TissueExpressionstudy
bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
(GTExv6)6.Thedatawasderivedfrom38solidorgantissues,11brainsubregions,wholeblood,
andthreecelllines.ImRePisabletoefficientlyextractTCR-andBCR-derivedreadsfromthe
RNA-Seq data and accurately assemble the complementarity determining regions 3 (CDR3s).
CDR3 are the most variable regions of B and T cell receptors and determine the antigen
specificity.UsingImReP,wecreatedasystematicatlasofimmunologicalsequencesforB-andTcellrepertoriesacrossabroadrangeoftissuetypes,mostofwhichwerenotpreviouslystudied
for B- and T-cell repertoires. We also examined the compositional similarities of clonal
populationsbetweenthetissuestotracktheflowofTandBclonotypesacrossimmune-related
tissues, including secondary lymphoid and organs that encompass mucosal, exocrine, and
endocrinesites.OurproposedapproachisnotsuperiorincomparisontotargetedTCRorBCR;
rather,itprovidesausefultoolformininglarge-scaleRNA-Seqdatasetsforstudyofadaptive
immunerepertoires.
bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
Results
ImReP:atwostageapproachforadaptiveimmunerepertoiresreconstruction
WeappliedImRePto0.6trillionRNA-Seqreads(92Tbp)from8,555samplestoassembleCDR3
sequencesofBandTcellreceptors(TableS1).TheRNA-SeqdatawasgeneratedbytheGenotypeTissue Expression Consortium (GTEx v6). First, we mapped RNA-Seq reads to the human
referencegenomeusingashort-readaligner(performedbyGTExconsortium6)(Figure1).Next,
weidentifyreadsspanningtheV(D)JjunctionofBandTcellreceptorsandassembleclonotypes
(agroupofcloneswithidenticalCDR3aminoacidsequences).HereImRePused0.02trillionhigh
qualityreadsthatsuccessfullymappedtoBCRgenes,successfullymappedtoTCRgenes,orwere
unmappedreadsthatfailedtomaptothehumanreferencegenome(Figures1aandS1).
ImReP is a two-stage approach to assemble CDR3 sequences and detect corresponding V(D)J
recombinations(Figure1b).Inthefirststage,ImRePutilizesreadsthatsimultaneouslyoverlapV
andJgenesegmentstoinfertheCDR3sequences.WedefinetheCDR3asthesequenceofamino
acidsbetweenthecysteineontherightofthejunctionandphenylalanine(forallTCRchainsand
immunoglobulinlightchains)ortryptophan(forIGH)ontheleftofthejunction.Inthesecond
stage, ImReP utilizes reads that overlap a single gene segment containing a partial CDR3
sequence.ImRePthenusesasuffixtreetoperformpairwisecomparisonofthereadsandjoin
thereadsbasedonoverlapintheCDR3region.Further,ImRePusesaCASTclusteringtechnique7
to accurately assemble clonotypes for PCR and sequencing errors. We map D genes (for IGH,
TCRB,andTCRG)ontoassembledCDR3sequencesandinfercorrespondingV(D)Jrecombination.
AdetaileddescriptionofthemethodologyimplementedwithImRePisprovidedintheExtended
bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
Experimental
Procedures
Section.
ImReP
is
freely
available
at
https://sergheimangul.wordpress.com/imrep/.
To validate the feasibility of using RNA-Seq to study the adaptive immune repertoire, we
simulatedRNA-SeqdataasamixtureoftranscriptomicreadsandreadsderivedfromBCRand
TCR transcripts (Figure S3). BCR and TCR transcripts are simulated based on random
recombination of V and J gene segments (obtained from IMGT database8) with non-template
insertionattherecombinationjunction(FigureS2).WeassessedtheabilityofImRePtoextract
CDR3-derived reads from the RNA-Seq mixture by applying ImReP to a simulated RNA-Seq
mixture.Whileoursimulationapproachmaynotcompletelysummarizethevariousnuancesand
eccentricitiesofactualimmunerepertoires,itallowsustoassesstheaccuracyofourtool.ImReP
is able to identify 99% of CDR3-derived reads from the RNA-Seq mixture, suggesting it is a
powerful tool for profiling RNA-Seq samples of immune-related tissues. Details about the
simulationdataareprovidedintheExtendedExperimentalProceduressection.
Next,wecomparedImRePwithothermethodsdesignedtoassembleimmunerepertoires.We
alsoinvestigatedthesequencingdepthandreadlengthrequiredtoreliablyassembleTCRand
BCR sequences from RNA-Seq data. Our simulations suggest that both read length and
sequencing depth have a major impact on precision-recall rates of CDR3 sequence assembly.
ImRePisabletomaintainan80%precisionrateforthemajorityofsimulatedscenarios.Average
CDR3coveragethatishigherthan8allowsImRePtoarchivearecallratecloseto90%foraread
length above 75bp (Figure 2a). Increasing coverage has a positive effect on the number of
bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
assembledclonotypesachievedbyImRePforbothBandTcellreceptors.Ingeneral,weobserve
higherprecision-recallratesofCDR3sequenceassemblyforTCRsincomparisontoBCRs(Figure
2a-b).
We compared the performance of ImReP to MiXCR (RNA-Seq mode)9, TRUST10, TraCeR11,
V’DJer12, IgBlast-based pipeline13, and iSSAKE14. These tools were developed to assemble the
hypervariablesequencesintheTandBcellreceptorsdirectlyfromRNA-Seqdata.Wesupplied
eachofthosetoolswiththeoriginalRNA-Seqreadsasrawormappedreads,dependingonthe
softwaredevelopers’recommendations.).TRUSTandTraCeRdonotsupporttheanalysisofBCR
sequencesandwereexcludedfromthecomparisonbasedfortheIGHdata.iSSAKEisnolonger
supportedandwasnotrecommendedforuse.Unfortunately,weobtainedemptyoutputafter
running V’DJer, and increasing coverage in the simulated data did not solve the problem.
Alternativeapproaches,suchasIMSEQ15,cannotbeapplieddirectlytoRNA-Seqreadsbecause
they were originally designed for targeted sequencing of B or T cell receptor loci. Thus, to
independentlyassessandcompareaccuracywithImReP,weonlyranIMSEQwiththesimulated
readsderivedfromBCRorTCRtranscripts(FigureS1).Scriptsandcommandstorunalltoolsused
inthisstudyareprovidedintheExtendedExperimentalProceduresandareavailableonlineat
https://github.com/smangul1/Profiling-adaptive-immune-repertoires-across-multiple-humantissues-by-RNA-Sequencing.ImRePconsistentlyoutperformedexistingmethodsonIGHdatain
bothrecallandprecisionratesforthemajorityofsimulatedparameters.ImRePandMiXCRshow
similarperformanceonTCRAdataandoutperformothermethods.Notably,ImRePwastheonly
bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
methodwithacceptableperformanceonIGHdataat50bpreadlength,reconstructingwitha
higherprecisionratesignificantlymoreCDR3clonotypesthanothermethods.
To demonstrate the feasibility of applying non-specific RNA Sequencing to assemble immune
repertoiresequences,weusedtheTCRB-Seqdatapreparedfromthreesamplesofkidneyrenal
clearcellcarcinoma(KIRC)byLi,Bo,etal. 10.WedownloadedmatchingRNA-Seqsamplesfrom
theTCGAportal.Intotal,weobtained301million2x50bpreadsfromthreeRNA-Seqsamples.
First,wepreparedtheCDR3sequencesobtainedfromTCRB-Seqandconsideredonlycomplete
CDR3s,whichwedefinedasasequenceofaminoacidsstartingwithcysteine(C)andendingwith
phenylalanine (F). We considered the prepared, complete CDR3s obtained from TCRB-Seq as
totalimmunerepertoire.
WeusedImReP,MIXCR,TRUST,andIMSEQtoassembleCDR3softheTCRBchain.Weexcluded
V’DJer,becauseitonlysupportsimmunoglobulinchainsandisnotsuitabletoassembleCDR3s
fromTcellreceptors.ImReP,MIXCR,andTRUSTassembledcomparablenumbersofcomplete
CDR3sthatfullymatchCDR3sfromthetotalimmunerepertoireobtainedbyTCRB-Seq;IMSEQ
assemblednoneoftheCDR3sequences(Figure2candFigureS4).Onaverage,54%ofCDR3s
assembledbyImRePfullymatchCDR3sfromTCRB-Seq.Ourmethodwasabletorecover0.10.9%ofthetotalimmunerepertoireobtainedbyTCRB-Seqfromkidneytissue.Othertissues,
including spleen and whole blood, contain a higher fraction of T cells and allow RNA-Seq to
capture a higher fraction of the total immune repertoire. One should note, the number of
completeCDR3sfullymatchingCDR3sobtainedbyTCRB-Seqinourstudy(reportedinFigure2c
bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
andFigureS4)arenotfullycomparablewiththeresultsreportedinLi,Bo,etal.10,whereCDR3
sequencesareconsideredtomatchCDR3sfromTCRB-Seqifatleast6aminoacidsarematched.
Scripts and commands utilized to process the data and run repertoire assembly tools are
provided in Extended Experimental Procedures and are available online at
https://github.com/smangul1/Profiling-adaptive-immune-repertoires-across-multiple-humantissues-by-RNA-Sequencing.
WefurthervalidatedtheabilityofImRePtoaccuratelyinfertheproportionofimmunecellsin
the sampled tissue. We hypothesized that the fraction of B- and T-cells in the sample will be
proportional with the fraction of receptor-derived reads in our RNA-Seq data. We used a
transcriptome-basedcomputationalmethod,SaVant16,whichusescell-specificgenesignatures
(independentofBCRorTCRtranscripts)toinfertherelativeabundanceofBorTcellswithineach
tissue sample. We found that B and T cell signatures inferred by SaVant showed positive
correlationwiththeamountofBCR(r=0.77,P<0.001)orTCR(r=0.86,P<0.001)transcripts,
respectively (Figure 2c,d). An exception to this correlation was for tissues that contain the
highestdensityofBorTcells:spleen,wholeblood,smallintestine(terminalileum),lung,and
EBV-transformedlymphocytes(LCLs).
bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
a
TCRA
TCRD
IGL
TCRG
IGK
TCRB
chr2
Mappedreads
chr7
V
CDR1
N D N
CDR2
IGH
chr14
Unmappedreads
J
C
CDR3
mRNAencodingforTandBcellreceptorchains
chr22
ImReP
b
J gene
Vgene
SYEQYFGPGTRLT…
…SQTSVYFCASSE
VYFICASSEASSLGGSGGYTACCSYEQYFGPG
read
CDR3
c
J gene
Vgene
SYEQYFGPGTRLT…
…SQTSVYFCASSE
read1
SGGYTACCSYEQYFGPG
SQTSVYFICASSEASSLGGSGGY
read2
d
Suffixtree
! G
M
G
L
Q
M
read2
Y
TACCSYEQYFGPG
readsmatchingonlyVgene
areencodedassuffixtree
bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
Figure 1. Overview of ImReP. (a) Schematic representation of human adaptive immune repertoire. Adaptive
immune repertoire consists of four T-cell receptor loci (blue color, T cell receptor alpha locus (TCRA); Tcell receptor beta locus (TCRB);T-cell receptor delta locus (TCRD); andT-cell receptor gamma locus (TCRG)) and
three immunoglobulin loci (red color, Immunoglobulin heavy locus (IGH); Immunoglobulin kappa locus (IGK);
Immunoglobulinlambdalocus(IGL).Alternativename–BCR,Bcellreceptor).B-andT-cellreceptorscontainmultiple
variable(V,greencolor),diversity(D,presentonlyinIGH,TCRB,TCRG,violetcolor),joining(J,yellowcolor)and
constant(C,bluecolor)genesegments.V(D)Jgenesegmentsarerandomlyjointedandnon-templatedbases(N,
darkredcolor)areinsertedattherecombinationjunctions.TheresultingsplicedT-orB-cellrepertoiretranscript
incorporatestheCsegmentandistranslatedintotheantigenreceptorproteins.RNA-Seqreadsarederivedfrom
the rearranged immunoglobulin IG and TCR loci. Reads entirely aligned to genes of B- and T-cell receptors are
inferredfrommappedreads(blackcolor).Readswithextensivesomatichypermutationsandreadsspanningthe
V(D)J recombination are inferred from the unmapped reads (grey color). Complementarity determining region 3
(CDR3)isthemostvariableregionofthethreeCDRregionsandisusedtoidentifyT/B-cellreceptorclonotypes—a
group of clones with identical CDR3 amino acid sequences. (b) Receptor derived reads spanning V(D)J
recombinations are identified from unmapped reads and assembled into the CDR3 sequences. We first scan the
aminoacidsequencesofthereadanddeterminetheputativeCDR3boundariesdefinedbylastconservedcysteine
encoded by the V gene and the conserved phenylalanine (for TCR) or tryptophan (for BCR) of J gene. Given the
putativeCDR3boundaries,wechecktheprefixandsuffixofthereadtomatchthesuffixofVandprefixofJgenes,
respectively.(c-d)IncaseareadoverlapswithonlytheVorJgene,weperformthesecondstageofImRePtomatch
suchreadsbasedontheoverlapofCDR3sequenceusingsuffixtree.
0.4
0.4
0.4
0.2
1
2
4
8
16
32
64
8
16
64
8
16
coverage
128
32
32
64
0.4
2
4
64 128
8
16
0
1
2
4
8
16
32
d
64 128
32
64
128
1
coverage
2
4
8
16
32
coverage
IgBlast
1000
EBV cells
Lung
100
10
1.0
64
128
32
64
128
1.0
0.8
0.8
Small
Intestine
4
8
16
32
100
Spleen
64
EBV cells
Read
100bp
Lunglength
Whole
64 128
Blood
1
Small
Intestine
128
0.2
32
2
r = 0.77
P < 0.001
coverage
1
0.1
1
2
IMSEQ
MiXCR
2
coverage
810 16
4
4
8
0
16
coverage
MiXCR
0.5
1
1.5
B cell signature score
IMSEQ
32
64
128
4
8
16
32
coverage
64 128
10
TRUST (SE)
TRUST (PE)
Whole
Blood
Spleen
Lung
1
r = 0.86
P < 0.001
0.1
EBV cells
0.01
1
2 14
8
16 32 1.5
64 128
0.5
coverage
B cell signature
score
TRUST (SE)
2
e
0.6
1
coverage
1000
0.0
16
0
TCR reads per million
32
0.4
0.0
16
TRUST (PE)
0
2
1
0
0.0
0.1
0.0
8
0.2
0.4
0.6
2
r = 0.77
P < 0.001
1
0.8
0.6
0.4
0.6
0.2
0.2
4
Whole
Blood
64 128
Read length 100bp
0.4
0.8
0.6
0.0
2
10
Spleen
Small
Intestine
coverage
IgBlast
10000
2
0
1.0
IMSEQ
2
128
0.2
1
coverage
IMSEQ
coverage
ImReP
16
10000
0.4
32
Read 100
length 75bp
0.0
1.0
0.8
0.6
0.4
0.2
8
8
0.0
4
1
0.2
128
1
4
4
0.2
0.4
0.0
2
4
0.8
64
64 128
2
1.0
0.4
0.2
0.0
64 128
1.0
1.0
32
0.4
16
0.0
Precision(TCRA)
32
Read length 75bp
Read length 50bp
1
ImReP
16
MiXCR
0.8
0.6
0.4
0.2
0.0
8
2
2
0
Final figures
2c,d
Read length 75bp
Read length 100bp
1.0
Recall(TCRA)
1.0
0.8
0.6
0.4
4
0.2
0.0
coverage
2
MiXCR
Read length 50bp
32
8
coverage
1
1
ImReP
2
0.8
0.8
0.6
1.0
0.8
0.6
0.4
1.0
4
128
coverage
16
1
0.6
8
64 128 coverage
1
8
128
0.2
4
Readb length 50bp
4
64
1
1
Final figures 2c,d
Read length 100bp
ImReP
2
32
0
2
Read length 75bp
0.2
0.4
2
coverage
1
16
TRUST
12
2
6
coverage
2
0.6
1
0.2
1
32
8
MiXCR
735
1
coverage
0.8
0.8
0.4
1.0
16
4
TCRBSeq
ImReP
0.2
8
64 128
2
BCR reads per million
4
0.2
1
0.0
0.0
2
0.0
128
64 128
0.0
64
32
Read length 100bp
1.0
32
16
Read length 100bp
c
0.8
16
8
0.6
8
0.0 0.2 0.4 0.6
0.6 0.8 1.0
4
0.0
Precision(IGH)
32
coverage
1
4
Read length 100bp
1.0
1.0
0.6
0.4
0.2
0.0
2
Read length 50bp
16
2
coverage
Read length 75bp
coverage
8
1
Read length 75bp
0.8
0.8
Recall(IGH)
1.0
1.0
0.8
1
0.2
0.4
0.0 0.2 0.4 0.6
0.6 0.8 1.0
Read length 50bp
4
64 128
Read length 75bp
Read length 50bp
2
32
0.0 0.2 0.4 0.6
0.6 0.8 1.0
Readalength 50bp
1
16
coverage
TCR reads per million
64 128
BCR reads per million
32
1.0
16
0.8
8
coverage
0.6
4
0.4
2
0.0
1
0.0
0.2
0.0
0.0
0.2
bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
0.5
1
1.5
T cell signature score
Tracer
Tracer
Figure2.EvaluationofImReP.(a-b)EvaluationofImRePbasedonthenumberofassembledCDR3sequencesand
comparison to existing methods. (c) Concordance of targeted TCRB-Seq and non-specific RNA-Seq. (d-e)
CorrespondenceofImReP-derivedreadsfromB-cell(BCR)andT-cell(TCR)receptorstotherelativeabundanceofB-
andT-cellsinferredfromcell-specificgeneexpressionprofiles.(a)PrecisionandrecallratesforImReP(blue),MiXCR
(RNA-Seq mode) (blue), IMSEQ (green), and IgBlast (orange) on simulated data for immunoglobulin heavy (IGH)
transcriptsarereportedforvariousreadslength(separateplots)andpertranscriptcoverages(1,2,4,8,16,32,64,128)
(x-axis).TRUSTandTraCeRdonotsupporttheanalysisofBCRsequencesandwereexcludedfromthecomparison
2
bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
basedfortheIGHdata.(b)PrecisionandrecallratesforImReP(blue),MiXCR(RNA-Seqmode)(red),TRUST(default,
paired-endmode)(pink),TRUST(single-endmode)(violet),IMSEQ(green),andTraCeR(aqua)onsimulateddatafor
Tcellreceptoralpha(TCRA)transcriptsarereportedforvariousreadslength(separateplots)andpertranscript
coverages (1,2,4,8,16,32,64,128) (x-axis). (c) Concordance of targeted TCRB-Seq and non-specific RNA-Seq
performedonthreeTCGAsamples(onlyoneisshown)fromkidneyrenalclearcellcarcinoma(KIRC).Venndiagram
on TCGA-CZ-5463 sample presents number of matching CDR3s reported by immunoSEQ Analyzer
(http://www.adaptivebiotech.com/)andCDR3sequencesassembledfromnon-specificRNA-SeqdatabyImReP,
MiXCR(RNA-Seqmode),TRUST(default, paired-end mode). Results on other samples are presented in Figure
S4. (d)ScatterplotofthenumberofallBCRreadsper1millionRNA-Seqreads(y-axis)andB-cellsignaturescore
inferredbySaVant(x-axis).(e)ScatterplotofthenumberofallTCRreadsper1millionRNA-Seqreads(y-axis)andBcellsignaturescoreinferredbySaVant(x-axis).Pearsoncorrelationcoefficient(r)andP-valuearereported.
Characterizingtheadaptiveimmunerepertoireacross53GTExtissues
ImReP identified over 26 million reads overlapping 3.8 million distinct CDR3 sequences that
originatefromdiversehumantissues.ThemajorityofassembledCDR3sequencesderivedfrom
BCRs, with 1.7 million from the immunoglobulin heavy chain (IGH), 0.9 million from the
immunoglobulinkappachain(IGK),and1.0millionfromtheimmunoglobulinlambdachain(IGL).
AsmallerfractionofCDR3sequencesderivedfromTCRs,with0.2millionsequencesfromalpha
andbetaTCRs(TCRAandTCRB).ThevastmajorityofallassembledCDR3shadalowfrequency
inthedata.98%ofCDR3sequenceshadacountoflessthan10reads,andthemedianCDR3
sequencecountwas1.4.CDR3sequencesderivedfromIGKwerethemostabundantacrossall
tissues,accountingonaveragefor54%oftheentireB-cellpopulation(FigureS5).IntheT-cell
bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
population,alphaandbetajointlyaccountedfor83%ofthepopulation.DeltaT-cellpopulation
wastherarest,accountingforlessthanonepercentoftheentireTcellpopulation(FigureS6).
We compared the length and amino acid composition of the assembled CDR3 sequences of
immunoglobulinandT-cellreceptorchains(Figure3a-g).Consistentwithpreviousstudies,we
observedthatimmunoglobulinlightchainshavenotablyshorterandlessvariableCDR3lengths
comparedtoheavychains17(Fig.3h).Thetissuetypeappearstohavenoeffectonthelength
distributionofCDR3sequencesofBCRs(FigureS7).DifferencesintheCDR3lengthdistributions
for TCRs cannot be estimated due to the small number of available TCRs. Sequencing
composition18ofCDR3regionsofbetaT-cellsassembledbyImRePrecapitulatesonedetectedby
TCRsequencing2fromthewholebloodwithtwodominantCASSandCSARmotifsinthebeginning
of the sequence (Fig. 3f). In line with other studies19, both light chains exhibited a reduced
amountofsequencingdiversity(Fig.3b-c).
a
d
IGH
TCRA
bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
b
IGK
TCRB
c
f
IGL
TCRD
e
h
TCRG
g
Length(amino acids)
Figure 3. Length and amino acid composition of the assembled CDR3 sequences of immunoglobulin and T-cell
receptorchains.Thesequencelogo(usingWebLogo)ofaminoacidcompositionrepresentationforCDR3sequences
withmeanlength.Theheightoftheaminoacidwithinthestackindicatestherelativefrequency.Distributionof
CDR3sequencelengthisestimatedusingskerneldensity.(a)Sequencelogoof15-amino-acidCDR3sequenceof
IGH.(b)Sequencelogoof11-amino-acidCDR3ofIGK.(c)Sequencelogoof12-amino-acidCDR3sequenceofIGL.(d)
Sequencelogoof14-amino-acidCDR3sequenceofTCRA.(e)Sequencelogoof14-amino-acidCDR3sequenceof
TCRB. (f) Sequence logo of 17-amino-acid CDR3 sequence of TCRD. (g) Sequence logo of 13-amino-acid CDR3
bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
sequenceofTCRG.(h)DistributionofCDR3sequencelengthisestimatedusingskerneldensityseparatelyforeach
chainofT-andB-cellreceptors.
We observed per sample an average of 1331 distinct clonotypes for BCRs and 20 distinct
clonotypessequencesforTCRs.Wenormalizedthenumberofdistinctclonotypesbythetotal
numberofRNA-Seqreads,whichwecallnumberofclonotypesperonemillionreads(CPM).As
thenumberofdistinctclonotypesdoesnotincreaselinearlywiththesequencingdepth,CPM
metricshouldnotbeusedinstudiescomparingclonotypediversityacrossvariousphenotypes.
Instead,CPMisintendedtobeaninformativemeasureofclonaldiversityadjustedforsequencing
depth.
We used per sample alpha diversity (Shannon entropy) to incorporate the total number of
distinctclonotypesandtheirrelativefrequenciesintoasinglediversitymetric.Amongalltissues,
spleenhasthelargestB-cellpopulation,withamedianof1301BCR-derivedreadsperonemillion
RNA-Seqreads.ItalsohasthemostdiversepopulationofBcellswithmedianpersamplealpha
diversityof7.6correspondingto1025CPM(Figure4andTableS1).Wholebloodhasboththe
largestandmostdiverseT-cellpopulation(Figure4andTableS1).Organsthatpossessmucosal,
exocrine,andendocrinesites(n=24)harborarichclonotypepopulationwithamedianof87CPM
persample.Minorsalivaryglandshavethehighestimmunediversityinthegroup(alpha=7.1)
andsurpassthediversityoftheterminalIleumcontainingPeyer’sPatches,whicharesecondary
lymphoidorgans(TableS1).
bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
Tissuesnotrelatedtotheimmunesystem,includingadipose,muscle,andtheorgansfromthe
centralnervoussystem,containedamedianof6CPMpersample,whicharemostlikelydueto
thebloodcontentofthetissues20.ThehighestnumberofdistinctCDR3sequencesamongnonlymphoid organs was present in the omentum, a membranous double layer of adipose tissue
containingfat-associatedlymphoidclusters.Asexpected21,Epsteinbarvirus(EBV)-transformed
lymphocytes(LCL)harboredalargehomogeneouspopulationofBcellclonotypes(TableS1and
FigureS7).
bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
a
14
12
CLONOTYPIC RICHNESSOFTCELLS,CPM
10
8
6
TCRA
TCRB
TCRD
TCRG
4
2
0
b
CLONOTYPIC RICHNESSOFBCELLS, CPM
0
Spleen
SmallIntestine- TerminalIleum
WholeBlood
Artery- Coronary
Artery- Aorta
Artery- Tibial
MinorSalivaryGland
Colon- Transverse
Stomach
Colon- Sigmoid
Lung
Breast- MammaryTissue
Esophagus- Mucosa
Thyroid
Kidney- Cortex
Bladder
FallopianTube
Vagina
Cervix- Endocervix
Liver
Cervix- Ectocervix
Prostate
Uterus
Ovary
Pancreas
Pituitary
Skin- NotSunExposed…
Skin- SunExposed(Lowerleg)
AdrenalGland
Testis
Cells- EBV-transformed…
Cells- Transformedfibroblasts
Adipose- Visceral(Omentum)
Esophagus- Gastroesophageal…
Adipose- Subcutaneous
Heart- AtrialAppendage
Esophagus- Muscularis
Heart- LeftVentricle
Muscle- Skeletal
Nerve- Tibial
Brain- Spinalcord(cervicalc-1)
Brain- Amygdala
Brain- Cortex
Brain- Hypothalamus
Brain- Caudate(basalganglia)
Brain- Substantianigra
Brain- Hippocampus
Brain- Putamen(basalganglia)
Brain- Anteriorcingulatecortex
Brain- Nucleusaccumbens(basal…
Brain- Cerebellum
Brain- FrontalCortex(BA9)
Brain- CerebellarHemisphere
200
400
600
IGH
IGK
800
1000
IGL
Tissuestypes
Secondarylymphoidorgans
Bloodassociatedsites
Mucosal,exocrineandendocrineorgans
Celllines
Adiposeandmuscleorgans
Organsofcentralnervoussystem
bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
Figure 4. Adaptive immune repertoires across multiple human tissues. Adaptive immune repertoires of 8,555
samplesacross544individualsfrom53bodysitesobtainedfromGenotype-TissueExpressionstudy(GTExv6).We
groupthetissuesbytheirrelationshiptotheimmunesystem.Thefirstgroupincludesthelymphoidtissues(n=2,red
colors).Thesecondgroupincludesbloodassociatedsitesincludingwholebloodandbloodvessel(n=4,redcolor).
Thethirdgrouparetheorgansthatencompassesmucosal,exocrineandendocrineorgans(n=21,lavendercolor).
The fourth group are cell lines (n=3, grey color). The filth group are adipose or muscle tissues and the
gastroesophagealjunction(n=7,bluecolor).Thesixthgroupareorgansfromcentralnervoussystem(n=14,green
color).HistogramreportsclonotypicrichnessofTandBcells,calculatedasnumberofdistinctaminoacidsequences
ofCDR3peronemillionRNA-Seqreads(CPM).(a)MedianCPMsarepresentedindividuallyforTcellreceptoralpha
chain(TCRA),T-cellreceptorbetachain(TCRB),T-cellreceptordeltachain(TCRD),andT-cellreceptorgammachain
(TCRG).(b)MediannumberofdistinctaminoacidsequencesofCDR3arepresentedindividuallyforimmunoglobulin
heavychain(IGH),immunoglobulinkappachain(IGK),immunoglobulinlambdachain(IGL).
Individual-andtissue-specificT-andB-cellclonotypes
Aminoacidsequencesofclonotypesexhibitedextremeinter-individualdissimilarity,with88%of
clonotypesuniquetoasingleindividual(private)(Figure5a).Theremaining~400,000clonotypes
weresharedbyatleasttwoindividuals(public).Thenumberofindividualssharingclonotypes
variedacrossTandBcellreceptors,withimmunoglobulinlightchainshavingthehighestnumber
ofpublicclonotypes.Twenty-fivepercentofallIGKclonotypeswerepublic,andthenumberof
individuals sharing the IGK clonotype sequences can be as high as 471 (Figure 5b). T cell
clonotypeshadonaverage7%publicclonotypes,withthehighestnumberofpublicclonotypes
amongthedeltachainofgamma-deltaTCRs(12%).ThelimitedcapacityofRNA-Seqtocoverlow
bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
abundant clonotypes may misclassify public clonotypes as private. Consistent with previous
studies10,22,weobservepublicclonotypestobesignificantlyshorterinlengththantheprivate
ones(p-value<2x10-16).Forexample,IGHchainpublicclonotypeshadanaveragelengthof13
aminoacids,andprivateclonotypeshadanaveragelengthof16.Wealsoexaminedwhetherthe
publicclonotypesweremoreoftensharedacrosstissueswithinanindividual.Only14%ofthe
~240,000 clonotypes shared across tissues were public. The majority of clonotypes were
individual-andtissue-specific(Figure5c).Thefulllistofpublicclonotypesisdistributedwiththe
‘AtlasofT-andB-cellrepertoires’thataccompaniesthismanuscript.
bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
b
-1
106
58
IGHV3-53
105
104
BCRs
Numberofclonotypesequences
a
103
CARGHYGMDVW
IGHJ6
471
IGKV3-20
CQQYGSSPRTF
IGKJ2
102
443
10
IGLV3-1
1
IGL
TCRA
TCRB
TCRD
TCRG
TCRs
IGK
c
Tissuespecific
Sharedacross
tissues
Individualspecific(private)
3.1x10 6
0.2x10 6
Sharedacrossindividuals
(public)
0.4x10 6
0.03x10 6
IGLJ2
90
1
3
5
7
9
11
13
15
17
Numberofindividualssharing theclonotypesequence
IGH
CQAWDSSTVVF
TRAV8-7
CAGADRLQTGMRGAF
TRBV10-3
CVIKYF
TRAJ19
67
TRGV10
CAAWDYYKKLF
TRBJ2-3
TRGJ1(2)
44
Figure5.PrivateandpublicTandBcellclonotypes.(a)Distributionoffrequenciesofprivate(n=1)andpublic(n>1)
clonotypes across 544 individuals. We collect clonotypes from all tissues of the same individual into a single set
correspondingtothatindividual.(b)Themostpublicclonotypes(sharedacrossmaximumnumberofindividuals)
and corresponding VJ recombination are presented for IGH, IGK, IGL, TCRA, TCRB, and TCRG. (c) Clonotypes
sequences are classified into public clonotypes (shared across individuals), private (individual specific), tissuespecific,andclonotypessharedacrossmultipletissues.Thenumberofclonotypesfallingintoeachpairofcategories
isreportedacrossallTandBcellreceptorchains.
bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
FlowofTandBcellclonotypesacrosshumanGTExtissues
The large number of individuals available through the study allow us to establish a pairwise
relationshipbetweenthetissuesandtracktheflowofTandBclonotypesacrosshumantissues.
WobservedasignificantincreaseoftheCDR3sequencessharedacrosspairsoftissuesfromthe
same individuals. Further, we observed this pattern consistently for all chains of B and T cell
receptors(p-value<2x10-16)(Figure6aandTableS2).Weobserveadifferentamountofshared
CDR3 sequences across different types of BCRs and TCRs with an increase in immunoglobulin
light chains. Decreased number of TCR reads compared to the BCRs makes it unfeasible to
comparethenumberofsharedCDR3acrossTandBcells.Onaverage,weobserve21.0CDR3
sequencestobesharedacrossapairoftissuesfromthesameindividuals.Pairsoftissuesfrom
differentindividualsshareonaverage10.6CDR3sequences(Figure6aandTableS2).
ToestablishtheflowofB-andT-cellclonotypesacrossvarioustissues,wecomparedclonotype
populationsbetweenandwithinthesameindividuals.Welimitedthisanalysistopairsoftissues
forwhichwehadatleast10individuals(870pairsoftissuesoutof1378possiblepairs).Weused
beta diversity (Sørensen–Dice similarity index) to measure compositional similarities between
thetissuesintermsofgainorlossofCDR3sequences(Figure6b-c).Forthemajorityofthe870
availabletissuepairs,weobservenoBCRorTCRsequencesincommon,whichcorrespondsto
betadiversityof0.0.
bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
WeexaminedtheflowofIGHclonotypesacrosstissuesandpresenteditasanetwork(Figure
5b).Among870availabletissuepairs,wehaveidentified56tissuepairswithbetadiversityabove
.001.Spleenwasthemosthighlyconnectedtissue,with17connections,followedbylung,with
16 connections. Clonotypes represents one connected component, meaning that every two
nodesareconnecteddirectlyorviaothernodes.Clonotypepopulationsofspleenandlungare
themostsimilar(0.02betadiversity),otherpairsincludeminorsalivaryglandandesophagus
mucosa,terminalileum(smallintestine)andtransversecolon.Weobserveabove200pairsof
tissueswithbetadiversityabove.001forimmunoglobulinlightchains(FigureS9-S10).Themost
similartissuepairsforIGKchainwerespleenandtransversecolon(0.15betadiversity).
WealsoexaminedtheflowofTCRBclonotypesacrosstissues.TCRBclonotypesequenceswere
sharedacross10pairsoftissueswithaveragebetadiversityof0.02.SimilartoBcells,thenetwork
of T cell clonotypes is a single connected component, with spleen being the most highly
connectedtissue(Figure6c).FortheTCRgammachain,weobservebetadiversityof0.0forall
tissuepairs.Atthesametime12%ofTCRgammaclonotypesarepublic,showingthehighestrate
amongallTCRgenes.AtthesequencingdepthprovidedbyRNA-Seq,weareunabletoobserve
TCRdeltaclonotypesharingacrossindividualsandtissues.
bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
b
a
10
TRCG
TRCD
TRCB
TRCA
IGL
-5
IGK
0
IGH
5
IGH
Average numberofclonotypes
sharedacrosspairsoftissues
15
-10
-15
Pairoftissues withinthesame individual
Pairoftissues acrossdifferentindividuals
TCRB
c
Figure6.FlowofTandBcellclonotypesacrossdiversehumantissues.Resultsarebasedonpairsoftissueswithat
least10individuals.(a)Thenumberofclonotypesequencessharedacrosspairsoftissuesfromthesameindividuals
(bluecolor)andfromdifferentindividuals(orangecolor)arepresented.Numberofclonotypessharedacrosstissues
fromthesameindividualsforTCRsis<.1.Numberofclonotypessharedacrosstissuesfromdifferentindividualsfor
TCRsis<.001.(b)-(c)Flowofclonotypesacrossdiversehumantissuesispresentedasanetwork.Eachnodeisatissue
withthesizeproportionaltoamediannumberofclonotypesofthetissue.Thecolorofthenodecorrespondstoa
typeofthetissuetype:lymphoidtissues(yellowcolors),bloodassociatedsites(redcolor),organsthatencompasses
mucosal,exocrineandendocrineorgans(lavendercolor).Compositionalsimilaritiesbetweenthetissuesintermsof
gain or loss of CDR3 sequences are measured across valid pairs of tissues using beta diversity (Sørensen–Dice
similarity index). Edges are weighted according to the beta diversity. (b) Flow of IGH clonotypes across diverse
bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
humantissuesispresentedasanetwork.Edgeswithbetadiversity>.001arepresented.(c)FlowofTCRBclonotypes
acrossdiversehumantissuesispresentedasanetwork.Edgeswithbetadiversity>.001arepresented.
ImRePidentifiestissuesampleswithlymphocyteinfiltration
Histologicalimagesoftissuecross-sectionsandpathologists’noteshavebeenusedtovalidate
the ImReP’s ability to detect the samples with a high lymphocyte content, which often
correspondstoadiseasestate.WeexaminedtheIGHclonotypepopulationsfromthyroidtissue
acrossindividuals.ThemediannumberofinferreddistinctCDR3sequencespersamplewas20,
though14.5%ofthesampleshadmorethan500distinctCDR3sequences.Weobservedthe
highestnumberofCDR3sequencesamongallthethyroidsamplesinanindividualwithlatestage
Hashimoto'sthyroiditis,anautoimmunediseasecharacterizedbylymphocyteinfiltrationandTcellmediatedcytotoxicity.Accordingtopathologists’notes,Hashimoto'sdiseasewaspresentin
11.2%ofthyroidsamples,withvaryingdegreesofseverity.First,weusedpathologists’notesto
annotate samples as healthy or bearing Hashimoto's disease, and then we compared the
adaptiverepertoirediversitybetweenthesegroups.Weobservedasignificantincreaseinthe
number of distinct IGH clonotypes in samples with Hashimoto's thyroiditis (p-value= 1.5x10-5)
(FigureS11).Thenumberofclonotypesvariedfrom113forfocalHashimoto'sthyroiditisto5621
forlatestageHashimoto'sthyroiditis(Figure7a).Inaddition,highclonotypediversityinkidney
samplesindicatedthepresenceofglomerulosclerosis.Inlungsamples,highclonotypediversity
correspondedtoinflammatorydiseasessuchassarcoidosisandbronchopneumonia.
bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
Weobservednodifferenceinclonaldiversityinmalesandfemalesacrossthetissues,exceptin
breasttissues(p-value<3.2x10-12,BCRs).Increasedclonotypediversityofbreasttissueinmale
individualscorrespondedtogynecomastia,acommondisorderofnon-cancerousenlargementof
malebreasttissue(Figure7b).
bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
Diseaseseverity
Histological
images
a
LATESTAGE
HASHIMOTOS
HASHIMOTOS
496
2353
1315
EARLYHASHIMOTOS
FOCALHASHIMOTOS
113
PATCHY
HASHIMOTOS
NumberofIGH
clonotypes
5621
b
Gynecomastia
sex_s
male
3000
2000
1000
0
female
Number of IGH clonotypes
Figure7.ImRePisabletoidentifysampleswithhighactivityoflymphocytes.Histologicalimagesoftissuecrosssectionsandpathologists’noteshavebeenusedtovalidatetheImReP’sabilitytodetectthesampleswithahigh
activityoflymphocytes.(a)SampleswereorderedbyHashimoto'sthyroiditisseverity,asreportedbypathologists’
notes.Histologicalimagesareprovidedtoillustratethediseasestate.Averagenumberofclonotypesisreportedfor
eachdiseasegroup.(b)Boxplotreportingnumberofclonotypesinthebreasttissuesformalesandfemales.Outlier
amongthemalesamplesisillustratedwiththehistologicalimage.
bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
Discussion
We have developed a novel computational approach (ImReP) for reconstruction of adaptive
immune repertoires using RNA-Seq data. We demonstrate the ability of ImReP to efficiently
extract TCR- and BCR- derived reads from the RNA-Seq data and accurately assemble
correspondingBCRandTCRclonotypes.TheproposedalgorithmcanaccuratelyassembleCDR3
sequencesofimmunereceptorsdespitethepresenceofsequencingerrorsandshortreadlength.
Simulations generated using various read lengths and coverage depth show that ImReP
consistentlyoutperformsexistingmethodsintermsofprecisionandrecallrates.
We have demonstrated the feasibility of applying RNA-Seq to study the adaptive immune
repertoire.AlthoughRNA-Seqlacksthesequencingdepthoftargetedsequencing(Rep-Seq),it
can compensate by examining a larger sample size. Using ImReP, we have created the first
systematicatlasofimmunologicalsequenceforB-andT-cellreceptorrepertoriesacrossdiverse
humantissues.Thisprovidesarichresourceforcomparativeanalysisofarangeoftissuetypes,
mostofwhichhavenotbeenstudiedbefore.TheatlasofT-andB-cellreceptors,availablewith
thepaper,isthelargestcollectionofCDR3sequencesandtissuetypes.Weanticipatethatthis
database will enhance future studies in areas such as immunology and contribute to the
developmentoftherapiesforhumandiseases.
Using RNA-Seq to study immune repertoires has some advantages, including the ability to
simultaneouslycapturebothTandBcellclonotypepopulationsduringasinglerun.Italsoallows
bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
simultaneousdetectionofoveralltranscriptionalresponsesoftheadaptiveimmunesystem,by
comparingchangesinthenumberofBCRandTCRtranscriptstothemuchlargertranscriptome.
Given large number of large-scale RNA-Seq datasets becoming available, we look forward to
scalinguptheatlasofT-andB-cellreceptorsinordertoprovidevaluableinsightsintoimmune
responsesacrossvariousautoimmunediseases,allergies,andcancers.
bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
References
1.
Georgiou,G.etal.Thepromiseandchallengeofhigh-throughputsequencingofthe
antibodyrepertoire.Nat.Biotechnol.32,158–168(2014).
2.
Freeman,J.D.,Warren,R.L.,Webb,J.R.,Nelson,B.H.&Holt,R.A.ProfilingtheT-cell
receptorbeta-chainrepertoirebymassivelyparallelsequencing.GenomeRes.19,1817–
1824(2009).
3.
Rajewsky,K.,Forster,I.&Cumano,A.Evolutionaryandsomaticselectionoftheantibody
repertoireinthemouse.Science(80-.).238,1088–1094(1987).
4.
Benichou,J.,Ben-Hamo,R.,Louzoun,Y.&Efroni,S.Rep-Seq:uncoveringthe
immunologicalrepertoirethroughnext-generationsequencing.Immunology135,183–
191(2012).
5.
Mangul,S.etal.DumpsterdivinginRNA-sequencingtofindthesourceofeverylastread.
bioRxiv53041(2016).
6.
Ardlie,K.G.etal.TheGenotype-TissueExpression(GTEx)pilotanalysis:Multitissuegene
regulationinhumans.Science(80-.).348,648–660(2015).
7.
Ben-Dor,A.,Shamir,R.&Yakhini,Z.Clusteringgeneexpressionpatterns.J.Comput.Biol.
6,281–297(1999).
8.
Lefranc,M.-P.etal.IMGT{®},theinternationalImMunoGeneTicsinformationsystem{®}
25yearson.NucleicAcidsRes.gku1056(2014).
9.
Bolotin,D.A.etal.MiXCR:softwareforcomprehensiveadaptiveimmunityprofiling.Nat.
bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
Methods12,380–381(2015).
10.
Li,B.etal.Landscapeoftumor-infiltratingTcellrepertoireofhumancancers.Nat.
Genet.(2016).
11.
Stubbington,M.J.T.etal.Tcellfateandclonalityinferencefromsingle-cell
transcriptomes.Nat.Methods(2016).
12.
Mose,L.E.etal.Assembly-basedinferenceofB-cellreceptorrepertoiresfromshortread
RNAsequencingdatawithV’DJer.Bioinformaticsbtw526(2016).
13.
Strauli,N.&Hernandez,R.StatisticalInferenceofaConvergentAntibodyRepertoire
ResponsetoInfluenzaVaccine.bioRxiv25098(2015).
14.
Warren,R.L.,Nelson,B.H.&Holt,R.A.ProfilingmodelT-cellmetagenomeswithshort
reads.Bioinformatics25,458–464(2009).
15.
Kuchenbecker,L.etal.IMSEQ—afastanderrorawareapproachtoimmunogenetic
sequenceanalysis.Bioinformatics31,2963–2971(2015).
16.
Lefranc,M.-P.etal.SaVanT:Aweb-basedtoolforthesample-levelvisualizationof
molecularsignaturesingeneexpressionprofiles.NucleicAcidsRes.gku1056(2014).
17.
Philibert,P.etal.AfocusedantibodylibraryforselectingscFvsexpressedathighlevelsin
thecytoplasm.BMCBiotechnol.7,1(2007).
18.
Crooks,G.E.,Hon,G.,Chandonia,J.-M.&Brenner,S.E.WebLogo:asequencelogo
generator.GenomeRes.14,1188–1190(2004).
19.
Hoi,K.H.&Ippolito,G.C.Intrinsicbiasandpublicrearrangementsinthehuman
immunoglobulinVλlightchainrepertoire.GenesImmun.14,271–276(2013).
20.
Yu,H.-P.,Chiu,Y.-W.,Lin,H.-H.,Chang,T.-C.&Shen,Y.-Z.Bloodcontentinguinea-pig
bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not
peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
tissues:correctionforthestudyofdrugtissuedistribution.Pharmacol.Res.23,337–347
(1991).
21.
DeRossi,A.etal.InfectionofEpstein-Barrvirus-transformedlymphoblastoidBcellsby
thehumanimmunodeficiencyvirus:evidenceforapersistentandproductiveinfection
leadingtoBcellphenotypicchanges.Eur.J.Immunol.20,2041–2049(1990).
22.
Warren,R.L.etal.ExhaustiveT-cellrepertoiresequencingofhumanperipheralblood
samplesrevealssignaturesofantigenselectionandadirectlymeasuredrepertoiresizeof
atleast1millionclonotypes.GenomeRes.21,790–797(2011).