Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Immune system wikipedia , lookup
Hygiene hypothesis wikipedia , lookup
12-Hydroxyeicosatetraenoic acid wikipedia , lookup
Adaptive immune system wikipedia , lookup
Innate immune system wikipedia , lookup
Cancer immunotherapy wikipedia , lookup
Immunosuppressive drug wikipedia , lookup
Polyclonal B cell response wikipedia , lookup
Adoptive cell transfer wikipedia , lookup
Psychoneuroimmunology wikipedia , lookup
bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. Title:ProfilingadaptiveimmunerepertoiresacrossmultiplehumantissuesbyRNASequencing Authors:SergheiMangul*#1,2,IgorMandric*3,HarryTaegyunYang1,NicolasStrauli4,Dennis Montoya5,JeremyRotman1,WillVanDerWey1,JiemR.Ronas6,BenjaminStatz1,DouglasYao5, AlexZelikovsky3,RobertoSpreafico2,SagivShifman**7,NoahZaitlen**8,MauraRossetti**9, K.MarkAnsel**10,EleazarEskin**#1,11 Affiliations: 1 DepartmentofComputerScience,UniversityofCaliforniaLosAngeles,LosAngeles,CA,USA 2 InstituteforQuantitativeandComputationalBiosciences,UniversityofCaliforniaLosAngeles,LosAngeles,CA,USA 3 DepartmentofComputerScience,GeorgiaStateUniversity,Atlanta,USA 4 BiomedicalSciencesGraduateProgram,UniversityofCalifornia,SanFrancisco,CA,USA 5 DepartmentofMolecular,Cell,andDevelopmentalBiology,UniversityofCalifornia,LosAngeles,LosAngeles,CA, USA 6 Department of Microbiology, Immunology, and Molecular Genetics, University of California, Los Angeles, Los Angeles,CA90095,USA 7 DepartmentofGenetics,TheInstituteofLifeSciences,TheHebrewUniversityofJerusalem,Jerusalem,Israel 8 DepartmentofMedicine,UniversityofCalifornia,SanFrancisco,CA,USA 9 Immunogenetics Center, Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, UniversityofCaliforniaLosAngeles,LosAngeles,CA,USA 10 DepartmentofMicrobiologyandImmunology,SandlerAsthmaBasicResearchCenter,UniversityofCalifornia,San Francisco,SanFrancisco,CA,USA 11 DepartmentofHumanGenetics,UniversityofCaliforniaLosAngeles,LosAngeles,USA #Correspondenceto:[email protected]@cs.ucla.edu *Equalcontribution **Equalcontribution bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. Abstract Assay-basedapproachesprovideadetailedviewoftheadaptiveimmunesystembyprofilingT andBcellreceptorrepertoires.However,thesemethodscarryahighcostandlackthescaleof standard RNA sequencing (RNA-Seq). Here we report the development of ImReP, a novel computationalmethodforrapidandaccurateprofilingoftheadaptiveimmunerepertoirefrom regular RNA-Seq data. We applied our novel method to 8,555 samples across 544 individuals from 53 tissues from the Genotype-Tissue Expression (GTEx v6) project. ImReP is able to efficiently extract TCR- and BCR-derived reads from RNA-Seq data. ImReP can also accurately assemblethecomplementarydeterminingregions3(CDR3s),themostvariableregionsofBand T cell receptors, and determine their antigen specificity. Using ImReP, we have created a systematicatlasofimmunologicalsequencesforBandTcellrepertoiresacrossabroadrangeof tissuetypes,mostofwhichhavenotbeenstudiedforBandTcellreceptorrepertoires.Wealso compared the GTEx tissues to track the flow of T- and B-clonotypes across immune-related tissues,includingsecondarylymphoidorgansandorgansencompassingmucosal,exocrine,and endocrinesites,andweexaminedthecompositionalsimilaritiesofclonalpopulationsbetween these tissues. The atlas of T and B cell receptors, freely available at https://sergheimangul.wordpress.com/atlas-immune-repertoires/, is the largest collection of CDR3sequencesandtissuetypes.Weanticipatethisrecoursewillenhancefutureimmunology studiesandadvancedevelopmentoftherapiesforhumandiseases.ImRePisfreelyavailableat https://sergheimangul.wordpress.com/imrep/. bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. Introduction Akeyfunctionoftheadaptiveimmunesystem,whichiscomposedofB-cellsandT-cells,isto mountprotectivememoryresponsestoagivenantigen.BandTcellsrecognizetheirspecific antigens through their surface antigen receptors (B and T cell receptors, BCR and TCR, respectively),whichareuniquetoeachcellanditsprogeny.BCRandTCRarediversifiedthrough somaticrecombination,aprocessthatrandomlycombinesvariable(V),diversity(D),andjoining (J)genesegments,andinsertsordeletesnon-templatedbasesattherecombinationjunctions1 (Figure1a).TheresultingDNAsequencesarethentranslatedintotheantigenreceptorproteins. Thisprocessallowsforanastonishingdiversityofthelymphocyterepertoire(i.e.,thecollection of antigen receptors of a given individual), with >1013 theoretically possible distinct immunological receptors1. This diversity is key for the immune system to confer protection againstawidevarietyofpotentialpathogens2.Inaddition,uponactivationofaB-cell,somatic hypermutationfurtherdiversifiesBCRsintheirvariableregion.Thesechangesaremostlysinglebasesubstitutionsoccurringatextremelyhighrates(10–5to10–3 mutationsperbasepairper generation)3. Isotype switching is another mechanism that contributes to B-cell functional diversity.Here,antigenspecificityremainsunchangedwhiletheheavychainVDJregionsjoinwith different constant (C) regions, such as IgG, IgA, or IgE isotypes, and alter the immunological propertiesofaBCR. High-throughputtechnologiesenableunprecedentedaccuracywhenprofilingtheBCRandTCR repertoires. Commonly used assay-based approaches provide a detailed view of the adaptive bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. immunesystemwithdeepsequencingofamplifiedDNAorRNAfromthevariableregionofBCR orTCRloci(Rep-Seq)4.Thosetechnologiesareusuallyrestrictedtoonechain,withthemajority of studies focusing on the beta chain of TCRs and the heavy chain of BCRs. Recent studies2 successfully applied assay-based approaches to characterize the immune repertoire of the peripheralblood.However,littleisknownabouttheimmunologicalrepertoiresofotherhuman tissues,includingbarriertissueslikeskinandmucosae.Studiesinvolvingassay-basedprotocols usually have small sample sizes, thus limiting analysis of intra-individual variation of immunologicalreceptorsacrossdiversehumantissues. RNASequencing(RNA-Seq)traditionallyusesthereadsmappedontohumangenomereferences to study the transcriptional landscape of both single cells and entire cellular populations. In contrasttoassay-basedprotocolsthatproducereadsfromtheamplifiedvariableregionofBCR orTCRloci,RNA-Seqisabletocapturetheentirecellularpopulationofthesample,includingB andTcells.However,duetotherepetitivenatureoflociencodingforBCRsandTCRs,aswellas theextremelevelofdiversityinBCRandTCRtranscripts,mostmappingtoolsareillequippedto handle immune repertoire sequences. Despite this, BCR and TCR transcripts often occur in sufficient numbers within the transcriptome of many tissues to characterize their respective immunologicalrepertoires5. In this study, we developed ImReP, a novel computational method for rapid and accurate profilingoftheadaptiveimmunerepertoirefromregularRNA-Seqdata.Weapplieditto8,555 samplesacross544individualsfrom53tissuesobtainedfromGenotype-TissueExpressionstudy bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. (GTExv6)6.Thedatawasderivedfrom38solidorgantissues,11brainsubregions,wholeblood, andthreecelllines.ImRePisabletoefficientlyextractTCR-andBCR-derivedreadsfromthe RNA-Seq data and accurately assemble the complementarity determining regions 3 (CDR3s). CDR3 are the most variable regions of B and T cell receptors and determine the antigen specificity.UsingImReP,wecreatedasystematicatlasofimmunologicalsequencesforB-andTcellrepertoriesacrossabroadrangeoftissuetypes,mostofwhichwerenotpreviouslystudied for B- and T-cell repertoires. We also examined the compositional similarities of clonal populationsbetweenthetissuestotracktheflowofTandBclonotypesacrossimmune-related tissues, including secondary lymphoid and organs that encompass mucosal, exocrine, and endocrinesites.OurproposedapproachisnotsuperiorincomparisontotargetedTCRorBCR; rather,itprovidesausefultoolformininglarge-scaleRNA-Seqdatasetsforstudyofadaptive immunerepertoires. bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. Results ImReP:atwostageapproachforadaptiveimmunerepertoiresreconstruction WeappliedImRePto0.6trillionRNA-Seqreads(92Tbp)from8,555samplestoassembleCDR3 sequencesofBandTcellreceptors(TableS1).TheRNA-SeqdatawasgeneratedbytheGenotypeTissue Expression Consortium (GTEx v6). First, we mapped RNA-Seq reads to the human referencegenomeusingashort-readaligner(performedbyGTExconsortium6)(Figure1).Next, weidentifyreadsspanningtheV(D)JjunctionofBandTcellreceptorsandassembleclonotypes (agroupofcloneswithidenticalCDR3aminoacidsequences).HereImRePused0.02trillionhigh qualityreadsthatsuccessfullymappedtoBCRgenes,successfullymappedtoTCRgenes,orwere unmappedreadsthatfailedtomaptothehumanreferencegenome(Figures1aandS1). ImReP is a two-stage approach to assemble CDR3 sequences and detect corresponding V(D)J recombinations(Figure1b).Inthefirststage,ImRePutilizesreadsthatsimultaneouslyoverlapV andJgenesegmentstoinfertheCDR3sequences.WedefinetheCDR3asthesequenceofamino acidsbetweenthecysteineontherightofthejunctionandphenylalanine(forallTCRchainsand immunoglobulinlightchains)ortryptophan(forIGH)ontheleftofthejunction.Inthesecond stage, ImReP utilizes reads that overlap a single gene segment containing a partial CDR3 sequence.ImRePthenusesasuffixtreetoperformpairwisecomparisonofthereadsandjoin thereadsbasedonoverlapintheCDR3region.Further,ImRePusesaCASTclusteringtechnique7 to accurately assemble clonotypes for PCR and sequencing errors. We map D genes (for IGH, TCRB,andTCRG)ontoassembledCDR3sequencesandinfercorrespondingV(D)Jrecombination. AdetaileddescriptionofthemethodologyimplementedwithImRePisprovidedintheExtended bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. Experimental Procedures Section. ImReP is freely available at https://sergheimangul.wordpress.com/imrep/. To validate the feasibility of using RNA-Seq to study the adaptive immune repertoire, we simulatedRNA-SeqdataasamixtureoftranscriptomicreadsandreadsderivedfromBCRand TCR transcripts (Figure S3). BCR and TCR transcripts are simulated based on random recombination of V and J gene segments (obtained from IMGT database8) with non-template insertionattherecombinationjunction(FigureS2).WeassessedtheabilityofImRePtoextract CDR3-derived reads from the RNA-Seq mixture by applying ImReP to a simulated RNA-Seq mixture.Whileoursimulationapproachmaynotcompletelysummarizethevariousnuancesand eccentricitiesofactualimmunerepertoires,itallowsustoassesstheaccuracyofourtool.ImReP is able to identify 99% of CDR3-derived reads from the RNA-Seq mixture, suggesting it is a powerful tool for profiling RNA-Seq samples of immune-related tissues. Details about the simulationdataareprovidedintheExtendedExperimentalProceduressection. Next,wecomparedImRePwithothermethodsdesignedtoassembleimmunerepertoires.We alsoinvestigatedthesequencingdepthandreadlengthrequiredtoreliablyassembleTCRand BCR sequences from RNA-Seq data. Our simulations suggest that both read length and sequencing depth have a major impact on precision-recall rates of CDR3 sequence assembly. ImRePisabletomaintainan80%precisionrateforthemajorityofsimulatedscenarios.Average CDR3coveragethatishigherthan8allowsImRePtoarchivearecallratecloseto90%foraread length above 75bp (Figure 2a). Increasing coverage has a positive effect on the number of bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. assembledclonotypesachievedbyImRePforbothBandTcellreceptors.Ingeneral,weobserve higherprecision-recallratesofCDR3sequenceassemblyforTCRsincomparisontoBCRs(Figure 2a-b). We compared the performance of ImReP to MiXCR (RNA-Seq mode)9, TRUST10, TraCeR11, V’DJer12, IgBlast-based pipeline13, and iSSAKE14. These tools were developed to assemble the hypervariablesequencesintheTandBcellreceptorsdirectlyfromRNA-Seqdata.Wesupplied eachofthosetoolswiththeoriginalRNA-Seqreadsasrawormappedreads,dependingonthe softwaredevelopers’recommendations.).TRUSTandTraCeRdonotsupporttheanalysisofBCR sequencesandwereexcludedfromthecomparisonbasedfortheIGHdata.iSSAKEisnolonger supportedandwasnotrecommendedforuse.Unfortunately,weobtainedemptyoutputafter running V’DJer, and increasing coverage in the simulated data did not solve the problem. Alternativeapproaches,suchasIMSEQ15,cannotbeapplieddirectlytoRNA-Seqreadsbecause they were originally designed for targeted sequencing of B or T cell receptor loci. Thus, to independentlyassessandcompareaccuracywithImReP,weonlyranIMSEQwiththesimulated readsderivedfromBCRorTCRtranscripts(FigureS1).Scriptsandcommandstorunalltoolsused inthisstudyareprovidedintheExtendedExperimentalProceduresandareavailableonlineat https://github.com/smangul1/Profiling-adaptive-immune-repertoires-across-multiple-humantissues-by-RNA-Sequencing.ImRePconsistentlyoutperformedexistingmethodsonIGHdatain bothrecallandprecisionratesforthemajorityofsimulatedparameters.ImRePandMiXCRshow similarperformanceonTCRAdataandoutperformothermethods.Notably,ImRePwastheonly bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. methodwithacceptableperformanceonIGHdataat50bpreadlength,reconstructingwitha higherprecisionratesignificantlymoreCDR3clonotypesthanothermethods. To demonstrate the feasibility of applying non-specific RNA Sequencing to assemble immune repertoiresequences,weusedtheTCRB-Seqdatapreparedfromthreesamplesofkidneyrenal clearcellcarcinoma(KIRC)byLi,Bo,etal. 10.WedownloadedmatchingRNA-Seqsamplesfrom theTCGAportal.Intotal,weobtained301million2x50bpreadsfromthreeRNA-Seqsamples. First,wepreparedtheCDR3sequencesobtainedfromTCRB-Seqandconsideredonlycomplete CDR3s,whichwedefinedasasequenceofaminoacidsstartingwithcysteine(C)andendingwith phenylalanine (F). We considered the prepared, complete CDR3s obtained from TCRB-Seq as totalimmunerepertoire. WeusedImReP,MIXCR,TRUST,andIMSEQtoassembleCDR3softheTCRBchain.Weexcluded V’DJer,becauseitonlysupportsimmunoglobulinchainsandisnotsuitabletoassembleCDR3s fromTcellreceptors.ImReP,MIXCR,andTRUSTassembledcomparablenumbersofcomplete CDR3sthatfullymatchCDR3sfromthetotalimmunerepertoireobtainedbyTCRB-Seq;IMSEQ assemblednoneoftheCDR3sequences(Figure2candFigureS4).Onaverage,54%ofCDR3s assembledbyImRePfullymatchCDR3sfromTCRB-Seq.Ourmethodwasabletorecover0.10.9%ofthetotalimmunerepertoireobtainedbyTCRB-Seqfromkidneytissue.Othertissues, including spleen and whole blood, contain a higher fraction of T cells and allow RNA-Seq to capture a higher fraction of the total immune repertoire. One should note, the number of completeCDR3sfullymatchingCDR3sobtainedbyTCRB-Seqinourstudy(reportedinFigure2c bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. andFigureS4)arenotfullycomparablewiththeresultsreportedinLi,Bo,etal.10,whereCDR3 sequencesareconsideredtomatchCDR3sfromTCRB-Seqifatleast6aminoacidsarematched. Scripts and commands utilized to process the data and run repertoire assembly tools are provided in Extended Experimental Procedures and are available online at https://github.com/smangul1/Profiling-adaptive-immune-repertoires-across-multiple-humantissues-by-RNA-Sequencing. WefurthervalidatedtheabilityofImRePtoaccuratelyinfertheproportionofimmunecellsin the sampled tissue. We hypothesized that the fraction of B- and T-cells in the sample will be proportional with the fraction of receptor-derived reads in our RNA-Seq data. We used a transcriptome-basedcomputationalmethod,SaVant16,whichusescell-specificgenesignatures (independentofBCRorTCRtranscripts)toinfertherelativeabundanceofBorTcellswithineach tissue sample. We found that B and T cell signatures inferred by SaVant showed positive correlationwiththeamountofBCR(r=0.77,P<0.001)orTCR(r=0.86,P<0.001)transcripts, respectively (Figure 2c,d). An exception to this correlation was for tissues that contain the highestdensityofBorTcells:spleen,wholeblood,smallintestine(terminalileum),lung,and EBV-transformedlymphocytes(LCLs). bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. a TCRA TCRD IGL TCRG IGK TCRB chr2 Mappedreads chr7 V CDR1 N D N CDR2 IGH chr14 Unmappedreads J C CDR3 mRNAencodingforTandBcellreceptorchains chr22 ImReP b J gene Vgene SYEQYFGPGTRLT… …SQTSVYFCASSE VYFICASSEASSLGGSGGYTACCSYEQYFGPG read CDR3 c J gene Vgene SYEQYFGPGTRLT… …SQTSVYFCASSE read1 SGGYTACCSYEQYFGPG SQTSVYFICASSEASSLGGSGGY read2 d Suffixtree ! G M G L Q M read2 Y TACCSYEQYFGPG readsmatchingonlyVgene areencodedassuffixtree bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. Figure 1. Overview of ImReP. (a) Schematic representation of human adaptive immune repertoire. Adaptive immune repertoire consists of four T-cell receptor loci (blue color, T cell receptor alpha locus (TCRA); Tcell receptor beta locus (TCRB);T-cell receptor delta locus (TCRD); andT-cell receptor gamma locus (TCRG)) and three immunoglobulin loci (red color, Immunoglobulin heavy locus (IGH); Immunoglobulin kappa locus (IGK); Immunoglobulinlambdalocus(IGL).Alternativename–BCR,Bcellreceptor).B-andT-cellreceptorscontainmultiple variable(V,greencolor),diversity(D,presentonlyinIGH,TCRB,TCRG,violetcolor),joining(J,yellowcolor)and constant(C,bluecolor)genesegments.V(D)Jgenesegmentsarerandomlyjointedandnon-templatedbases(N, darkredcolor)areinsertedattherecombinationjunctions.TheresultingsplicedT-orB-cellrepertoiretranscript incorporatestheCsegmentandistranslatedintotheantigenreceptorproteins.RNA-Seqreadsarederivedfrom the rearranged immunoglobulin IG and TCR loci. Reads entirely aligned to genes of B- and T-cell receptors are inferredfrommappedreads(blackcolor).Readswithextensivesomatichypermutationsandreadsspanningthe V(D)J recombination are inferred from the unmapped reads (grey color). Complementarity determining region 3 (CDR3)isthemostvariableregionofthethreeCDRregionsandisusedtoidentifyT/B-cellreceptorclonotypes—a group of clones with identical CDR3 amino acid sequences. (b) Receptor derived reads spanning V(D)J recombinations are identified from unmapped reads and assembled into the CDR3 sequences. We first scan the aminoacidsequencesofthereadanddeterminetheputativeCDR3boundariesdefinedbylastconservedcysteine encoded by the V gene and the conserved phenylalanine (for TCR) or tryptophan (for BCR) of J gene. Given the putativeCDR3boundaries,wechecktheprefixandsuffixofthereadtomatchthesuffixofVandprefixofJgenes, respectively.(c-d)IncaseareadoverlapswithonlytheVorJgene,weperformthesecondstageofImRePtomatch suchreadsbasedontheoverlapofCDR3sequenceusingsuffixtree. 0.4 0.4 0.4 0.2 1 2 4 8 16 32 64 8 16 64 8 16 coverage 128 32 32 64 0.4 2 4 64 128 8 16 0 1 2 4 8 16 32 d 64 128 32 64 128 1 coverage 2 4 8 16 32 coverage IgBlast 1000 EBV cells Lung 100 10 1.0 64 128 32 64 128 1.0 0.8 0.8 Small Intestine 4 8 16 32 100 Spleen 64 EBV cells Read 100bp Lunglength Whole 64 128 Blood 1 Small Intestine 128 0.2 32 2 r = 0.77 P < 0.001 coverage 1 0.1 1 2 IMSEQ MiXCR 2 coverage 810 16 4 4 8 0 16 coverage MiXCR 0.5 1 1.5 B cell signature score IMSEQ 32 64 128 4 8 16 32 coverage 64 128 10 TRUST (SE) TRUST (PE) Whole Blood Spleen Lung 1 r = 0.86 P < 0.001 0.1 EBV cells 0.01 1 2 14 8 16 32 1.5 64 128 0.5 coverage B cell signature score TRUST (SE) 2 e 0.6 1 coverage 1000 0.0 16 0 TCR reads per million 32 0.4 0.0 16 TRUST (PE) 0 2 1 0 0.0 0.1 0.0 8 0.2 0.4 0.6 2 r = 0.77 P < 0.001 1 0.8 0.6 0.4 0.6 0.2 0.2 4 Whole Blood 64 128 Read length 100bp 0.4 0.8 0.6 0.0 2 10 Spleen Small Intestine coverage IgBlast 10000 2 0 1.0 IMSEQ 2 128 0.2 1 coverage IMSEQ coverage ImReP 16 10000 0.4 32 Read 100 length 75bp 0.0 1.0 0.8 0.6 0.4 0.2 8 8 0.0 4 1 0.2 128 1 4 4 0.2 0.4 0.0 2 4 0.8 64 64 128 2 1.0 0.4 0.2 0.0 64 128 1.0 1.0 32 0.4 16 0.0 Precision(TCRA) 32 Read length 75bp Read length 50bp 1 ImReP 16 MiXCR 0.8 0.6 0.4 0.2 0.0 8 2 2 0 Final figures 2c,d Read length 75bp Read length 100bp 1.0 Recall(TCRA) 1.0 0.8 0.6 0.4 4 0.2 0.0 coverage 2 MiXCR Read length 50bp 32 8 coverage 1 1 ImReP 2 0.8 0.8 0.6 1.0 0.8 0.6 0.4 1.0 4 128 coverage 16 1 0.6 8 64 128 coverage 1 8 128 0.2 4 Readb length 50bp 4 64 1 1 Final figures 2c,d Read length 100bp ImReP 2 32 0 2 Read length 75bp 0.2 0.4 2 coverage 1 16 TRUST 12 2 6 coverage 2 0.6 1 0.2 1 32 8 MiXCR 735 1 coverage 0.8 0.8 0.4 1.0 16 4 TCRBSeq ImReP 0.2 8 64 128 2 BCR reads per million 4 0.2 1 0.0 0.0 2 0.0 128 64 128 0.0 64 32 Read length 100bp 1.0 32 16 Read length 100bp c 0.8 16 8 0.6 8 0.0 0.2 0.4 0.6 0.6 0.8 1.0 4 0.0 Precision(IGH) 32 coverage 1 4 Read length 100bp 1.0 1.0 0.6 0.4 0.2 0.0 2 Read length 50bp 16 2 coverage Read length 75bp coverage 8 1 Read length 75bp 0.8 0.8 Recall(IGH) 1.0 1.0 0.8 1 0.2 0.4 0.0 0.2 0.4 0.6 0.6 0.8 1.0 Read length 50bp 4 64 128 Read length 75bp Read length 50bp 2 32 0.0 0.2 0.4 0.6 0.6 0.8 1.0 Readalength 50bp 1 16 coverage TCR reads per million 64 128 BCR reads per million 32 1.0 16 0.8 8 coverage 0.6 4 0.4 2 0.0 1 0.0 0.2 0.0 0.0 0.2 bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. 0.5 1 1.5 T cell signature score Tracer Tracer Figure2.EvaluationofImReP.(a-b)EvaluationofImRePbasedonthenumberofassembledCDR3sequencesand comparison to existing methods. (c) Concordance of targeted TCRB-Seq and non-specific RNA-Seq. (d-e) CorrespondenceofImReP-derivedreadsfromB-cell(BCR)andT-cell(TCR)receptorstotherelativeabundanceofB- andT-cellsinferredfromcell-specificgeneexpressionprofiles.(a)PrecisionandrecallratesforImReP(blue),MiXCR (RNA-Seq mode) (blue), IMSEQ (green), and IgBlast (orange) on simulated data for immunoglobulin heavy (IGH) transcriptsarereportedforvariousreadslength(separateplots)andpertranscriptcoverages(1,2,4,8,16,32,64,128) (x-axis).TRUSTandTraCeRdonotsupporttheanalysisofBCRsequencesandwereexcludedfromthecomparison 2 bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. basedfortheIGHdata.(b)PrecisionandrecallratesforImReP(blue),MiXCR(RNA-Seqmode)(red),TRUST(default, paired-endmode)(pink),TRUST(single-endmode)(violet),IMSEQ(green),andTraCeR(aqua)onsimulateddatafor Tcellreceptoralpha(TCRA)transcriptsarereportedforvariousreadslength(separateplots)andpertranscript coverages (1,2,4,8,16,32,64,128) (x-axis). (c) Concordance of targeted TCRB-Seq and non-specific RNA-Seq performedonthreeTCGAsamples(onlyoneisshown)fromkidneyrenalclearcellcarcinoma(KIRC).Venndiagram on TCGA-CZ-5463 sample presents number of matching CDR3s reported by immunoSEQ Analyzer (http://www.adaptivebiotech.com/)andCDR3sequencesassembledfromnon-specificRNA-SeqdatabyImReP, MiXCR(RNA-Seqmode),TRUST(default, paired-end mode). Results on other samples are presented in Figure S4. (d)ScatterplotofthenumberofallBCRreadsper1millionRNA-Seqreads(y-axis)andB-cellsignaturescore inferredbySaVant(x-axis).(e)ScatterplotofthenumberofallTCRreadsper1millionRNA-Seqreads(y-axis)andBcellsignaturescoreinferredbySaVant(x-axis).Pearsoncorrelationcoefficient(r)andP-valuearereported. Characterizingtheadaptiveimmunerepertoireacross53GTExtissues ImReP identified over 26 million reads overlapping 3.8 million distinct CDR3 sequences that originatefromdiversehumantissues.ThemajorityofassembledCDR3sequencesderivedfrom BCRs, with 1.7 million from the immunoglobulin heavy chain (IGH), 0.9 million from the immunoglobulinkappachain(IGK),and1.0millionfromtheimmunoglobulinlambdachain(IGL). AsmallerfractionofCDR3sequencesderivedfromTCRs,with0.2millionsequencesfromalpha andbetaTCRs(TCRAandTCRB).ThevastmajorityofallassembledCDR3shadalowfrequency inthedata.98%ofCDR3sequenceshadacountoflessthan10reads,andthemedianCDR3 sequencecountwas1.4.CDR3sequencesderivedfromIGKwerethemostabundantacrossall tissues,accountingonaveragefor54%oftheentireB-cellpopulation(FigureS5).IntheT-cell bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. population,alphaandbetajointlyaccountedfor83%ofthepopulation.DeltaT-cellpopulation wastherarest,accountingforlessthanonepercentoftheentireTcellpopulation(FigureS6). We compared the length and amino acid composition of the assembled CDR3 sequences of immunoglobulinandT-cellreceptorchains(Figure3a-g).Consistentwithpreviousstudies,we observedthatimmunoglobulinlightchainshavenotablyshorterandlessvariableCDR3lengths comparedtoheavychains17(Fig.3h).Thetissuetypeappearstohavenoeffectonthelength distributionofCDR3sequencesofBCRs(FigureS7).DifferencesintheCDR3lengthdistributions for TCRs cannot be estimated due to the small number of available TCRs. Sequencing composition18ofCDR3regionsofbetaT-cellsassembledbyImRePrecapitulatesonedetectedby TCRsequencing2fromthewholebloodwithtwodominantCASSandCSARmotifsinthebeginning of the sequence (Fig. 3f). In line with other studies19, both light chains exhibited a reduced amountofsequencingdiversity(Fig.3b-c). a d IGH TCRA bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. b IGK TCRB c f IGL TCRD e h TCRG g Length(amino acids) Figure 3. Length and amino acid composition of the assembled CDR3 sequences of immunoglobulin and T-cell receptorchains.Thesequencelogo(usingWebLogo)ofaminoacidcompositionrepresentationforCDR3sequences withmeanlength.Theheightoftheaminoacidwithinthestackindicatestherelativefrequency.Distributionof CDR3sequencelengthisestimatedusingskerneldensity.(a)Sequencelogoof15-amino-acidCDR3sequenceof IGH.(b)Sequencelogoof11-amino-acidCDR3ofIGK.(c)Sequencelogoof12-amino-acidCDR3sequenceofIGL.(d) Sequencelogoof14-amino-acidCDR3sequenceofTCRA.(e)Sequencelogoof14-amino-acidCDR3sequenceof TCRB. (f) Sequence logo of 17-amino-acid CDR3 sequence of TCRD. (g) Sequence logo of 13-amino-acid CDR3 bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. sequenceofTCRG.(h)DistributionofCDR3sequencelengthisestimatedusingskerneldensityseparatelyforeach chainofT-andB-cellreceptors. We observed per sample an average of 1331 distinct clonotypes for BCRs and 20 distinct clonotypessequencesforTCRs.Wenormalizedthenumberofdistinctclonotypesbythetotal numberofRNA-Seqreads,whichwecallnumberofclonotypesperonemillionreads(CPM).As thenumberofdistinctclonotypesdoesnotincreaselinearlywiththesequencingdepth,CPM metricshouldnotbeusedinstudiescomparingclonotypediversityacrossvariousphenotypes. Instead,CPMisintendedtobeaninformativemeasureofclonaldiversityadjustedforsequencing depth. We used per sample alpha diversity (Shannon entropy) to incorporate the total number of distinctclonotypesandtheirrelativefrequenciesintoasinglediversitymetric.Amongalltissues, spleenhasthelargestB-cellpopulation,withamedianof1301BCR-derivedreadsperonemillion RNA-Seqreads.ItalsohasthemostdiversepopulationofBcellswithmedianpersamplealpha diversityof7.6correspondingto1025CPM(Figure4andTableS1).Wholebloodhasboththe largestandmostdiverseT-cellpopulation(Figure4andTableS1).Organsthatpossessmucosal, exocrine,andendocrinesites(n=24)harborarichclonotypepopulationwithamedianof87CPM persample.Minorsalivaryglandshavethehighestimmunediversityinthegroup(alpha=7.1) andsurpassthediversityoftheterminalIleumcontainingPeyer’sPatches,whicharesecondary lymphoidorgans(TableS1). bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. Tissuesnotrelatedtotheimmunesystem,includingadipose,muscle,andtheorgansfromthe centralnervoussystem,containedamedianof6CPMpersample,whicharemostlikelydueto thebloodcontentofthetissues20.ThehighestnumberofdistinctCDR3sequencesamongnonlymphoid organs was present in the omentum, a membranous double layer of adipose tissue containingfat-associatedlymphoidclusters.Asexpected21,Epsteinbarvirus(EBV)-transformed lymphocytes(LCL)harboredalargehomogeneouspopulationofBcellclonotypes(TableS1and FigureS7). bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. a 14 12 CLONOTYPIC RICHNESSOFTCELLS,CPM 10 8 6 TCRA TCRB TCRD TCRG 4 2 0 b CLONOTYPIC RICHNESSOFBCELLS, CPM 0 Spleen SmallIntestine- TerminalIleum WholeBlood Artery- Coronary Artery- Aorta Artery- Tibial MinorSalivaryGland Colon- Transverse Stomach Colon- Sigmoid Lung Breast- MammaryTissue Esophagus- Mucosa Thyroid Kidney- Cortex Bladder FallopianTube Vagina Cervix- Endocervix Liver Cervix- Ectocervix Prostate Uterus Ovary Pancreas Pituitary Skin- NotSunExposed… Skin- SunExposed(Lowerleg) AdrenalGland Testis Cells- EBV-transformed… Cells- Transformedfibroblasts Adipose- Visceral(Omentum) Esophagus- Gastroesophageal… Adipose- Subcutaneous Heart- AtrialAppendage Esophagus- Muscularis Heart- LeftVentricle Muscle- Skeletal Nerve- Tibial Brain- Spinalcord(cervicalc-1) Brain- Amygdala Brain- Cortex Brain- Hypothalamus Brain- Caudate(basalganglia) Brain- Substantianigra Brain- Hippocampus Brain- Putamen(basalganglia) Brain- Anteriorcingulatecortex Brain- Nucleusaccumbens(basal… Brain- Cerebellum Brain- FrontalCortex(BA9) Brain- CerebellarHemisphere 200 400 600 IGH IGK 800 1000 IGL Tissuestypes Secondarylymphoidorgans Bloodassociatedsites Mucosal,exocrineandendocrineorgans Celllines Adiposeandmuscleorgans Organsofcentralnervoussystem bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. Figure 4. Adaptive immune repertoires across multiple human tissues. Adaptive immune repertoires of 8,555 samplesacross544individualsfrom53bodysitesobtainedfromGenotype-TissueExpressionstudy(GTExv6).We groupthetissuesbytheirrelationshiptotheimmunesystem.Thefirstgroupincludesthelymphoidtissues(n=2,red colors).Thesecondgroupincludesbloodassociatedsitesincludingwholebloodandbloodvessel(n=4,redcolor). Thethirdgrouparetheorgansthatencompassesmucosal,exocrineandendocrineorgans(n=21,lavendercolor). The fourth group are cell lines (n=3, grey color). The filth group are adipose or muscle tissues and the gastroesophagealjunction(n=7,bluecolor).Thesixthgroupareorgansfromcentralnervoussystem(n=14,green color).HistogramreportsclonotypicrichnessofTandBcells,calculatedasnumberofdistinctaminoacidsequences ofCDR3peronemillionRNA-Seqreads(CPM).(a)MedianCPMsarepresentedindividuallyforTcellreceptoralpha chain(TCRA),T-cellreceptorbetachain(TCRB),T-cellreceptordeltachain(TCRD),andT-cellreceptorgammachain (TCRG).(b)MediannumberofdistinctaminoacidsequencesofCDR3arepresentedindividuallyforimmunoglobulin heavychain(IGH),immunoglobulinkappachain(IGK),immunoglobulinlambdachain(IGL). Individual-andtissue-specificT-andB-cellclonotypes Aminoacidsequencesofclonotypesexhibitedextremeinter-individualdissimilarity,with88%of clonotypesuniquetoasingleindividual(private)(Figure5a).Theremaining~400,000clonotypes weresharedbyatleasttwoindividuals(public).Thenumberofindividualssharingclonotypes variedacrossTandBcellreceptors,withimmunoglobulinlightchainshavingthehighestnumber ofpublicclonotypes.Twenty-fivepercentofallIGKclonotypeswerepublic,andthenumberof individuals sharing the IGK clonotype sequences can be as high as 471 (Figure 5b). T cell clonotypeshadonaverage7%publicclonotypes,withthehighestnumberofpublicclonotypes amongthedeltachainofgamma-deltaTCRs(12%).ThelimitedcapacityofRNA-Seqtocoverlow bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. abundant clonotypes may misclassify public clonotypes as private. Consistent with previous studies10,22,weobservepublicclonotypestobesignificantlyshorterinlengththantheprivate ones(p-value<2x10-16).Forexample,IGHchainpublicclonotypeshadanaveragelengthof13 aminoacids,andprivateclonotypeshadanaveragelengthof16.Wealsoexaminedwhetherthe publicclonotypesweremoreoftensharedacrosstissueswithinanindividual.Only14%ofthe ~240,000 clonotypes shared across tissues were public. The majority of clonotypes were individual-andtissue-specific(Figure5c).Thefulllistofpublicclonotypesisdistributedwiththe ‘AtlasofT-andB-cellrepertoires’thataccompaniesthismanuscript. bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. b -1 106 58 IGHV3-53 105 104 BCRs Numberofclonotypesequences a 103 CARGHYGMDVW IGHJ6 471 IGKV3-20 CQQYGSSPRTF IGKJ2 102 443 10 IGLV3-1 1 IGL TCRA TCRB TCRD TCRG TCRs IGK c Tissuespecific Sharedacross tissues Individualspecific(private) 3.1x10 6 0.2x10 6 Sharedacrossindividuals (public) 0.4x10 6 0.03x10 6 IGLJ2 90 1 3 5 7 9 11 13 15 17 Numberofindividualssharing theclonotypesequence IGH CQAWDSSTVVF TRAV8-7 CAGADRLQTGMRGAF TRBV10-3 CVIKYF TRAJ19 67 TRGV10 CAAWDYYKKLF TRBJ2-3 TRGJ1(2) 44 Figure5.PrivateandpublicTandBcellclonotypes.(a)Distributionoffrequenciesofprivate(n=1)andpublic(n>1) clonotypes across 544 individuals. We collect clonotypes from all tissues of the same individual into a single set correspondingtothatindividual.(b)Themostpublicclonotypes(sharedacrossmaximumnumberofindividuals) and corresponding VJ recombination are presented for IGH, IGK, IGL, TCRA, TCRB, and TCRG. (c) Clonotypes sequences are classified into public clonotypes (shared across individuals), private (individual specific), tissuespecific,andclonotypessharedacrossmultipletissues.Thenumberofclonotypesfallingintoeachpairofcategories isreportedacrossallTandBcellreceptorchains. bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. FlowofTandBcellclonotypesacrosshumanGTExtissues The large number of individuals available through the study allow us to establish a pairwise relationshipbetweenthetissuesandtracktheflowofTandBclonotypesacrosshumantissues. WobservedasignificantincreaseoftheCDR3sequencessharedacrosspairsoftissuesfromthe same individuals. Further, we observed this pattern consistently for all chains of B and T cell receptors(p-value<2x10-16)(Figure6aandTableS2).Weobserveadifferentamountofshared CDR3 sequences across different types of BCRs and TCRs with an increase in immunoglobulin light chains. Decreased number of TCR reads compared to the BCRs makes it unfeasible to comparethenumberofsharedCDR3acrossTandBcells.Onaverage,weobserve21.0CDR3 sequencestobesharedacrossapairoftissuesfromthesameindividuals.Pairsoftissuesfrom differentindividualsshareonaverage10.6CDR3sequences(Figure6aandTableS2). ToestablishtheflowofB-andT-cellclonotypesacrossvarioustissues,wecomparedclonotype populationsbetweenandwithinthesameindividuals.Welimitedthisanalysistopairsoftissues forwhichwehadatleast10individuals(870pairsoftissuesoutof1378possiblepairs).Weused beta diversity (Sørensen–Dice similarity index) to measure compositional similarities between thetissuesintermsofgainorlossofCDR3sequences(Figure6b-c).Forthemajorityofthe870 availabletissuepairs,weobservenoBCRorTCRsequencesincommon,whichcorrespondsto betadiversityof0.0. bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. WeexaminedtheflowofIGHclonotypesacrosstissuesandpresenteditasanetwork(Figure 5b).Among870availabletissuepairs,wehaveidentified56tissuepairswithbetadiversityabove .001.Spleenwasthemosthighlyconnectedtissue,with17connections,followedbylung,with 16 connections. Clonotypes represents one connected component, meaning that every two nodesareconnecteddirectlyorviaothernodes.Clonotypepopulationsofspleenandlungare themostsimilar(0.02betadiversity),otherpairsincludeminorsalivaryglandandesophagus mucosa,terminalileum(smallintestine)andtransversecolon.Weobserveabove200pairsof tissueswithbetadiversityabove.001forimmunoglobulinlightchains(FigureS9-S10).Themost similartissuepairsforIGKchainwerespleenandtransversecolon(0.15betadiversity). WealsoexaminedtheflowofTCRBclonotypesacrosstissues.TCRBclonotypesequenceswere sharedacross10pairsoftissueswithaveragebetadiversityof0.02.SimilartoBcells,thenetwork of T cell clonotypes is a single connected component, with spleen being the most highly connectedtissue(Figure6c).FortheTCRgammachain,weobservebetadiversityof0.0forall tissuepairs.Atthesametime12%ofTCRgammaclonotypesarepublic,showingthehighestrate amongallTCRgenes.AtthesequencingdepthprovidedbyRNA-Seq,weareunabletoobserve TCRdeltaclonotypesharingacrossindividualsandtissues. bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. b a 10 TRCG TRCD TRCB TRCA IGL -5 IGK 0 IGH 5 IGH Average numberofclonotypes sharedacrosspairsoftissues 15 -10 -15 Pairoftissues withinthesame individual Pairoftissues acrossdifferentindividuals TCRB c Figure6.FlowofTandBcellclonotypesacrossdiversehumantissues.Resultsarebasedonpairsoftissueswithat least10individuals.(a)Thenumberofclonotypesequencessharedacrosspairsoftissuesfromthesameindividuals (bluecolor)andfromdifferentindividuals(orangecolor)arepresented.Numberofclonotypessharedacrosstissues fromthesameindividualsforTCRsis<.1.Numberofclonotypessharedacrosstissuesfromdifferentindividualsfor TCRsis<.001.(b)-(c)Flowofclonotypesacrossdiversehumantissuesispresentedasanetwork.Eachnodeisatissue withthesizeproportionaltoamediannumberofclonotypesofthetissue.Thecolorofthenodecorrespondstoa typeofthetissuetype:lymphoidtissues(yellowcolors),bloodassociatedsites(redcolor),organsthatencompasses mucosal,exocrineandendocrineorgans(lavendercolor).Compositionalsimilaritiesbetweenthetissuesintermsof gain or loss of CDR3 sequences are measured across valid pairs of tissues using beta diversity (Sørensen–Dice similarity index). Edges are weighted according to the beta diversity. (b) Flow of IGH clonotypes across diverse bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. humantissuesispresentedasanetwork.Edgeswithbetadiversity>.001arepresented.(c)FlowofTCRBclonotypes acrossdiversehumantissuesispresentedasanetwork.Edgeswithbetadiversity>.001arepresented. ImRePidentifiestissuesampleswithlymphocyteinfiltration Histologicalimagesoftissuecross-sectionsandpathologists’noteshavebeenusedtovalidate the ImReP’s ability to detect the samples with a high lymphocyte content, which often correspondstoadiseasestate.WeexaminedtheIGHclonotypepopulationsfromthyroidtissue acrossindividuals.ThemediannumberofinferreddistinctCDR3sequencespersamplewas20, though14.5%ofthesampleshadmorethan500distinctCDR3sequences.Weobservedthe highestnumberofCDR3sequencesamongallthethyroidsamplesinanindividualwithlatestage Hashimoto'sthyroiditis,anautoimmunediseasecharacterizedbylymphocyteinfiltrationandTcellmediatedcytotoxicity.Accordingtopathologists’notes,Hashimoto'sdiseasewaspresentin 11.2%ofthyroidsamples,withvaryingdegreesofseverity.First,weusedpathologists’notesto annotate samples as healthy or bearing Hashimoto's disease, and then we compared the adaptiverepertoirediversitybetweenthesegroups.Weobservedasignificantincreaseinthe number of distinct IGH clonotypes in samples with Hashimoto's thyroiditis (p-value= 1.5x10-5) (FigureS11).Thenumberofclonotypesvariedfrom113forfocalHashimoto'sthyroiditisto5621 forlatestageHashimoto'sthyroiditis(Figure7a).Inaddition,highclonotypediversityinkidney samplesindicatedthepresenceofglomerulosclerosis.Inlungsamples,highclonotypediversity correspondedtoinflammatorydiseasessuchassarcoidosisandbronchopneumonia. bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. Weobservednodifferenceinclonaldiversityinmalesandfemalesacrossthetissues,exceptin breasttissues(p-value<3.2x10-12,BCRs).Increasedclonotypediversityofbreasttissueinmale individualscorrespondedtogynecomastia,acommondisorderofnon-cancerousenlargementof malebreasttissue(Figure7b). bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. Diseaseseverity Histological images a LATESTAGE HASHIMOTOS HASHIMOTOS 496 2353 1315 EARLYHASHIMOTOS FOCALHASHIMOTOS 113 PATCHY HASHIMOTOS NumberofIGH clonotypes 5621 b Gynecomastia sex_s male 3000 2000 1000 0 female Number of IGH clonotypes Figure7.ImRePisabletoidentifysampleswithhighactivityoflymphocytes.Histologicalimagesoftissuecrosssectionsandpathologists’noteshavebeenusedtovalidatetheImReP’sabilitytodetectthesampleswithahigh activityoflymphocytes.(a)SampleswereorderedbyHashimoto'sthyroiditisseverity,asreportedbypathologists’ notes.Histologicalimagesareprovidedtoillustratethediseasestate.Averagenumberofclonotypesisreportedfor eachdiseasegroup.(b)Boxplotreportingnumberofclonotypesinthebreasttissuesformalesandfemales.Outlier amongthemalesamplesisillustratedwiththehistologicalimage. bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. Discussion We have developed a novel computational approach (ImReP) for reconstruction of adaptive immune repertoires using RNA-Seq data. We demonstrate the ability of ImReP to efficiently extract TCR- and BCR- derived reads from the RNA-Seq data and accurately assemble correspondingBCRandTCRclonotypes.TheproposedalgorithmcanaccuratelyassembleCDR3 sequencesofimmunereceptorsdespitethepresenceofsequencingerrorsandshortreadlength. Simulations generated using various read lengths and coverage depth show that ImReP consistentlyoutperformsexistingmethodsintermsofprecisionandrecallrates. We have demonstrated the feasibility of applying RNA-Seq to study the adaptive immune repertoire.AlthoughRNA-Seqlacksthesequencingdepthoftargetedsequencing(Rep-Seq),it can compensate by examining a larger sample size. Using ImReP, we have created the first systematicatlasofimmunologicalsequenceforB-andT-cellreceptorrepertoriesacrossdiverse humantissues.Thisprovidesarichresourceforcomparativeanalysisofarangeoftissuetypes, mostofwhichhavenotbeenstudiedbefore.TheatlasofT-andB-cellreceptors,availablewith thepaper,isthelargestcollectionofCDR3sequencesandtissuetypes.Weanticipatethatthis database will enhance future studies in areas such as immunology and contribute to the developmentoftherapiesforhumandiseases. Using RNA-Seq to study immune repertoires has some advantages, including the ability to simultaneouslycapturebothTandBcellclonotypepopulationsduringasinglerun.Italsoallows bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. simultaneousdetectionofoveralltranscriptionalresponsesoftheadaptiveimmunesystem,by comparingchangesinthenumberofBCRandTCRtranscriptstothemuchlargertranscriptome. Given large number of large-scale RNA-Seq datasets becoming available, we look forward to scalinguptheatlasofT-andB-cellreceptorsinordertoprovidevaluableinsightsintoimmune responsesacrossvariousautoimmunediseases,allergies,andcancers. bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. References 1. Georgiou,G.etal.Thepromiseandchallengeofhigh-throughputsequencingofthe antibodyrepertoire.Nat.Biotechnol.32,158–168(2014). 2. Freeman,J.D.,Warren,R.L.,Webb,J.R.,Nelson,B.H.&Holt,R.A.ProfilingtheT-cell receptorbeta-chainrepertoirebymassivelyparallelsequencing.GenomeRes.19,1817– 1824(2009). 3. Rajewsky,K.,Forster,I.&Cumano,A.Evolutionaryandsomaticselectionoftheantibody repertoireinthemouse.Science(80-.).238,1088–1094(1987). 4. Benichou,J.,Ben-Hamo,R.,Louzoun,Y.&Efroni,S.Rep-Seq:uncoveringthe immunologicalrepertoirethroughnext-generationsequencing.Immunology135,183– 191(2012). 5. Mangul,S.etal.DumpsterdivinginRNA-sequencingtofindthesourceofeverylastread. bioRxiv53041(2016). 6. Ardlie,K.G.etal.TheGenotype-TissueExpression(GTEx)pilotanalysis:Multitissuegene regulationinhumans.Science(80-.).348,648–660(2015). 7. Ben-Dor,A.,Shamir,R.&Yakhini,Z.Clusteringgeneexpressionpatterns.J.Comput.Biol. 6,281–297(1999). 8. Lefranc,M.-P.etal.IMGT{®},theinternationalImMunoGeneTicsinformationsystem{®} 25yearson.NucleicAcidsRes.gku1056(2014). 9. Bolotin,D.A.etal.MiXCR:softwareforcomprehensiveadaptiveimmunityprofiling.Nat. bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. Methods12,380–381(2015). 10. Li,B.etal.Landscapeoftumor-infiltratingTcellrepertoireofhumancancers.Nat. Genet.(2016). 11. Stubbington,M.J.T.etal.Tcellfateandclonalityinferencefromsingle-cell transcriptomes.Nat.Methods(2016). 12. Mose,L.E.etal.Assembly-basedinferenceofB-cellreceptorrepertoiresfromshortread RNAsequencingdatawithV’DJer.Bioinformaticsbtw526(2016). 13. Strauli,N.&Hernandez,R.StatisticalInferenceofaConvergentAntibodyRepertoire ResponsetoInfluenzaVaccine.bioRxiv25098(2015). 14. Warren,R.L.,Nelson,B.H.&Holt,R.A.ProfilingmodelT-cellmetagenomeswithshort reads.Bioinformatics25,458–464(2009). 15. Kuchenbecker,L.etal.IMSEQ—afastanderrorawareapproachtoimmunogenetic sequenceanalysis.Bioinformatics31,2963–2971(2015). 16. Lefranc,M.-P.etal.SaVanT:Aweb-basedtoolforthesample-levelvisualizationof molecularsignaturesingeneexpressionprofiles.NucleicAcidsRes.gku1056(2014). 17. Philibert,P.etal.AfocusedantibodylibraryforselectingscFvsexpressedathighlevelsin thecytoplasm.BMCBiotechnol.7,1(2007). 18. Crooks,G.E.,Hon,G.,Chandonia,J.-M.&Brenner,S.E.WebLogo:asequencelogo generator.GenomeRes.14,1188–1190(2004). 19. Hoi,K.H.&Ippolito,G.C.Intrinsicbiasandpublicrearrangementsinthehuman immunoglobulinVλlightchainrepertoire.GenesImmun.14,271–276(2013). 20. Yu,H.-P.,Chiu,Y.-W.,Lin,H.-H.,Chang,T.-C.&Shen,Y.-Z.Bloodcontentinguinea-pig bioRxiv preprint first posted online Nov. 22, 2016; doi: http://dx.doi.org/10.1101/089235. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license. tissues:correctionforthestudyofdrugtissuedistribution.Pharmacol.Res.23,337–347 (1991). 21. DeRossi,A.etal.InfectionofEpstein-Barrvirus-transformedlymphoblastoidBcellsby thehumanimmunodeficiencyvirus:evidenceforapersistentandproductiveinfection leadingtoBcellphenotypicchanges.Eur.J.Immunol.20,2041–2049(1990). 22. Warren,R.L.etal.ExhaustiveT-cellrepertoiresequencingofhumanperipheralblood samplesrevealssignaturesofantigenselectionandadirectlymeasuredrepertoiresizeof atleast1millionclonotypes.GenomeRes.21,790–797(2011).