Download MiTCR: software for T-cell receptor sequencing data analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

List of types of proteins wikipedia , lookup

Amitosis wikipedia , lookup

Cellular differentiation wikipedia , lookup

JADE1 wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
correspondEnce
The approaches MiTCR uses to analyze TCR sequencing data
have been shown to be efficient in previous studies5,6. The software performs CDR3 extraction, identifies V, D and J segments,
assembles clonotypes, filters out or rescues low-quality reads5 and
To the Editor: High-throughput sequencing technologies have trans- provides advanced correction of PCR and sequencing errors 1,5
formed the field of antigen receptor diversity studies, enabling deep
using either a predefined or user-specified strategy (Fig. 1a and
and quantitative analysis for deciphering adaptive immunity function Supplementary Notes 1–3). Simple command-line parameters,
in health and disease1–4. As more data are produced each year, there human-readable configuration files and a well-documented appliis steadily growing demand for standardized analysis software.
cation programming interface (API) optimized for use in scripts
Here we report MiTCR, an open-source software for rapid, make the software flexible enough for routine data extraction by
robust and comprehensive analysis of hundreds of mil- immunologists as well as for more advanced analysis and cuslions of raw high-throughput sequencing reads contain- tomization by bioinformaticians (Supplementary Note 1 and
ing sequences encoding human or mouse a or β T-cell anti- Supplementary Data 1).
gen receptor (TCR) chains (Supplementary Software). Raw
We computationally optimized and parallelized the algorithms
data in FASTQ format generated via Illumina, 454 or Ion such that MiTCR can efficiently extract CDR3 information at
Torrent sequencing can be used as input for analysis. The only
a speed of more than 50,000 sequencing reads per second (0.3–
requirement is that sequence encoding conserved positions 0.6 gigabases min–1) on standard PC hardware. For example,
Illumina MiSeq run of 10 million reads can be analyzed in ~3 min,
flanking the complementarity-determining region 3 (CDR3),
and a HiSeq lane of 100 million reads can be analyzed in ~20 min
Cys104 and Phe118 or Trp118, is covered by a sequencing read.
(Supplementary Note 4).
Output is provided in a tab-delimited
b Identified clonotypes (input = 10 true clonotypes)
a
Raw
text file that contains exhaustive inforsequencing
200
PCR errors
mation regarding TCR clonotype comreads
Sequencing errors
190
TCR-α
180
position, abundance and aggregated
TCR-β
CDR3 extractor
170
160
sequence quality (Supplementary Data
150
2). Additionally, we developed MiTCR
140
TRBV7-6 TRBD2
TRBJ1-4
CDR3-marked C
F
130
Viewer software that works with a cussequencing read
120
with quality data
110
tom (*.cls) format produced by MiTCR,
100
enabling convenient visualization, filter90
80
ing and in silico spectratyping of the data
Clonotype builder
70
Deposit
Cluster
60
(http://mitcr.milaboratory.com/viewer/;
low-quality
high-quality
50
CDR3 reads
CDR3 reads
Supplementary Data 3).
40
Map low-quality
30
CDR3 reads
Core clonotypes:
We demonstrated the accuracy and spec20
× 182
10
ificity of MiTCR for the analysis of both
0
×2
cDNA-based and genomic DNA–based
Input data
Low-quality
SMD
ETE
read mapping correction
correction
c
high-throughput sequencing datasets
Reads with identified CDR3 (input = 3 × 10 reads)
(Supplementary Tables 1 and 2).
10
0
To verify software performance with
Error corrector
100
TGTGCCAGCAGCTTAGCGCCGGGAGCAACTAATGAAAAACTGTTTTTT
datasets
of known clonotype composition,
× 219
90
80
TGTGCCAGCAGCTTAGCGCCGGGAGCAACTAATCAAAAACTGTTTTTT
we generated Illumina-like data sets in silico,
× 3
70
based on real rates of PCR and sequenc60
50
ing errors (Supplementary Note 5).
40
TRBV7-6
TRBD2
TRBJ1-4
30
We determined the efficiency of human
20
C A S S L A P G A T N E K L F F × 222
TCR-α and TCR-β CDR3 extraction, clo10
0
notype
generation and error correction for
Input data
Low-quality
SMD
ETE
read mapping correction
correction
the model data (Fig. 1b,c). Accuracy of
V and J segment identification was
Figure 1 | MiTCR data analysis. (a) Analysis pipeline. Vertical bars show sequence quality. Horizontal
97–99%.
MiTCR efficiency was supebars represent raw sequencing reads. (b,c) Analysis of in silico–simulated data of 3 × 106 sequencing
rior
compared
to that of existing CDR35
reads (average Phred quality = 33) containing 10 human TCR-a or TCR-β CDR3 clonotypes. Efficiency of
extraction packages (Supplementary
CDR3 identification and correction of PCR and sequencing errors is shown for the input clonotypes (b)
Note 6, Supplementary Table 3 and
and CDR3-containing reads (c). The error-correction strategies (SMD, save my diversity and ETE,
eliminate these errors) are described in Supplementary Notes 1–3.
Supplementary Fig. 1).
True (%)
TGT GCC AGC AGC TTA GCG CCG GGA GCA ACT AAT GAA AAA CTG TTT TTT
Erroneous (%)
5
TGT GCC AGC AGC TTA CCG CCG GGA GCA ACT AAT GAA AAA CTG TTT TTT
TGT GCC AGC AGC TTA GCG CCG GGA GCA ACT AAT GAA AAA CTG TTT TTT
TGT GCC AGC AGC TTA GCG CCG GGA GCA ACT AAT GAA AAA CTG TTT TTT
TGT GCC AGC AGC TTA GCG CCG GGA GCA ACT AAT CAA AAA CTG TTT TTT
Erroneous (%)
TGT GCC AGC AGC TTA GCG ACG GGA GCA ACT AAT GAA AAA CTG TTT TTT
6
True (%)
npg
© 2013 Nature America, Inc. All rights reserved.
MiTCR: software for T-cell receptor
sequencing data analysis
TGT GCC AGC AGC TTA GCG CCG GGA GCA ACT AAT GAA AAA CTG TTT TTT
nature methods | VOL.10 NO.9 | SEPTEMBER 2013 | 813
correspondEnce
MiTCR modules are checked by more than 80 comprehensive unit tests, which improved the reliability and correctness of
the code. The MiTCR API package can be used in Java projects
through Maven and in Groovy scripts using Groovy Grapes.
MiTCR is regularly updated; Windows installer, cross-platform
binaries and source code are available from http://mitcr.milaboratory.com/ under the terms of the GNU GPL v3 license.
a
Germ cells
Endothelial cells
Meso-/endodermal
progenitors
Tissue and tissue
mixtures
Other cancers
Teratomas
Embryonal
carcinomas
Blood cells
npg
© 2013 Nature America, Inc. All rights reserved.
COMPETING FINANCIAL INTERESTS
The authors declare no competing financial interests.
Embryonic
stem cells
Embryoid
bodies
Note: Any Supplementary Information and Source Data files are available in the
online version of the paper (doi:10.1038/nmeth.2555).
ACKNOWLEDGMENTS
We are grateful to T. Schumacher and C. Linnemann for valuable technical
discussions and to M. Eisenstein for English editing. This work was supported
by the Molecular and Cell Biology program of the Russian Academy of Sciences
and Russian Foundation for Basic Research (12-04-33139-mol-a, 12-04-33065mol-a, 12-04-00229-a, 13-04-00998-a and 13-04-40251).
Others
Mesenchymal
cells
Induced
pluripotent stem cells
Neural
stem cells
Fibroblasts
b
Query samples
List of
selected
samples
Dmitriy A Bolotin1,3, Mikhail Shugay1,3, Ilgar Z Mamedov1,
Ekaterina V Putintseva1, Maria A Turchaninova1,
Ivan V Zvyagin1,2, Olga V Britanova1 & Dmitriy M Chudakov1,2
1Shemiakin-Ovchinnikov Institute of Bioorganic Chemistry, Russian Academy of
Sciences, Moscow, Russia. 2Central European Institute of Technology, Masaryk
University, Brno, Czech Republic. 3These authors contributed equally to this work.
e-mail: [email protected]
Analyze samples
Define 1 gene
Coexpression
tool
Define 2 sample groups
Define genes
Differential expression
tool
PUBLISHED ONLINE 28 JULY 2013; DOI:10.1038/NMETH.2555
1.
2.
3.
4.
5.
6.
Nguyen, P. et al. BMC Genomics 12, 106 (2011).
Robins, H.S. et al. Blood 114, 4099–4107 (2009).
Venturi, V. et al. J. Immunol. 186, 4285–4294 (2011).
Warren, R.L. et al. Genome Res. 21, 790–797 (2011).
Bolotin, D.A. et al. Eur. J. Immunol. 42, 3073–3083 (2012).
Britanova, O.V. et al. Bone Marrow Transplant. 47, 1479–1481 (2012).
ESTOOLS Data@Hand: human stem cell
gene expression resource
To the Editor: We developed ESTOOLS Data@Hand, a resource to
facilitate exploration of published gene expression array data in stem
cell research. The resource, updated four times a year, offers efficient
sample identification, preprocessing that enables cross-experiment
comparisons and computational analysis.
High-throughput studies provide large amounts of data on
human embryonic stem cells (hESCs) and human induced pluripotent stem cells (hiPSCs), their parent cells as well as their
differentiated progeny cells 1. Published genome-wide expression data on stem cells can be exploited to answer questions
other than those addressed in the original studies. This is one
reason why such data are actively stored in the public repositories ArrayExpress 2 and Gene Expression Omnibus (GEO) 3.
Nevertheless, the means to perform such reanalyses are currently
limited. Often, sample information is only available as free text
in publications or public databases, hindering identification of
appropriate samples. Moreover, the measurement data are often
available in heterogeneous formats, and the lack of systematic preprocessing hampers cross-experiment comparisons. Some databases have been developed for stem cell research (Supplementary
Note 1 and Supplementary Table 1), but none provide both a
814 | VOL.10 NO.9 | SEPTEMBER 2013 | nature methods
List of
coexpressed
genes
GO-KEGG
enrichment tool
List of
enriched
GO terms
List of
differentially
expressed
genes
Expression profile
tool
List of
enriched
KEGG
pathways
KEGG
pathway
figures
Clustering
and heat map tool
Expression
profile
figures
Heat map
figures
Figure 1 | Gene expression data and analysis workflows in ESTOOLS
Data@Hand. (a) Cell and tissue types of the samples. (b) The data analysis
steps available on the user interface.
wide range of human pluripotent stem cell gene expression data
and typical analysis tools.
ESTOOLS Data@Hand regroups gene expression array data
and annotations primarily from experiments including hESCs or
hiPSCs and thus involves stem cell pluripotency, differentiation and cell dedifferentiation. We selected data from GEO and
ArrayExpress manually, preferentially including large experimental series; the selection covers pluripotent cells as well as
dozens of other cell and tissue types reported in the same studies (Fig. 1a, Supplementary Methods and Supplementary
Tables 2–4). Meta-analysis, a statistical approach to combine results
from independent but related studies is a relatively inexpensive option
that has the potential to increase both the statistical power and generalizability of single-study analysis4. We established two sample metadatasets of 408 and 245 jointly normalized samples using the two most
common array types for this data collection, Affymetrix and Illumina