Download Supplementary Figures (doc 928K)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Transposable element wikipedia , lookup

Microevolution wikipedia , lookup

Genome (book) wikipedia , lookup

Neocentromere wikipedia , lookup

Pathogenomics wikipedia , lookup

Genome evolution wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Gene expression profiling wikipedia , lookup

Gene wikipedia , lookup

Gene desert wikipedia , lookup

Designer baby wikipedia , lookup

RNA-Seq wikipedia , lookup

Human genome wikipedia , lookup

Non-coding DNA wikipedia , lookup

Microsatellite wikipedia , lookup

Smith–Waterman algorithm wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Multiple sequence alignment wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Sequence alignment wikipedia , lookup

Ridge (biology) wikipedia , lookup

Point mutation wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

Metagenomics wikipedia , lookup

Transcript
SUPPLEMENTAL FIGURES
Supplemental Figure 1. Distribution of IGHV sequences of the present
study according to mutation status.
Based on the percentage of identity to germline, this collection of
sequences was divided into four major “identity groups”; “truly
unmutated” (100% identity; 864 sequences), “minimally mutated” (9999.9% identity; 243 sequences), “borderline mutated” (98-98.9%
identity; 129 sequences) and “mutated” (<98% identity; 1426
sequences). The IGHV repertoires of the “mutated”, “minimally mutated”,
“borderline mutated” and “truly unmutated” subgroups differed
(Supplemental Table III), in keeping with previous reports. Also, at the
individual gene level, the distribution of rearrangements of IGHV genes
according to mutation status varied significantly (Supplemental Table III).
1
Supplemental Figure 2. Distribution of HCDR3 lengths of the CLL cases
from our series included in the present study. The x axis depicts HCDR3
amino acid length; the y axis refers to % of sequences with a given
length.
HCDR3 length ranged from 4-32 amino acids (AA) (median, 17)
(Supplemental Table VI). “Truly unmutated” sequences had significantly
longer HCDR3s (median 21 AA) than all other sequences. A significant
difference in HCDR3 length was also observed between “minimally
mutated” (median 19 AA) and “borderline mutated” or “mutated”
sequences (median 15 AA for both groups) (Supplemental Table VI;
Supplemental Figure 3).
2
Supplemental Figure 3. Distribution of HCDR3 lengths of the CLL cases
from our series according to mutation status. The x axis depicts HCDR3
amino acid length; the y axis refers to % of sequences with a given
length.
The striking peak at 9 AA in the “borderline mutated” group is made up
predominantly of IGHV3-21 cases with distinctive, stereotyped HCDR3s
(Supplemental Figure 4). A further striking peak at 13 AA in both the
“minimally mutated” and “truly unmutated” groups is made up
predominantly of rearrangements with stereotyped HCDR3s utilizing genes
of the IGHV1/5/7 clan along with IGHD6-19 and IGHJ4. The increase in
HCDR3 length observed among truly unmutated sequences is mainly
accounted for by rearrangements of the IGHV1-69 gene (for example,
40.35% of all sequences with a 24 AA-long HCDR3 utilized the IGHV1-69
gene). That notwithstanding, it is worth underscoring the fact that two
groups of mutated IGHV4-34 sequences with stereotyped HCDR3s
(corresponding to subsets 4 and 16 in Murray et al), both utilizing the
IGHJ6 gene, carried significantly longer HCDR3s (20 and 24 AA,
respectively) compared to other mutated cases (p<0.001).
3
Supplemental Figure 4. Distribution of HCDR3 lengths of the CLL cases
from our series utilizing IGHV3 subgroup genes. The striking peak at 9 AA
in the “borderline mutated” group is made up predominantly of IGHV3-21
cases with distinctive, stereotyped HCDR3s. Green line: all cases; black
line: cases utilizing the IGHV3-21 gene; red line: all other IGHV3 genes
except IGHV3-21.
4
Clustering of CLL sequences based on HCDR3 patterns
Supplemental Figure 5. A Biolayout 2D representation of two clustering
examples with real data, namely the formation of clusters 2-0020 and 10089 with four samples each. Black nodes are either clusters (smaller
nodes) or samples (larger nodes). Blue lines connect clusters to samples
and clusters to clusters, while red lines connect samples between
themselves. A. Cluster 2-0020 with samples NY278, P2959, F73, and
N2518. B. Cluster 1-0089 with samples NY585, FRA-340, DK76, and FRA069.
The clustering procedure for these samples is as follows:
Sample NY278 (HCDR3 sequence CARGPDESGWCGFRYW) was connected
to P2959 (HCDR3 sequence CARGPDISGWNGFEYW) by pattern
ARGPD.SGW.GF.Y providing 78.57% identity, forming the level 0 cluster
0-0168. P2959 was then found to be connected to F73
(CARGPDTSGWNSLDYW) by ARGPD.SGWN..[DE]Y providing 71.43%
identity and 78.57% similarity, making F73 a candidate for membership
in cluster 0-0168. However, F73 was not found to be connected to the
other member of the cluster, NY278, and therefore was not allowed to
join the cluster. Consequently, F73 formed a new level 0 cluster 0-0247
by borrowing P2959. F73 (CARGPDTSGWNSLDYW) was finally found to
be connected to N2518 (CARGPDESGWLALAYW) by ARGPD.SGW..L.Y with
71.43% identity, making N2518 a candidate for membership in cluster 00247. As in the previous case though, N2518 was not found to be
connected to the other member of the cluster, P2959, therefore it had to
borrow F73 and form yet another new level 0 cluster 0-0250. Since
clusters 0-0168 and 0-0247 shared P2959 they were connected on level
1 to form cluster 1-0107; similarly, clusters 0-0247 and 0-0250 shared
F73 and were thus connected on level 1 to form cluster 1-0110. Finally,
5
since clusters 1-0107 and 1-0110 shared cluster 0-0247, they formed the
level 2 cluster 2-0020 (Supplemental Figure 5A).
Samples
FRA-340
(CARAGEMATVFGRGAFDIW)
and
NY585
(CARAGEMATLMGLGAFDIW)
were
connected
by
pattern
ARAGEMAT[AVLI].G.GAFDI providing 82.35% identity and 88.24%
similarity to form level 0 cluster 0-0143. With the same identity and
similarity values, the pattern REGEMAT.[KRH]GFGAFDI connected samples
DK76
(CGREGEMATQRGFGAFDIW)
and
FRA-069
(CAREGEMATMKGFGAFDIW)
to
form
cluster
0-0144.
Pattern
AR.GEMAT..G.GAFDI connected FRA-069, FRA-340, and NY585
(76.47% identity), and all four samples shared pattern R.GEMAT..G.GAFDI
(70.59% identity). Therefore the two level 0 clusters were connected into
the 1-0089 cluster (Supplemental Figure 5B).
The second example provides us with the opportunity to make some
important remarks regarding the clustering procedure. Level 0 clusters are
guaranteed to contain sequences that all share patterns between
themselves, but are not guaranteed to contain all samples that display
above-threshold identity and similarity between themselves. In the second
example above, one could argue that DK76, FRA-069, FRA-340, and
NY585 should be in the same level 0 cluster since they are all connected
between themselves by the same pattern albeit with a lower score.
However, this would lead to a loss of information, in this case the fact that
DK76 and FRA-069 are more similar between themselves than across to
FRA-340 and NY585, and vice versa. The issue is addressed by higherlevel clusters, with the end result that if two sequences are abovethreshold identical and similar they are guaranteed to reside within a
cluster of some level, i.e. the two sequences could be in different level 0
and level 1 clusters but in the same level 2 cluster.
6
Supplemental Figure 6. Major high level clusters (HCDR3 archetypes) in
CLL. The screenshot is taken from within Biolayout3D. The grey spherical
nodes can either represent sequences or clusters; the connecting lines are
coloured according to the score of the connection ranging from blue (low
score) to red (high score). Each autonomous cluster of sequences is
annotated with the cluster ID, its size (in number of sequences), and its
major IGHV genes and their frequencies. The ID (e.g. 3-0002) is made
out of the Level of hierarchy the cluster is representing (e.g. 3) and a
sequential four-digit number (e.g. 0002). Note that clusters 3-0003 and
3-0004 are connected to form the only Level 4 cluster in the hierarchy
(not shown).
7
Supplemental Figure 7. Breakdown of sequences in level 3 clusters with
regard to HCDR3 length.
(i) 3-0000: 20 amino acids (aa) = 1 case; 21 aa = 13 cases; 22 aa = 17 cases;
23 aa = 29 cases; 24 aa = 34 cases; 25 aa = 9 cases; 26 aa = 3 cases
(ii) 3-0001: 9 aa = 82 cases
(iii) 3-0002: 13 aa = 144 cases; 14 aa = 29 cases
(iv) 3-0003: 20 aa = 38 cases; 21 aa = 16 cases
(v) 3-0004: 20 aa = 2 cases; 21 aa = 22 cases; 22 aa = 37 cases
(vi) 3-0005: 20 aa = 88 cases
8
Supplemental Figure 8. Level 0 clusters in all sequences (CLL+other
entities). Stacked percentage distribution of level 0 cluster sizes from the
group of all sequences, divided by cluster specificity. The “specificity” of a
cluster is determined by the number of sequences that belong to the CLL
group or the non-CLL group. If all sequences are non-CLL then the cluster
is considered as non-CLL unique, if the majority of the sequences are nonCLL then the cluster is non-CLL biased, and if the number of non-CLL and
CLL sequences is equal then the cluster is considered neutral. If most
sequences are CLL then the cluster is CLL biased, and finally the cluster is
CLL-unique if all the sequences are CLL. Evidently, the majority of small
(i.e. two- or three-member) clusters are mostly non_CLL-unique or nonCLL biased. From cluster size of four and onwards the majority of clusters
feature CLL sequences. Of note is the big gap to the cluster of 22
sequences, that has 21 stereotypical IGHV3-21 sequences from patients
with CLL and one non-CLL sequence.
It should be noted that groups of unique public-database sequences with
identical or near-identical HCDR3s were often referenced in the same
publication i.e. probably clonally related. In this context, 218 level 0
clusters included at least two sequences from the same publication (480
sequences referenced in 54 publications); 193 of these 218 level 0
clusters were characterized by the exclusive use of sequences from the
same publication (429 sequences in 50 publications).
9
Supplemental Figure 9. IGHV repertoires in clustered and non-clustered
cases. The distribution of CLL sequences with selected genes along the
process of clustering, from the whole cohort (all) to sequences in level 3
clusters. In each pie diagram, the five most frequent genes are
highlighted in color; the same color code is used in all diagrams.
10
Supplemental Figure 10. Effects of clustering on the IGHJ repertoire.
The percentage of CLL sequences with IGHJ4 or IGHJ6 along the process
of clustering, from the whole cohort (all) to sequences in level 3 clusters.
In general, the IGHJ4 and IGHJ6 genes were the most frequent IGHJ
genes (Supplemental Table VI). The contradictory fate of the IGHJ4 and
IGHJ6 genes in evident in this graph. More specifically, the IGHJ6 gene
was represented with an increasing frequency in each successive level of
clustering, starting from 32.8% at the cohort level and reaching up to
71.6% in level 3 (X2-test: p<0.0001). In contrast, the IGHJ4 gene started
from 42.5% at the cohort level plummeting to 27.8% in level 3 (X2-test:
p<0.0001) (see also Supplemental Table VI).
11