Download Supplementary Material (doc 28K)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Viral phylodynamics wikipedia , lookup

RNA-Seq wikipedia , lookup

Expanded genetic code wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Human genome wikipedia , lookup

Non-coding DNA wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genetic code wikipedia , lookup

Point mutation wikipedia , lookup

Ridge (biology) wikipedia , lookup

Metagenomics wikipedia , lookup

Sequence alignment wikipedia , lookup

Multiple sequence alignment wikipedia , lookup

Smith–Waterman algorithm wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Transcript
Parameters used for pattern analysis
The complete set of parameters of TEIRESIAS used in this analysis
is as follows: amino acids in the pattern (-l), number of overlapping
characters in the convolved pattern (-c), maximum length of an
elementary pattern (-w), minimum number of appearances of the pattern
(-k), maximum number of brackets (indicating equivalent amino acids)
allowed in the pattern (-n), flag for the support k to be the minimum
number of sequences in which a pattern should appear (compared to the
minimum number of instances of the pattern) (-v), flag for the algorithm
to use amino acid equivalences (-b<equivalences file>), and flag for the
algorithm to consider only the uppercase characters during pattern
discovery (-u). The parameters were set to [l=3] [c=2] [w=6] [k=2]
[n=2] [-v] [-b<equivalences file>] [-u].
Clustering of non-CLL sequences based on HCDR3 patterns
In the 5,344 non-CLL HCDR3 sequences from public-databases,
TEIRESIAS discovered 1,106,692 patterns which were filtered down to
1,714, a reduction of 99.9%. This final set of patterns was smaller by
21.5% than the one in the CLL dataset although the number of sequences
analyzed was almost twice as high (5,344 vs. 2,845). This was partly due
to the fact that this set of sequences was a sum of several different
entities. Furthermore, a significant number of groups of identical or nearidentical sequences were referenced in the same publication, i.e. probably
clonally related and consequently described by very few patterns (see also
Supplemental Figure 8).
The non-CLL level 0 clusters were significantly smaller in size (most
included 2-3 members) than the corresponding CLL clusters. Furthermore,
in stark contrast to high-level clusters in CLL, high-level clusters in the
non-CLL group were characterized by marked IGHV gene heterogeneity.
For instance, non-CLL level 3 clusters had a minimum of 8 different genes,
reaching up to 35 in a cluster of 152 members. Strikingly, three of the six
most frequent genes in level 3 CLL clusters (IGHV1-69, IGHV3-21 and
IGHV1-3) were under-represented in level 3 non-CLL clusters.
Clustering of all sequences based on HCDR3 patterns
Finally, analysing the total 8,189 HCDR3 sequences from patients
with CLL and from the other entities, 2,033,781 patterns were discovered
and subsequently filtered down to 4,955, a reduction of 99.7%. These
patterns allowed us to put 2,493 sequences in clusters of different levels.
Taking into account the origin of sequences in each cluster, the 1,364
level 0 clusters were subdivided in five categories on a CLL to non-CLL
direction: CLL-unique, CLL-biased, “neutral” (i.e. equal number of CLL and
non-CLL sequences), non-CLL-biased, and non-CLL-unique. Analysis of the
sizes of clusters in these five different “specificity” categories showed a
strong bias for CLL sequences to get together in significantly larger
clusters compared to non-CLL sequences, which formed mainly two- or
three-member-only level 0 clusters (Supplemental Figure 8).