* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Supplementary Material (doc 28K)
Survey
Document related concepts
Viral phylodynamics wikipedia , lookup
Expanded genetic code wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Human genome wikipedia , lookup
Non-coding DNA wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Genetic code wikipedia , lookup
Point mutation wikipedia , lookup
Ridge (biology) wikipedia , lookup
Metagenomics wikipedia , lookup
Sequence alignment wikipedia , lookup
Multiple sequence alignment wikipedia , lookup
Transcript
Parameters used for pattern analysis The complete set of parameters of TEIRESIAS used in this analysis is as follows: amino acids in the pattern (-l), number of overlapping characters in the convolved pattern (-c), maximum length of an elementary pattern (-w), minimum number of appearances of the pattern (-k), maximum number of brackets (indicating equivalent amino acids) allowed in the pattern (-n), flag for the support k to be the minimum number of sequences in which a pattern should appear (compared to the minimum number of instances of the pattern) (-v), flag for the algorithm to use amino acid equivalences (-b<equivalences file>), and flag for the algorithm to consider only the uppercase characters during pattern discovery (-u). The parameters were set to [l=3] [c=2] [w=6] [k=2] [n=2] [-v] [-b<equivalences file>] [-u]. Clustering of non-CLL sequences based on HCDR3 patterns In the 5,344 non-CLL HCDR3 sequences from public-databases, TEIRESIAS discovered 1,106,692 patterns which were filtered down to 1,714, a reduction of 99.9%. This final set of patterns was smaller by 21.5% than the one in the CLL dataset although the number of sequences analyzed was almost twice as high (5,344 vs. 2,845). This was partly due to the fact that this set of sequences was a sum of several different entities. Furthermore, a significant number of groups of identical or nearidentical sequences were referenced in the same publication, i.e. probably clonally related and consequently described by very few patterns (see also Supplemental Figure 8). The non-CLL level 0 clusters were significantly smaller in size (most included 2-3 members) than the corresponding CLL clusters. Furthermore, in stark contrast to high-level clusters in CLL, high-level clusters in the non-CLL group were characterized by marked IGHV gene heterogeneity. For instance, non-CLL level 3 clusters had a minimum of 8 different genes, reaching up to 35 in a cluster of 152 members. Strikingly, three of the six most frequent genes in level 3 CLL clusters (IGHV1-69, IGHV3-21 and IGHV1-3) were under-represented in level 3 non-CLL clusters. Clustering of all sequences based on HCDR3 patterns Finally, analysing the total 8,189 HCDR3 sequences from patients with CLL and from the other entities, 2,033,781 patterns were discovered and subsequently filtered down to 4,955, a reduction of 99.7%. These patterns allowed us to put 2,493 sequences in clusters of different levels. Taking into account the origin of sequences in each cluster, the 1,364 level 0 clusters were subdivided in five categories on a CLL to non-CLL direction: CLL-unique, CLL-biased, “neutral” (i.e. equal number of CLL and non-CLL sequences), non-CLL-biased, and non-CLL-unique. Analysis of the sizes of clusters in these five different “specificity” categories showed a strong bias for CLL sequences to get together in significantly larger clusters compared to non-CLL sequences, which formed mainly two- or three-member-only level 0 clusters (Supplemental Figure 8).