Download Classification of nucleic acids structures by means of the

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Comparative genomic hybridization wikipedia , lookup

Molecular cloning wikipedia , lookup

Gel electrophoresis of nucleic acids wikipedia , lookup

Maurice Wilkins wikipedia , lookup

RNA-Seq wikipedia , lookup

Non-coding DNA wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Real-time polymerase chain reaction wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

EXPOSE wikipedia , lookup

Holliday junction wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

DNA supercoil wikipedia , lookup

Pharmacometabolomics wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Replisome wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Community fingerprinting wikipedia , lookup

Transcript
OPEN ACCESS DOCUMENT
Information of the Journal in which the present paper is
published:
 Elsevier, Analytica Chimica Acta, 2009, 642 (1-2), pp.
117-126.

DOI: dx.doi.org/10.1016/j.aca.2008.12.052
1
Classification of nucleic acids structures by means of the chemometric
analysis of circular dichroism spectra
Joaquim Jaumot1*, Ramon Eritja2, Susana Navea3 and Raimundo Gargallo1
1. Department of Analytical Chemistry, Universitat de Barcelona, Diagonal 647, Barcelona,
E-08028 Spain.
2. Department of Structural Biology, IBMB-CSIC, Jordi Girona 18-26, Barcelona, E-08034
Spain
3. Acciona Agua, Av. de les Garrigues 22, El Prat de Llobregat, E-08820, Spain
* Author to whom correspondence should be addressed
Tel: +34-934034445
Fax: +34-934021233
E-mail: [email protected]
2
1. ABSTRACT
DNA can adopt structures in solution apart from the well-known Watson-Click double
helix, ranging from disordered single strands to high-order structures such as triplexes
or quadruplexes. Moreover, different topologies can be adopted depending on the
polarity of the DNA strands. The elucidation of the structure and topology adopted by a
DNA sequence is usually carried out by means of spectroscopic techniques, such as
circular dichroism.
In this work, the ability of several chemometric methods to efficiently classify DNA
structures from circular dichroism data is tested. With this objective in mind, a data set
including 50 experimental spectra corresponding to different DNA structures (random
coil, duplex, hairpin, reversed and normal triplex, parallel and antiparallel Gquadruplex, and i-motif) has been analyzed by means of unsupervised Hierarchical
Clustering Analysis, Principal Component Analysis and Partial Least Squares
Discriminant Analysis. The results have shown than those methods allow efficiently the
classification of DNA structures from circular dichroism spectra. Moreover, these
classification methods also provided the most characteristic wavelengths used in the
classification procedures.
Keywords: DNA structure, classification, Principal Component Analysis, Clustering,
Partial Least Squares Discriminant Analysis, Circular Dichroism spectroscopy
3
2. INTRODUCTION
Since the initial elucidation of the secondary structure of B-DNA duplex by Watson and
Crick in 1953 [1], additional DNA secondary structures have been described in the
literature (Scheme 1) [2, 3]. These are related to base and sugar geometries different
from those found in the B-DNA duplex structure and are possible due to the existence
of base pairs different from those initially proposed by Watson and Crick. Well-known
examples of these base pairs are inverted Watson-Crick, Hoogsteen, or Wooble base
pairs [3]. Furthermore, it has been observed that interactions involving more than two
nitrogenated bases are also possible. This allows the formation of higher order
structures
such
as
triplex
(when
three
bases
interact,
for
example,
(thymine•adenine)*adenine or (cytosine•guanine)*guanine) or quadruplex (when four
bases interact, for example, the G-tetrad in which four guanine bases simultaneously
interact form a planar arrangement) (see Scheme 1) [4-6].
The structure adopted by a DNA in solution is often strongly fixed by the sequence of
nitrogenated bases. As an example, the presence of several tracks containing
guanines can favor the formation of G-quadruplex structures [7, 8], whereas the
presence of several tracks containing cytosines can favor the formation of i-motif
structures at pH values near 5 -6 [9].
It is interesting to point out that these DNA secondary structures can be built up by
means of intramolecular or intermolecular interactions [3]. In the first case, we will have
a DNA structure built up by just a single strand. This is the case of the hairpin structure
(Scheme 1), which is a intramolecular duplex consisting of an ordered part with base
pairs (stem) and another part without base pairs (loop). Intramolecular folding can also
be observed in higher order structures such as triplexes or quadruplexes [4]. The
complexity of DNA secondary structures increases if we also consider the variability in
the structures caused by the topologies that can adopt the different interacting strands.
Thus, it is possible to distinguish among parallel (if the chains run in the same sense)
4
and antiparallel (if the chains run in opposite sense) topologies, or even mixtures of
both.
Finally, we have to consider the disordered structures or random coil [3]. In this case,
there are no hydrogen bonds between the nitrogenated bases and, because of this, it is
not possible the formation of ordered structures. The random coil is observed when the
nucleic acids are in denaturating conditions, such as high temperature, extreme pH
values or in the presence of denaturing agents.
Several spectroscopic techniques can be used to monitor the formation and stability of
these structures, like UV molecular absorption, circular dichroism (CD) or Nuclear
Magnetic Resonance. Among these, CD in the UV region can be considered the most
appropriate technique because the measured instrumental response is extremely
sensitive to the distance between the interacting strands, the inclination and distance
between the nitrogenated bases and the axis of the structure, with an acceptable cost
[10-12]. Therefore, CD spectroscopy is widely used to distinguish between ordered and
disordered structures and, also, between different types of ordered structures.
In this work, we have tested the ability of CD spectroscopy and Chemometrics to
efficiently classify DNA secondary structures. This classification has been first
attempted by using unsupervised classification methods such as Hierarchical
Clustering Analysis (HCA) [13] and Principal Component Analysis (PCA) [14] in order
to explore the data set and obtain different sample groups. Finally, Partial Least
Squares Discriminant Analysis [15] (a supervised method) has been used to model the
different DNA structures classes from the CD spectra.
3. EXPERIMENTAL
DNA synthesis
Oligonucleotide sequences were synthesized on an Applied Biosystems 392 DNA
synthesizer using the 1 mol scale synthesis cycle. Standard phosphoramidites were
used for the natural bases. Sequences were deprotected using standard protocols
5
(concentrated ammonia, 55ºC, and overnight). After deprotection, oligonucleotides
were purified using purification cartridges and, finally, desalted using Sephadex G-25
cartridges (NAP-10, Amersham Biosciences).
Besides, the parallel-stranded hairpins were prepared as described elsewhere [16, 17].
5’-5’ Hairpins were prepared in three steps. First, the pyrimidine part was prepared
using reversed C and T phosphoramidites and reversed C-support (linked to the
support through the 5' end). Second, a hexaethyleneglycol linker was added using a
commercially available phosphoramidite. Third, the purine part was assembled using
standard phosphoramidites. For the preparation of 3’-3’ hairpins a similar approach
was used. In this case, the purine part was assembled first, followed by the
hexaethyleneglycol. The pyrimidine part was the last to be assembled using reversed
phosphoramidites.
Finally, oligonucleotides sequence 5’-T12-(EG)6-A12-(EG)6-T12-3’ was assembled using
standard phosphoramidites and the hexaethyleneglycol linker. The synthesis of 3’-T125’-(EG)6-5’-A12-(EG)6-T12-3’ was prepared in two steps. First, DMT-(EG)6-5’-A12-(EG)6T12-3’ was assembled using standard phosphoramidites and the last dodecathymidine
sequence was assembled using reversed phosphoramidites.
Sample Preparation
Samples described in Table I were prepared at a concentration of 3 M in strand. DNA
concentration was determined by measuring the UV absorbance at 260 nm at 90ºC
and calculating the concentration by means of the nearest-neighbor method as
implemented in Oligo Parameter Calculation [18]. Appropriate volumes of phosphate
(pH 6.9) or acetate (pH 5.1) buffer solutions were added to the samples. For the
preparation of buffers, NaH2PO4 (Panreac, a.r., Spain), KH2PO4 (Panreac, a.r., Spain),
CH3COOH (Merck, a.r., Germany) and CH3COONa (Panreac, a.r., Spain) were used.
Ionic strength was adjusted to 150 mM by adding appropriate volumes of KCl, NaCl
and MgCl2 stock solutions. KCl (Merck, a.r., Germany), NaCl (Merck, a.r., Germany)
6
and MgCl2 (Panreac, a.r., Spain) were used. The samples used to test the prediction
ability of the proposed models consisted of equimolar mixtures of two complementary
strands. These samples were prepared by using the same ionic strength buffer as for
the other samples but the pH was adjusted using small volumes of HCl (Panreac, a.r.,
Spain) or NaOH (Panreac, a.r., Spain). All the solutions were prepared in Ultrapure
water (Millipore, France). Finally, samples were heated at 90ºC for 10 minutes and
allowed to renaturalise, cooling slowly until room temperature. Oligonucleotides
samples were kept at 4ºC until measurement.
Spectroscopic measurements
CD spectra were recorded on a Jasco J-810 spectropolarimeter equipped with a Julabo
F-25/HD control unit. Spectra were recorded between 220 and 360 nm (data pitch: 0.5
nm; scan mode: continuous, sensitivity: 10 mdeg, speed: 50 nm/min, response: 4 s,
bandwidth: 1 nm, 2 accumulations). Spectra were recorded using a Hellma quartz cell
with pathlength of 10 mm and volume of 1400 l. Measurements were carried out at
two different temperatures: 20ºC in annealing conditions and 85ºC in denaturing
conditions.
Samples used for building up the chemometrical models
Table I lists all the DNA sequences used in this work. The data set includes disordered
single stranded DNAs, B-DNA duplex, intramolecular and intermolecular triplexes,
parallel and antiparallel G-quadruplexes, and i-motifs.
Disordered DNAs include samples whose CD spectrum has been measured at high
temperature where it is expected that the secondary structure is lost (samples 1-10) [3].
Moreover, some sequences which do not form secondary structures have been
included (samples 11-12). Duplex structures include both intra- and intermolecular
structures and parallel and antiparallel topologies. The data set includes two
intramolecular triplex structures: one normal (sample 28) and another reversed
7
structure (sample 29). Several DNA structures involving four strands have been
included. First, G-quadruplexes, which is a structure formed in guanine-rich sequences.
The data set contains both parallel (samples 30 - 35) and antiparallel (samples 36 - 41)
G-quadruplexes. Second, the data set includes two spectra corresponding to i-motif
structures (samples 42 - 43). These are only stable at pH 5 - 6 because its formation
requires half-protonation of cytosine bases. For that reason, the CD spectra of these
sequences have been measured in acetate buffer. Finally, 7 samples prepared by
mixing two DNA sequences have been included. These samples could potentially
present more than one structure simultaneously. Hence, samples 44 - 45 can produce
a mixture of duplex and G-quadruplex structures due to the mixing of a G-quadruplexforming oligonucleotide with its complementary strand. In some samples (46 - 50),
duplex and triplex structures could be simultaneously present after the addition of a
single strand target to a hairpin structure.
4. CHEMOMETRICAL METHODS
Hierarchical Clustering Analysis (HCA)
Cluster analysis is used to classify objects, characterized by the values of a set of
variables, into clusters or groups [14], in such a way that one object within a cluster is
more closely related to another object in the same cluster than to another object
assigned to a different cluster. In HCA the data are not partitioned into a particular
cluster in a single step. Instead of this, a series of partitions takes place, which may run
from a single cluster containing all objects to n clusters each containing a single object.
In order to build up these groups a measurement of the similarity between the different
objects is considered. This measurement is also known as the distance between the
objects considered. There are several methods to measure distances and its selection
will influence the shape of the clusters [19]. Examples of these distances are the
Euclidean distance, City block distance or Mahalanobis distance.
8
Among all the linkage cluster methods in this work the agglomerative Ward’s method
has been selected. Ward’s clustering method generates the different clusters in order
to minimize the loss associated with each cluster. At each step in the analysis, the
union of every possible cluster pair is considered and the two clusters whose fusion
results in minimum increase in 'information loss' are combined. Information loss is
defined by Ward in terms of an error sum-of-squares criterion.
Principal Component Analysis
Principal Component Analysis (PCA) is a multivariate technique that allows the
reduction of matrices to their lowest orthogonal space [14]. PCA assumes a bilinear
model to explain the observed data variance using a reduced number of factors (also
known as principal components):
X = U VT + E
Equation 1
In particular, the principal components identified by PCA are linear combinations of the
original variables which are orthonormal (orthogonal and normalized to unit length) and
explain maximum variance. The goal of PCA is to represent the variation presents in
many samples using the smallest number of components [20]. A new row space is built
up in which to plot the samples by redefining the axes using factors rather than the
original measured variables. The new axes, the principal components, allow the
investigation of data matrices with many variables and the display of the true
multivariate nature in a relatively small number of dimensions. In PCA, the matrix
related to the sample contributions (U), is called the score matrix, and the matrix
related to the variables contributions (VT) is called the loadings matrix. By retaining only
the significant components, one compress the relevant data information into these two
data matrices, U and VT, and, supposedly, the random noise contribution (E) is
suppressed.
In the present study, X contains the CD spectra for the different samples considered
(Table I). Therefore, X contains 50 rows (corresponding to the number of samples) and
9
282 columns (corresponding to the number of measured wavelengths). In this case, the
scores matrix, U, provides information about the samples (DNA structures) distribution
and grouping, whereas the loadings matrix, VT, provides information about the most
relevant wavelengths used to obtain this classification.
Partial Least Squares Discriminant Analysis (PLS-DA)
PLS-DA is a variant of PLS used as a classification tool. In this method, X contains the
input information (spectra) about the objects to be classified and Y the class
membership information [15]. So, this method fulfills the general equation of PLS
methods:
Y=XB
Equation 2
where X and Y are represented by their latent variables and B contains the regression
coefficients in the calibration step.
In this work, PLS-DA has been applied to a subset of the full data set (calibration data
set) whereas the remaining samples have been used as a validation data set. The
validation set includes samples 44 - 50 (in which there is a mixture of sequences) and
9 additional samples selected by using a Kennard-Stone algorithm [21]. A crossvalidated leave-one out model on the calibration data set was used to test the ability of
the method to carry out the class recognition. The threshold value to separate different
classes is calculated using a Bayesian statistical approach and allows to separate DNA
structures in the class and out of the class.
Software
The software used in this work for the PCA and PLS-DA was the PLS toolbox 3.5 for
MATLAB® from Eigenvector Research. HCA was performed using pdist (calculates the
pairwise distance between observations using the distance measurement method
specified by the user), linkage (creates a hierarchical clustering tree using the algorithm
10
specified by the user) and dendrogram (generates the dendrogram plot) functions of
the Statistics Toolbox for Matlab®.
5. RESULTS
A) Building up the chemometrical models
Figure 1 shows the measured CD spectra that have been later analyzed. Baseline
subtraction and Savitzky-Golay smoothing [22] data pretreatments have been applied
prior to the analysis. The results obtained have been organized in three blocks
corresponding to the considered chemometrical methods used in the classification
procedure. First, we have considered the unsupervised classification methods
(Hierarchical Clustering and Principal Component Analysis) and, finally, the supervised
classification method (Partial Least Squares – Discriminant Analysis).
Hierarchical clustering analysis (HCA)
As explained in the “Chemometrical Methods” section there are two parameters
(distance measurement and linkage method) that should be optimized to built up a
reliable dendrogram. Several options for these two parameters have been investigated
and, finally, the Euclidean distance, as distance measurement, and the Ward’s method,
as linkage method, were selected. Other distance measurements, such as the
Semieuclidean or Chebychev distances, and other linkage methods, such as the
complete or the weighted methods, also provided acceptable results.
Figure 2 shows the calculated dendrogram. Two main branches can be clearly
distinguished. The branch on the left contains the disordered DNAs together with
several triplex, i-motifs and antiparallel G-quadruplex structures, whereas the branch
on the right includes most of the parallel G-quadruplex and duplex structures listed in
Table I. Based on the known CD signatures for these structures and on the
characteristics of the data set studied in this work, it can be deduced that the right
11
branch includes those DNA secondary structures showing a positive CD band at 260
nm (Figure 1), whereas the left branch contains those structures which do not show
this CD signature [10].
Now, we will study in detail each one of these two main branches. The left one contains
in turn three clusters. The first one (samples 2 to 16) undoubtedly contains the samples
that present a disordered structure. This cluster includes samples 1 to 10 (measured at
denaturing conditions (i.e. at high temperature) and samples 11 and 12, whose base
sequence does not allow the formation of higher order structures. In all these cases,
the measured CD spectra showed low intensity (less than 5 mdeg) and no significant
CD signatures. Finally, this cluster also contains samples 16 and 41. CD spectra of
these samples showed weak signals probably because of the ionic medium in which
the DNA sequences were dissolved (Na+ cations). For instance, in the case of sample
41, it is known that the formation of the G-quadruplex structure is clearly favored in the
presence of K+ over Na+ [7]. The second cluster includes 5 samples that show weak
positive bands between 270-290 nm and a strong negative band around 250 nm.
These five samples are the two triplex structures (samples 28 and 29) that are present
in the data set and three B-DNA duplex structures: the Dickerson oligonucleotide
(sample 19) and the two d(CGCGCGCG) oligonucleotides, either in potassium (sample
13) or sodium saline medium (sample 14). Finally, the third cluster can be assigned to
the antiparallel quadruplex structures. These samples are characterized by a positive
band around 285-290 nm and weaker bands negative and positive, respectively,
around 265 and 240 nm. Moreover, we can distinguish that samples 42 and 43
correspond to the structure known as i-motif while samples 32, 38, 39 and 40
correspond to the antiparallel G-quadruplex structure.
As explained above, the right branch corresponds to samples that show a strong
positive CD band at 260 nm. This band has been assigned to both the duplex DNA and
to the parallel G-quadruplex. Differentiation of these two structures within the different
clusters is not so obvious. However, a careful analysis of CD spectra allows us to
12
explain the differentiation between the two major clusters that may be observed in this
branch. Hence, the cluster on the left (samples 15 to 33) corresponds to DNA samples
whose CD spectra showed additional features. Thus, these samples show small
contributions in the CD spectrum around 280 nm possibly due to the presence of mixed
topologies or to the existence of additional structures in solution. A clear example of
this behavior is sample 33 that shows a maximum at 260 nm and a shoulder around
290 nm. In this case, it has been proposed a mixed parallel / antiparallel topology for
this sequence reference [23]. On the contrary, the cluster on the right (samples 17 to
50) corresponds to DNA structures whose spectra only show maxima around 260 nm
and minima around 240 nm.
Principal Component Analysis (PCA)
The first step in the creation of the PCA model was to determine the optimal number of
components needed to explain most of the variance of the data whereas overfitting is
avoided. In this case, the selected number of components was 3, explaining
approximately 91% of the data variance (PC1 explains a 62.1% of the variance, PC2 a
17.5% of the variance and PC3 a 11.3% of the variance).
The results obtained are shown in Figure 3. The loadings plot allows us the selection of
the key wavelengths for each one of the 3 components (Figure 3a) whereas the scores
plot allows us the classification of the samples (Figure 3b-d).
The first component separates samples showing a clear positive signal at 260 nm, like
B-DNA duplexes and parallel G-quadruplexes, from the others. In fact, the shape of the
corresponding loading reminds that of the typical CD spectrum for a B-DNA duplex
[11].
The second component models samples that show a positive signal around 285 nm
and negative signal around 260 nm. This group comprises i-motifs (samples 42 and 43)
and also some characteristic B-DNA duplex structures (samples 13, 14 and 19). In this
case, although described as B-DNA, the CD spectra of these samples is clearly
13
different from the one typical associated with B-DNA (which is similar to the loading of
the first component). This is due to the high content of guanine and cytosine bases in
these sequences which produce several distortions to the canonical B-DNA structure
[3].
Finally, the third component models samples which show positive signals around 295
nm and 245 nm. This group mainly comprises antiparallel G-quadruplex, which show a
typical signature at 295 nm. On the other hand, samples 13, 14, 19, 28 and 29 show a
negative correlation because their CD spectra present a significant negative signal
around 250 nm.
Looking at the scores plots, it is interesting to point out that DNA samples showing a
disordered structure are very close to the origin of coordinates in all three principal
components. This is probably because these samples do not show any significant CD
signal and, therefore, are not relevant to any of the three components.
Globally, the results obtained with PCA are consistent with those previously obtained
with HCA.
Partial Least Squares Discriminant Analysis (PLS-DA)
Finally, the supervised PLS-DA method has been applied to classify the distinct DNA
structures based on their CD spectra. In order to analyze matrix X it is necessary to
define previously the class membership information (matrix Y). Hence, PLS-DA
analysis has been carried out from two different viewpoints. Firstly, the classes in Y
have been defined from the information previously obtained with HCA and PCA.
Secondly, the classes in Y have been defined from the information about the DNA
structures found in the literature (as listed in Table I). In both cases the data set has
been split into a calibration set and a validation set. The validation set contains 16
samples, including 7 samples in which there is a mixing of two DNA sequences
(samples 44 to 50) and 9 samples selected using the Kennard-Stone algorithm.
14
The results obtained when using the first viewpoint are shown in Figure 4. In this case,
3 latent variables have been used to build the model in B in Equation 2. Figure 4a
shows the modeling of the samples with disordered structure in which the CD spectra
do not show any significant feature. As expected, it is difficult to classify these samples
into a single class because of the absence of key features in their CD spectra. On the
other hand, a good classification of the remaining classes is obtained. Hence, triplex
and B-DNA structures (which were classified in the second HCA cluster) appear above
the calculated threshold (Figure 4b). Moreover, sample number 14 included in the
validation set have been properly classified. Similarly, antiparallel G-quadruplex (which
were classified in the third HCA cluster) appear quite well solved (Figure 4c), despite a
false positive that corresponds to the sample 33 (this sample shows the maximum in
the spectrum at 260 nm and a significant contribution around 290 nm). Also, it can be
seen that validation sample number 40 has been correctly classified. Finally, the last
HCA cluster, which corresponds to B-DNA duplex and parallel G-quadruplex
structures, has been also correctly classified by the PLS-DA method (Figure 4d).
PLS-DA allowed the prediction of the structure for samples consisting in a mixture of
two sequences (for example, samples 44 and 45). It has been attempted to carry out
the classification of these samples distinguishing between the two subgroups (duplex
and quadruplex) that can be seen in HCA or PCA but the results obtained were not
satisfactory.
Finally, the PLS-DA results obtained when considering the second viewpoint are shown
in Figure 5. In this case, the classification was considerably more complex than in the
previous case because it does not take into account the spectral properties of the
different DNA structures. So, in the building of the model we have only considered the
a priori expected structure.
As observed previously when using the classes predicted by HCA, the classification of
the disordered samples is not good (Figure 5a). These samples are distributed
throughout all the space, hampering an efficient classification. In contrast, the
15
classification of the duplex structures is correct (Figure 5b) as most of the samples are
above the calculated threshold. However, two false negatives (samples 22 and 25) and
two false positives for G-quadruplex structures (samples 33 and 37) can be observed.
Looking at the validation set, samples that were expected to be duplex structures have
been classified correctly. Figure 5c shows the classification of the two samples
identified as triplex structures and its difference from all the other samples. Finally,
Figure 5d shows the classification obtained for both parallel and antiparallel Gquadruplexes. In this case, we have obtained almost a correct classification for the
considered samples despite the existence of three false negatives and three false
positives. In the case of the validation set, the samples that could be expected as Gquadruplex structures (for instance, samples 30 and 40) have been properly classified.
The comparison of the two calculated PLS-DA models allows us to determine that
using information obtained by HCA or PCA provided us a better model. Despite this,
the results obtained with the model created only with the structural information
available in the literature can be considered acceptable.
B) Application examples of the proposed chemometrical models to unknown
samples
Two additional samples (see last rows of Table I) were used to test the prediction
ability of the previously proposed HCA, PCA and PLS-DA models. Both samples
contain a equimolar mixture of a guanine-rich strand and of a cytosine-rich strand but
were measured at different pH values. Guanine-rich strands can form antiparallel or
parallel G-quadruplex structures from neutral to slightly acid pH values. Cytosine-rich
strands form antiparallel i-motif structures at slightly acid pH values. At neutral pH
values, it is expected the formation of B-DNA duplex structure. According to this, the
major structure in sample A, prepared at pH 7.1, would correspond to a B-DNA duplex
(Figure 6a). On the contrary, it is more difficult to predict the structures present in
sample B, prepared at pH 2.9, because the protonation of cytosine bases will hinder
16
the formation of the B-DNA duplex structure. Therefore, a mixture of an i-motif and an
G-quadruplex could be expected.
Hierarchical clustering analysis (HCA)
A new HCA model was built up using all 52 samples listed in Table I. The Euclidean
distance and the Ward’s method were used for measuring distances and linking,
respectively. The new dendrogram was almost identical to that previously calculated
when only the 50 calibration samples were considered (Figure 2). Because of this, only
the classification of the new samples will now be discussed. First, sample A has been
classified inside cluster number 4, between samples 31 and 33. As commented
previously, this cluster is characterized by a strong positive band at 260 nm and
weaker contributions around 280 nm. According to this, DNA in sample A will have a
major contribution of B-DNA duplex structure and minor contributions of other
structures, like G-quadruplex or i-motif. On the other hand, sample B has been
classified inside cluster 3, between samples 40 and 38. As said before, this cluster
contains mainly antiparallel quadruplex structures. According to this, a mixture of i-motif
structure and / or antiparallel G-quadruplex will be predominant.
Principal Component Analysis (PCA)
PCA has also been used to predict the structures adopted by the DNA sequences in
the new samples. The projection of the corresponding CD spectra into the space
spanned by the previously calculated PCA model is shown in Figures 6b and 6c,
respectively. In Figure 6b, sample A is now located at the right end of the PC1 axis,
surrounded by duplex-related samples. According to this, sample A will be assigned to
a B-DNA duplex structure. As in the case of HCA, more difficulties were found when
studying sample B. In this case, the assignation of this sample to an unambiguous
group is not straightforward. In Figure 6c, this sample has been placed in the middle of
the group of antiparallel quadruplex structures (see cluster in the first quadrant of
17
Figure 3d). However, when looking at Figure 6b, this sample is only close to samples
42 and 43 and far away from other samples of this group (such as 32, 38 or 39).
Moreover, it could be considered that the sample is located also close to the B-DNA
duplex structures. This behavior can help us to explain the composition of the mixture
with a component of antiparallel quadruplex structure (probably an i-motif because its
proximity to samples 42 and 43) and another component of B-DNA duplex structure or
parallel G-quadruplex.
Partial Least Squares Discriminant Analysis (PLS-DA)
Finally, PLS-DA has been applied to predict the structures present in samples A and B
using the first point of view above discussed. Figures 6d and 6e show the prediction
according to classes 3 (antiparallel G-quadruplex and i-motif) and 4 (B-DNA duplex and
parallel G-quadruplex). The prediction for classes 1 (disordered structures) and 2
(triplex and some characteristic B-DNA structures) is not shown because it does not
provide any significant information. According to Figure 6d it is not possible to classify
sample A as belonging to class 3 but, on the contrary, it can be clearly assigned to the
class 4 (Figure 6e). In the case of sample B, it can not be labeled as belonging to only
one class because it shows characteristic features of two classes (see Figures 6d and
63): 3 (antiparallel G-quadruplex and i-motif) and 4 (B-DNA duplex and parallel Gquadruplex). So, the classification of both samples according to the PLS-DA model is
quite concordant with that previously obtained according to the PCA model.
18
6. CONCLUSIONS
The application of multivariate data analysis methods like HCA, PCA or PLS-DA to a
CD spectra data set has been shown to be a useful tool in order to classify DNA
sequences according to their experimental CD spectra. The three chemometric
methods used in this work have allowed to extract information from the data set.
Unsupervised Hierarchical Clustering Analysis and Principal Component Analysis have
provided complementary information about the grouping of the samples. However,
PCA has provided information about the key wavelength to perform this classification.
Finally, PLS-DA, as a supervised method, has allowed to classify the samples that
present a spectral signature but fails when the samples do not show significant
features. The procedure for classification proposed in this work can be especially
useful when it is applied to the elucidation of mixtures of DNA sequences where more
than one major structure could be present. This has been demonstrated in the example
presented where the chemometrical methods have allowed the prediction of the
structure of the DNA present in solution. This would be the case more usual when
research related to DNA structures is performed such as the case of DNA sequences
corresponding to the promoter regions of several oncogenes, like bcl-2, c-kit or c-myc
[24, 25]. In those cases, the simultaneous presence of guanine-rich and cytosine-rich
strands can produce a mixture of several structures, like G-quadruplex, i-motif and BDNA duplex, depending on the experimental conditions [26, 27]. The knowledge of the
structures present in these mixtures can be useful when designing drugs which mode
of interaction strongly depends on the DNA structure. In order to make available the
results showed in this manuscript to the chemometrical and circular dichroism
communities, we will follow two different ways. First, the CD dataset is publicly
available to download from our webpage (http://www.ub.es/gesq/dna/). Second,
models obtained with this dataset (or extended datasets) will be implemented in a web
based application to allow users without chemometrics expertise to predict DNA
structures from circular dichroism spectra.
19
7. ACKNOWLEDGEMENTS
This research was supported by the Spanish MEC (CTQ2006-15052-C02-01/BQU,
CTQ2007-28940-E/BQU and BFU2007-63287/BMC).
20
8. TABLES AND FIGURES CAPTIONS
Table I. Description of the CD spectra data set.
Scheme 1. Examples of DNA structures. a) duplex (antiparallel, intermolecular), b)
hairpin (antiparallel, intramolecular), c) triplex (parallel, intermolecular), d) Gquadruplex (antiparallel, intramolecular), e) i-motif (intramolecular).
Figure 1. Experimental CD spectra of the DNA samples
Figure 2. HCA obtained dendrogram using Ward’s linkage method and Euclidian
distance.
Figure 3. PCA analysis: a) Loadings plot for three components (Solid Line: PC1, Green
Dotted Line: PC2, Dashed Line: PC3), b) PC1 vs. PC2 scores plot c) PC1 vs. PC3
scores plot and d) PC2 vs. PC3 scores plot. Symbols: ▼: class 1 (disordered
structures), : class 2 (duplex structures), ■: class 3 (triplex structures), +: class 4
(quadruplex structures) and ◊: class 5 (mixture samples).
Figure 4. PLS-DA model for the first viewpoint. a) Y predicted for class 1, b) Y
predicted for class 2, c) Y predicted for class 3 and d) Y predicted for class 4. Symbols:
▼: class 1 (HCA cluster number 1, : class 2 (HCA cluster number 2), ■: class 3 (HCA
cluster number 3), +: class 4 (HCA cluster number 4) and ◊: validation samples.
Figure 5. PLS-DA model for the second viewpoint. a) Y predicted for class 1, b) Y
predicted for class 2, c) Y predicted for class 3 and d) Y predicted for class 4. Symbols:
▼: class 1 in Table I, : class 2 in Table I, ■: class 3 in Table I, +: class 4 in Table I
and ◊: validation set.
Figure 6. Results obtained for the prediction of two new samples. a) Experimental CD
spectra. Solid Line: Sample A, Dashed Line: Sample B, b) PCA analysis: PC1 vs. PC2
scores plot, c) PCA analysis: PC2 vs. PC3 scores plot, d) PLS-DA analysis: Y
predicted for class 3 and e) PLS-DA analysis: Y predicted for class 4. Legends for
symbols in figures b) and c) like in Figure 3. Legends for figures d) and e) like in Figure
4.
21
9. TABLES AND FIGURES
Table I.
Code
DNA Sequence
Expected structure
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
A
5’-CCGGCCGG-3’
5’-TCTCCTCCTTC-3’
5’-GAAGGA GGAGA -3’-(EG)6- 3’-TCTCCTCCTTC-5’
5’-GAAGGAGGAGA-T4-TGTGGTGGTTG-3’
5’-phos-AGGAGA-T4-AGAGGAGGAAG-T4-GAAGG
Mixture of DNA Sequences 2 & 4
Mixture of DNA Sequences 2 & 5
5’-CGCGCGCG-3’
5’-CCGGCCGG-3’
5’-CCCCGGGG-3’
5’-TCTCCTCCTTC-3’
5’-ACCCTAACCCTA-3’
5’-CGCGCGCG-3’
5’-CGCGCGCG-3’
5’-CCGGCCGG-3’
5’-CCGGCCGG-3’
5’-CCCCGGGG-3’
5’-CCCCGGGG-3’
5’-CGCGAATTCGCG-3’
5’-GAAGGAGGAGA -3’-(EG)6- 3’-TCTCCTCCTTC-5’
3’-AGANGGANGGAAG-5’-5’-T4-CTTCCTCCTCT-3’
3’-AGANGGANGGAAG-CTTTG-5’-5’-CTTCCTCCTCT-3’
5’-GAAGGANGGANGA-T4-AGAGGAGGAAG-3’
5’-GAAGGAGGAGA-T4-TGTGGTGGTTG-3’
5’-GAAGGANGGANGA-T4-TGTGGTGGTTG-3’
5’-phos-AGGAGA-T4-TGTGGTGGTTG-T4-GAAGG-3’
5’-phos-AGGAGA-T4-AGAGGAGGAAG-T4-GAAGG-3’
5’-T12 -(EG)6-A12 -3’-(EG)6-3’-T12-5’
5’-T12 -(EG)6-A12-(EG)6-T12-3’
5’-CGGGCACGGGAGGAAGGGGGCGGG-3’
5’-CGGGCACGGGAGGAPAGGGGGCGGG-3’
5’-GGCGCGGGAGGAATTGGGCGGG-3’
5’-GCGCGGGAGGAATTGGGCGGG-3’
5’-TGGGGGT-3’
5’-TGGGGGT-3’
5’ GGNGTTGGGTGTGGGTTGGG 3’
5’ GGGNTTGGGTGTGGGTTGGG 3’
5’-GGTTGGTGTGGTTGG -3’- biot
5’-TAGGGTTAGGGT-3’
5’ GGNGTTGGGTGTGGGTTGGG 3’
5’ GGGNTTGGGTGTGGGTTGGG 3’
5’-CCCGCCCAATTCCTCCCGCGCCCG-3’
5’-CCCGAC CCCTTCAPTCCCGAGCCCG-3’
Mixture of DNA Sequences 33 & 42
Mixture of DNA Sequences 33 & 42
Mixture of DNA Sequences 2 & 20
Mixture of DNA Sequences 2 & 25
Mixture of DNA Sequences 2 & 22
Mixture of DNA Sequences 2 & 26
Mixture of DNA Sequences 2 & 27
5’-CCCCTCCCTCGCGCCCGCCCG-3’ +
5’-CGGGCGGGCGCGAGGGAGGGG-3’
5’-CCCCTCCCTCGCGCCCGCCCG-3’ +
5’-CGGGCGGGCGCGAGGGAGGGG-3’
B
Conditions*
Single Strand
Single Strand
Single Strand
Single Strand
Single Strand
Single Strand
Single Strand
Single Strand
Single Strand
Single Strand
Single Strand
Single Strand
Duplex Inter Antiparallel
Duplex Inter Antiparallel
Duplex Inter Antiparallel
Duplex Inter Antiparallel
Duplex Inter Antiparallel
Duplex Inter Antiparallel
Duplex Inter Antiparallel
Duplex Intra Parallel
Duplex Intra Parallel
Duplex Intra Parallel
Duplex Intra Antiparallel
Duplex Intra Antiparallel
Duplex Intra Antiparallel
Duplex Intra Antiparallel
Duplex Intra Antiparallel
Triplex reversed
Triplex normal
G-quadruplex Parallel
G-quadruplex Parallel
G-quadruplex Parallel
G-quadruplex Parallel
G-quadruplex Parallel
G-quadruplex Parallel
G-quadruplex Antiparallel
G-quadruplex Antiparallel
G-quadruplex Antiparallel
G-quadruplex Antiparallel
G-quadruplex Antiparallel
G-quadruplex Antiparallel
i-motif
i-motif
Duplex + Quadruplex
Duplex + Quadruplex
Duplex + Triplex
Duplex + Triplex
Duplex + Triplex
Duplex + Triplex
Duplex + Triplex
Duplex
Class
Code
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
3
3
4
4
4
4
4
4
4
4
4
4
4
4
4
4
5
5
5
5
5
5
5
-
Unknown
-
LT, K, pH2.9
HT, K, pH7
HT, K, pH7
HT, K, pH7
HT, K, pH7
HT, K, pH7
HT, K, pH7
HT, K, pH7
HT, Na, pH7
HT, Na, pH7
HT, Na, pH7
LT, K, pH7
LT, K, pH7
LT, K, pH7
LT, Na, pH7
LT, K, pH7
LT, Na, pH7
LT, K, pH7
LT, Na, pH7
LT, K, pH7
LT, K, pH7
LT, K, pH7
LT, K, pH7
LT, K, pH7
LT, K, pH7
LT, K, pH7
LT, K, pH7
LT, K, pH7
LT, K, pH7
LT, K, pH7
LT, K, pH5
LT, K, pH7
LT, Na, pH7
LT, K, pH7
LT, Na, pH7
LT, K, pH7
LT, K, pH7
LT, K, pH7
LT, K, pH7
LT, K, pH7
LT, Na, pH7
LT, Na, pH7
LT, K, pH5
LT, K, pH5
LT, K, pH7
LT, K, pH5
LT, K, pH7
LT, K, pH7
LT, K, pH7
LT, K, pH7
LT, K, pH7
LT, K, pH7.1
where inter denotes intermolecular: intra denotes intramolecular; (EG)6 denotes hexaethyleneglycol linker;
biot denotes biotine tetraethylenglycol (biotine-TEG); phos denotes phosphate; AN denotes 8aminoadenine; AP denotes 2-aminopurine and GN denotes 8-aminoguanine.
* where in conditions column HT refers to High Temperature (85 ºC), LT to Low Temperature (20 ºC), K to
a 150 mM potassium medium, Na to a 150 mM sodium medium and pH7 and pH5 to the pH of the solution
(pH=6.9 was obtained with a phosphate buffer, pH=5.1 was obtained with an acetate buffer and pH 7.1
and 2,9 were obtained with the appropriate volumes of HCl/NaOH).
22
10. REFERENCES
[1] J.D. Watson, F.H.C. Crick, Nature, 171 (1953) 737.
[2] W. Saenger, Principles of nucleic acid structure, Springer, New York, NY, USA,
1988.
[3] V.A. Bloomfield, D.M. Crothers, I.T. Jr., Nucleics Acids. Structure, Properties and
Functions, University Science Books, Sausalito, CA, USA, 1999.
[4] D.E. Gilbert, J. Feigon, Curr Opin Struc Biol, 9 (1999) 305.
[5] L.E. Xodo, G. Manzini, M. Alunnifabbroni, B. Scaggiante, F. Quadrifoglio, Acta
Pharmaceut, 42 (1992) 299.
[6] T. Simonsson, Biological Chemistry, 382 (2001) 621.
[7] S. Neidle, S. Balasubramanian (Eds.), Quadruplex Nucleic Acids, RSC
Biomolecular Sciences, Cambridge, 2006.
[8] M.A. Keniry, Biopolymers, 56 (2000) 123.
[9] M. Gueron, J.L. Leroy, Curr Opin Struc Biol, 10 (2000) 326.
[10] D.M. Gray, S.H. Hung, K.H. Johnson, Methods in enzymology, 216 (1995) 19.
[11] G.D. Fasman, Circular Dichroism and the conformational analysis of biomolecules,
Plenum Press, New York, NY, USA, 1996.
[12] N. Berova, K. Nakanishi, R.W. Woody (Eds.), Circular Dichroism. Principles and
Applications, Wiley-VCH Inc., New York, US, 2000.
[13] D.L. Massart, L. Kaufman, The interpretation of analytical chemical data by the use
of cluster analysis, Wiley, New York, US, 1983.
[14] D.L. Massart, L.M.C. Buydens, B.G.M. Vandegiste, Handbook of Chemometrics
and Qualimetrics, Elsevier, Amsterdam, The Netherlands, 1997.
[15] S. Wold, A. Ruhe, H. Wold, W.J. Dunn, Siam Journal on Scientific and Statistical
Computing, 5 (1984) 735.
23
[16] A. Avino, M. Frieden, J.C. Morales, B. Garcia de la Torre, R. Guimil Garcia, F.
Azorin, J.L. Gelpi, M. Orozco, C. Gonzalez, R. Eritja, Nucleic Acids Res, 30 (2002)
2609.
[17] M.G. Grimau, A. Avino, R. Gargallo, R. Eritja, Chem Biodivers, 2 (2005) 275.
[18] http://proligo2.proligo.com/Calculation/calculation.html.
[19] L. Kaufman, P.J. Rousseeuw, Finding Groups in Data; An Introduction to Cluster
Analysis., Wiley, New York, 1990.
[20] S. Wold, K. Esbensen, P. Geladi, Chemometrics and Intelligent Laboratory
Systems, 2 (1987) 37.
[21] R.W. Kennard, L.A. Stone, Technometrics, 11 (1969) 137.
[22] A. Savitzky, M.J.E. Golay, Analytical Chemistry, 36 (1964) 1627.
[23] J.X. Dai, T.S. Dexheimer, D. Chen, M. Carver, A. Ambrus, R.A. Jones, D.Z. Yang,
Journal of the American Chemical Society, 128 (2006) 1096.
[24] A. Chanan-Khan, Blood Reviews, 19 (2005) 213.
[25] D.J. Patel, A.T. Phan, V. Kuryavyi, Nucl. Acids Res., 35 (2007) 7429.
[26] S. Neidle, M.A. Read, Biopolymers, 56 (2000) 195.
[27] L.H. Hurley, D.D. Von Hoff, A. Siddiqui-Jain, D.Z. Yang, Seminars in Oncology, 33
(2006) 498.
24
Scheme 1
25
Figure 1
26
Figure 2
27
Figure 3
28
Figure 4
29
Figure 5
30
Figure 6
31