Download VizStruct for visualization of genome

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
BIOINFORMATICS
ORIGINAL PAPER
Vol. 22 no. 13 2006, pages 1569–1576
doi:10.1093/bioinformatics/btl144
Sequence analysis
VizStruct for visualization of genome-wide SNP analyses
Kavitha Bhasi1, Li Zhang3, Daniel Brazeau1, Aidong Zhang2 and Murali Ramanathan1,
1
Department of Pharmaceutical Sciences and 2Deparment of Computer Science and Engineering,
State University of New York, Buffalo, NY 14260, USA and 3Department of Computer Science,
Eastern Michigan University, Ypsilanti, MI 48197, USA
Received on March 2, 2006; revised on April 9, 2006; accepted on April 10, 2006
Advance Access publication April 13, 2006
Associate Editor: Alex Bateman
ABSTRACT
Motivation: The size, dimensionality and the limited range of the data
values make visualization of single nucleotide polymorphism (SNP)
datasets challenging. The purpose of this study is to evaluate the usefulness of 3D VizStruct, a novel multi-dimensional data visualization
technique for analyzing patterns in SNP datasets.
Results: VizStruct is an interactive visualization technique that reduces
multi-dimensional data to two dimensions using the complex-valued
harmonics of the discrete Fourier transform (DFT). In the 3D
VizStruct extension, the multi-dimensional SNP data vectors are
reduced to three dimensions using a combination of the DFT and the
Kullback–Leibler divergence. The performance of 3D VizStruct was
challenged with several biologically relevant published datasets that
included human Chromosome 21, the human lipoprotein lipase (LPL)
gene locus and the multi-locus genotypes of coral populations. In every
case, the 3D VizStruct mapping provided an intuitive visual description
of the key characteristics of the underlying multi-dimensional genotype.
Availability: Excel and MATLAB code are available at http://www.cse.
buffalo.edu/DBGROUP/bioinformatics/supplementary/SNP/
Contact: [email protected]
1
INTRODUCTION
Technologies capable of simultaneously genotyping thousands of
single nucleotide polymorphisms (SNPs) are now widely employed
in basic biomedical research for investigating the genetic basis of
complex diseases (Mir and Southern, 2000; Suh and Vijg, 2005).
Visualization can also provide effective tools to summarize and
interpret datasets, describe the contents, and expose features in
genome-wide SNP datasets. One of the key obstacles of visualizing
genome-wide SNP data is the high dimensionality. Additional challenges include the size of the datasets (typically data on >10 000
SNPs can be obtained from a single sample), the limited range of the
data values (the data are sequences of ordinal numbers and the
number of values taken by each SNP is very limited: each SNP
is typically called as heterozygous or one of two homozygous states)
and the presence of correlated markers delimiting haplotypes.
Although genotyping technologies have advanced considerably
and a variety of sequence analysis and alignment algorithms and
tools have been developed, analytical visualization of SNP datasets,
the primary focus of this research, has not been extensively investigated in the context of SNP data analysis. However, fast efficient,
effective and easy-to-use analytical visualization tools are essential
To whom correspondence should be addressed.
for identifying and interpreting patterns in large SNP datasets,
generating hypotheses and directing subsequent research.
Many techniques have been developed to visualize multivariate
data and these can potentially be applied to SNP visualization as
well. The simplest multivariate data visualization tool to display
genetic variation is the parallel coordinates plot (Ward, 1994), in
which the data along each dimension is plotted on a separate axis;
e.g. Holter et al. used this method to plot microarray data from yeast
cell cycle experiments (Holter et al., 2000). Heat maps are also
commonly employed in multi-dimensional data visualization, e.g.
Halldorsson et al. used it to summarize the results from their SNP
tagging algorithm (Halldorsson et al., 2004). On closer inspection,
however, it becomes evident that the heat plot is a special version of
the parallel coordinates plot. The Graphical Display of Linkage
Disequilibrium (GOLD) software program is an example of a
software tool that generates heat plots of linkage disequilibriumcoefficient matrices and calculates a variety of disequilibrium
measures (Abecasis and Cookson, 2000).
Multiple sequence alignment (Batzoglou, 2005; Phillips et al.,
2000; Phillips, 2005; Snel et al., 2005; Wallace et al., 2005), which
is widely used in all areas of molecular biology including SNP
analysis, can also be considered a visualization aid because the
algorithms cluster sequences and the resultant alignments are organized to highlight regions of similarity. Conceptually, multiple
sequence alignment can be viewed as an extension of the parallel
coordinates plots because each position in the sequence vector is
assigned a separate column. As with all parallel coordinates plots,
its visualization effectiveness is dependent on prior clustering of the
sequences and it is most effective for small numbers of sequences or
when the similarities or differences are limited to a few patches; it
can be informative but visually less effective for summarizing complex heterogeneous sequence patterns. Graphical representation of
trees (also called dendrogram or cladogram) alignment can complement sequence alignment to provide a quantitative overview of
distance relationships during clustering. Trees are widely employed
in phylogenetic analysis and genome evolution but their structure is
sensitive to the assumptions of the underlying cost function and they
can be difficult to interpret (Horner and Pesole, 2004; Phillips et al.,
2000; Sanderson and Driskell, 2003; Snel et al., 2005; Wallace
et al., 2005).
In the multi-dimensional scaling (MDS) approach, the presentation in two dimensions (2D) is optimized to preserve a specific
aspect of the relationship, e.g. the Euclidean distance, block distance or rank relationships between the points in the N-dimensional
space. In many respects, MDS is the current gold standard for
The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
1569
K.Bhasi et al.
multi-dimensional visualization (Cox and Cox, 1994). For example,
MDS has been used by Hammer et al. to assess global patterns of
Y-chromosome diversity based on 43 polymorphic loci in 50 populations (Hammer et al., 2001) and by Tarazano-Santos et al. to
investigate diversity of the interleukin-13 gene (Tarazona-Santos
and Tishkoff, 2005). In practice, MDS encompasses a class of
methods and Sammon’s non-linear mapping (Sammon, 1969) is
a flavor of MDS that normalizes the distances in the stress function
to distances in the original n-dimensional space, so that it preserves
relationships between small magnitude points better than the
Euclidean distance-based stress function, which tends to ‘undervalue’ small magnitude points.
2
METHODS
2.1
The VizStruct mapping
At the core of VizStruct is a radial projection that maps the n-dimensional
vectors into 2D points while retaining correlation similarity in the original
input space (Bhadra, 2001; Hoffman et al., 1997). If the vector x[n] ¼ (x[0],
x[1], . . . , x[n 1]), represents a data item in n-dimensional space, Rn, its
mapping to a point F1(x[n]) in the complex plane C is given by
F1 ðx½nÞ ¼
n1
X
x½ je2pij/n :
The real and imaginary components of F1(x[n]) are used for creating the
2D mapping. In the Equation (1), above, i ¼ H1 and the complex exponential has the effect of dividing the circle of display into equally spaced
sectors. The equation shown represents a substantive reformulation of the
usual radial visualization mapping and the use of the complex number
notation has significant advantages: it allows easier derivations of the
theoretical underpinnings and an intuitive geometric interpretation of the
mapping (Zhang et al., 2002, 2003, 2004).
In addition, on closer inspection the mapping F1(x[n]) is seen to be
equivalent to the first harmonic of the discrete Fourier transform (DFT).
The relationship between the DFT and the radial visualization mapping,
which was first identified by our group (Zhang et al., 2002, 2003, 2004),
allows the computationally efficient fast Fourier transform algorithm (complexity of O(n log n), where n is the number of dimensions) to be used. It
allows a wide range of enhancements, including higher harmonic analysis,
that have been previously described (Zhang et al., 2002, 2003, 2004).
Coding of SNP datasets
An ordinal scale was used to code the SNP genotype sequences: the numbers
1, 2 and 3 were used for genotypes that were homozygous in the major allele,
heterozygous and homozygous in the minor allele, respectively.
A systematic, sequential approach was used for missing data. Individuals
in whom >75% of the SNP genotypes were missing were excluded. SNP
locations comprised entirely of a combination of missing data and a single
genotype were excluded from the analysis because of the absence of
information; locations where >75% of the SNP genotypes were missing
were also deleted. The remaining missing data points were replaced by
the sample mean for that SNP location.
The coding and computations were conducted in MATLAB and Microsoft
Excel (Microsoft, Bellevue, WA). The data were initially visualized by
graphing in MATLAB. For publication, the polar plots were redrawn
with Kaleidagraph (Synergy Software, Malvern, PA) and 3D graphs were
redrawn with SigmaPlot (SPSS Inc., Chicago, IL).
In all analyses, VizStruct was used without supervision, the dimensions
were uniformly weighted and the symbols and colors were added subsequent
to the computations.
1570
3D analysis
For 3D analysis, we included the Kullback–Leibler divergence (KLD) as the
third dimension or z-coordinate; the complex number corresponding the first
Fourier harmonic was used for the x and y-axes. The KLD between two
probability mass functions p(x) and q(x) is denoted by D(p jj q) and is also
known as the relative entropy. The KLD is defined as follows (Cover and
Thomas, 1991)
X
Dðp jj qÞ ¼
pðxÞlogðpðxÞ/qðxÞÞ
x2X
The base of the logarithm was taken to be 2. The KLD is a measure of the
distance between two distributions or equivalently, it is the inefficiency of
assuming that the distribution is q when the true distribution is p. The KLD
always takes non-negative values, D(pkq) 0), and is zero only if p ¼ q.
More importantly, the KLD is invariant to permutation, monotonic nonlinear transformation and amplitude scaling in the components of the variable X (Haykin, 1999).
The coded SNP data vector was normalized using the sum of its individual
components; this normalization transformed each vector to a probability
distribution summing to unity that was appropriate as the p needed for
computing the KLD. The same reference distribution, q, was used for computing the KLD for all the SNP data vectors. To obtain q, the medians were
computed at each position and the resultant vector of medians was normalized using the sum of its individual components.
ð1Þ
j¼0
2.2
2.3
3
3.1
RESULTS
Analysis of the chromosome 21 dataset
To evaluate the usefulness of the VizStruct approach, we used the
dataset from the SNP analysis of human Chromosome 21 by Patil
et al. (2001). The dataset, obtained using high-density oligonucleotide arrays in combination with somatic cell genetics, consists of
24 047 SNPs typed on 20 haploid copies of the chromosome; it has
been extensively used to assess haplotype-partitioning algorithms
[e.g. (Halldorsson et al., 2004; Zhang et al., 2002)].
The human Chromosome 21 data were downloaded from http://
www.sciencemag.org/feature/data/patil1065573/index.html. In their
paper, Patil et al. (2001) identified blocks consisting of regions of
the DNA sequence in which certain groups of samples shared
sequence similarity; the shared sequence similarity was referred
to as patterns. We coded the more common nucleotide with
1 and the less common nucleotide with 2. For each block, we
projected the sequence corresponding to each sample using
VizStruct with the goal of determining whether the samples identified by Patil et al. (2001) as belonging to given pattern were visually
grouped in the 2D mapping. Figures 1A and B summarize the results
from VizStruct for two representative blocks, Blocks 4 and 5; the
samples belonging to the same pattern are labeled with the same
color to facilitate comparisons. The results show that samples corresponding to the same pattern are placed near each other. However,
the method also identifies specific samples that are outliers relative
to others in the pattern. We analyzed 48 other blocks from the same
dataset and achieved excellent visualization results for each (data
not shown).
3.2
Analysis of lipoprotein lipase genotypes
We also used the human lipoprotein lipase (LPL) gene dataset from
http://droog.gs.washington.edu/mdecode/data/lpl/lpl.prettybase.txt.
wo_n wherein LPL was genotyped at 88 polymorphic sites in
48 individuals (Clark et al., 1998; Nickerson et al., 1998). The
Visualizing SNP data
90
10
A
90
B
45
45
5
0
0
225
315
270
Phase, degrees
Amplitude
Amplitude
5
0
0
225
315
270
Phase, degrees
Fig. 1. (A and B) VizStruct mapping of the Blocks 4 and 5 from Patil et al. (2001). The samples corresponding to the same pattern are highlighted in the
same symbol. The samples with the most frequent pattern are filled with open circles; the samples with the second most frequent pattern are in filled circles;
the remaining patterns had relatively few samples, sometimes just one sample, and are filled squares, open squares and filled triangles.
haplotype phase is available in this dataset; however, because haplotype phase information is generally not available in the majority of
experimental situations, we intentionally coded each SNP location
as being either homozygous in the major allele, heterozygous or
homozygous in the minor allele for visualization.
The dataset contains genotypes of 24 Americans of African
ancestry from M. S. Jackson, who participated in the Family
Blood Pressure Program, a hypertension study, and 24 Americans
of European ancestry from M. N. Rochester, who participated in the
Rochester Family Heart Study.
Figure 2A shows the VizStruct mapping of the M. S. Jackson
(filled circles) and the M. N. Rochester (open circles) samples and
indicates visual separation of the Jackson and Rochester genotypes.
However, there was also partial overlap between the Jackson and
Rochester groups in the 2D VizStruct and this allowed us to assess
the potential usefulness of 3D visualization. The KLD approach was
used for the 3D z-coordinate. The normalized distribution of medians at each site was used to create the reference distribution q,
needed for computing the KLD. The 3D scatter plot in
Figure 2B includes the KLD with the VizStruct mapping and
from comparing Figure 2A with B, it is apparent that the inclusion
of the KLD in Figure 2B, results in further differentiation of the
Jackson data from the Rochester data. From Figure 2B, it is apparent
that the VizStruct mapping of the Rochester data visually shows a
better defined cluster and is generally less variable than the Jackson
data. This is consistent with the conclusions of Clark et al. (1998)
and Nickerson et al. (1998) because it visually reflects the lesser
variability and fewer number of population specific substitutions in
the Rochester group. Thus, the VizStruct approach may potentially
be useful for detecting and estimating the extent of admixture.
3.3
Analysis of the LPL haplotypes
The third analysis involved the 88 LPL haplotypes identified by
Clark et al. (1998) in Figure 2 of their paper (Clark et al., 1998); the
authors compared the human haplotypes with the genotype
sequence from chimpanzee. We coded the major nucleotide at
each site with 1 and variable nucleotide with 2.
Figure 3A shows the polar plot from VizStruct of the distribution
of the 88 human LPL haplotypes and the corresponding chimpanzee
genotype. The chimpanzee data point is highlighted in the filled
square to indicate that the VizStruct method clearly identifies the
chimpanzee genotype sequence as being an outlier. It is important to
note that for every SNP site in humans examined, the chimpanzee
genotype contains one of the human nucleotide variants (Clark
et al., 1998) and that the overall differences between the human
haplotypes sequences and the chimpanzee genotype are therefore
relatively modest. Despite these challenges, the VizStruct method
highlights the chimpanzee genotype sequence sufficiently to invite
visual scrutiny.
A salient finding of Clark et al. (1998) paper was that there were
roughly two major clades that could be identified by counting the
number of pairwise mismatches between all the 142 chromosome
pairs. The VizStruct points corresponding to haplotypes in the two
major clades, in the open and filled circles in Figure 3A, generally
occupy diametrically opposite halves of the circular polar plot
frame. Figure 3B shows the distribution of amplitude values
from VizStruct for the 88 haplotypes as a probability plot; recall,
a single normally distributed population would be a straight line on
this plot. Figure 3B shows the presence of two different groups as
identified by Clark et al. (1998); the inset in Figure 3B is the
corresponding histogram that also highlights the bimodal nature
of the haplotypes distribution. It is important to note that the
VizStruct analysis readily identifies this salient finding with significantly less computational effort because pairwise mismatch counting is not needed. Figure 3C, a 3D plot of the VizStruct mapping
combined with the KLD, further underscores separation in visually
effective manner. Figure 3D, a polar plot of KLD versus VizStruct
phase angle, highlights the additional separation that can be
1571
K.Bhasi et al.
90
20
A
B
45
15
Amplitude
10
5
0
0
225
315
270
Phase, degrees
Fig. 2. (A) The figure shows the VizStruct mapping of the LPL genotypes of the African American patients from M. S. Jackson (filled circles) and Caucasian
American patients from M. N. Rochester, (open circles). (B) The figure shows the 3D plot with the real and imaginary parts of the VizStruct mapping on the x and
y-axes, respectively and the KLD on the z-axis. The graph legends are the same as in (A).
achieved by including the KLD. The KLD selectively induces
separation of data points that were incompletely separated in the
original VizStruct mapping in Figure 3A. The KLD is thus a useful
metric that complements the VizStruct mapping and substantively
enhances its visualization capabilities.
3.4
Analysis of the Y-chromosome dataset
To assess the scalability of the visualization approach, we also
analyzed the human Y-chromosome data from International
HapMap
project
(http://www.hapmap.org/cgi-perl/gbrowse/
gbrowse/hapmap/) using VizStruct. The sequence variations in the
Y-chromosome have applications in forensic identification, paternity testing and the study of human migrations. The HapMap project
obtained 270 DNA samples from four diverse human populations:
(1) Yoruba in Ibadan, Nigeria, (YRI); (2) Americans of European
descent from Utah, USA; (CEU), (3) Han Chinese from Beijing,
China (CHB) and (4) Japanese from Tokyo, Japan (JPT). The 90 individuals in each of the YRI and CEU groups consisted of 30 parent–
offspring trios, whereas the 45 samples in each of the CHB and JPT
groups were unrelated. Because the Y-chromosome is present only in
males, only 141 samples (53 YRI, 44 CEU, 22 CHB and 22 JPT) were
available for visualization. The YRI and CEU groups had 23 and 14
father–son duos, respectively. As before, each SNP location was
coded for visualization based on whether it homozygous in major
allele, heterozygous or homozygous in minor allele.
Figure 4A and B show the results from VizStruct, and in these
figures, it is important to note that each father–son duo projected to
the same point because the underlying Y-chromosome genetic
sequences were identical. In Figure 4A, the 2D VizStruct mapping
of the Y-chromosome dataset, and Figure 4B, the corresponding
3D VizStruct mapping, the YRI group (closed circles) is readily
discernible and forms a cluster that is well separated from the other
groups. Although, there was overlap of the CEU group with the
CHB and JPT groups, the overlap between CHB and JPT samples
1572
was greater. The substantial overlap between the CHB and JPT
groups is consistent with their geographical proximity and the postulated founding of the Japanese population by migration from
China. The findings using VizStruct are consistent with two
other independent studies of Y-chromosome genetic variation
(Hammer et al., 2001; Rosser et al., 2000): these studies highlighted
the distinct divisions in the human Y-chromosome pool and the
importance of geographic factors in creating these divisions.
3.5
Analysis of the coral dataset
Next, we analyzed a dataset obtained by genotyping individual
corals from 5 coral reefs by Brazeau et al. (2005). These authors
used amplification fragment length polymorphism, a multi-locus
technique employed for genetic analysis of organisms with limited
available sequence information that involves a combination of
restriction digestion followed by the polymerase chain reaction
and sequencing. The names and locations of the reefs are summarized in Figure 5A: DNA samples from individual coral specimens
from geographical locations in the Bahamas (23 2800 N, 75 4200 W),
the Crocker and Conch reefs (two sites separated by 12 km at
24 5500 N, 80 3100 W near the Key Largo, FL, area) and the Flower
Gardens Banks (27 5500 N, 93 3600 W, 110 km south-southeast of
Galveston, TX in the Gulf of Mexico) were analyzed using two
separate sets of primers. Coral larvae can derive from either local
adult populations or emigrate from distant locations. Samples from
coral larvae (referred to as recruits) from the Flower Gardens Banks
reef were also obtained and the object of the study was to determine
the likely source from which the recruits migrated; the authors used
discriminant analysis to assign all but one of the recruits to the
Flower Gardens banks. The data were nominal variables indicating
whether a fragment of a given length was present for each set
of primers for 45 polymorphic markers. For this dataset, the
dimensions were ordered so that the mean across all the samples
approximated a cosine-like function; this was achieved by sorting
Visualizing SNP data
90
45
Amplitude
6
3
0
0
B
99
95
90
80
70
50
8
30
20
Frequency
A
99.9
Cumulative % of Samples
9
10
5
4
2
0
1
225
6
0
2
4
6
Amplitude Values
.1
315
0
2
4
6
8
Amplitude Value
270
Phase, degrees
90
C
Kullback-Leibler Divergence
0.09
D
45
0.06
0.03
0
0
225
315
270
Phase, degrees
Fig. 3. (A) The figure shows the VizStruct mapping of the 88 human (the two clades are shown in open and filled circles) and the chimpanzee haplotypes
(highlighted as filled square) in polar coordinates. (B) The distribution of amplitude values, shown on probability paper, demonstrates deviations from normality;
the inset is the corresponding histogram and demonstrates the high degree of bimodality. (C) This is a 3D scatter plot that plots the real and imaginary parts of the
VizStruct mapping against the KLD on the z-axis. (D) This figure plots the KLD versus VizStruct phase in polar coordinates. The graph legends in (C) and (D) are
the same as in (A).
the results for one primer in increasing order and the other primer in
decreasing order.
The VizStruct results shown in Figure 5B demonstrate that the
majority of individuals from each site segregate into distinct areas in
the 3D mapping; this is consistent with the existence of genetic
differences between the majority of individuals from the Bahamas,
Crocker and Conch and Flower Gardens Banks locations. However,
a minority of individuals at each reef visually maps to regions more
typical of populations from other reefs, and this is consistent with
immigration between sites. In particular, several data points from
the Crocker and Conch reefs map to regions characteristic of the
samples from the Bahamas and the Flower Gardens; this pattern
may be made possible because these reefs, which are strategically
located in Florida Keys, may be able to exchange larvae with both
the Bahamas and the Flower Gardens reefs. Consistent with the
findings of Brazeau et al. (2005), the VizStruct analysis also indicates that the recruits are most similar to the samples from the Flower
Gardens Banks. These findings suggest that the 3D VizStruct
visualization approach could also potentially be useful for data
generated by multi-locus techniques that are used for poorly characterized genomes.
4
DISCUSSION
The objective of this report was to evaluate VizStruct, a multidimensional visualization approach based on radial visualization
1573
K.Bhasi et al.
90
10
A
45
Amplitude
5
0
0
225
315
270
Phase, degrees
Fig. 4. (A and B) Shows the VizStruct mapping of the Y-chromosome dataset in polar coordinates and 3-dimensions, respectively. In both graphs, the open
circles, filled circles, open and filled squares represent the CEU, YRI, CHB and JPT, respectively.
for SNP data visualization. We analyzed several datasets to
demonstrate the usefulness of the VizStruct approach to SNP
data obtained from haplotype-tagging studies of an entire chromosome, Chromosome 21, the Y-chromosome and a densely genotyped candidate gene, LPL. We also highlighted the additional
insights that can be obtained from 3D visualization by employing
the KLD.
Our motivation was to assess the role of visualization in general
and VizStruct in particular to SNP analysis. As noted in Introduction, a variety of SNP tagging and multiple sequence analysis techniques can be applied to SNP data analysis (Abecasis et al., 2001;
Abecasis and Cookson, 2000; Halldorsson et al., 2004; Ke and
Cardon, 2003; Snel et al., 2005; Wallace et al., 2005; Zhang
et al., 2002). VizStruct complements these existing techniques in
a unique way. It maps the entire genotyping sequence vector to a
single point and allows a global assessment of the similarities/
dissimilarities between individuals and identifies clusters. As demonstrated in the results, the visualization can be used to efficiently
generate hypothesis regarding the similarities and dissimilarities,
clusters and outliers in the data. The user can thus use VizStruct to
explore large complex genotyping datasets visually and learn more
about the data. The Fourier harmonic and KLD components underlying VizStruct are relatively non-parametric and assumption free,
but both have features (e.g. the higher harmonics of the DFT and the
reference distribution q in the KLD) that allow extensive user interactions (Zhang et al., 2003, 2004). The heuristics for interpreting the
Fourier harmonics and the KLD have well-developed underlying
theory but are also simple. Thus, the visualization process in
3D VizStruct is an aid to and enriches the data mining experience.
The VizStruct mapping is directly related to the complex-valued
harmonics of the DFT. Although the DFT, more specifically the
periodogram, is used widely in spectral analysis (Diggle, 1990), this
adaptation of the DFT for multi-dimensional visualization that
includes both amplitude and phase is novel and has not been
explored in detail. Frequency domain terminology, amplitude
1574
versus frequency and phase versus frequency plots are also commonplace in control theory to describe filter performance and system responses; again however, the applications in
multi-dimensional visualization of genetic datasets have not been
systematically investigated.
We also extended our VizStruct approach to 3D during this
research. The primary motivation for the 3D enhancement was to
create additional ‘real estate’ to accommodate a large number of
data; this will be useful when a large number of SNPs are to be
viewed simultaneously and will also assist in reducing the crowding
of the visualization field that can result because SNP data take on a
limited number of ordinal values. The important considerations for
the variable used for the third dimension were information content
non-redundant with the first harmonic of the DFT and the analytical
utility to endow 3D VizStruct with both increased capacity and new
capabilities. The rich theoretical foundations and roles of the
KLD in both information theory and hypothesis testing motivated
us to evaluate it for 3D VizStruct. The KLD is the expected
log-likelihood ratio and has several important and useful properties
such as (Cover and Thomas, 1991; Haykin, 1999): (1) Convergence
in the KL sense implies convergence in the L1 norm sense (but no
proof is known for the reverse), because the L1 norm is statistically
more robust than the L2 norm, the standard Euclidean distance, this
property of the KL distance is extremely useful; (2) the x2 statistic is
twice the first term in the Taylor expansion of the KLD and (3)
D(pkq) is convex in the pair (p,q). Across the datasets examined, the
KLD complements the original VizStruct mapping presumably
because it is order insensitive and extracts the order-insensitive
features in the SNP sequences.
Mapping from n-dimensions to 2D or 3D necessarily involves
loss of information. For example, in mapping from n-dimensions to
2D or 3D, points that are distant in n-dimensional space may appear
close in the 2D or 3D mapping and vice versa. Furthermore, in
multi-dimensional visualization, one has to be always vigilant about
the curse of dimensionality, which refers to the exponential growth
Visualizing SNP data
A
B
Fig. 5. (A) is a map of the southeastern United States [made using Weinelt (1999, http://www.aquarius.geomar.de/omc/make_map.html)] showing the locations
of the Bahamas (BAH), Crocker and Conch (CC) and Flower Gardens Banks (FGB) coral reefs from which the samples were derived. The grid on the map
indicates latitude north and longitude west. (B) shows the VizStruct mapping of the results from amplification fragment length polymorphism analysis of the coral
samples from reefs located in the Bahamas (filled circles), Crocker and Conch reefs (both open circles), the Flower Garden Banks (open square) and the recruits
from the Flower Garden Banks (filled triangles).
of hypervolume with increasing dimensionality, and confirm any
findings in the mapped space with other multivariate quantitative
techniques. However, the critical challenges in visualization are
also to preserve complexity sufficient to reflect the character of
the raw data without obscuring the underlying structure that is to
be visualized and to provide intuitive heuristics for interpretation:
our results with a diverse group of independently characterized,
complex, biologically relevant datasets indicate that 3D VizStruct
could be effective at meeting all of these challenges.
In conclusion, 3D VizStruct is effective, versatile and has strong
theoretical underpinnings that aid intuitive data interpretation; it
could therefore have potential applications in the visualization of
SNP genotyping data.
ACKNOWLEDGEMENTS
This work was supported in part by grants from the Kapoor
Foundation, National Science Foundation (Research Grant
0234895) and the National Institutes of Health (P20-GM 067650).
Conflict of Interest: none declared.
REFERENCES
Abecasis,G.R. et al. (2001) GRR: graphical representation of relationship errors.
Bioinformatics, 17, 742–743.
Abecasis,G.R. and Cookson,W.O. (2000) GOLD—graphical overview of linkage
disequilibrium. Bioinformatics, 16, 182–183.
Batzoglou,S. (2005) The many faces of sequence alignment. Brief Bioinform., 6,
6–22.
Bhadra,D. (2001) An interactive visual framework for detecting clusters of a multidimensional dataset. Computer Science and Engineering. State University of
New York, Buffalo, NY.
Brazeau,D.A. et al. (2005) A multi-locus genetic assignment technique to assess
sources of Agaricia agaricites larvae on coral reefs. Marine Biol., 147, 1141–1148.
Clark,A.G. et al. (1998) Haplotype structure and population genetic inferences from
nucleotide-sequence variation in human lipoprotein lipase. Am. J. Hum. Genet., 63,
595–612.
Cover,T.M. and Thomas,J.A. (1991) Elements of information theory. Wiley,
New York.
Cox,T.F. and Cox,M.A.A. (1994) Multidimensional Scaling. Chapman and Hall,
London, England.
Diggle,P.J. (1990) Time series. Clarendon Press, Oxford.
Halldorsson,B.V. et al. (2004) Optimal haplotype block-free selection of tagging SNPs
for genome-wide association studies. Genome Res., 14, 1633–1640.
Hammer,M.F. et al. (2001) Hierarchical patterns of global human Y-chromosome
diversity. Mol. Biol. Evol., 18, 1189–1203.
Haykin,S. (1999) Neural Networks: A Comprehensive Foundation. College Publishing
Co., New York.
Hoffman,P.E., Grinstein,G.G. and Marx,K. (1997) DNA visual and analytic data mining. In Proceedings of IEEE Visualization, Phoenix, AZ, pp. 437–441.
Holter,N.S. et al. (2000) Fundamental patterns underlying gene expression profiles:
simplicity from complexity. Proc. Natl Acad. Sci. USA, 97, 8409–8414.
Horner,D.S. and Pesole,G. (2004) Phylogenetic analyses: a brief introduction to
methods and their application. Expert Rev. Mol. Diagn., 4, 339–350.
Ke,X. and Cardon,L.R. (2003) Efficient selective screening of haplotype tag SNPs.
Bioinformatics, 19, 287–288.
Mir,K.U. and Southern,E.M. (2000) Sequence variation in genes and genomic
DNA: methods for large-scale analysis. Annu. Rev. Genomics Hum. Genet., 1,
329–360.
1575
K.Bhasi et al.
Nickerson,D.A. et al. (1998) DNA sequence diversity in a 9.7-kb region of the human
lipoprotein lipase gene. Nat. Genet., 19, 233–240.
Patil,N. et al. (2001) Blocks of limited haplotype diversity revealed by high-resolution
scanning of human chromosome 21. Science, 294, 1719–1723.
Phillips,A. et al. (2000) Multiple sequence alignment in phylogenetic analysis. Mol.
Phylogenet. Evol., 16, 317–330.
Phillips,A.J. (2005) Homology assessment and molecular sequence alignment.
J. Biomed. Inform., 39, 18–33.
Rosser,Z.H. et al. (2000) Y-chromosomal diversity in Europe is clinal and influenced
primarily by geography, rather than by language. Am. J. Hum. Genet., 67,
1526–1543.
Sammon,J.W. (1969) A nonlinear mapping for data structure analysis. IEEE Trans.
Comput., C-18, 401–409.
Sanderson,M.J. and Driskell,A.C. (2003) The challenge of constructing large phylogenetic trees. Trends Plant Sci., 8, 374–379.
Snel,B. et al. (2005) Genome trees and the nature of genome evolution. Annu. Rev.
Microbiol., 59, 191–209.
Suh,Y. and Vijg,J. (2005) SNP discovery in associating genetic variation with human
disease phenotypes. Mutat. Res., 573, 41–53.
1576
Tarazona-Santos,E. and Tishkoff,S.A. (2005) Divergent patterns of linkage disequilibrium and haplotype structure across global populations at the interleukin-13
(IL13) locus. Genes Immun., 6, 53–65.
Wallace,I.M. et al. (2005) Multiple sequence alignments. Curr. Opin. Struct. Biol., 15,
261–266.
Ward,M.O. (1994) Integrating multiple methods for visualizing multivariate data. In
Proceedings of IEEE Visualization, Washington, DC, pp. 326–336.
Weinelt,M. (1999) Online map creation.
Zhang,K. et al. (2002) A dynamic programming algorithm for haplotype block
partitioning. Proc. Natl Acad. Sci. USA, 99, 7335–7339.
Zhang,L., Zhang,A. and Ramanathan,M. (2002) Visualized classification of multiple
sample types. , Proceedings of the 2nd Workshop on Data Mining in Bioinformatics
(BIOKDD 2002), The ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining Edmonton, Alberta, Canada, pp. 55–62.
Zhang,L., Zhang,A. and Ramanathan,M. (2003) Enhanced visualization of time series
through higher Fourier harmonics. In Proceedings of BIOKDD03: 3rd ACM
SIGKDD Workshop on data mining in bioinformatics, Washington, DC, pp. 49–56.
Zhang,L. et al. (2004) VizStruct: exploratory visualization for gene expression
profiling. Bioinformatics, 20, 85–92.