Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
BIOINFORMATICS ORIGINAL PAPER Vol. 22 no. 13 2006, pages 1569–1576 doi:10.1093/bioinformatics/btl144 Sequence analysis VizStruct for visualization of genome-wide SNP analyses Kavitha Bhasi1, Li Zhang3, Daniel Brazeau1, Aidong Zhang2 and Murali Ramanathan1, 1 Department of Pharmaceutical Sciences and 2Deparment of Computer Science and Engineering, State University of New York, Buffalo, NY 14260, USA and 3Department of Computer Science, Eastern Michigan University, Ypsilanti, MI 48197, USA Received on March 2, 2006; revised on April 9, 2006; accepted on April 10, 2006 Advance Access publication April 13, 2006 Associate Editor: Alex Bateman ABSTRACT Motivation: The size, dimensionality and the limited range of the data values make visualization of single nucleotide polymorphism (SNP) datasets challenging. The purpose of this study is to evaluate the usefulness of 3D VizStruct, a novel multi-dimensional data visualization technique for analyzing patterns in SNP datasets. Results: VizStruct is an interactive visualization technique that reduces multi-dimensional data to two dimensions using the complex-valued harmonics of the discrete Fourier transform (DFT). In the 3D VizStruct extension, the multi-dimensional SNP data vectors are reduced to three dimensions using a combination of the DFT and the Kullback–Leibler divergence. The performance of 3D VizStruct was challenged with several biologically relevant published datasets that included human Chromosome 21, the human lipoprotein lipase (LPL) gene locus and the multi-locus genotypes of coral populations. In every case, the 3D VizStruct mapping provided an intuitive visual description of the key characteristics of the underlying multi-dimensional genotype. Availability: Excel and MATLAB code are available at http://www.cse. buffalo.edu/DBGROUP/bioinformatics/supplementary/SNP/ Contact: [email protected] 1 INTRODUCTION Technologies capable of simultaneously genotyping thousands of single nucleotide polymorphisms (SNPs) are now widely employed in basic biomedical research for investigating the genetic basis of complex diseases (Mir and Southern, 2000; Suh and Vijg, 2005). Visualization can also provide effective tools to summarize and interpret datasets, describe the contents, and expose features in genome-wide SNP datasets. One of the key obstacles of visualizing genome-wide SNP data is the high dimensionality. Additional challenges include the size of the datasets (typically data on >10 000 SNPs can be obtained from a single sample), the limited range of the data values (the data are sequences of ordinal numbers and the number of values taken by each SNP is very limited: each SNP is typically called as heterozygous or one of two homozygous states) and the presence of correlated markers delimiting haplotypes. Although genotyping technologies have advanced considerably and a variety of sequence analysis and alignment algorithms and tools have been developed, analytical visualization of SNP datasets, the primary focus of this research, has not been extensively investigated in the context of SNP data analysis. However, fast efficient, effective and easy-to-use analytical visualization tools are essential To whom correspondence should be addressed. for identifying and interpreting patterns in large SNP datasets, generating hypotheses and directing subsequent research. Many techniques have been developed to visualize multivariate data and these can potentially be applied to SNP visualization as well. The simplest multivariate data visualization tool to display genetic variation is the parallel coordinates plot (Ward, 1994), in which the data along each dimension is plotted on a separate axis; e.g. Holter et al. used this method to plot microarray data from yeast cell cycle experiments (Holter et al., 2000). Heat maps are also commonly employed in multi-dimensional data visualization, e.g. Halldorsson et al. used it to summarize the results from their SNP tagging algorithm (Halldorsson et al., 2004). On closer inspection, however, it becomes evident that the heat plot is a special version of the parallel coordinates plot. The Graphical Display of Linkage Disequilibrium (GOLD) software program is an example of a software tool that generates heat plots of linkage disequilibriumcoefficient matrices and calculates a variety of disequilibrium measures (Abecasis and Cookson, 2000). Multiple sequence alignment (Batzoglou, 2005; Phillips et al., 2000; Phillips, 2005; Snel et al., 2005; Wallace et al., 2005), which is widely used in all areas of molecular biology including SNP analysis, can also be considered a visualization aid because the algorithms cluster sequences and the resultant alignments are organized to highlight regions of similarity. Conceptually, multiple sequence alignment can be viewed as an extension of the parallel coordinates plots because each position in the sequence vector is assigned a separate column. As with all parallel coordinates plots, its visualization effectiveness is dependent on prior clustering of the sequences and it is most effective for small numbers of sequences or when the similarities or differences are limited to a few patches; it can be informative but visually less effective for summarizing complex heterogeneous sequence patterns. Graphical representation of trees (also called dendrogram or cladogram) alignment can complement sequence alignment to provide a quantitative overview of distance relationships during clustering. Trees are widely employed in phylogenetic analysis and genome evolution but their structure is sensitive to the assumptions of the underlying cost function and they can be difficult to interpret (Horner and Pesole, 2004; Phillips et al., 2000; Sanderson and Driskell, 2003; Snel et al., 2005; Wallace et al., 2005). In the multi-dimensional scaling (MDS) approach, the presentation in two dimensions (2D) is optimized to preserve a specific aspect of the relationship, e.g. the Euclidean distance, block distance or rank relationships between the points in the N-dimensional space. In many respects, MDS is the current gold standard for The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] 1569 K.Bhasi et al. multi-dimensional visualization (Cox and Cox, 1994). For example, MDS has been used by Hammer et al. to assess global patterns of Y-chromosome diversity based on 43 polymorphic loci in 50 populations (Hammer et al., 2001) and by Tarazano-Santos et al. to investigate diversity of the interleukin-13 gene (Tarazona-Santos and Tishkoff, 2005). In practice, MDS encompasses a class of methods and Sammon’s non-linear mapping (Sammon, 1969) is a flavor of MDS that normalizes the distances in the stress function to distances in the original n-dimensional space, so that it preserves relationships between small magnitude points better than the Euclidean distance-based stress function, which tends to ‘undervalue’ small magnitude points. 2 METHODS 2.1 The VizStruct mapping At the core of VizStruct is a radial projection that maps the n-dimensional vectors into 2D points while retaining correlation similarity in the original input space (Bhadra, 2001; Hoffman et al., 1997). If the vector x[n] ¼ (x[0], x[1], . . . , x[n 1]), represents a data item in n-dimensional space, Rn, its mapping to a point F1(x[n]) in the complex plane C is given by F1 ðx½nÞ ¼ n1 X x½ je2pij/n : The real and imaginary components of F1(x[n]) are used for creating the 2D mapping. In the Equation (1), above, i ¼ H1 and the complex exponential has the effect of dividing the circle of display into equally spaced sectors. The equation shown represents a substantive reformulation of the usual radial visualization mapping and the use of the complex number notation has significant advantages: it allows easier derivations of the theoretical underpinnings and an intuitive geometric interpretation of the mapping (Zhang et al., 2002, 2003, 2004). In addition, on closer inspection the mapping F1(x[n]) is seen to be equivalent to the first harmonic of the discrete Fourier transform (DFT). The relationship between the DFT and the radial visualization mapping, which was first identified by our group (Zhang et al., 2002, 2003, 2004), allows the computationally efficient fast Fourier transform algorithm (complexity of O(n log n), where n is the number of dimensions) to be used. It allows a wide range of enhancements, including higher harmonic analysis, that have been previously described (Zhang et al., 2002, 2003, 2004). Coding of SNP datasets An ordinal scale was used to code the SNP genotype sequences: the numbers 1, 2 and 3 were used for genotypes that were homozygous in the major allele, heterozygous and homozygous in the minor allele, respectively. A systematic, sequential approach was used for missing data. Individuals in whom >75% of the SNP genotypes were missing were excluded. SNP locations comprised entirely of a combination of missing data and a single genotype were excluded from the analysis because of the absence of information; locations where >75% of the SNP genotypes were missing were also deleted. The remaining missing data points were replaced by the sample mean for that SNP location. The coding and computations were conducted in MATLAB and Microsoft Excel (Microsoft, Bellevue, WA). The data were initially visualized by graphing in MATLAB. For publication, the polar plots were redrawn with Kaleidagraph (Synergy Software, Malvern, PA) and 3D graphs were redrawn with SigmaPlot (SPSS Inc., Chicago, IL). In all analyses, VizStruct was used without supervision, the dimensions were uniformly weighted and the symbols and colors were added subsequent to the computations. 1570 3D analysis For 3D analysis, we included the Kullback–Leibler divergence (KLD) as the third dimension or z-coordinate; the complex number corresponding the first Fourier harmonic was used for the x and y-axes. The KLD between two probability mass functions p(x) and q(x) is denoted by D(p jj q) and is also known as the relative entropy. The KLD is defined as follows (Cover and Thomas, 1991) X Dðp jj qÞ ¼ pðxÞlogðpðxÞ/qðxÞÞ x2X The base of the logarithm was taken to be 2. The KLD is a measure of the distance between two distributions or equivalently, it is the inefficiency of assuming that the distribution is q when the true distribution is p. The KLD always takes non-negative values, D(pkq) 0), and is zero only if p ¼ q. More importantly, the KLD is invariant to permutation, monotonic nonlinear transformation and amplitude scaling in the components of the variable X (Haykin, 1999). The coded SNP data vector was normalized using the sum of its individual components; this normalization transformed each vector to a probability distribution summing to unity that was appropriate as the p needed for computing the KLD. The same reference distribution, q, was used for computing the KLD for all the SNP data vectors. To obtain q, the medians were computed at each position and the resultant vector of medians was normalized using the sum of its individual components. ð1Þ j¼0 2.2 2.3 3 3.1 RESULTS Analysis of the chromosome 21 dataset To evaluate the usefulness of the VizStruct approach, we used the dataset from the SNP analysis of human Chromosome 21 by Patil et al. (2001). The dataset, obtained using high-density oligonucleotide arrays in combination with somatic cell genetics, consists of 24 047 SNPs typed on 20 haploid copies of the chromosome; it has been extensively used to assess haplotype-partitioning algorithms [e.g. (Halldorsson et al., 2004; Zhang et al., 2002)]. The human Chromosome 21 data were downloaded from http:// www.sciencemag.org/feature/data/patil1065573/index.html. In their paper, Patil et al. (2001) identified blocks consisting of regions of the DNA sequence in which certain groups of samples shared sequence similarity; the shared sequence similarity was referred to as patterns. We coded the more common nucleotide with 1 and the less common nucleotide with 2. For each block, we projected the sequence corresponding to each sample using VizStruct with the goal of determining whether the samples identified by Patil et al. (2001) as belonging to given pattern were visually grouped in the 2D mapping. Figures 1A and B summarize the results from VizStruct for two representative blocks, Blocks 4 and 5; the samples belonging to the same pattern are labeled with the same color to facilitate comparisons. The results show that samples corresponding to the same pattern are placed near each other. However, the method also identifies specific samples that are outliers relative to others in the pattern. We analyzed 48 other blocks from the same dataset and achieved excellent visualization results for each (data not shown). 3.2 Analysis of lipoprotein lipase genotypes We also used the human lipoprotein lipase (LPL) gene dataset from http://droog.gs.washington.edu/mdecode/data/lpl/lpl.prettybase.txt. wo_n wherein LPL was genotyped at 88 polymorphic sites in 48 individuals (Clark et al., 1998; Nickerson et al., 1998). The Visualizing SNP data 90 10 A 90 B 45 45 5 0 0 225 315 270 Phase, degrees Amplitude Amplitude 5 0 0 225 315 270 Phase, degrees Fig. 1. (A and B) VizStruct mapping of the Blocks 4 and 5 from Patil et al. (2001). The samples corresponding to the same pattern are highlighted in the same symbol. The samples with the most frequent pattern are filled with open circles; the samples with the second most frequent pattern are in filled circles; the remaining patterns had relatively few samples, sometimes just one sample, and are filled squares, open squares and filled triangles. haplotype phase is available in this dataset; however, because haplotype phase information is generally not available in the majority of experimental situations, we intentionally coded each SNP location as being either homozygous in the major allele, heterozygous or homozygous in the minor allele for visualization. The dataset contains genotypes of 24 Americans of African ancestry from M. S. Jackson, who participated in the Family Blood Pressure Program, a hypertension study, and 24 Americans of European ancestry from M. N. Rochester, who participated in the Rochester Family Heart Study. Figure 2A shows the VizStruct mapping of the M. S. Jackson (filled circles) and the M. N. Rochester (open circles) samples and indicates visual separation of the Jackson and Rochester genotypes. However, there was also partial overlap between the Jackson and Rochester groups in the 2D VizStruct and this allowed us to assess the potential usefulness of 3D visualization. The KLD approach was used for the 3D z-coordinate. The normalized distribution of medians at each site was used to create the reference distribution q, needed for computing the KLD. The 3D scatter plot in Figure 2B includes the KLD with the VizStruct mapping and from comparing Figure 2A with B, it is apparent that the inclusion of the KLD in Figure 2B, results in further differentiation of the Jackson data from the Rochester data. From Figure 2B, it is apparent that the VizStruct mapping of the Rochester data visually shows a better defined cluster and is generally less variable than the Jackson data. This is consistent with the conclusions of Clark et al. (1998) and Nickerson et al. (1998) because it visually reflects the lesser variability and fewer number of population specific substitutions in the Rochester group. Thus, the VizStruct approach may potentially be useful for detecting and estimating the extent of admixture. 3.3 Analysis of the LPL haplotypes The third analysis involved the 88 LPL haplotypes identified by Clark et al. (1998) in Figure 2 of their paper (Clark et al., 1998); the authors compared the human haplotypes with the genotype sequence from chimpanzee. We coded the major nucleotide at each site with 1 and variable nucleotide with 2. Figure 3A shows the polar plot from VizStruct of the distribution of the 88 human LPL haplotypes and the corresponding chimpanzee genotype. The chimpanzee data point is highlighted in the filled square to indicate that the VizStruct method clearly identifies the chimpanzee genotype sequence as being an outlier. It is important to note that for every SNP site in humans examined, the chimpanzee genotype contains one of the human nucleotide variants (Clark et al., 1998) and that the overall differences between the human haplotypes sequences and the chimpanzee genotype are therefore relatively modest. Despite these challenges, the VizStruct method highlights the chimpanzee genotype sequence sufficiently to invite visual scrutiny. A salient finding of Clark et al. (1998) paper was that there were roughly two major clades that could be identified by counting the number of pairwise mismatches between all the 142 chromosome pairs. The VizStruct points corresponding to haplotypes in the two major clades, in the open and filled circles in Figure 3A, generally occupy diametrically opposite halves of the circular polar plot frame. Figure 3B shows the distribution of amplitude values from VizStruct for the 88 haplotypes as a probability plot; recall, a single normally distributed population would be a straight line on this plot. Figure 3B shows the presence of two different groups as identified by Clark et al. (1998); the inset in Figure 3B is the corresponding histogram that also highlights the bimodal nature of the haplotypes distribution. It is important to note that the VizStruct analysis readily identifies this salient finding with significantly less computational effort because pairwise mismatch counting is not needed. Figure 3C, a 3D plot of the VizStruct mapping combined with the KLD, further underscores separation in visually effective manner. Figure 3D, a polar plot of KLD versus VizStruct phase angle, highlights the additional separation that can be 1571 K.Bhasi et al. 90 20 A B 45 15 Amplitude 10 5 0 0 225 315 270 Phase, degrees Fig. 2. (A) The figure shows the VizStruct mapping of the LPL genotypes of the African American patients from M. S. Jackson (filled circles) and Caucasian American patients from M. N. Rochester, (open circles). (B) The figure shows the 3D plot with the real and imaginary parts of the VizStruct mapping on the x and y-axes, respectively and the KLD on the z-axis. The graph legends are the same as in (A). achieved by including the KLD. The KLD selectively induces separation of data points that were incompletely separated in the original VizStruct mapping in Figure 3A. The KLD is thus a useful metric that complements the VizStruct mapping and substantively enhances its visualization capabilities. 3.4 Analysis of the Y-chromosome dataset To assess the scalability of the visualization approach, we also analyzed the human Y-chromosome data from International HapMap project (http://www.hapmap.org/cgi-perl/gbrowse/ gbrowse/hapmap/) using VizStruct. The sequence variations in the Y-chromosome have applications in forensic identification, paternity testing and the study of human migrations. The HapMap project obtained 270 DNA samples from four diverse human populations: (1) Yoruba in Ibadan, Nigeria, (YRI); (2) Americans of European descent from Utah, USA; (CEU), (3) Han Chinese from Beijing, China (CHB) and (4) Japanese from Tokyo, Japan (JPT). The 90 individuals in each of the YRI and CEU groups consisted of 30 parent– offspring trios, whereas the 45 samples in each of the CHB and JPT groups were unrelated. Because the Y-chromosome is present only in males, only 141 samples (53 YRI, 44 CEU, 22 CHB and 22 JPT) were available for visualization. The YRI and CEU groups had 23 and 14 father–son duos, respectively. As before, each SNP location was coded for visualization based on whether it homozygous in major allele, heterozygous or homozygous in minor allele. Figure 4A and B show the results from VizStruct, and in these figures, it is important to note that each father–son duo projected to the same point because the underlying Y-chromosome genetic sequences were identical. In Figure 4A, the 2D VizStruct mapping of the Y-chromosome dataset, and Figure 4B, the corresponding 3D VizStruct mapping, the YRI group (closed circles) is readily discernible and forms a cluster that is well separated from the other groups. Although, there was overlap of the CEU group with the CHB and JPT groups, the overlap between CHB and JPT samples 1572 was greater. The substantial overlap between the CHB and JPT groups is consistent with their geographical proximity and the postulated founding of the Japanese population by migration from China. The findings using VizStruct are consistent with two other independent studies of Y-chromosome genetic variation (Hammer et al., 2001; Rosser et al., 2000): these studies highlighted the distinct divisions in the human Y-chromosome pool and the importance of geographic factors in creating these divisions. 3.5 Analysis of the coral dataset Next, we analyzed a dataset obtained by genotyping individual corals from 5 coral reefs by Brazeau et al. (2005). These authors used amplification fragment length polymorphism, a multi-locus technique employed for genetic analysis of organisms with limited available sequence information that involves a combination of restriction digestion followed by the polymerase chain reaction and sequencing. The names and locations of the reefs are summarized in Figure 5A: DNA samples from individual coral specimens from geographical locations in the Bahamas (23 2800 N, 75 4200 W), the Crocker and Conch reefs (two sites separated by 12 km at 24 5500 N, 80 3100 W near the Key Largo, FL, area) and the Flower Gardens Banks (27 5500 N, 93 3600 W, 110 km south-southeast of Galveston, TX in the Gulf of Mexico) were analyzed using two separate sets of primers. Coral larvae can derive from either local adult populations or emigrate from distant locations. Samples from coral larvae (referred to as recruits) from the Flower Gardens Banks reef were also obtained and the object of the study was to determine the likely source from which the recruits migrated; the authors used discriminant analysis to assign all but one of the recruits to the Flower Gardens banks. The data were nominal variables indicating whether a fragment of a given length was present for each set of primers for 45 polymorphic markers. For this dataset, the dimensions were ordered so that the mean across all the samples approximated a cosine-like function; this was achieved by sorting Visualizing SNP data 90 45 Amplitude 6 3 0 0 B 99 95 90 80 70 50 8 30 20 Frequency A 99.9 Cumulative % of Samples 9 10 5 4 2 0 1 225 6 0 2 4 6 Amplitude Values .1 315 0 2 4 6 8 Amplitude Value 270 Phase, degrees 90 C Kullback-Leibler Divergence 0.09 D 45 0.06 0.03 0 0 225 315 270 Phase, degrees Fig. 3. (A) The figure shows the VizStruct mapping of the 88 human (the two clades are shown in open and filled circles) and the chimpanzee haplotypes (highlighted as filled square) in polar coordinates. (B) The distribution of amplitude values, shown on probability paper, demonstrates deviations from normality; the inset is the corresponding histogram and demonstrates the high degree of bimodality. (C) This is a 3D scatter plot that plots the real and imaginary parts of the VizStruct mapping against the KLD on the z-axis. (D) This figure plots the KLD versus VizStruct phase in polar coordinates. The graph legends in (C) and (D) are the same as in (A). the results for one primer in increasing order and the other primer in decreasing order. The VizStruct results shown in Figure 5B demonstrate that the majority of individuals from each site segregate into distinct areas in the 3D mapping; this is consistent with the existence of genetic differences between the majority of individuals from the Bahamas, Crocker and Conch and Flower Gardens Banks locations. However, a minority of individuals at each reef visually maps to regions more typical of populations from other reefs, and this is consistent with immigration between sites. In particular, several data points from the Crocker and Conch reefs map to regions characteristic of the samples from the Bahamas and the Flower Gardens; this pattern may be made possible because these reefs, which are strategically located in Florida Keys, may be able to exchange larvae with both the Bahamas and the Flower Gardens reefs. Consistent with the findings of Brazeau et al. (2005), the VizStruct analysis also indicates that the recruits are most similar to the samples from the Flower Gardens Banks. These findings suggest that the 3D VizStruct visualization approach could also potentially be useful for data generated by multi-locus techniques that are used for poorly characterized genomes. 4 DISCUSSION The objective of this report was to evaluate VizStruct, a multidimensional visualization approach based on radial visualization 1573 K.Bhasi et al. 90 10 A 45 Amplitude 5 0 0 225 315 270 Phase, degrees Fig. 4. (A and B) Shows the VizStruct mapping of the Y-chromosome dataset in polar coordinates and 3-dimensions, respectively. In both graphs, the open circles, filled circles, open and filled squares represent the CEU, YRI, CHB and JPT, respectively. for SNP data visualization. We analyzed several datasets to demonstrate the usefulness of the VizStruct approach to SNP data obtained from haplotype-tagging studies of an entire chromosome, Chromosome 21, the Y-chromosome and a densely genotyped candidate gene, LPL. We also highlighted the additional insights that can be obtained from 3D visualization by employing the KLD. Our motivation was to assess the role of visualization in general and VizStruct in particular to SNP analysis. As noted in Introduction, a variety of SNP tagging and multiple sequence analysis techniques can be applied to SNP data analysis (Abecasis et al., 2001; Abecasis and Cookson, 2000; Halldorsson et al., 2004; Ke and Cardon, 2003; Snel et al., 2005; Wallace et al., 2005; Zhang et al., 2002). VizStruct complements these existing techniques in a unique way. It maps the entire genotyping sequence vector to a single point and allows a global assessment of the similarities/ dissimilarities between individuals and identifies clusters. As demonstrated in the results, the visualization can be used to efficiently generate hypothesis regarding the similarities and dissimilarities, clusters and outliers in the data. The user can thus use VizStruct to explore large complex genotyping datasets visually and learn more about the data. The Fourier harmonic and KLD components underlying VizStruct are relatively non-parametric and assumption free, but both have features (e.g. the higher harmonics of the DFT and the reference distribution q in the KLD) that allow extensive user interactions (Zhang et al., 2003, 2004). The heuristics for interpreting the Fourier harmonics and the KLD have well-developed underlying theory but are also simple. Thus, the visualization process in 3D VizStruct is an aid to and enriches the data mining experience. The VizStruct mapping is directly related to the complex-valued harmonics of the DFT. Although the DFT, more specifically the periodogram, is used widely in spectral analysis (Diggle, 1990), this adaptation of the DFT for multi-dimensional visualization that includes both amplitude and phase is novel and has not been explored in detail. Frequency domain terminology, amplitude 1574 versus frequency and phase versus frequency plots are also commonplace in control theory to describe filter performance and system responses; again however, the applications in multi-dimensional visualization of genetic datasets have not been systematically investigated. We also extended our VizStruct approach to 3D during this research. The primary motivation for the 3D enhancement was to create additional ‘real estate’ to accommodate a large number of data; this will be useful when a large number of SNPs are to be viewed simultaneously and will also assist in reducing the crowding of the visualization field that can result because SNP data take on a limited number of ordinal values. The important considerations for the variable used for the third dimension were information content non-redundant with the first harmonic of the DFT and the analytical utility to endow 3D VizStruct with both increased capacity and new capabilities. The rich theoretical foundations and roles of the KLD in both information theory and hypothesis testing motivated us to evaluate it for 3D VizStruct. The KLD is the expected log-likelihood ratio and has several important and useful properties such as (Cover and Thomas, 1991; Haykin, 1999): (1) Convergence in the KL sense implies convergence in the L1 norm sense (but no proof is known for the reverse), because the L1 norm is statistically more robust than the L2 norm, the standard Euclidean distance, this property of the KL distance is extremely useful; (2) the x2 statistic is twice the first term in the Taylor expansion of the KLD and (3) D(pkq) is convex in the pair (p,q). Across the datasets examined, the KLD complements the original VizStruct mapping presumably because it is order insensitive and extracts the order-insensitive features in the SNP sequences. Mapping from n-dimensions to 2D or 3D necessarily involves loss of information. For example, in mapping from n-dimensions to 2D or 3D, points that are distant in n-dimensional space may appear close in the 2D or 3D mapping and vice versa. Furthermore, in multi-dimensional visualization, one has to be always vigilant about the curse of dimensionality, which refers to the exponential growth Visualizing SNP data A B Fig. 5. (A) is a map of the southeastern United States [made using Weinelt (1999, http://www.aquarius.geomar.de/omc/make_map.html)] showing the locations of the Bahamas (BAH), Crocker and Conch (CC) and Flower Gardens Banks (FGB) coral reefs from which the samples were derived. The grid on the map indicates latitude north and longitude west. (B) shows the VizStruct mapping of the results from amplification fragment length polymorphism analysis of the coral samples from reefs located in the Bahamas (filled circles), Crocker and Conch reefs (both open circles), the Flower Garden Banks (open square) and the recruits from the Flower Garden Banks (filled triangles). of hypervolume with increasing dimensionality, and confirm any findings in the mapped space with other multivariate quantitative techniques. However, the critical challenges in visualization are also to preserve complexity sufficient to reflect the character of the raw data without obscuring the underlying structure that is to be visualized and to provide intuitive heuristics for interpretation: our results with a diverse group of independently characterized, complex, biologically relevant datasets indicate that 3D VizStruct could be effective at meeting all of these challenges. In conclusion, 3D VizStruct is effective, versatile and has strong theoretical underpinnings that aid intuitive data interpretation; it could therefore have potential applications in the visualization of SNP genotyping data. ACKNOWLEDGEMENTS This work was supported in part by grants from the Kapoor Foundation, National Science Foundation (Research Grant 0234895) and the National Institutes of Health (P20-GM 067650). Conflict of Interest: none declared. REFERENCES Abecasis,G.R. et al. (2001) GRR: graphical representation of relationship errors. Bioinformatics, 17, 742–743. Abecasis,G.R. and Cookson,W.O. (2000) GOLD—graphical overview of linkage disequilibrium. Bioinformatics, 16, 182–183. Batzoglou,S. (2005) The many faces of sequence alignment. Brief Bioinform., 6, 6–22. Bhadra,D. (2001) An interactive visual framework for detecting clusters of a multidimensional dataset. Computer Science and Engineering. State University of New York, Buffalo, NY. Brazeau,D.A. et al. (2005) A multi-locus genetic assignment technique to assess sources of Agaricia agaricites larvae on coral reefs. Marine Biol., 147, 1141–1148. Clark,A.G. et al. (1998) Haplotype structure and population genetic inferences from nucleotide-sequence variation in human lipoprotein lipase. Am. J. Hum. Genet., 63, 595–612. Cover,T.M. and Thomas,J.A. (1991) Elements of information theory. Wiley, New York. Cox,T.F. and Cox,M.A.A. (1994) Multidimensional Scaling. Chapman and Hall, London, England. Diggle,P.J. (1990) Time series. Clarendon Press, Oxford. Halldorsson,B.V. et al. (2004) Optimal haplotype block-free selection of tagging SNPs for genome-wide association studies. Genome Res., 14, 1633–1640. Hammer,M.F. et al. (2001) Hierarchical patterns of global human Y-chromosome diversity. Mol. Biol. Evol., 18, 1189–1203. Haykin,S. (1999) Neural Networks: A Comprehensive Foundation. College Publishing Co., New York. Hoffman,P.E., Grinstein,G.G. and Marx,K. (1997) DNA visual and analytic data mining. In Proceedings of IEEE Visualization, Phoenix, AZ, pp. 437–441. Holter,N.S. et al. (2000) Fundamental patterns underlying gene expression profiles: simplicity from complexity. Proc. Natl Acad. Sci. USA, 97, 8409–8414. Horner,D.S. and Pesole,G. (2004) Phylogenetic analyses: a brief introduction to methods and their application. Expert Rev. Mol. Diagn., 4, 339–350. Ke,X. and Cardon,L.R. (2003) Efficient selective screening of haplotype tag SNPs. Bioinformatics, 19, 287–288. Mir,K.U. and Southern,E.M. (2000) Sequence variation in genes and genomic DNA: methods for large-scale analysis. Annu. Rev. Genomics Hum. Genet., 1, 329–360. 1575 K.Bhasi et al. Nickerson,D.A. et al. (1998) DNA sequence diversity in a 9.7-kb region of the human lipoprotein lipase gene. Nat. Genet., 19, 233–240. Patil,N. et al. (2001) Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science, 294, 1719–1723. Phillips,A. et al. (2000) Multiple sequence alignment in phylogenetic analysis. Mol. Phylogenet. Evol., 16, 317–330. Phillips,A.J. (2005) Homology assessment and molecular sequence alignment. J. Biomed. Inform., 39, 18–33. Rosser,Z.H. et al. (2000) Y-chromosomal diversity in Europe is clinal and influenced primarily by geography, rather than by language. Am. J. Hum. Genet., 67, 1526–1543. Sammon,J.W. (1969) A nonlinear mapping for data structure analysis. IEEE Trans. Comput., C-18, 401–409. Sanderson,M.J. and Driskell,A.C. (2003) The challenge of constructing large phylogenetic trees. Trends Plant Sci., 8, 374–379. Snel,B. et al. (2005) Genome trees and the nature of genome evolution. Annu. Rev. Microbiol., 59, 191–209. Suh,Y. and Vijg,J. (2005) SNP discovery in associating genetic variation with human disease phenotypes. Mutat. Res., 573, 41–53. 1576 Tarazona-Santos,E. and Tishkoff,S.A. (2005) Divergent patterns of linkage disequilibrium and haplotype structure across global populations at the interleukin-13 (IL13) locus. Genes Immun., 6, 53–65. Wallace,I.M. et al. (2005) Multiple sequence alignments. Curr. Opin. Struct. Biol., 15, 261–266. Ward,M.O. (1994) Integrating multiple methods for visualizing multivariate data. In Proceedings of IEEE Visualization, Washington, DC, pp. 326–336. Weinelt,M. (1999) Online map creation. Zhang,K. et al. (2002) A dynamic programming algorithm for haplotype block partitioning. Proc. Natl Acad. Sci. USA, 99, 7335–7339. Zhang,L., Zhang,A. and Ramanathan,M. (2002) Visualized classification of multiple sample types. , Proceedings of the 2nd Workshop on Data Mining in Bioinformatics (BIOKDD 2002), The ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Edmonton, Alberta, Canada, pp. 55–62. Zhang,L., Zhang,A. and Ramanathan,M. (2003) Enhanced visualization of time series through higher Fourier harmonics. In Proceedings of BIOKDD03: 3rd ACM SIGKDD Workshop on data mining in bioinformatics, Washington, DC, pp. 49–56. Zhang,L. et al. (2004) VizStruct: exploratory visualization for gene expression profiling. Bioinformatics, 20, 85–92.