* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download TOHEuroVA - Computer Science
Genomic library wikipedia , lookup
Pathogenomics wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Genetic engineering wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Microevolution wikipedia , lookup
Metagenomics wikipedia , lookup
Designer baby wikipedia , lookup
Mir-92 microRNA precursor family wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Oncogenomics wikipedia , lookup
Non-coding DNA wikipedia , lookup
Genealogical DNA test wikipedia , lookup
Genome editing wikipedia , lookup
Human genome wikipedia , lookup
Genome evolution wikipedia , lookup
History of genetic engineering wikipedia , lookup
Human Genome Project wikipedia , lookup
Genome (book) wikipedia , lookup
Molecular Inversion Probe wikipedia , lookup
Human genetic variation wikipedia , lookup
SNP genotyping wikipedia , lookup
Visual Analysis of Tracts of Homozygosity In The Human Genome Sean Reber and Ye Zhao Kent State University, USA Li Zhang and Mohammed Orloff and Charis Eng Cleveland Clinic Foundation, USA Presented by Ye Zhao Human Genomics Genomics ◦ The study of DNA and their functions, and their impact on human health ◦ Has growing relevance to the biomedical sciences ◦ Highly reliant on computational methods Big data created by high-throughput and high-resolution experiment techniques Knowledge discovery of genetic factors associated with diseases is very important for early diagnosis and prevention of the diseases. SNP: Single Nucleotide Polymorphism SNP is one of the most common genetic variants Each SNP represents one DNA nucleotide difference at a specific locus of two alleles. ◦ For example, a SNP may replace the nucleotide cytosine (C) with the nucleotide thymine (T) in a certain stretch of DNA ◦ An SNP is present every 300 nucleotides on average, meaning there are about 10 million SNPs in the human genome Genome-wide assay can genotype over 500,000 human SNPs SNP can act as a biomedical marker associated with disease risk Image from Wiki Tract of Homozygosity Homozygosity: the same alleles for a particular SNP locus Homozygous alleles associated with clinical phenotypes through statistical study ◦ Traditional single SNP-association study compounded by multiple testing and false associations ◦ More importantly, complex disorders may involve groups of SNPs Tract of Homozygosity regions has been shown and validate to associate with cancers Common TOH Region A window of at least L consecutive SNPs where the number of SNPs belonging to a TOH is at least n. n=10 Related Work Exiting techniques and tools for TOH computation and disease risk provide naive functionality and limited usability Most previous works focus on single SNP-association with simple visualization ◦ A few existing TOH detection and analysis tools: Plink, Golden Helix, have limited visualization and exploration capability HomozygosityMapper© with bar chart Golden Helix© with Heatmap Our Highlights Propose a new visual analytics system to explore tracts of homozygosity (TOH) in the human genome Contribute to the discovery of genetic factors related to human diseases Visually examine TOH clusters identified from any genome-wide SNP database Incorporate clustering, statistical measures and allow interactive navigation TOH Clustering Find TOH clusters in a selected cTOH region Define pair-wise TOH similarity by their overlapping (OP) ratio Adapt spectral clustering to compute clusters ◦ Reduce the dimensionality of the similarity matrix before clustering ◦ Then perform k-means in a fewer dimensions Hierarchical Clustering Hard to define k in spectral clustering as we do not have any prior knowledge To overcome this problem we design a hierarchical clustering process ◦ Set (k = 2) to perform clustering ◦ Repeat this binary clustering over clusters created in previous step ◦ Stopping criteria: depth threshold, TOHs amount threshold, statistical attributes of similarities TOH Cluster Tree A cluster tree, TOHC tree, is created for a cTOH region Navigation tools to be build on top of it Statistical Associate Study P-value is obtained for each cluster to present the significance of the association with human disease or phenotype The smaller a P-value is, the more significant it is Loaded into the visualization system for further exploration Map P-values to colors in visualization: high P-values at the blue end of the spectrum and low P-values at the red end. Visualization System TOH Cluster Explorer Chromosome & cTOH Selector Navigation Rings of TOHC Trees Zooming TOH Cluster View NCBI Genome Browser Provide a broad spectrum of information linked with genome sequences ◦ gene symbols, gene names, protein,regulation, etc. Parameters & Reclustering TOHC tree is computed based on userdefined parameters Adjust by users to find if the clustering results are affected Case Study: Lung Cancer Data: a data set of lung cancer from the PLCO Cancer Screening Trial by National Cancer Institute DNA sequencing samples from a group of subjects of diseased cases and controls 1618 subjects: 788 lung cancer cases and 830 controls of European Americans. 514,355 SNPs were available for subsequent TOH analyses by DNA microarray Statistical Analysis Besides the association study applied to detected TOH cluster regions, we also performed Single-SNP association analysis by PLINK for comparison P-values Comparison Small values (more significant) are shown higher than large P-values (less significant) Dots: single SNP association study results Blue line: the whole cTOH region Red line: one TOH cluster target region Case Study Conclusion Introducing TOH cluster is able to provide genetic background or ancestry related information hence a biological meaningful association with the disease The visual analytics tools helped domain experts in TOH study ◦ More biomedical background and statistical analysis results published in PLOS One Summary A new visual analytics system to compute and identify various patterned clusters of SNPs demonstrating regional homozygosity Future work ◦ More statistical testing methods ◦ Visualization methods to manifest cluster relationships Thanks!