Download TOHEuroVA - Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genomic library wikipedia , lookup

Pathogenomics wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Genetic engineering wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Microevolution wikipedia , lookup

Metagenomics wikipedia , lookup

Designer baby wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Oncogenomics wikipedia , lookup

Non-coding DNA wikipedia , lookup

Genealogical DNA test wikipedia , lookup

Genome editing wikipedia , lookup

Human genome wikipedia , lookup

Genome evolution wikipedia , lookup

RNA-Seq wikipedia , lookup

History of genetic engineering wikipedia , lookup

Human Genome Project wikipedia , lookup

Genome (book) wikipedia , lookup

Molecular Inversion Probe wikipedia , lookup

Genomics wikipedia , lookup

Human genetic variation wikipedia , lookup

SNP genotyping wikipedia , lookup

Public health genomics wikipedia , lookup

Tag SNP wikipedia , lookup

Transcript
Visual Analysis
of Tracts of Homozygosity
In The Human Genome
Sean Reber and Ye Zhao
Kent State University, USA
Li Zhang and Mohammed Orloff and Charis Eng
Cleveland Clinic Foundation, USA
Presented by Ye Zhao
Human Genomics

Genomics
◦ The study of DNA and their functions, and their
impact on human health
◦ Has growing relevance to the biomedical sciences
◦ Highly reliant on computational methods


Big data created by high-throughput and
high-resolution experiment techniques
Knowledge discovery of genetic factors
associated with diseases is very important
for early diagnosis and prevention of the
diseases.
SNP: Single Nucleotide Polymorphism


SNP is one of the most common genetic
variants
Each SNP represents one DNA
nucleotide difference at a specific locus
of two alleles.
◦ For example, a SNP may replace the
nucleotide cytosine (C) with the nucleotide
thymine (T) in a certain stretch of DNA
◦ An SNP is present every 300 nucleotides on
average, meaning there are about 10 million
SNPs in the human genome
Genome-wide assay can genotype over
500,000 human SNPs
 SNP can act as a biomedical marker
associated with disease risk

Image from Wiki
Tract of Homozygosity
Homozygosity: the same alleles for a particular
SNP locus
 Homozygous alleles associated with clinical
phenotypes through statistical study

◦ Traditional single SNP-association study compounded
by multiple testing and false associations
◦ More importantly, complex disorders may involve
groups of SNPs

Tract of Homozygosity regions has been shown
and validate to associate with cancers
Common TOH Region

A window of at least L consecutive SNPs
where the number of SNPs belonging to a
TOH is at least n.
n=10
Related Work

Exiting techniques and tools for
TOH computation and disease risk
provide naive functionality and
limited usability
Most previous works focus on single
SNP-association with simple
visualization
◦ A few existing TOH detection and
analysis tools: Plink, Golden Helix, have
limited visualization and exploration
capability

HomozygosityMapper©
with bar chart
Golden Helix© with Heatmap
Our Highlights
Propose a new visual analytics system to
explore tracts of homozygosity (TOH) in
the human genome
 Contribute to the discovery of genetic
factors related to human diseases
 Visually examine TOH clusters identified
from any genome-wide SNP database
 Incorporate clustering, statistical
measures and allow interactive navigation

TOH Clustering


Find TOH clusters in a selected cTOH region
Define pair-wise TOH similarity by their overlapping
(OP) ratio

Adapt spectral clustering to compute clusters
◦ Reduce the dimensionality of the similarity matrix before
clustering
◦ Then perform k-means in a fewer dimensions
Hierarchical Clustering
Hard to define k in spectral clustering as
we do not have any prior knowledge
 To overcome this problem we design a
hierarchical clustering process

◦ Set (k = 2) to perform clustering
◦ Repeat this binary clustering over clusters
created in previous step
◦ Stopping criteria:
 depth threshold, TOHs amount threshold, statistical
attributes of similarities
TOH Cluster Tree
A cluster tree, TOHC tree, is created for
a cTOH region
 Navigation tools to be build on top of it

Statistical Associate Study




P-value is obtained for each cluster to
present the significance of the association
with human disease or phenotype
The smaller a P-value is, the more significant
it is
Loaded into the visualization system for
further exploration
Map P-values to colors in visualization: high
P-values at the blue end of the spectrum and
low P-values at the red end.
Visualization System
TOH Cluster Explorer
Chromosome & cTOH Selector
Navigation Rings of TOHC Trees
Zooming
TOH Cluster View
NCBI Genome Browser

Provide a broad spectrum of information
linked with genome sequences
◦ gene symbols, gene names, protein,regulation, etc.
Parameters & Reclustering
TOHC tree is computed based on userdefined parameters
 Adjust by users to find if the clustering
results are affected

Case Study: Lung Cancer




Data: a data set of lung cancer from the
PLCO Cancer Screening Trial by National
Cancer Institute
DNA sequencing samples from a group of
subjects of diseased cases and controls
1618 subjects: 788 lung cancer cases and 830
controls of European Americans.
514,355 SNPs were available for subsequent
TOH analyses by DNA microarray
Statistical Analysis

Besides the association study applied to
detected TOH cluster regions, we also
performed Single-SNP association analysis
by PLINK for comparison
P-values Comparison
Small values (more significant) are shown
higher than large P-values (less significant)
 Dots: single SNP association study results
 Blue line: the whole cTOH region
 Red line: one TOH cluster target region

Case Study Conclusion
Introducing TOH cluster is able to
provide genetic background or ancestry
related information hence a biological
meaningful association with the disease
 The visual analytics tools helped domain
experts in TOH study

◦ More biomedical background and statistical
analysis results published in PLOS One
Summary
A new visual analytics system to compute
and identify various patterned clusters of
SNPs demonstrating regional homozygosity
 Future work

◦ More statistical testing methods
◦ Visualization methods to manifest cluster
relationships
Thanks!