Download Genotype Error Detection using Hiddend Markov Models of

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Computational Challenges in
Whole-Genome Association Studies
Ion Mandoiu
Computer Science and Engineering Department
University of Connecticut
Approaches to Disease Gene Mapping

Linkage analysis

Association analysis
Cases





LOD:=log10(L()/L(1/2))
Very successful for Mendelian
diseases (cystic fibrosis,
Huntington’s,…)
Low power to detect genes
with small relative risk in
complex diseases
[RischMerikangas’96]
Controls
2-test
Genome-wide scans made
possible by recent
progress in SNP
genotyping technologies
Computational Challenges







Detecting genotyping errors
Imputation of missing genotypes
Imputation of untyped genotypes based on
reference population (e.g., Hapmap)
Haplotype inference and haplotype-based
association tests
Modeling gene-gene interactions
Handling structural variation data provided by
new sequencing technologies
Optimal multi-stage study design
3
Genotype Error Detection


A real problem despite advances in technology
In [KMP07] we proposed efficient methods for error detection
in trio data based on LLR approach combined with an HMM
model of haplotype diversity
Parents-COMBINED
NO_ERR
ERR
1
1000000
0.9
100000
#FP=#FN line
0.8
10000
1000
0.7
TotalProb-TRIO
10
5.94
5.4
5.67
5.13
4.86
4.59
4.32
4.05
3.78
3.51
3.24
2.7
2.97
2.43
2.16
1.89
1.62
1.35
1.08
0.81
0.54
0
0.27
1
Children-COMBINED
NO_ERR
Sensitivity
100
0.6
0.5
TotalProbCOMBINED
0.4
0.3
ERR
1000000
0.2
100000
0.1
FAMHAP
10000
0
1000
0
100
0.005
0.01
0.015
FP rate
10
5.88
5.6
5.32
5.04
4.76
4.48
4.2
3.92
3.64
3.36
3.08
2.8
2.52
2.24
1.96
1.68
1.4
1.12
0.84
0.56
0

0.28
1
In ongoing work we seek to improve error detection accuracy
by using low-level data such as typing confidence scores
Genotype Imputation


Current genotyping platforms cover <1 mil. SNPs of ~10mil.
SNPs  causal variant unlikely to be assayed directly
Untyped SNPs can be imputed based on linkage disequilibrium
info inferred from high-density datasets such as Hapmap
Maximum likelihood approach:

probabilities computed using HMM
Allele frequency, typed genotypes

Allele frequency, imputed genotypes
Acknowledgements & Advertisment


Justin Kennedy, Bogdan Pasaniuc
NSF funding (Awards 0546457 and 0543365)
DIMACS Workshop on Computational Issues in Genetic Epidemiology
August 21 - 22, 2008
DIMACS Center, CoRE Building, Rutgers University
Presented under the auspices of the DIMACS/BioMaPS/MB Center Special Focus
on Information Processing in Biology.
Organizers:
Andrew Scott Allen, Duke University, Ion Mandoiu, University of Connecticut
Dan Nicolae, University of Chicago, Yi Pan, Georgia State University, Alex
Zelikovsky, Georgia State University
Related documents