Download Clustering for Accuracy, Performance, and Alternative

Document related concepts
no text concepts found
Transcript
Disease Gene Identification: A
Practical Guide to Techniques
Candidate gene, interval (linkage), association
(disequilibrium)
Difficult for both groups (biological / computational).
Biologists tend to understand the biological
justifications, computerists are better qualified to
tackle the underlying math and associated
computations. (The statistics befuddle us both.)
Overview


Review
Linkage analysis
–
–
–
–
markers
SNPs and micro arrays
pooling – parallel genotyping
what is a LOD score


2-point
multipoint
– TDT
– programs

pros and cons
– files and formats

Linkage disequilibrium
– experimental results

Demo
2
Review

meiosis
– produces haploid gametes and is mechanism for
transmission of genetic material, independent
assortment, and recombination between loci

marker
– an informative marker is used to observe the
genetic state at a particular genomic location which
enables the observation of the transmission

links to marker D15S160
3
Marker D15S643 (153) and
Genotypes – (min, max)143,159;
CHEPH 147,145; 149;149
AATTGCTCTGAGTTCTGAGGC
>chr15:72,091,076-72,091,409
CAGCTGATCTTTAGGAAACATTTAGGGGGAGGAGGCACTCCTTTCAAATA
ACCTTTCTTTAGACAGGTTTCTGATCTGATTCAAGGCCACATCCTGGCCA
TCTGGTTTCTGTAACTCAGAGAATTACTGCTCCTGAT AAATTGCTCTGAG
TTCTGAGGC (22)
TACTGCTGTCATATTGCATTCTCCGACCATTTTCCAGGTCT 41
CTCAAG 6 acacacacacacacacacacacacacacacacacacacacacac
acacac (50) TCCTCAAGC (9) CGTTAGACTCCATTCCCATGTAGTA
(25) TCCAAATAAG
TTTTACAGCAAGACACACTGGAGAGATTGAAGCT
TACTACATGGGAATGGAGTCTAACG
ATGATGTACCCTTACCTCAGATTGC
4
D15S160
>chr15:72091076-72091409
CAGCTGATCTTTAGGAAACATTTAGGGGGAGGAGGCACTCCTTTCAAATA
GTCGACTAGAAATCCTTTGTAAATCCCCCTCCTCCGTGAGGAAAGTTTAT
ACCTTTCTTTAGACAGGTTTCTGATCTGATTCAAGGCCACATCCTGGCCA
TGGAAAGAAATCTGTCCAAAGACTAGACTAAGTTCCGGTGTAGGACCGGT
TCTGGTTTCTGTAACTCAGAGAATTACTGCTCCTGAT AAATTGCTCTGAG
AGACCAAAGACATTGAGTCTCTTAATGACGAGGACTA TTTAACGAGACTC
TTCTGAGGC TACTGCTGTCATATTGCATTCTCCGACCATTTTCCAGGTCT
AAGACTGGC ATGACGACAGTATAACGTAAGAGGCTGGTAAAAGGTCCAGA
CTCAAGacacacacacacacacacacacacacacacacacacacacacac
GAGTTCtgtgtgtgtgtgtgtgtgtgtgtgtgtgtgtgtgtgtgtgtgtg
acacacTCCTCAAGC CGTTAGACTCCATTCCCATGTAGTA TCCAAATAAG
tgtgtgAGGAGTTCG GCAATCTGAGGTAAGGGTACATCAT AGGTTTATTC
TTTTACAGCAAGACACACTGGAGAGATTGAAGCT
AAAATGTCGTTCTGTGTGACCTCTCTAACTTCGA
5
30 cycles yields
2^30 = 1.07x10^9
molecules
6
heat
Primers,
nts, enzymes
extension
heat
Primers,
nts, enzymes
30 cycle,
then run
on gel
7
143,159; CHEPH 147,145; 149;149,
153 (genomic)
159
157
155
153
151
149
147
145
143
8
DNA Pooling
9
Pooling is Potential
Alternative to Genotyping

Pool
– parents and offspring
– affecteds and unaffecteds
– fathers, mothers, offspring

Advantage
– high-throughput
– cost

Disadvantage
– does it work?
10
Example
fathers
mothers
offspring
11
“SNPs”



Single-Nucleotide Polymorphisms
1 every 1000 bp (estimated)
2,972,052 SNPs submitted to dbSNP
– 50% of all SNPs are in question
– 10% of UTRs have SNPs


100,000 - 500,000 SNPs needed (for
association)
Why don’t we do this?
– $$$
12
Strachan, Human Molecular Genetics 2, pg 412. Mutation Detection
13
Strachan, Human Molecular Genetics 2, pg 412. Minisequencing
14
Fundamental Genetics

meiosis
– Hs are diploid
– meiosis produces haploid gametes
– mechanism for transmission of genetic
material to offspring
– recombination by cross-over (Holliday
structure) or by independent segregation of
homologous pairs
15
Fundamental Genetics (Background for
Linkage Analysis)

Rule of Segregation
– offspring receive ONE allele (genetic
material) from the pair of alleles possessed
by BOTH parents

Rule of Independent Assortment
– alleles of one gene can segregate
independently of alleles of other genes
– (Linkage Analysis relies on the violation of
Independent Assortment Rule)
16
Genetic Marker … Prelude to LA
– A genetic marker allows for the observation of
the genetic state at a particular genomic location
(locus).
A genotype is the measured state of a genetic marker.
 May never be feasible to sequence cases directly.

– An “informative” marker is often “heterozygous,”
or “polymorphic” and enables the observation of
the inheritance of genetic material.
17
Monogenic and Polygenic Diseases
– monogenic (Mendelian) -- one gene
 “simple” (dominant and recessive) Mendelian
inheritance
 direct correspondence between one gene
mutation and one disorder
 majority of disease genes found are monogenic
– polygenic -- (complex, nonmendelian) multiple
genes
 heterogeneity and epistasis
 combinatorics
 no longer have direct correspondence between
one gene and disorder
 majority of disorders are probably polygenic
– complexity of organisms and observed
18
pathways
...Mongenic and Polygenic Diseases
phenocopy
 reduced penetrance

– Example -- sickle cell anemia
“classic” recessive disorder
 defect in red blood cells (hemoglobin)
 but… infant hemoglobin gene can “leak”
 wide range of phenotypes

19
20
H-W
f(AA) = p2
 f(Aa) = 2pq
 f(aa) = q2


(p+q)2

(p2 + q2 + r2 + 2pq + 2pr + 2qr)= (p+q+r)2
21
Dominant and Recessive
Penetrance Modeled
penetrance = P(pt | gt)
DD Dd dd
1 1 0
DD Dd dd
0 0 1
DD Dd dd
0.9 0.9 0.0
DD Dd dd
0 0 0.8
22
D-R Heterogeneous, DD Epistatic
AA
BB 1
Bb 1
bb 1


Aa
1
1
1
aa
0
0
1
AA
BB 1
Bb 1
bb 0
Aa
1
1
0
aa
0
0
0
reduced penetrance
3,9,27,81,243… 3n
23
Linkage
theta =
recombination
fraction =
R/(NR + R)
M2 and B
3/(4+3) = 0.43
Close = 0
Far = 0.5
A and B
2/7 = 0.29
What about M1
24
and B (0.57)????
Linkage Analysis
Goal: find a marker “linked” to a disease
gene.
 LOD score = log of likelihood ratio
 LR[θ;data] == k P[data; θ]
 theta = estimate of genetic distance
(recombination fraction) between marker
and disease
 = proportion of recombinant
gametes/total gametes
25

…Linkage Analysis

Linkage analysis calculates the likelihood that
the inheritance pattern of the phenotype
(disease) is supported by the observed
inheritance patterns (genotypes) in a
pedigree.

parametric – requires a precise genetic model
– linkage analysis
nonparametric – no model
– association (linkage disequilibrium), TDT, IBD,
ASP, etc.

26
So which one (parametric or
nonparametric)???




Even with the dependence on a genetic model for
parametric analysis, in general parametric analysis is
typically as powerful, or more powerful for identifying
candidate loci….
however, parametric (linkage), in general, requires
families (people have to be related)…
however, some nonparametric methods (association)
are not limited by this…
however, it has been shown that some nonparametric
methods would require unfeasibly large samples to
detect susceptibility – ASP (Risch and Merikangas
1996).
27
Study Design
genome-wide screen by linkage
 then narrow candidate region by
disequilibrium mapping OR
 candidate gene approaches

28
Linkage Analysis and Problems
with Nonmendelian Disorders
– few monogenic models, easy to test
– more difficult to find models explaining
inheritance in polygenic models
– nonmendelian disorders are often more
difficult to establish diagnostic criteria
BP
 obesity
 psychiatric disorders (autism, schizophrenia)
 Bardet-Biedl syndrome

29
How to Address Difficulties
Seek families in which the disease
segregates in a near-mendelian manner.
 Use affected pedigree members only in
a parametric analysis.
 Use nonparametric (model-free) method
of linkage analysis.

30
Elston-Stewart Algorithm




Human Heredity, 21: 523-542 (1971)
Example? – see handout
Take home: Linkage calculation is difficult,
complicated, and tedious – best left to
computer programs.
Kruglyak L, et. al. Parametric and
nonparametric linkage analyis: a unified
multipoint approach. Am J. Hum Genet
58:1347-1363.
31
Linkage Analysis Programs

FASTLINK - 2 point
– O(n2), where n = number of markers
– O(n), where n = number of people

GeneHunter - multipoint, 2 point, and
parametric and non-parametric LOD (NPL)
–
–
–
–
NPL == alleles shared IBD
O(n2), where n = number of people
O(n), where n = number of markers
2n-f < 16 (Nonfounders, and Founders)
32
GeneHunter

GeneHunter –
– multipoint
– 2-point
– parametric
– non-parametric LOD (NPL)
NPL == alleles shared IBD
 typically expressed as “p-value”
 “significant” threshold is not as obvious

33
GeneHunter

genome-wide p value == probability that
the observed value will be exceeded
anywhere in the genome, assuming the
null hypothesis of no linkage
34
Criteria for linkage in complex dieases
(Lander and Kruglyak 1995)
– Suggestive Linkage is lod or p value that would be
expected to occur once by chance in a whole genome
scan
– Significant linkage is lod or p value that would be
expected to occur by chance 0.05 times in a gnome
scan (p = 0.05)
– Highly suggestive linkage is a lod score or p value that
would be expected to occur by chance 0.001 times in a
whole genome scan
– Confirmed linkage – linkage is regarded as confirmed
when a significant linkage is observed in one study is
confirmed by finding a lod score or p value that would be
expected to occur 0.01 times by chance in a specific
search of the candidate region
35
2-point VS multipoint
M
M
Disease
M2
Disease
theta
M3
M4
36
Linkage Disequilibrium
– Association: particular alleles at 2 or more loci
show allelic association if they occur together with
frequencies significantly different from those
predicted from the individual allele frequencies –
aka disequilibrium
– disease-bearing chromosomes must desced from
one or a few individuals
– Generally need case vs controls
– TDT (heterozygous marker transmitted), HRR
(untransmitted alleles as control)
– IBD
– allelic associations (outbred populations)
maintained at only <<= 1cM
– ASHG 2001

LD maintained up to 20 – 30 KB
37
Complex Disorders: Case Study

Schizophrenia
– LOD of 6
– tried many different diagnostic criteria, then
selected the best

Breast cancer
–
–
–
–
–
1990 locus mapped to 17q21 (BRCA1)
confirmed by 2 more groups
narrowed to 8 cM
1994, second locus to 13q12 (BRCA2)
BRCA1 cloned in 1994, BRCA2 in 1995
38
Transmission Disequilibrium Test
(TDT) – Linkage Dissequilibrium

Spielman, et. al. Transmission Test for
Linkage Disequilibrium: The Insulin Gene
Region and Insulin-dependent Diabetes
Mellitus(IDDM). Am J Hum Genet 52:506516, 1993.

TDT = (n12 – n21)^2/(n12 + n21) where n12
and n21 and n12 are the number of
heterozygous offspring from heterozygous
parents of N families
39
LOD Properties



Lods are additive across pedigrees
“Significant” linkage for LOD >= 3.0
Heterogeneity LOD (Het-LOD)
– LOD calculation over another parameter, alpha,
where alpha is the proportion of families linked to
the disease
– can only raise the LOD score

Typically perform LODs over 2 models, dom
and recessive – which may affect your cuttoff
– if you maximize over all parameters, you run the
risk of erroneously obtaining a “significant” LOD
40
Files and Formats

datain.dat
– genetic model
– marker description – allele frequencies

pedin.ped
– pre-makeped file
– used by genehunter
– pedigree information, affection status,
genotypes
41
Example – pedin.ped
1000 1 0 0 1 1 5 4 6 4
1000 2 0 0 2 1 4 3 6 6
1000 3 1 2 1 2 5 4 6 6
1000 4 1 2 1 2 4 3 6 4
1
2
3
4
42
Example – pedin.dat
1000
1000
1000
1000
1
2
3
4
0
0
1
1
0
0
2
2
3
3
0
0
0
0
4
0
01115464
02014366
41025466
01024364
Ped: 1000
Ped: 1000
Ped: 1000
Ped: 1000
Per: 1
Per: 2
Per: 3
Per: 4
43
Example -- datain.dat
11 0 0 5 << NO. OF LOCI, RISK LOCUS, SEXLINKED (IF 1) PROGRAM
0 0.0 0.0 0 << MUT LOCUS, MUT MALE, MUT FEM, HAP FREQ (IF1)
1 2 3 4 5 6 7 8 9 10 11
1 2 << AFFECTION, NO. OF ALLELES
0.9 0.1 << GENE FREQUENCIES
1 << NO. OF LIABILITY CLASSES
0 0.99 0.99 << PENETRANCES
3 5 D13S794 << ALLELE NUMBERS, NO. OF ALLELES GATA48C10 Marker name goes here
0.22957894 0.51578945 0.24736843 0.005263158 0.0020 << GENE FREQUENCIES
44
Where does the data come from?

A lab
A spreadsheet
A napkin
A database

Where do allele frequencies come from?



– Published estimates
– You calculate them

Example: autism, 400 people, 300 markers =
120,000 genotype pairs
45
Demo
Code sources and documents (links)
 Linkage Analysis

– FASTLINK
– GeneHunter2
46
47
Fundamental Genetics and
Probability Concepts
meiosis and sampling
 patterns of inheritance
 monogenic and complex inheritance

– phenocopy
– reduced penetrance

DNA variation
– polymorphisms, SNPs, and mutations

positional cloning
48
Examples
49
Examples
50
Example
51
BBS4 Pedigree
52
Dom-Rec Heterozygous
Screen genes A, B?, b
53
Uninformative Marker
54
Informative Marker
55

Given the following observations: family structure,
affection status, genotypes, and disease allele
frequencies. Assuming a model for the disease, can
we calculate the probability that these observations
“fit” an assumed model???
56
BBS2 genetic mapping
C16
1
2
3
4
5
6
7
8
9
10
11
12
57
BBS2 genetic mapping
affected
unaffected
C16
1
2
3
4
5
6
7
8
9
10
11
12
58
Summary

Disease Gene Identification
– challenges
– interval localization

genotyping and genetic markers, linkage
analysis, allele sharing, association studies
(“SNiPs”), homozygosity mapping
– disease gene identification techniques

Take home
– A complex disorder (with interacting genes)
has yet to be characterized
59
Allele Sharing

tries to show that affected family
members inherit the same chromosomal
regions more often than expected by
chance
60
Allele Sharing Example
Needs at least sibs.
61
Association Studies


“Allelic association studies provide the most
powerful method for locating genes of small
effect contributing to complex diseases and
traits.” Daniels, Am J Hum Genet 62:1189-1197,
1998.
Linkage analysis
– genome wide screen, 400 markers ~ 10 cM (10 MB),
association needs 4000+ polymorphic markers
– generally need nuclear family or larger

Association finds “linkage disequilibruim”
62
Association Studies

“Association is simply a statistical
statement about the co-occurrence of
alleles or phenotypes. Allele A is
associated with disease D if people who
have D also have A more (or maybe
less) often than would be predicted from
the individual frequencies of D and A in
the population.” Pg. 286 Human
Molecular Genetics 2, Tom Strachan
63
Examples

HLA-DR4 (antigen marker)
– 36% in UK
– 78% with rheumatoid arthritis

CF( RFLP markers XV2.c (X1,X2), KM19(K1,K2))
– Marker Alleles
CF(case)
Normal(control)
– X1, K1
3
49
– X1, K2
147
19
– X2, K1
8
70
– X2, K2
8
25
– CF associated with X1, K2 in ‘89 (Strachan)64
Hardy-Weinberg Equilibrium


Rule that relates allelic and genotypic frequencies in a
population of diploid, sexually reproducing individuals
if that population has random mating, large size, no
mutation or migration, and no selection
Assumptions
– allelic frequencies will not change in a population
from one generation to the next
– genotypic frequencies are determined in a
predictable way by allelic frequencies
– the equilibrium is neutral -- if perturbed, it will
reestablish within one generation of random mating
at the new allelic frequency
65
Equilibrium
66
Homozygosity Mapping
67
Related documents