* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download cancer_b
Molecular Inversion Probe wikipedia , lookup
Inbreeding avoidance wikipedia , lookup
Genome (book) wikipedia , lookup
Public health genomics wikipedia , lookup
Transgenerational epigenetic inheritance wikipedia , lookup
Koinophilia wikipedia , lookup
History of genetic engineering wikipedia , lookup
Polymorphism (biology) wikipedia , lookup
Designer baby wikipedia , lookup
Human genetic variation wikipedia , lookup
Pharmacogenomics wikipedia , lookup
Genome-wide association study wikipedia , lookup
Genetic drift wikipedia , lookup
Behavioural genetics wikipedia , lookup
Microevolution wikipedia , lookup
Population genetics wikipedia , lookup
Heritability of IQ wikipedia , lookup
Dominance (genetics) wikipedia , lookup
Statistical Considerations for Population-Based Studies in Cancer I Special Topic: Statistical analyses of twin and family data Kim-Anh Do, Ph.D. Associate Professor Department of Biostatistics Email: [email protected] http://odin.mdacc.tmc.edu/~kim 1 The usual idea of a gene is a specific region of DNA that codes for a single protein or enzyme, and the position of a gene on a chromosome is its locus. •The basis for research by human geneticists is to try to identify traits, or phenotypes, whose inheritance patterns are consistent with the action of individual genes. • Recent advances in genetics show that the relationship between DNA sequence and phenotype is both more complex and more interesting than we thought. •Some functions of DNA do not even depend on its nucleotide sequence, and DNA sequence variation includes a variety of direct and indirect forms of feedback among various regions of the DNA within and between cells. 2 Allele and genotype frequenquencies • The most fundamental quantitative variable in population genetics is the allele frequency, a prevalence measure. • When a locus has only two alleles, denote their frequencies p and q=1-p. • Let Pg define the frequency of genotype g • The frequency of an i homozygote is Pii = pi pi = pi 2 •The frequency of an ik heterozygote is Pik = 2 pi pk •For a diallelic system the genotypes have frequency PAA = p2 PAa = 2pq Paa = q2 3 Frequenquency relationships between genotype and phenotype The concept of penetrance • A given genotype does not always produce the same phenotype. The association between the two is known as the penetrance. • Individuals with a given genotype will have some distribution of phenotypes; the penetrance function specifies the probability that an individual with genotype g has phenotype g() = Pr(|g) 4 Frequenquency relationships between genotype and phenotype (cont’d) • For many quantitative biological traits there is some measurement scale on which the phenotypes are approximately normally distributed. g() = {1/[g (2)]} exp[-(-g)2 / 2g2 • Penetrance is a statistical, population-specific association between genotype and phenotype, not a biological explanation of such a relationship. •Many factors may affect the expression of a given genotype: genes, environmental factors, errors in measurement or classification, sampling error etc. 5 Nuclear families and sibships The distribution of traits in families • A diploid, sexually reproducing organism has two sets of genes, one inherited from each parent. • Each time that individual produces his/her own gamete (sperm or egg), one of his/her inherited alleles, at each locus, will be randomly chosen and transmitted in the gamete. There is thus a probability of ½ that an offspring will inherit a specific parental allele. THIS probabilistic aspect of inheritance IS A FUNDAMENTAL ASPECT OF OUR BIOLOGY. 6 Segregation analysis: discrete traits in families • We can understand the basic principles of genetic epidemiology by studying the behavior of alleles at a single locus in nuclear families. • We can take advantage of evolution-based constraints on the distribution of genetic variation in families. • The analysis of trait distributions in families is known as segregation analysis after Gregor Mendel’s Law of Segregation of individual alleles at a locus. • The idea is to judge if the pattern of phenotypes in families is consistent with a genetic model. • Families are ascertained via one or more index individuals, or probands, who may be either randomly identified, or chosen because of their disease or other phenotype status. 7 Nuclear families and sibships The distribution of traits in families • A diploid, sexually reproducing organism has two sets of genes, one inherited from each parent. • Each time that individual produces his/her own gamete (sperm or egg), one of his/her inherited alleles, at each locus, will be randomly chosen and transmitted in the gamete. There is thus a probability of ½ that an offspring will inherit a specific parental allele. THIS probabilistic aspect of inheritance IS A FUNDAMENTAL ASPECT OF OUR BIOLOGY. 8 Nuclear families and sibships (cont’d) Transmission probabilities • For a single diallelic locus with alleles A and a, define the transmission probabilities t(x|g), as the probability that a parent of genotype g produces a gamete with allele a. These are conditional probabilities because they depend on the genotypic state of the parent. • For autosomal loci t(A|AA) = 1, t(A|Aa) = ½, t(A|aa) = 0. 9 Table 5.1A. Genotypic mating table for an autosomal diallelic locus Mating type Empiric mating frequency Offspring genotype probabilities Under random mating Conditional AA Aa aa Unconditional AA Aa aa AA X AA M11 p2 p2 = p4 1 0 0 p4 AA X Aa M12 2p2 (2pq) = p3 q ½ ½ 0 2p3q 2p3q AA X aa M13 2p2 q2 0 0 Aa X Aa M22 (2pq) (2pq)= 4p2q2 ¼ ½ ¼ Aa X aa M23 2 (2pq) q2 = 4pq3 0 ½ ½ 0 2pq3 2pq3 aa X aa M33 q2 q2 = q4 0 0 0 0 q4 10 1 0 1 0 0 0 2p2q2 0 p2q2 2p2q2 p2q2 Nuclear families and sibships (cont’d) Mating types • The probability that an individual has a given genotype is determined by the genotype, or mating types, of its parents • A nuclear family is a set of repeated selections of offspring genotypes from the mating type, Mk l, of parents with genotypes k and l. • In a population (or sample), 0 <= Pr(Mk l) <= 1 ; k l Pr(Mk l) = 1; est(Mk l)= nk l/N. • If there is random mating relative to the locus in question, the mating type frequencies are determined by the genotype frequencies (determined by the allele frequencies) 11 Nuclear families and sibships (cont’d) Transition probabilities • Family data consists of parent-offspring triads. • Define transition probabilities P(go|gf , gm) as the conditional probabilities of genotypes in offspring given those in the father and mother. • For a diallelic locus, there are three possible offspring genotypes (AA, Aa, aa) with transition probabilities t(A|f) t(A|m) t(A|f) (1- t(A|m)) + t(A|m) (1- t(A|f)) (1 - t(A|m)) (1- t(A|f)) See Table 5.1B 12 Table 5.1B. Parent to offspring transition probabilities for a diallelic locus Father’s genotype AA Mother’s genotype Aa aa AA {1 0 0} {½ ½ 0} {0 1 0} Aa {½ ½ 0} {¼ ½ ¼} {0 ½ ½} aa {0 1 0} {0 ½ ½} {0 0 1} 13 Table 5.2. Phenotypic mating table for an autosomal diallelic locus Mating type Random mating frequency Offspring segregation proportions () Conditional D Dominant by dominant matings AA X AA p4 1 3 AA X Aa 4p q 1 2 2 Aa X Aa 4p q ¾ 2 2 All D X D (1 - q ) (1+2q)/(1+q) 2 Dominant by recessive matings AA X aa 2p2 q2 Aa X aa 4pq3 All D X R 2q2(1 - q2) 1 ½ 1/(1+q) Recessive by recessive matings aa X aa q4 All R X R q4 0 0 Unconditional R R D 0 0 1/4 2 q /(1+q) 2 p4 4p3q 3p2q2 p2(1+2q) 0 0 p2q2 p2q2 0 ½ q(1+q) 2p2q2 2pq3 2pq2 0 2pq3 2pq3 0 0 q4 q4 1 1 D = dominant, R = recessive, “All {mating phenotype}” are weighted by their population frequencies. The segregation proportion, , can be interpreted as the probability that a random offspring is affected. 14 Segregation analysis: discrete traits in families (con’t) Ascertainment bias and correction: sibship data • The way in which families are ascertained can have major effect on the interpretation we make of the data. Example: Ascertain affected children through the school system. Collect data on all siblings of affected. Suppose the segregation proportion (alsp the prob that a rnadom offspring is affected) is . The probability that a family of sibship size s produces r affected children follows a binomial distribution Pr(r|s, ) = s!/[r!(s-r)!] r (1- )(s-r) Therefore the probability that such a family will produce s normal children is (1- )s. These families will never be identified if we ascertain sibships through affected school children. 15 Ascertainment bias and correction: sibship data (con’t) • Must correct for ascertainment to obtain unbiased estimates. • One simple way: recognize that our sample contain all families, except those with no affecteds, I.e. our sample represents a fraction [1- (1- )s] of the total population of sibships in this example. • The corrected probabilities of r affected from a family of size s Pr(r|s, ) = s!/[r!(s-r)!] r (1- )(s-r) / [1- (1- )s] . • Another way of ascertainment correction is to perform analyses ignoring the affected probands. This is acceptable only if the probability that a given affected child is ascertained is small. • Other ascertainment problems: Families with many affecteds may have a higher chance of being ascertained by a given sampling scheme. • Corrections for some simple sampling situations have long been known in medical genetics, but methods for complex situations are still inexact. 16 Segregation analysis: quantitative traits in families • Quantitative traits may be affected by a large number of loci acting together, as well as by environmental factors. • Examples of important disease related traits: Blood pressure; obesity measures; cholesterol; triglycerides. • We need to understand the effect of the genotypes, and the environment, on the phenotype. • The effects of genotypes on quantitative phenotypes are relative: Does phenotype AA increase the phenotype, or does aa decrease it? 17 Segregation analysis: quantitative traits in families • The simplest measure of genetic effect is the genotypic value, the mean phenotype observed amongst individuals with a given genotype in the population of reference g = i i g(i) •The mean number of doses of a given allele, say A, in genotypes in a population is g = 2 p2 + 2pq (1) + q2 (0) = 2p The mean phenotype in the population is the weighted average = g Pg g = p2 AA + 2pq Aa + q2 aa for a diallelic locus 18 Genetic variation for a quantitatitve trait • The genotypic variance is defined as the variance among the genotypic values in the population: g2 = Pg (g - )2 = Pg g2 - 2 = 2pq •It is often convenient to express genotypic values as deviations from the population mean denoted by g = g - • In the simplest situation, the effects of the individual alleles are additive, and the genotypic value is the sum of the effects of the two alleles in the genotype. 19 Genetic variation for a quantitatitve trait (cont’d) • Define i to be the allelic value that each allele contributes to the genotype. Since allele A is paired with another A a fraction p of the time, and with a for q of the time, we have A = pAA + qAa a = pAa + qaa •Special characteristic of effects expressed as deviations: Their average over all genotypes must be zero, I.e pA + qa = 0. When the allelic effects are additive, the breeding value, or average deviation, of genotype ik is I + k. 20 Genetic variation for a quantitatitve trait (cont’d) • Define the additive genotypic variance, 2A , as the sums of squares of the breeding values, weighted by the genotype frequencies 2A = p2 (2A)2 + 2pq (A + a )2 + q2 (2a)2 = 2(p 2A + q2a) • Define the dominance displacement d as the position of the heterozygote relative to the two homozypotes d = (Aa - aa) / (AA - aa) • If the effects are purely additive, the heterozygote genotypic value will be exactly halfway between those of the homozygote, I.e. d=1/2. • The dominance variance is the variance due to dominance deviations from additivity and equals 2D = p2 (AA - 2A)2 + 2pq (Aa - A -a )2 + q2 (aa - 2a)2 21 Environmental effects on quantitative phenotypes • Environmental factors are responsible for within genotype variance. The simplest way to account for environmental variance is to aggegate all unmeasured effects on the phenotype, usually assuming that they have a normal distribution. • We can now express the determination of the phenotype as a sum of additive genetic, dominance, and environmental effects =A+D+E with variance 2 = 2A + 2 D + 2E • The environmental effects can ge additive, I.e. act similarly on each genotype, or there can be a genotype by environment (G E) interaction if the same environmental exposure affects different genotypes 22 differently (add 2GE to the above equation). Kinship and inbreeding coefficients: probabilities of shared genes Several quantities are used to measure the genetic relationship between two individuals. • The coefficient of kinship, FXY , between individuals X and Y, is the probability that two alleles at the same locus, one chosen randomly from each individual, are identical by descent (ibd) from some common ancestor. • The inbreeding coefficient , F, is the probability that his/her two alleles at a locus are ibd. This equals the kinship coefficient of its parents. • The coefficient of relationship, r = 2 FXY , is the fraction of genes shared ibd by two individuals. • Table 6.2 gives kinship F coefficients for various important kinds of relative pair. 23 Table 6.2 (Weiss) Genetic relationships among various types of relative Coefficient of Relative type MZ twins Parent-offspring Full sibs / DZ twins Half-siblings Avuncles* Half-avuncles* First cousins Double first cousins Half first cousins 1st cousins once rem Second cousins Degree of relationship __ 1st 1st 2nd 2nd 3rd 3rd 2nd 4th 4th 5th Kinship (F) Relationship(r) __ ¼ ¼ 1/8 1/8 1/16 1/16 1/8 1/32 1/32 1/64 1 ½ ½ ¼ ¼ 1/8 1/8 ¼ 1/16 1/16 1/3 * Avuncles refers to uncle/aunt-nephew/niece pairs 24 Genotypic correlation between relatives Consider the genotypic values of parents and offspring, for an additive diallelic locus. See Table 6.3. • For a locus with three genotypes, there are nine possible parent-offspring genotype pairs. Example: First row of table. • The probabilities of an AA father and an AA, Aa, or aa child are p, (1-p), 0 respectively, because: Note that all offsprings receive an A from father with probability 1, so offsprings cannot have genotype aa. All offsprings receive an A from the father, and an A from the mother with prob p (making their genotype AA); or an a from their mother with prob 1-p (making their genotype Aa). 25 Table 6.3. Parent-offspring relationships Geno Parent_________ dose Genotype Prob value AA Aa aa p2 2p(1-p) (1-p) 2 2 1 0 Offspring___________________ Genotype Probs AA Aa aa 2 1 0 Tot Mean p 1-p 0 1.0 p+1 p/2 ½ (1-p)/2) 1.0 p+1 0 p 1-p 1.0 p+1 From this table, the covariance between parent (P) and offspring (O) can be calculated from all the values in the table to arrive at Cov(P,O) = p(1-p) = ½ 2g Recall: 2g = 2pq; and g = 2p 26 Table 6.4 (Weiss) Components of genetic covariance for various types of relative Coefficient of Relative type MZ twins Full sibs / DZ twins Parent-offspring Mid-parent-offspring Half-siblings Avuncles* Double first cousins First cousins General 2A 1 ½ ½ ½ ¼ ¼ ¼ 1/8 r 2D 1 ¼ 1/16 u *Avuncles refers to uncle/aunt-nephew/niece pairs 27 The covariances between any pair of relatives, P and Q, can be expressed as a weighted combination of additive and dominance effects. Let the parents of P be denoted by A nd B. • Let the parents of Q be denoted by C and D. Cov(P,Q) = rPQ2A + uPQ2D where uPQ = FAC FBD + FAD FBC F values are kinship coeficients given in Table 6.2. 28 Extension to multiple loci: polygenic traits Fisher, 1918, showed that the single-locus genetic relationships among relatives were preserved for multiple additive loci. Example: • At a single locus, there are 3 genotypes (AA, Aa, aa) and three genotypic dose values (0, 1, and 2). • At two such loci, there are nine genotypes (aabb, aabB, aaBB, aAbb,aAbB, aABB, AAbb, AAbB, AABB) and 5 different genotypic values (0, 1, 2, 3, 4). • In general, for n such loci there are 3n genotypes and 2n+1 genotypic values, i.e., as n gets large, the distribution of additive genotypic values resembles the continuous distribution of a quantitative trait. In practice, the distribution of summed additive effects can be approximated by a normal distribution. The genotypic correlations between relativesalso hold for multiple additive loci. 29 Extension to multiple loci: polygenic traits (con’t) • Dominance refers to non-additive (interaction) effects between alleles at the same locus. • Epistasis refers to interactions among alleles at different loci. This adds another term to the expression for the determination of the phenotype = PG + E = A + D + I + E with variance 2 = 2PG + 2E = 2A + 2D+ 2I + 2E which can be rewritten as 1 = 2PG /2 + 2E /2 Define heritability as h2 = 2PG /2 . Heritability represents the ratio of the observed phenotypic correlation to the theoretical genotypic correlation. In twins: h2 = (2DZ - 2MZ ) / 2DZ 30