Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
SOLAR 101: Intro to Theory & Intro to Software Laura Almasy Outline on web site Part 1: 2:00- 3:00pm (John Blangero and Laura Almasy) I. Introduction to SOLAR: Download, Installation, Registration II. Creating a SOLAR project: family vs. epidemiological sample, pedigree file III. Loading traits and pedigree files IV. Manipulation with traits: statistics, normalization, regression Break (10 minutes) Part 2: 3:10- 4:10pm (John Blangero) I. Multivariate linear model: Assumptions, departures, and basic construction II. Polygenic model, and heritability III. Statistical Inference IV. Basic example: transformation, heritability estimation, hypothesis testing V. Fundamentals of GWAS and linkage Outline - revised Part 1: THEORY I. Basic concepts of variance component models for genetics II. Heritability III. Covariates IV. Association V. A little bit about linkage Break Part 2: SOFTWARE I. SOLAR: download, registration, documentation, user support II. Pedigree file – what if you don’t have families? III. Phenotype file – polygenic, heritability, covariates, normalization IV. SNP and map files – linkage disequillibrium, association analysis V. Just a little bit about linkage σ2 μ Variance Decomposition σp 2 σp = 2 = σg + σe 2 2 Total phenotypic variance σg 2 = σa + σd 2 2 σa = Additive genetic variance σd = Dominance variance 2 2 AA AB BB -a d +a AA AB BB -a d +a If the heterozygote is half way between the two homozygotes, there’s a “dose-response” effect, d is zero, and there is no dominance. 2 σ a= 2pq[a + 2 σ d = 2 d(q-p)] 2 (2pqd) σg 2 σi = 2 = σa + σd +σ 2 2 2 i Interaction variance (epistasis) AA AB -a d BB +a Interaction variance exists when a (or d) is a function of the genotypes at another locus. σe 2 σc = 2 σue= 2 = σc + σue 2 2 Common or shared environment Unique environment Heritability (h2): the proportion of the phenotypic variance in a trait that is attributable to the additive effects of genes. Broad sense heritability σ g 2 h = 2 σp 2 Additive genetic (narrow sense) heritability σ a 2 h = 2 σp 2 Modeling the Phenotype p= μ +Σβi xi + a + e μ Baseline mean β Regression coefficients x Scaled covariates a Additive genetic effects e Random environmental effects Modeling phenotypic covariance Ω = 2Φσ a + I σ e 2 2 2 σ a = additive genetic variance 2 - environmental influences σ e = variance due to unique Relationship Self MZ twin pair Parent-offspring Siblings Grandparent-grandchild Avuncular Half-siblings 1st cousins 2nd cousins 2φ 1 1 1/2 1/2 1/4 1/4 1/4 1/8 1/32 Hypothesis Testing: Null hypothesis h2 = 0 Parameters Estimated Model σ2a σ2e Sporadic 0 + Additive + + Twice the difference in ln likelihoods between the two models is distributed as a mixture of chi-square distributions. Alcoholism NIDDM Body mass index HDL cholesterol Thrombosis Height h2 0.39 0.49 0.51 0.52 0.61 0.81 Limitation This model assumes that the only source of correlation among family members is genetic. Solution Add more components to the model. Shared environment Modeling the Phenotype: p= μ +Σβi xi + a +c+e μ Baseline mean β Regression coefficients x Scaled covariates a Additive genetic effects c Shared environmental effects e Random environmental effects Variance Decomposition σ p = σa + σc + σe 2 2 2 2 Shared environmental (household) effects σ c 2 c = 2 σp 2 Modeling phenotypic covariance Ω = 2Φσ a + Hσ c + I σ e 2 2 2 2 σ a = additive genetic variance 2 σ c = variance due to shared environmental influences 2 - environmental influences σ e = variance due to unique Assumption of VC analysis: Trait is normally distributed. What happens if this assumption is violated? 4th Central Moment: Kurtosis Data Transformation: Waist Circumference, Serum Leptin What about those covariates? Modeling the Phenotype: p= μ +Σβi xi + a +c+e μ Baseline mean β Regression coefficients x Scaled covariates g Additive genetic effects c Shared environmental effects e Random environmental effects What good are covariates? G5 G2 G1 G3 G4 h2 = G1 + G2 + G3 + G4 + G5 Total Covariates absorb variance G5 Age G2 G1 G3 G4 Sex h2 = G1 + G2 + G3 + G4 + G5 Total - Age - Sex Hypothesis Testing: Null hypothesis: regression coefficient for covariate = 0 Parameters Estimated Model β Null 0 Alternate + Twice the difference in ln likelihoods between the two models is distributed as a chi-square with 1 df. Caution! In general, you only want to use covariates that are demographic or environmental. h2 = G5 G3 + G4 Total - Age - Sex - BMI BMI Age G2 G4 G1 G3 Sex If we include BMI as a covariate, we reduce our power to detect G1, G2, or G5, genes that influence both BMI and the trait of interest. Standard association analysis uses genotype as a covariate G5 Age G2 G1 G3 A39T Sex Measured genotype association tests whether the trait mean differs by genotype, taking into account the nonindependence between family members. QTN-Specific Heritability σ qtn 2 hqtn= 2 σp 2 Additive genetic variance due to a QTN assuming additivity: AA -a AB BB 0=d +a 2 σ QTN = Ha 2 , where a = 1 / 2 of phenotypic difference between homozygotes H = 2 p (1− p), i.e. heterozygosity Genotypes as covariates If effect of QTL is modeled as additive: Genotype AA Aa aa Cov 0 1 2 Power for Association Studies Power for association studies is a function of the sample size, the family configuration, the QTN variance, and the LD between a QTN and a genotyped marker. effective h2snp = actual h2qtn * r2 Example: FXII levels by F12 46C/T genotype CC CT TT FXII levels 128.88 92.23 55.58 p < 1×10 -7 Prothrombin activity levels (%) Prothrombin levels by G20210A genotype 190 170 150 130 110 90 p < 1´10-7 70 50 G/G G/A A/A Linkage analysis 2 2 2 ˆ Ω = Πσ qtl + 2Φσ a + I σ e Identity By Descent Want to know the proportion of alleles shared by a relative pair that derive from a common ancestral source (IBD). This is in contrast to alleles shared identical by state (IBS) in which the ancestral source of the alleles is not considered. IBS = association IBD = linkage πij = IBD probability for individuals i and j = ½ Pr (1 allele shared IBD) + Pr (2 alleles shared IBD) Variance component-based linkage analysis In a region containing a QTL influencing the trait, relatives who are phenotypically similar will share more alleles IBD than relatives who are phenotypically dissimilar Linkage and association analyses linkage analysis Linkage and association analyses time linkage analysis Break Time! Next up: How to actually run these models/analyses in SOLAR Downloading SOLAR http://solar.txbiomedgenetics.org/ SOLAR is available for the following systems: Linux: Intel or AMD cpu Solaris 10+: Sun SPARC workstation Mac OS X 10.4+: Intel cpu Mac OS X 10.4+: G3-G5 PPC Solaris x86 10+ OS: Intel or AMD cpu Windows XP or later (using VMWare Player) SOLAR Registration To maxmize any likelihood models, you will need a SOLAR key, obtained by emailing [email protected]. SOLAR is funded by an NIH grant and registration allows us to provide numbers of users and justify our funding. We do not give out the SOLAR mailing list and will send you email maybe once a year or so letting you know of updates or bugs. SOLAR is helpful Documentation is online, but also available interactively from within SOLAR. COMMANDS: solar> help solar> help XXX SOLAR data files Two general file formats: 1) Comma delimited text file 2) PEDSYS files Five types of files: 1) Pedigree 2) Phenotype 3) Marker 4) Map 5) Freq Pedigree file ID, FA, MO, SEX, [FAMID],[MZTWIN],[HHID] ID,MO,FA,SEX 1,0,0,2 2,0,0,1 3,1,2,1 4,1,2,1 5,0,0,2 6,0,0,1 7,5,6,2 8,5,6,2 1 2 5 3 4 7 6 8 What if you don’t have families? ID,FA,MO,SEX sam,,,M bob,,,M joshua,,,M aliesha,,,F sophie,,,F ralph,,,M gladys,,,F Loading the pedigree file COMMANDS: solar> pedigree load solar> pedigree show Phenotype file Must contain ID [and FAMID if needed]. May have anything else you want. CRUCIAL: MISSING DATA = BLANK, YES/NO or AFF/UNAFF CODED 1/0 (or 1/2) ID,LDL,AGE,SMOKING sam,157,42.5,1 gladys,200,65.34,0 sophie,127,22.1,0 joshua,146,38.5,0 aliesha,,46.2,1 Loading the phenotype file COMMANDS: solar> pheno load phenos solar> stats –all solar> pheno SOLAR output files Housekeeping files: pedindex.out pedindex.cde pedigree.info phenotypes.info phi2.gz Copies of the results shown on the screen: stats.out The polygenic model COMMANDS: solar> trait LDL solar> covariate age sex solar> polygenic [-s] How polygenic -s deals with covariates Parameters Estimated Model β1 β2 β3 Full model + + + Covar 1 0 + + Covar 2 + 0 + Each covariate is dropped out one by one and the likelihood is compared with that of the full model. Looking under the hood COMMAND: solar> model Normalization VC models assume a normally distributed trait and can be sensitive to kurtosis in the trait distribution. COMMANDS: solar> define LDLnorm = inorm_LDL Define can be used for general manipulation of phenotypes: solar> define newthing = (LDL + AGE)^2 What does inorm really do? The trait values are sorted, and for any value V found at position I in the sorted list, a quantile is computed for it by the formula I/(N+1). The inverse normal cumulative density function is computed for each quantile and stored. When the value V occurs multiple times, the inverse normal is computed for each applicable quantile, averaged, then the average is what is stored. Marker file ID [and FAMID if needed] and genotypes only ID,rs14756,rs93456,rs34526 bob,AA,GC,AT gladys,TT,CC,AT joshua,TA,CG,TT ralph,AT,CC,TT sophie,AA,CC,TA sam,AA,CG,TT aliesha,TA,CG,AT Loading the marker file COMMANDS: solar> snp load mysnps solar> snp show Map file Header line specifying type of map (cM for linkage or basepair position) then each line has marker name and location, separated by spaces. SNP basepair rs14756 45667823 rs34526 40693821 rs93456 45692598 Preparing SNPs for analysis COMMANDS: solar> snp covar [-nohaplos] solar> snp ld [plot] solar> snp effnum Simplest way to run association COMMAND: solar> mgassoc –files snp.genocov Measured genotype model Asks ‘does the trait mean differ by genotype’? Likelihood ratio test comparing the likelihood of a model where a regression parameter for genotype is estimated to a model where it is fixed to zero. 1 df chi-sq ASSUMPTION: additive model of gene action where heterozygotes are midway between the two homozygotes If you have families COMMANDS: solar> snp qtldcov solar> pheno load phenos snp.qtldcov solar> qtld Runs measured genotype but also provides quantitative trait TDT and a test for stratification. Making life easier You can use most Unix commands from inside SOLAR. Wildcards (*) are an exception. You can use TCL scripts to automate any series of commands and build your own custom analyses. (See also the ‘toscript’ command.) Power users – you can directly specify the mean and variance equations. Linkage analysis To quote Monty Python: “Not dead yet!” Requires families Obtain MLEs for allele frequencies Calculate and store estimates of identity by descent (IBD) allele sharing Run twopoint or multipoint See SOLAR tutorial for detailed walk-through. Email [email protected] for help with MIBD estimation. Recap Download, documentation, etc: http://solar.txbiomedgenetics.org/ User support: [email protected] Title Stuff