* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Download Platform: Affymetrix GeneChip System
Pharmacogenomics wikipedia , lookup
Genetic engineering wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Fetal origins hypothesis wikipedia , lookup
Behavioural genetics wikipedia , lookup
Medical genetics wikipedia , lookup
Hardy–Weinberg principle wikipedia , lookup
Genome (book) wikipedia , lookup
Genetic testing wikipedia , lookup
Heritability of IQ wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Human genetic variation wikipedia , lookup
Genetic drift wikipedia , lookup
Microevolution wikipedia , lookup
Disease gene mapping in the post-genomic age Pak C. Sham Department of Psychiatry & Genome Research Centre, HKU Bioinformatic and Comparative Genome Analysis Course HKU-Pasteur Research Centre - Hong Kong, China August 17 - August 29, 2009 Overview Introduction Linkage Model-based Linkage Analysis Model-free Linkage Analysis Population Association Analysis Case-Control Association Analysis Family-based Association Analysis Linkage Disequilibrium Genome-wide Association Studies Data Quality Checks of Genetic Data Population stratification Levels of Statistical Significance Statistical Power Calculation Quantile-Quantile Plots Gene-Gene and Gene-Environment Interactions Meta-analysis Human Diseases Inherited “Mendelian” diseases Complex diseases Huntington’s disease Cystic fibrosis Haemophilia Cancers Ischaemic heart disease Diabetes Depression Health-related quantitative traits Body mass index Blood pressure Blood sugar level Neuroticism Classical Genetics Mendel: the gene as the unit of biological inheritance Inference of genetic effects Mendelian segregation ratios Twin studies (CMZ > CDZ) Adoption studies Estimation of genetic parameters Gene frequency, penetrance, mutation rate Heritability: narrow and broad Based only on the pattern of disease occurrence in families and in the population Not direct analysis of DNA DNA Sequence Variation Single nucleotide polymorphisms Structural Variations Types of Genetic Traits Monogenic Diseases (e.g Huntington’s Disease, Cystic Fibrosis) Polygenic Inheritance: Quantitative traits (e.g. blood pressure, IQ) Complex disorders (e.g. type 2 diabetes) Levels of Genetic Analysis Heritability Gene finding Based on genetic sharing between family members Determines the relative contributions of genetic and environmental variation to trait variation in the population Based on measuring genetic markers (naturally occurring sequence variants) and associating these with disease or trait within families (LINKAGE) or in an entire population (ASSOCIATION) Identifies genes which contain sequence variation that influence disease risk, course, outcome and response to different treatments Gene function Involves molecular genetics, cell biology and animal models Identifies mechanisms through which sequence variation influences cell function and therefore disease related traits Overview Introduction Linkage Model-based Linkage Analysis Model-free Linkage Analysis Population Association Analysis Case-Control Association Analysis Family-based Association Analysis Linkage Disequilibrium Genome-wide Association Studies Data Quality Checks of Genetic Data Population stratification Levels of Statistical Significance Statistical Power Calculation Quantile-Quantile Plots Gene-Gene and Gene-Environment Interactions Meta-analysis Mendelian Double Backcross Pure lines AA/BB aa/bb ab AB Informative parent Aa/Bb (doubly heterozygous, phase-known) F1 Parental Types Aa/Bb AB aB ab aa/bb Ab Aa/bb Recombinants aa/Bb ab aa/bb ab Recombination Fraction For two loci (e.g. A and B), the recombination fraction (θ) is the proportion of gametes that are recombinant For two loci on different chromosomes, θ = ½ For two loci on the same chromosome (syntenic), θ < ½ As the distance between 2 syntenic loci 0, θ 0 Crossing Over in Meiosis Cross-over points The recombination fraction between two loci is the probability of having an odd number of cross-over points between the two loci Overview Introduction Linkage Model-based Linkage Analysis Model-free Linkage Analysis Population Association Analysis Case-Control Association Analysis Family-based Association Analysis Linkage Disequilibrium Genome-wide Association Studies Data Quality Checks of Genetic Data Population stratification Levels of Statistical Significance Statistical Power Calculation Quantile-Quantile Plots Gene-Gene and Gene-Environment Interactions Meta-analysis Genetic Markers Linkage can be used to map disease-related genes if we can measure naturally-occurring DNA sequence variations (polymoprhisms) at known chromosomal locations. The determination of the alleles present at a polymorphic marker locus is called genotyping Classical genetic markers Mendelian disorders Blood groups HLA antigens Molecular genetic markers Restriction fragment length polymorphisms (RFLPs) Variable-length short-sequence repeats, SSRs (e.g. CACACA… ) Single-nucleotide polymorphisms, SNPs (e.g. C/T) Disease Pedigrees Marker allele A1 cosegregates with dominant disease A3A4 A1A2 A1A3 A1A2 A1A4 A2A4 A3A4 A2A3 A3A2 Problem: Genotypes at disease locus not directly measured although they can be partially deduced from disease status Single Major Locus Model Mendelian dominant disease Mendelian recessive disease DD: Risk = 1, Dd, dd: Risk = 0 Generalisation (the SML model) DD, Dd: Risk = 1, dd: Risk = 0 DD: Risk = f2, Dd: Risk = f1, dd: Risk = f0 f2, f1, f0 are called penetrances Additional parameters Disease allele (D) frequency q Statistics Diversion Likelihood Probability of data as a function of the value of an unknown parameter Example A urn contains red and blue marbles in the proportions p and 1-p respectively, where p is unknown Of 10 marbles randomly drawn from the urn with replacement, 7 are red Likelihood function = p7(1-p)3 Maximum Likelihood Estimate of p: 0.7 Likelihood ratio test of hypothesis H0: p=0.5 T 2(ln( 0.77 0.33 ) ln( 0.510 )) ~ 1.65 LOD Scores Given a SML model with specified parameter values, the likelihood of the pedigree data (disease and marker status) can be calculated numerically (using the ElstonStewart or Lander Green Algorithms) for a range of chromosomal locations of the disease locus The lod score at a particular location is defined as the common logarithm of the likelihood at the location (among the markers) to the that at a very distant location (unlinked to the markers) lod log 10 L( X x ) L ( X ) A lod score of 3 or more at a chromosomal location is by convention considered significant evidence for declaring linkage at that location. For Mendelian diseases this criterion has an empirical false positive rate of ~ 2% Computer Programs Elston-Stewart Algorithm Lander-Green Algorithms Large pedigrees, few markers LINKAGE Small pedigrees, many markers MERLIN Simulations (MCMC) Large pedigrees, many markers SIMWALK (very slow) Overview Introduction Model-based Linkage Model-free Linkage Population Association Case-Control Association Family-based Association Linkage Disequilibrium Genome-wide Association Studies Data Quality Checks of Genetic Data Population stratification Levels of Significance Statistical Power Calculation Quantile-Quantile Plots Gene-Gene and Gene-Environment Interactions Meta-analysis Identity by Descent (IBD) Two alleles are IBD if they are descended from the same allele of a common ancestor in the pedigree AC AB AC AB AB AC AB BC AB AC Affected Sib Pair (ASP) Method At any locus the numbers of alleles shared IBD by a sib pair can be 0 (25%), 1 (50%) 2 (25%) For sib pairs where both sibs are affected, the IBD distribution at marker locus linked to the disease locus is distorted such that fewer pairs have IBD 0 and more pairs have IBD 2 ASP method is based on a test of whether IBD sharing is increased at each marker locus ASP is considered “model-free” or “non-parametric” as it does not require the assumption of an SML model Generalized NPL methods The IBD concept can be applied to types of relative pairs other than sib pairs Each type of relative pairs has a characteristic distribution of IBD sharing in the absence of linkage For marker loci linked to a disease locus, the IBD sharing of affected relative pairs will be elevated Non-parametric linkage (NPL) tests are based on the detection of an elevated level of IBD sharing at a marker locus in all the affected pairs in general pedigrees Statistics Diversion Many test statistics are calculated by summing up the contributions from many observation units (e.g. families) The frequency distribution of such statistics are often approximately normal (central limit theorem) If the standard deviation of such a statistic can be calculated, then this can be used to “standardize” the statistic so that it has a standard deviation of 1 The resulting test statistic can be referred to a standard normal distribution to obtain a p-value Many tests in genetic analysis (including NPL tests) are of this form Computer Programs Lander-Green algorithm for IBD calculation, followed by calculation of NPL test statistics from IBD sharing of affected pairs Genehunter Merlin Overview Introduction Model-based Linkage Model-free Linkage Population Association Case-Control Association Family-based Association Linkage Disequilibrium Genome-wide Association Studies Data Quality Checks of Genetic Data Population stratification Levels of Significance Statistical Power Calculation Quantile-Quantile Plots Gene-Gene and Gene-Environment Interactions Meta-analysis Disease Association Two variables are statistically associated if their values in the population are not independent but related to each other In the case of a disease and a genetic variant (allele), they are associated if individuals with different numbers of copies of the variant have different risks of disease Hypothetical Example Genotype DD Dd dd Risk 0.09 0.03 0.01 Risk Ratio 9 3 1 Prospective Design Collect a cohort of individuals from the population (random sampling, stratified sampling, cluster sampling etc) Collect DNA and measure genotypes at genetic markers (plus other relevant covariate, e.g. age, sex, ethnicity, environmental exposures) Follow-up the cohort for the occurrence of the disease of interest Perform statistical analysis for association between marker genotype and disease Hypothesis test (typically chi-square tests) Parameter estimation (typically odds ratios) Statistical modelling (e.g. logistic regression) required if analysis to include other loci or covariates in addition to locus of interest Statistics Diversion Odds ratio Odds = Risk / (1-Risk) Odds Ratio = Odds (exposed) / Odds (Unexposed) Example Genotype DD Dd dd Risk 0.09 0.03 0.01 Odds Odds Ratio 0.989 9.79 0.309 3.06 0.010 1 Statistics Diversion Simple odds ratio (OR) analysis Disease Healthy “B” allele a b “b” allele c d OR = (ad)/(bc) SD (ln OR) = 1/a + 1/b + 1/c + 1d Chi-square test statistic = (OR/SD)2 Statistics Diversion Logistic Regression The log-odds of disease is modelled as a linear combination of predictor variables Predictor variables include marker genotype of interest plus other covariates (e.g. age, sex, ethnicity) Regression coefficients represent log odds ratios and are estimated by maximum likelihood Significance tests can be performed for a single predictor variable for jointly for a group of predictors There are automated methods for model selection (e.g. forward stepwise, backward stepwise) Available in most popular statistical packages Codings for Genetic Models What values to assign to different genotypes ? Genotype DD Dd dd Dominant 1 1 0 Recessive 1 0 0 Additive 1 0.5 0 Can include any 2 into a model, but including all 3 will lead to “collinearity” and an error message Overview Introduction Model-based Linkage Model-free Linkage Population Association Case-Control Association Family-based Association Linkage Disequilibrium Genome-wide Association Studies Data Quality Checks of Genetic Data Population stratification Levels of Significance Statistical Power Calculation Quantile-Quantile Plots Gene-Gene and Gene-Environment Interactions Meta-analysis Retrospective Design Prospective design is inefficient for rare disease In retrospective (case-control) design, a sample of affected individuals (cases) and a sample of unaffected individuals (controls) from the population are collected Collect DNA and measure genotypes at genetic markers (plus other relevant covariate, e.g. age, sex, ethnicity, environmental exposures) Perform statistical analysis for association between marker genotype and disease Hypothesis test (typically chi-square tests) Parameter estimation (typically odds ratios) Statistical modelling (e.g. logistic regression) required if analysis to include other loci or covariates in addition to locus of interest Statistics Diversion Simple odds ratio (OR) analysis “B” Allele “b” allele Cases a b Controls c d OR = (ad)/(bc) SD (ln OR) = 1/a + 1/b + 1/c + 1d Chi-square test statistic = (OR/SD)2 Same calculations as for prospective study Statistics Diversion In a statistical test False positive result = Type 1 error False negative result = Type 2 error Type 1 error rate is controlled by setting the p-value threshold for declaring the test significant Probability of detecting a true effect is defined as statistical power 1-Type 2 error rate Study should be designed to achieve adequate statistical power (e.g. 80%) Statistical power may depend on parameters whose values are unknown (e.g. effect size of genotypes as measured by odds ratio). This calls consideration over a range of plausible assumptions. Selection of Controls The controls should be from the same population as the cases Some minor mismatching between cases and controls can be handled by stratified analysis or statistical adjustment (e.g. in logistic regression model) For fixed number of cases, statistical power increases with increasing number of controls For fixed total sample size, a balanced design (equal number of cases and controls) is optimal for maximizing statistical power Overview Introduction Model-based Linkage Model-free Linkage Population Association Case-Control Association Family-based Association Linkage Disequilibrium Genome-wide Association studies Data Quality Checks of Genetic Data Population stratification Levels of Significance Statistical Power Calculation Quantile-Quantile Plots Gene-Gene and Gene-Environment Interactions Meta-analysis Hidden Population Stratification In mixed populations of diverse origin case-control studies runs the risk of mismatch between cases and controls and this can lead to false positive and false negative association results Example Peptic ulcer and O blood group are both more common in “Orientals” than Europeans In a mixed population, “Orientals” are likely to be overrepresented in cases with peptic ulcer Cases with peptic ulcer would have a higher frequency of O blodd group than randomly drawn controls from the population If ethnic origin is known then this can be entered in the the statistical model for adjustment. If not, then this could lead to a false positive association between peptic ulcer and the O blood group Transmission/Disequilibrium Test Using family members as controls should ensure matching between cases and controls and avoid the problem of population stratification The TDT uses the non-transmitted alleles of the parents of cases as controls Example: Overall table T NT Bb Bb T NT Paternal B b “B” allele b c Maternal B b “b” allele c b BB TDT = (b-c)2/(b+c) OR = b/c Var ln(OR) = 1/b + 1/c Generalizations of TDT To using multiallelic markers (ETDT) To using siblings instead of parents (STDT) To using general pedigrees (PTDT, FBAT) To using quantitative instead of disease phenotypes (QTDT) TDT Designs Disadvantages Parents may be difficult to recruit for late-onset diseases Less statistical power for same number of subjects genotyped (3 for TDT ~ 2 for balanced CC) Assymmetry between transmitted and non-transmitted alleles lead to propensity to artifacts (e.g. missing and wrong genotype calls) BUT Opportunity to examine de novo mutations and parent-oforigin effects Overview Introduction Model-based Linkage Model-free Linkage Population Association Case-Control Association Family-based Association Linkage Disequilibrium Genome-wide Association Studies Data Quality Checks of Genetic Data Population stratification Levels of Significance Statistical Power Calculation Quantile-Quantile Plots Gene-Gene and Gene-Environment Interactions Meta-analysis Linkage disequilibrium (LD) LD is defined as association between alleles at closely linked loci Example “A” allele frequency 0.1 “B” allele frequency 0.2 Expected frequency of “AB” haplotype 0.1 0.2 =0.02 Observed frequency of “AB” haplotype = 0.1 i.e. Positive LD between alleles “A” and “B” Measures of LD d = f(AB) – f(A)f(B) d’ = d / Max(d) given the f(A) and f(B) r2 = d2 / f(A)f(a)f(B)f(b) i.e. squared correlation Haplotype Frequency Estimation Genotype data tells us how many copies each allele is present in an individual, not how these are made up in terms of haplotypes (i.e. their parental origin) Example Genotype: AaBb Possible haplotype combinations AB/ab or Ab/aB Therefore haplotype frequency estimation requires a statistical procedure rather than simple counting Alternative methods EM algorithm (maximum likelihood): e.g. EH, EH+ Bayesian algorithms, e.g. PHASE Overview Introduction Model-based Linkage Model-free Linkage Population Association Case-Control Association Family-based Association Linkage Disequilibrium Genome-wide Association Studies Data Quality Checks of Genetic Data Population stratification Levels of Significance Statistical Power Calculation Quantile-Quantile Plots Gene-Gene and Gene-Environment Interactions Meta-analysis Systematic Association Mapping Disease Indirect Direct A LD G From set of SNPs in high LD with each other, select only one (the tag SNP) to genotype (others can be imputed, e.g. IMPUTE, MACH, PLINK) A systematic association analysis of a gene, a chromosome region, or the entire genome can be achieved by the appropriate choice of tag SNPs This requires a catalogue of SNPs in the genome and their LD relationships; this information is provided by the HapMap Tagging 90% of all common SNPs in the human genome requires genotyping ~ 1 M SNPs (e.g. Illumina 1M or Affy 6.0) in European and Asian populations (Genome-Wide Association Studies, GWAS) RET NRG1 First 2008 2007 first second quarter third quarter 2006 2005 2007quarter fourth quarter Second quarterquarter 2008 Updated from Manolio, Brooks, Collins. J Clin Invest 2008; 118:1590-625 GWAS: Current Experience A few SNP were detectable with small sample sizes (100s) but the majority required very large sample sizes (1000s) Frequency distribution of effect sizes for the risk alleles of 92 validated SNPs identified from GWAS on 16 disorders. Wray et al, 2008 Overview Introduction Model-based Linkage Model-free Linkage Population Association Case-Control Association Family-based Association Linkage Disequilibrium Genome-wide Association Studies Data Quality Checks of Genetic Data Population stratification Levels of Significance Statistical Power Calculation Quantile-Quantile Plots Gene-Gene and Gene-Environment Interactions Meta-analysis “Data Cleaning” Large-scale genetic data allow systematic checks of data integrity and removal of problematic subjects or SNPs from further analyses Subjects (sample) checks Genotype call rate (low suggests poor DNA quality) Autosomal heterozygosity (low suggests poor DNA quality, high suggests contamination) X heterozygosity (to check against recorded sex, inconsistency suggests sample mix up) Inconsistencies between recorded family relationships and those inferred from genotype data SNP checks Low call rate Hardy-Weinberg proportions Allele frequency (consistency with reference values, e.g. HapMap) LD relationships (consistency with reference values, e.g. HapMap) Consistency of genotypes SNP pairs in strong LD Example: Gender Check Labelled male Sty 306 X chromosome heterozygosity Labelled female 1.2 1 0.8 F 0.6 0.4 0.2 0 0 1000 2000 -0.2 -0.4 DNA No. Overview Introduction Model-based Linkage Model-free Linkage Population Association Case-Control Association Family-based Association Linkage Disequilibrium Genome-wide Association Studies Data Quality Checks of Genetic Data Population stratification Levels of Significance Statistical Power Calculation Quantile-Quantile Plots Gene-Gene and Gene-Environment Interactions Meta-analysis Safeguards against stratification Genomic Control (GC) method Find median of chi-square test statistics of all SNPs Calculate an inflation factor (IF) Use IF to adjust down all chi-square statistics Extracting and adjusting for population structure Bayesian approach: STRUCTURE Maximum likelihood latent class analysis: L-POP Principal components analysis, EIGENSTRAT Multidimensional scaling, PLINK Hierarchical complete-linkage clustering, PLINK Empirical assessment of ancestry CEPH/European Yoruba Han Chinese Japanese ~2K SNPs Empirical assessment of ancestry Entire Phase I HapMap Empirical assessment of ancestry Han Chinese Japanese ~10K SNPs Sample Checks: MDS Plot Overview Introduction Model-based Linkage Model-free Linkage Population Association Case-Control Association Family-based Association Linkage Disequilibrium Genome-wide Association Studies Data Quality Checks of Genetic Data Population stratification Levels of Significance Statistical Power Calculation Quantile-Quantile Plots Gene-Gene and Gene-Environment Interactions Meta-analysis Multiple Testing Correction Many tests for which H0 is true will have small p-values e.g. in 100,000 tests, 5,000 are expected to have p < 0.05, by chance Therefore the threshold for declaring statistical significance needs to be set at a more stringent level Methods Bonferroni, set α = 0.05 / n, for n tests “Bayesian”: fix α to give the desired false positive rate given certain prior assumptions about proportion of SNPs with true association and the distribution of effect sizes False Discovery Rate (FDR): use the empirical distribution of test statistics or p-values to set appropriate α to give the desired false positive rate For GWAS, it is usual to set α at 510-8, regardless of the number of SNPs tested Statistics Diversion Benjamini & Hochberg (1995) Procedure: 1. 2. 3. 4. Set FDR (e.g. to 0.05) Rank the tests in ascending order of p-value, giving p1 p2 … pr … pm Then find the test with the highest rank, r, for which the p-value, pr, is less than or equal to (r/m) FDR Declare the tests of rank 1, 2, …, r as significant A minor modification is to replace m by m0 B & H FDR Method FDR=0.05 Rank P-value (Rank/n)×FDR Reject H0 ? 1 .001 .005 1 2 .010 .010 1 3 .165 .015 0 4 .205 .020 0 5 .396 .025 0 6 .450 .030 0 7 .641 .035 0 8 .781 .040 0 9 .901 .045 0 10 .953 .050 0 Overview Introduction Model-based Linkage Model-free Linkage Population Association Case-Control Association Family-based Association Linkage Disequilibrium Genome-wide Association Studies Data Quality Checks of Genetic Data Population stratification Levels of Significance Statistical Power Calculation Quantile-Quantile Plots Gene-Gene and Gene-Environment Interactions Meta-analysis Power Calculations Help design studies that are likely to succeed Determine the minimum sample size necessary to achieve the desired level of statistical power (usually > 80%), for a given effect size Determine the minimum effect size that can be detected with adequate statistical power, for a fixed sample size Usually obligatory for grant applications Steps in Power Calculation Specify Study design (e.g. case-control) Statistical test Assume hypothetical values for 2 of the 3 parameters: Sample size Effect size (including effect frequency) Statistical power Calculate the remaining parameter Make informative plots: Sample size against effect size, for fixed power Power versus sample size, for fixed effect size Program for Power Calculation Genetic Power Calculator (on-line interactive tool) Contains modules for power or sample size calculations for a variety of genetic study designs, e.g. Quantitative trait linkage analysis Quantitative trait association analysis Case-Control association analysis Ways to Enhance Power Increase sample size Increase marker density Increase accuracy of phenotype measurements Increase accuracy of genotyping Rigorous quality control and error checking Collect and adjust for environmental covariates Appropriate treatment of heterogeneity Appropriate treatment of population substructure Select most “extreme” individuals for study Optimal statistical test that extracts maximum information Overview Introduction Model-based Linkage Model-free Linkage Population Association Case-Control Association Family-based Association Linkage Disequilibrium Genome-wide Association Studies Data Quality Checks of Genetic Data Population stratification Levels of Significance Statistical Power Calculation Quantile-Quantile Plots Gene-Gene and Gene-Environment Interactions Meta-analysis QQ Plots An effective way of visualizing the overall pattern of pvalues from a large-scale systematic association analysis, especially GWAS Plots observed –log(p), ranked in magnitude, against their expected values according to the null hypothesis (i.e. uniform between 0 and 1) If the null hypothesis is true for all SNPs and the tests are behaving appropriately then the plot should follow a straight line at 45° from the origin Deviations from the null line may suggest The presence of true association! Misbehaviour of statistical tests due to a variety of reasons Example QQ Plot: Hirschsprung’s Disease Without EIGENSTRAT correction With EIGENSTRAT correction Overview Introduction Model-based Linkage Model-free Linkage Population Association Case-Control Association Family-based Association Linkage Disequilibrium Genome-wide Association Studies Data Quality Checks of Genetic Data Population stratification Levels of Significance Statistical Power Calculation Quantile-Quantile Plots Gene-Gene and Gene-Environment Interactions Meta-analysis Statistical Interaction If the effect of factor B (on risk of disease) is different depending on whether factor A is also present, then the factors A and B are said to display statistical interaction Example: OR of allele “B” in the smokers = 3 OR of allele “b” non-smokers = 1 Interactions can be GG (epistasis), GE, or EE Gene-Environment Interaction: Male conduct disorder % conduct disorder 100 MAOA activity 80 Low MAOA activity High MAOA activity 60 40 20 0 Low Mild Severe Child Maltreatment Caspi et al., 2002 (Science) Analysis of Interactions Statistical modelling Inclusion of product term in logistic regression model Set association Stepwise approach to find best subset of SNPs among a large number of SNPs in a gene or pathway Multidimensional Reduction (MDR) Model-free approach to reduce high-dimensional genotype data into high- and low- risk classes Canonical correlation Considers interactions between 2 sets of SNPs Overview Introduction Model-based Linkage Model-free Linkage Population Association Case-Control Association Family-based Association Linkage Disequilibrium Genome-wide Association Studies Data Quality Checks of Genetic Data Population stratification Levels of Significance Statistical Power Calculation Quantile-Quantile Plots Gene-Gene and Gene-Environment Interactions Meta-analysis Meta-analysis Combination of results from multiple studies in order to detect effects that are too small to detect in the individual studies Methodology Consistent phenotype definition Shared SNPs (possibly by imputation) Consistent coding of SNP data Combine estimates, weighted by sample size or inverse variance Combine p-values, Fisher’s method and variants Checks for publication bias (not so much a concern for GWAS) Heterogeneity tests THANK YOU