Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Multifactorial traits and complex genetics I Genome-wide association studies in humans [email protected] Wellcome Trust Centre for Human Genetics Overview Describe studies aiming to find genetic differences between individuals that influence susceptibility to diseases (or other traits). Why find disease genes? Identify putative drug targets. Identify high risk individuals. Gene therapy? Personalised medicine (e.g. stratifying cancer) Understand the biology of disease. How do genetic factors influence traits? Two somewhat competing views Genetic influence on traits is inherited in big, discrete lumps Genetic influence on traits is inherited in essentially continuous quantities “Mendelian inheritance” “biometrical”, “multifactorial”, “polygenic” viewpoint - Gregor Mendel (1865) - Morgan (1915) - e.g. Discovery of ABO blood group (1924) - Darwin 1859 - Galton 1886 (e.g. human height) Fisher, Haldane, Wright 1920s-1930s The modern evolutionary synthesis Genomics timeline 1950’s – structure of DNA 1970’s – ‘Sanger sequencing’ 1980’s – RFLP (genetic barcode / inexpensive genotyping of marker loci) 1990’s – Linkage studies using RFLPs 2000’s – Human Genome Project completed; International HapMap project; first genotyping microarrays; first large-scale association studies. 2010’s – 1000 Genomes Project; direct-to-consumer genetic testing Present – Massively large-scale biobank / population sequencing projects (UK Biobank), 100 000 genomes project (UK); Precision Medicine Initiative (US), … Finding Disease Genes 1 (linkage) Familial Aggregation Segregation Analysis Genome-wide Linkage Analysis Linkage Mapping Small number of typed markers A/a ABC abc B/b C/c … abc abc A chromosome ABC abc abc abc = Affected = Unaffected aBC abc abc abc ABC abc Abc abc abC abc ABc abc ABC abc Linkage Mapping Typical result if successful – a strong signal (good) but not well localised within a chromosome. chromosome Initial discovery led to finding of APOE variants affecting risk of Alzheimers. Pericak-Vance et al, Am. J. Hum. Gen (1991) Finding Disease Genes 1 (linkage) Familial Aggregation Segregation Analysis Genome-wide Linkage Analysis Candidate Gene Studies + Fine Mapping Gene Characterization Finding Disease Genes 1 (linkage) Familial Aggregation Segregation Analysis We aren’t very good at this! Genome-wide Linkage Analysis Candidate Gene Studies + Fine Mapping Gene Characterization Successes and Failures Linkage Mapping has been successful in identifying the genetic basis of many human diseases in which the disease penetrance resembles a simple Mendelian model e.g. Huntington’s disease (HD 1993), Cystic Fibrosis, some forms of breast cancer (BRCA1 1993), Alzheimers (APOE 1991)… But “the literature is now replete with linkage screens for an array of common ‘complex’ disorders such as schizophrenia, manic depression, autism, asthma, type I and type II diabetes, Multiple Sclerosis, Lupus. Although many of these studies have reported significant linkage findings, none has lead to convincing replication” – Risch (2000) Successes and Failures Why? It’s because linkage studies aren’t the right study design for detecting non-Mendelian-like effects. These so-called ‘complex’ traits have fundamentally different genetic architectures. Relative risk = P( disease | risk allele ) P( disease | non-risk allele ) ‘Mendelian’-like trait => RR > 4 or so, i.e. you are many times more likely to get disease if you are a risk allele carrier. Typically for common disease RR are thought to be < 1.5 or smaller. (But there may be many such variants.) Relative risk (RR) Complex diseases Rare (e.g. <1%) Frequency Common (e.g. 5-50%) The mutations underlying common complex disease are composed of multiple mutations of modest effect Typically RR < 1.5 Successes and Failures Linkage studies aren’t the right study design for detecting complex trait effects. Number of families / case-control pairs needed Relative risk = Linkage study Case/control, GWAS study P( disease | risk allele ) P( disease | non-risk allele ) Risch (2000) Finding Disease Genes 2 - GWAS Familial Aggregation Still want a heritable trait! Segregation Analysis Genome-wide Linkage Analysis Genome-wide Association Analysis Candidate Gene Studies + Fine Mapping Gene Characterization Association mapping Chromosomes Cases (D) Controls (U) 1. Collect a set of unrelated affected individuals (cases) and unaffected individuals (controls). Association mapping Chromosomes Cases (D) Controls (U) Red variant is what we’re looking for – e.g. in this toy example, RR = P(D|red) P(D|not red) = P(red|D) P(not red) P(not red|D) P(red) = 5/6 * 5/6 / (1/6)*(1/6) = 25 So real effects, e.g. RR<1.5, are much more subtle than this! Association mapping Cases (D) Controls (U) * * * 2. Genotype many thousands of genetic markers (but probably not the causal, functional mutations themselves) Association mapping Cases (D) Controls (U) * * * 3. Hope to rely on correlations between typed markers and the causal mutations Association mapping e.g in our toy example Not white white Frequency cases 5 1 1/6 controls 2 4 2/3 => Estimate RR=10 at this marker SNP. Perform statistical test to test for evidence of difference in allele frequencies between cases and controls. (e.g. chi-squared test). In this toy example P=0.24 so not enough data even for this strong effect. P < (a stringent threshhold) => success! (Aside - association studies – TDT) Collect (lots) of trios of individuals Condition on phenotype of offspring (case) High risk alleles should be over transmitted Internal control formed by untransmitted alleles A a A a a a A A Difference between linkage and association Linkage studies - Collect set of families with individuals carrying disease or phenotype - Look for co-segregation of small number of markers with disease status. Association Studies - Collect unrelated individuals and look at allele frequency differences between cases and controls (or cases and parents for TDT) - Requires genotyping many thousands of markers. - Exploits correlations between nearby genetic diversity along chromosomes within the population Theory Association studies provide more power allowing us to detect the small effect sizes underlying gene responsible for common disease. Questions - How many SNPs would actually be needed to cover the genome? - Can we actually type enough SNPs, and cheaply enough, for the large sample sizes required? Tagging genetic diversity How many markers are actually required to tag the diversity? - To understand this, must first understand patterns of diversity in natural populations - Identify catalogue of variants to type Can we design experiments to analyse such large numbers of SNPs? Correlation between SNPs Correlation Real data Previous prediction Physical distance along chromosome Reich et al Nature 2001 Why? - recombination hotspots Count the number of recombination in (lots) of sperm in the MHC region of chromosome 6 Jeffreys et al 1998 Hotspots are a genome wide feature More than 80% of recombination in less than 10% of the genome Recombination gives LD a block-like structure HapMap project Consortium of a large number of scientist to conduct a study to catalogue and describe human genetic diversity Discovery of over 5 million SNP across the genome HapMap project Consortium of a large number of scientist to conduct a study to catalogue and describe human genetic diversity Estimate that 200,000 to 500,000 SNPs require to tag genome (at least in European and Asian populations). Competition drove technology improvements Cost Coverage Affymetrix 100K Affymetrix 500K Affymetrix 6.0 (~1M SNPs) … Illumina 650Y Illumina 1M Illumina 2.5M Illumina 5M … Which one to buy? Costing a GWA Competition and anticipation of GWA association studies power drove cost of genotyping chips down Cost per genotype 2003 ~ $1 2005 ~ $0.1 2006 ~ $0.001 2009 ~ $0.0005 (ish) High throughput microchip arrays Main players Affymetrix and Illumina Power to find weak effects Illumina 650k Power Illumina 550k Illumina 300k Affymetrix 500k Affymetrix 100k Sample size (number of cases and controls) Relative risk of 1.2 Theory Association studies provide more power allowing us to detect the small effect sizes underlying gene responsible for common disease HapMap Strong correlations between neighbouring SNP due to hotspots mean that we don’t necessarily need to type the causal variant Technology Competition and commercial drive has meant the we can now affordable type the necessary number of SNPs in large numbers of individuals GWAS recipe 1. Collect large numbers of case individuals (1000s) 2. Collect large numbers of controls (perhaps randomly from the population). (3. Get consent) 4. Extract DNA 5. Genotype individuals at lots of markers 6. At each SNP do a test for allele frequency difference between cases and controls (chisquared, logistic regression) 7. Look for small p-values (how small)? It works! Study of ulcerative colitis (inflammatory bowel disease) 2321 cases, 4,818 controls typed on Affy 6.0 array (~1M SNPs) There are now (2016) over 160 common SNPs with effects RR < 2 associated with IBD, accounting for ~20% of disease heritability It works! www.well.ox.ac.uk/wtccc2/ms Study of multiple sclerosis (2011) 9772 cases, 17,376 controls from across Europe www.genome.gov/gwastudies/ What can possibly go wrong? Association mapping Cases (D) Controls (U) Genetic markers genotyped * * * Potential confounders Testing for small differences in allele frequency in large samples at around a million different SNPs in the genome • Statistical tests are sensitive to possible confounding e.g. ?? • Large amounts of data makes it difficult to visual inspect data Some potential problems Population Structure Population differentiation – tends to affect all parts of the genome Natural selection – has pronounced effect at particular loci Experimental biases Subtle difference in the DNA collection, storage or analysis can lead to both consistent and sporadic differences Confounding by population structure Subpopulation A Subpopulation B Cases Cases Controls Controls 2 = 2.1 (p = 0.34) Genotype 2 = 1.57 (p = 0.46) 2 = 16.3 (p <0.001) aa Aa AA SNP genotyping Intensity of probe B • SNP genotyping is achieved by measuring the evidence for the presence of the two alleles at each SNP in each individuals independently • Genotypes are then obtained by “clustering” the data Intensity of probe A •This is hard! Differences in genotype calling Cases Controls The experimental process is not perfect and slight differences can lead to apparent allele frequency differences An embarrassing example Plausible hypothesis, big study, genome-wide markers, very small P-value (< 1x10-10). In a respected journal (Science)... But not real, and now retracted. Why – because of genotyping errors! A quick example to demonstrate some of the analytical and statistical challenges…