* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download SNPGray
Pharmacogenomics wikipedia , lookup
Molecular Inversion Probe wikipedia , lookup
Medical genetics wikipedia , lookup
Genetics and archaeogenetics of South Asia wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Behavioural genetics wikipedia , lookup
Frameshift mutation wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Genealogical DNA test wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Oncogenomics wikipedia , lookup
Point mutation wikipedia , lookup
Population genetics wikipedia , lookup
Genetic engineering wikipedia , lookup
Minimal genome wikipedia , lookup
Metagenomics wikipedia , lookup
Pathogenomics wikipedia , lookup
Non-coding DNA wikipedia , lookup
Designer baby wikipedia , lookup
Helitron (biology) wikipedia , lookup
SNP genotyping wikipedia , lookup
History of genetic engineering wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Genomic library wikipedia , lookup
Genome editing wikipedia , lookup
Human genome wikipedia , lookup
Genome (book) wikipedia , lookup
Genome-wide association study wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Public health genomics wikipedia , lookup
Microevolution wikipedia , lookup
Human Genome Project wikipedia , lookup
Genome evolution wikipedia , lookup
Applications in Bioinformatics, Proteomics, and Genomics SNPs (1) J. Gray (UT) [email protected] Oct 1 2015 88 million genetic variants mapped in humans The realization that DNA differs from person to person much more than researchers had suspected, may transform medicine but could also threaten personal privacy. Todays lecture 1: Genetic Variation in humans 2. What are SNPs? 3. Why should we care about SNPs ? 4. SNP Discovery – The SNP Consortium/The International HapMap/The Personal Genome Project/The 1000 Genomes Project /10,000 genomes project 5. Haplotypes and how chromosomal recombination gives rise to new Haplotypes 6. Overview of SNP detection methods 1: Understanding Human Genetic Variation Human genome 3 billion base pairs All human genetic variation ~100 million bp ~3% < 500,000 are medically relevant “Every drop of human blood contains a history book written in the language of our genes” - Spencer Wells “The Journey of Man: A Genetic Odyssey” 2002 Founder mutations and genetic disease Two men born in the US - thousands of miles apart - share a condition known as hereditary hemochromatosis. The error in their genes originated in a single European ancestor, whose ancestors now number nearly 22 million The original mutation is known as a “founder mutation” due to bottlenecks in human migration The study of these mutations is intimately linked to the study of the recent evolution and spread of the human species. Simple illustration of founder effect. The original population is on the left with three possible founder populations on the right. Human Migration in the past 100K “Once modern humans began their migration out of Africa about 70,000 years ago, they kept going until they had spread to all corners of the globe. How far and fast they went depended on climate, the pressures of population and the invention of boats and other technologies. Less tangible qualities also sped their footsteps,imagaination, adaptability and curiosity. Human demographic history has shaped the pattern of variation observed in modern populations. The general concensus is that Africa is the cradle of modern humans (approx 200k years ago) Genetic data shows the ALL non-Africans are the descendants of a small group of Africans that moved into the middle east about 70K yrs ago. Logarithimic scale All humans are very closely related Humans went through a very narrow genetic bottleneck estimated only about 1 to 10 million humans in the world after the last ice age (10 k) The greatest diversity of genetic markers is in Africa indicating it was the earliest home of modern humans. Only a handful of people - carrying a few markers - left Africa seeding the genetic makeup of the rest of the world. National Geographic March 2006 See also http://www.bradshawfoundation.com/stephenoppenheimer/ Genetic mutations that act as markers, trace the journey of human migration. The earliest known mutation to spread outside of Africa is M168 (Haplotype CT) (about 50 K yrs ago) This graphic shows the Y chromosome of a Native American man with various mutations including M168 (Haplotype CT), proving his African ancestry. Founder mutations on Y chromosome give rise to Haplotypes “Eurasian Adam” In human genetics, Haplogroup CT is a Ychromosome haplogroup, defining one of the major lines of common ancestry of humanity along father-toson male lines. Men within this haplogroup have Y chromosomes with the SNP mutation M168, along with P9.1 and M294. These mutations are present in all modern human male lines except A and B, which are both found almost exclusively in Africa. Origin and spread of Haplotype CT Haplogroup CT is therefore the common ancestral male lineage of all men alive today except the ones that belong to A or B haplogroups, including most Africans Y-DNA Haplogroup Mutations Table The Y haplotype is very stable because there is no recombination happening with any other chromosome. The mitochondrial genome supplies a similar grouping in the maternal lineage. Haplogroups Mutations A no mutations B SRY10831.1 C SRY10831.1>M168 D SRY10831.1>M168>M174 E SRY10831.1>M168>M96 F SRY10831.1>M168>M89 G SRY10831.1>M168>M89>M201 H SRY10831.1>M168>M89>M69 I SRY10831.1>M168>M89>M170 J SRY10831.1>M168>M89>M304 K SRY10831.1>M168>M89>M9 L SRY10831.1>M168>M89>M9>M11 M SRY10831.1>M168>M89>M9>M5 N SRY10831.1>M168>M89>M9>M214 O SRY10831.1>M168>M89>M9>M214>M175 O3 SRY10831.1>M168>M89>M9>M214>M175>M122 P SRY10831.1>M168>M89>M9>M45 Q SRY10831.1>M168>M89>M9>M45>P36 R SRY10831.1>M168>M89>M9>M45>M207 R1b SRY10831.1>M168>M89>M9>M45>M207>M343 The pattern of genetic diversity in modern human populations, is the result of many evolutionary processes. New tools/resources promise to help identify functional mutations important for normal phenotypic variation as well as susceptibility to genetic disease. The same approaches are just as important for deciding how to protect biodiversity and in aiding plant breeding and animal husbandry Q: How much do humans differ ? A: very very very little! But everyone is unique Human genome project HGP) involved DNA from 9 individuals from diverse ethnic backgrounds Identified about 26,000 genes and about 1.5 million Single Nucleotide Polymorphisms – SNPs These are the most prevalent form of genetic variation in humans The HGP was launched 25 years ago – above are members of the 1989 meeting that launched the project http://www.nature.com/colle ctions/dcfqmlgsrw 2: What is a SNP ? (Single Nucleotide Polymorphism) 2: So what is a SNP ? GCATGCATGCATGCAT |||||||||||||||| Gene allele A1 CGTACGTACGTACGTA GCATGCAaGCATGCAT |||||||||||||||| Gene allele A2 CGTACGTtCGTACGTA Comparing DNA between two individuals shows that about every 1.5 kb there is one base pair difference – a single nucleotide polymorphism (SNP). When a variant nucleotide is present in more than one percent of a population, that DNA position is the location of the SNP. (less than 1% considered “rare” alleles). Only 2% of genome encodes protein 93% of all annotated genes have 1 SNP 59% have 5 or more SNPs 39% have 10 or more SNPs Often scientists distinguish between ancient “founder mutations” where surrounding DNA is same as others in the population and “hot spot mutations” which occur in error prone regions. Sci. Amer. Oct 2005 Old Originals versus numerous newcomers Sickle cell anemia is most often caused by a “founder mutation” Achondroplasia (a form of human dwarfism) ordinarily results from a “hotspot mutation” Noteworthy Founder Mutations Gene Condition Mutation origin HFE Iron overload NW Europe CFTR Cystic fibrosis SW Europe HbS Sickle cell disease ALDH2 Alcohol toxicity LCT LactoseAsia tolerance GJB2 Deafness Africa Middle East Far east Asia FV Blood clots Leiden W. Europe Middle East Migration Possible Advantage of 1 copy Across Europe Protection from anemia Across Europe Protection from diarrhea To New World Protection from malaria North & West Protection from across Asia alcoholism West & North Allows animal milk across Eurasia consumption West & North Unknown across Europe Worldwide Protection from sepsis In addition to SNPs there are Copy Number variations (CNVs) or Structural Variations (SVs) CNVs can be caused by structural rearrangements of the genome such as deletions, duplications, inversions, and translocations. Some associated with disease, most are not and some are advantageous Approximately 0.4% of the genome of unrelated people typically differ with respect to copy number This gene duplication has created a copy-number variation. The chromosome now has two copies of this section of DNA, rather than one. 3: Why should we care about SNPs ? 3: Why should we care about SNPs ? We want to know the basis of human variation and disease susceptibility How can some who never smoke get lung cancer and others who smoke heavily stay cancer free ? Why do some people exposed to HIV never develop AIDS ? SNPs are useful to....... 1: DNA fingerprinting for criminal or parental identification 2: Help map polygenic/disease traits by comparing DNA of groups with and without inheritance of that disease 3: Genotype-specific medication (pharmacogenomics) 4: Study human evolution 4: SNP Discovery 4: SNP Discovery The urgency and importance of identifying thousands of SNPs resulted in 11 major pharmaceutical and technology companies cooperating (2001-2008) First a pool of 24 DNAs was digested with one of several restriction enzymes, size fractionated and cloned into M13-based vectors. Individual clones sequenced, repeats discarded, gene pairs accepted only if 99% homologous. SNP fining and validation steps - isolated more than 1.5 million SNPs www.hapmap.org See also http://www.ncbi.nlm. nih.gov/SNP/ The Goal of the International HapMap Project was to develop a “haplotype” map of the human genome, the HapMap, which will describe the common (not rare) patterns of human DNA sequence variation (variants in >1% of population). The HapMap became a key resource for researchers to use to find genes affecting health, disease, and responses to drugs and environmental factors. Phase 3 was completed and there >6million SNPs defined. The information is freely available. (see Nature 27 Oct 2005 for report on phase 1 of project, Nature 18 Oct 2007 for phase II and 2 Sep 2010 for phase III) Sequencing Entire Genomes – The Terabyte era July 10, 2008 DNA sequencing enters the terabase era The Wellcome Trust Sanger Institute announced something remarkable: its scientists had sequenced 300 human genomes in six months. In perspective. They sequenced more DNA every 2 seconds than was sequenced during the first five years of international genome-sequencing efforts, from 1982 to 1987. The institute has now sequenced 1 trillion = 1000 billion letters of the genetic code. The cost of sequencing a human genome has fallen from $3 billion in 2001 (Human Genome Project) $1 million in 2007 (for James Watson) $50,000 in 2010 (James Lupski) $1000 in Jan 2014 (Illumina 30X coverage Hi Seq X Ten) $1000 in Sep 2015 for Personal Genome Project Volunteers The Personal Genome Project The Personal Genome Project (PGP) is a long term, large cohort study which aims to sequence and publicize the complete genomes and medical records of 100,000 volunteers, in order to enable research into personal genomics and personalized medicine. ~5000 volunteers to date www.personalgenomes.org Dr. George Church founder of project Would you have your genome sequenced if you could afford it? Yes No Undecided 81% 9% 10% If you had your genome sequenced would you want to know everything? Yes No Undecided 74% 16% 10% In 2013 Researchers were able to identify 50 people whose DNA had been posted anonymously on the Internet for genetics studies. The results highlight a trade-off in making genetic data widely available for researchers and protecting personal privacy. SNP Discovery by sequencing individual genome Lupski, J.R. et al., New England Journal of Medicine 362:11811191 2010 James Lupski, a physician-scientist who suffers from a neurological disorder called Charcot-Marie-Tooth, searched for the genetic cause for > 25 years…….. Late last year, he finally found it-by sequencing his entire genome -in SH3TC2 (the SH3 domain and tetratricopeptide repeats 2 gene) – cost ~$50,000 First to show how whole-genome sequencing can be used to identify the genetic cause of an individual's disease. "I have hundreds of thousands of differences from all the other genomes that have been sequenced. I expect that to hold true for others. Everyone is truly unique.” SNP Discovery by sequencing family genomes How much genetic variation in each family? Sequenced entire genome of two parents and 2 children who both have a recessive genetic disease named Miller Syndrome Estimated a human intergeneration mutation rate of ~1.1 x 10-8 per position per haploid genome a high degree of certainty that each parent passes 30 new mutations—for a total of 60—to their offspring Also narrowed candidate genes to just four Roach et al., Analysis of Genetic Inheritance in a Family Quartet by Whole-Genome Sequencing. Science DOI: 10.1126/science.1186802 March 2010 SNP Discovery by sequencing 1000 genomes With advances in sequencing technology, the 1000 genomes project became feasible – revealed more SNPs than the HapMap project. www.genome.gov/27542240 - useful video tutorials Whose 1000 (actually 1096) genomes? Figure S2. 1000 Genomes Project Phase I populations. A – Total number of samples sequenced; B – Source of DNA (blood (bld) or LCL); C – Gender composition D – Number that are part of trios (t), parent-child duos (d) or singletons (s). Phase III 2504 genomes 1000 Genomes Project Phase III populations. Population sampling. a, Polymorphic variants within sampled populations. The area of each pie is proportional to the number of polymorphisms within a population. Pies are divided into four slices, representing variants private to a population (darker colour unique to population), private toa continental area (lighter colour shared across continental group), Shared across continental areas (light grey), and shared across all continents (dark grey). Dashed lines indicate populations sampled outside of their ancestral continental region Nature 526 68-74 Oct 2015 Phase III 2504 genomes 1000 Genomes Project Phase III populations. The number of variant sites per genome. The total number of observed non-reference sites differs greatly among populations (Fig. 1b). Individuals from African ancestry populations harbour the greatest numbers of variant sites, as predicted by the out-of-Africa model of human origins. Individuals from recently admixed populations show great variability in the number of variants, roughly proportional to the degree of recent African ancestry in their genomes. Nature 526 68-74 Oct 2015 Phase III 2504 genomes 1000 Genomes Project Phase III populations. ~ 64 million autosomal variants have a frequency <0.5%,, ~ 12 million have a frequency between 0.5% and 5%, ~ 8 million have a frequency >5% Nevertheless, the majority of variants observed in a single genome are common: just 40,000 to 200,000 of the variants in a typical genome (1–4%) have a frequency <0.5% Nature 526 68-74 Oct 2015 The number of variants within the phase 3 sample as a function of alternative allele frequency. Phase III 2504 genomes Table 1 and Fig 1c, The average number of singletons per genome – more in African populations and LWK which is the centre of origin of humans. Variants most likely to affect gene function in a typical genome contained 149–182 sites with protein truncating variants, 10,000 to 12,000 sites with peptide sequence-altering variants, and 459,000 to 565,000 variant sites overlapping known regulatory regions. Nature 526 68-74 Oct 2015 Whoel Genome Sequencing Deep whole-genome sequencing of 129 trios (motherfather-daughter) from 2 populations Low-coverage sequencing of 179 unrelated individuals from 4 populations Exon sequencing of 906 randomly-selected genes in 697 individuals from 7 populations. Overall Findings: 84.7 million SNPs (less than 1% of entire genome) 3.6 million Indels (Short Insertions/Deletions) 60,000 structural variantsp Nature 526 68-74 Oct 2015 1000 genomes website http://browser.1000genomes.org/index/html All data is deposited at 1000genomes.org Paper: A map of human variation from population-scale sequencing Nature Vol 467 p 1061 October 2010 NCBI hosts a public SNP database dbSNP http://www.ncbi.nlm.nih.gov/snp 10K Genomes Project identifies rare variants in health and disease Goal is to explore contribution of rare and low-frequency variants to human traits Sequence whole genomes (low read depth, 73) or exomes (high read depth, 803) of nearly 10,000 individuals http://www.uk10k.org Found ~24 million novel single nucleotide variants (SNVs) Nature 526 82-90 1st Oct 2015 10K Genomes Project identifies rare variants in health and disease Nature 526 82-90 1st Oct 2015 Figure 1 The UK10K-cohorts resource for variation discovery. Number of SNVs identified in the UK10K-cohorts data set in all autosomal regions in different allele frequency (AF) bins, and percentages that were shared with samples of European ancestry from the 1000 Genomes Project (phase I, EUR n5379) and/or the Genomes of the Netherlands (GoNL, n5499) study, or unique to the UK10K-cohorts data set. AF bins were calculated using the UK10K data set, for allele count (AC)51, AC52, and non-overlapping AF bins for higher AC. 10K Genomes Project identifies rare variants in health and disease Sub-populations of the 10K were chosen for rare diseases, obesity and neurodevelopmental problems. About 4000 were unselected. 10X more European samples compared to 1000GP, yields substantial improvements in imputation accuracy and coverage for low-frequency and rare variants Nature 526 82-90 1st Oct 2015 10K Genomes Project identifies rare variants in health and disease Nature 526 82-90 1st Oct 2015 Figure 5 | Enrichment of single-marker associations by functional annotation in the UK10Kcohorts study. Distribution of fold enrichment statistics for single-variant associations of lowfrequency Minor Allele Frequency (MAF 1–5%) and common (MAF>5%) SNVs in near-genic elements or selected chromatin states and DNase I hotspots (DHS). Boxplots represent distributions of fold enrichment statistics estimated across the five (out of 31 core) traits where at least 10 independent SNVs were associated with the trait at 10-7 P value (permutation test) threshold (HDL, LDL, TC, APOA1 and APOB). 5: Haplotypes and how chromosomal recombination gives rise to new Haplotypes xyz XYZ xyz Haplotypes Xyz Xyz xYz xYZ During meiosis, homologous chromosomes (1 from each parent) pair along their lengths. The chromosomes cross over at points called chiasma. At each chiasma, the chromosomes break and rejoin, trading some of their genes. This recombination results in genetic variation (new haplotypes). Crossing over occurs during Meiosis http://www.youtube.com/watch?v=BhJf9MHHmc4 http://www.youtube.com/watch?v=3qgBKrAZCLg Crossing Over during Meiosis increases genetic variability http://www.dnatube.com/video/350/Crossi ng-Over-increases-genetic-variability If every homologous pair in humans has just one crossing over event then there will many possible new gametes (sperm or eggs) with many new haplotypes (depends on how the chromosomes randomly segregate and how many). SNPs that are inherited close to one another on a given chromosome are said to be genetically “linked” SNP1 C Patient A C SNP1 SNP2 A A SNP2 SNP1’ T Patient B T SNP1’ SNP2’ G G SNP2’ Maternal chromosome Paternal chromosome Maternal chromosome Paternal chromosome Haplotype refers to the set of alleles on one particular chromosome Patient C has two haplotypes SNP1 C Patient C T SNP1’ SNP2 A G SNP2’ Maternal chromosome Paternal chromosome Each haplotype is passed on to offspring as a complete unit unless recombination occurs between them to create new haplotypes A Trio is the genotype of mother father and offspring Recombination in patient C leads to 2 new haplotypes in gametes (sperm or egg) that are passed onto next generation SNP1 C Patient C T SNP1’ SNP2 A G SNP2’ Maternal chromosome Paternal chromosome SNP1 C T SNP1’ SNP2’ G A SNP2 “New” chromosome “New” chromosome http://www.youtube.com/watch?v=3qgBKrAZCLg Because of recombination a haplotype that surrounds a founder mutation will get shorter over generations as chromosomes mix Sci. Amer. Oct 2005 It follows that a “recent” founder mutation will be associated with a long haplotype, and an “ancient” founder mutation with a short haplotype. Sci. Amer. Oct 2005 Underlies method of Genome Wide Association Studies (GWAS) 6: How to detect SNPs ? SNP assay requirements a: Assay must be easily developed from sequence information b: Low cost of assay development (reagents/personnel) c: Assay must be robust d: Easily automated e: Simple analysis, accurate genotype calling f: Scalable assay (up to millions/day) g: Low cost per genotype assay Genotyping methods are evolving rapidly and costs greatly decreasing How can we detect SNPs ? Since most association studies require genotyping large numbers of individuals with a large number of SNPs then SNP assays must clearly distinguish between different alleles. there are several methods and this is an area of intense investigation and improvement………… Sequence-specific SNP Detection Methods 1: Hybridization: Allele-specific probes that only hybridize when there is a perfect match - several methods to detect hybridization Affymetrix® SNP Array 6.0 1.8 million SNPs ~ $400 http://www.affymetrix.com/estore/browse/staticHtmlContentTemplate.jsp?stati cHtmlMediaId=m1621192&isHtmlStatic=true&navMode=35810&aId=productsNav Sequence-specific SNP Detection Methods 2: Nucleotide incorporation: addition of nucleotides with DNA polymerase can only occur if 3’ end of primer is a perfect match with SNP This method can be miniaturized and large numbers of SNPs assayed in a short time e.g. Illumina Infinium II Assay Protocol - can assay 650,000 SNPs on one chip - three day protocol from start to finish Now Infinium HD does up to 1.2 million www.illumina.com Illumina Omni 5 million SNPs $580 For online video see http://www.illumina.com/applications/genotyping.ilmn Next lecture 1: Mapping complex traits using SNPs 2: Genome Wide Association Studies (GWAS) 3. Example of complex trait mapping Using SNP analysis to find gene linked to genetic disease Genome-wide association study of systemic sclerosis (autoimmune disease) identifies CD247 as a new susceptibility locus