Download SNPGray

Document related concepts

Gene wikipedia , lookup

NUMT wikipedia , lookup

Pharmacogenomics wikipedia , lookup

Molecular Inversion Probe wikipedia , lookup

Medical genetics wikipedia , lookup

Genetics and archaeogenetics of South Asia wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Behavioural genetics wikipedia , lookup

Frameshift mutation wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Genealogical DNA test wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Polyploid wikipedia , lookup

RNA-Seq wikipedia , lookup

Oncogenomics wikipedia , lookup

Mutation wikipedia , lookup

Point mutation wikipedia , lookup

Population genetics wikipedia , lookup

Genetic engineering wikipedia , lookup

Minimal genome wikipedia , lookup

Metagenomics wikipedia , lookup

Pathogenomics wikipedia , lookup

Non-coding DNA wikipedia , lookup

Designer baby wikipedia , lookup

Helitron (biology) wikipedia , lookup

SNP genotyping wikipedia , lookup

History of genetic engineering wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genomic library wikipedia , lookup

Genome editing wikipedia , lookup

Human genome wikipedia , lookup

Genome (book) wikipedia , lookup

Genome-wide association study wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Public health genomics wikipedia , lookup

Microevolution wikipedia , lookup

Genomics wikipedia , lookup

Human Genome Project wikipedia , lookup

Genome evolution wikipedia , lookup

Human genetic variation wikipedia , lookup

Tag SNP wikipedia , lookup

Transcript
Applications in Bioinformatics,
Proteomics, and Genomics
SNPs (1)
J. Gray (UT)
[email protected]
Oct 1 2015
88 million genetic variants
mapped in humans
The realization that DNA differs from
person to person much more than
researchers had suspected, may
transform medicine but could also
threaten personal privacy.
Todays lecture
1: Genetic Variation in humans
2. What are SNPs?
3. Why should we care about SNPs ?
4. SNP Discovery – The SNP Consortium/The
International HapMap/The Personal Genome
Project/The 1000 Genomes Project
/10,000 genomes project
5. Haplotypes and how chromosomal
recombination gives rise to new Haplotypes
6. Overview of SNP detection methods
1:
Understanding
Human Genetic
Variation
Human genome
3 billion base pairs
All human genetic
variation
~100 million bp
~3%
< 500,000 are
medically relevant
“Every drop of human blood contains a history book
written in the language of our genes” - Spencer Wells
“The Journey of Man: A Genetic Odyssey” 2002
Founder mutations and genetic disease
Two men born in the US - thousands
of miles apart - share a condition
known as hereditary
hemochromatosis.
The error in their genes originated in
a single European ancestor, whose
ancestors now number nearly 22
million
The original mutation is known as a
“founder mutation” due to
bottlenecks in human migration
The study of these mutations is
intimately linked to the study of the
recent evolution and spread of the
human species.
Simple illustration of founder
effect. The original population
is on the left with three
possible founder populations
on the right.
Human Migration in the
past 100K
“Once modern humans began their migration out of Africa about
70,000 years ago, they kept going until they had spread to all corners
of the globe. How far and fast they went depended on climate, the
pressures of population and the invention of boats and other
technologies.
Less
tangible
qualities
also
sped
their
footsteps,imagaination, adaptability and curiosity.
Human
demographic
history has shaped
the pattern of
variation observed
in modern
populations.
The general concensus is that Africa is the cradle of
modern humans (approx 200k years ago)
Genetic data shows the ALL non-Africans are the
descendants of a small group of Africans that moved
into the middle east about 70K yrs ago.
Logarithimic scale
All humans are very closely related
Humans went through a very narrow genetic bottleneck estimated only about 1 to 10 million humans in the world
after the last ice age (10 k)
The greatest diversity of genetic markers is in Africa
indicating it was the earliest home of modern humans. Only
a handful of people - carrying a few markers - left Africa
seeding the genetic makeup of the rest of the world.
National Geographic March 2006
See also http://www.bradshawfoundation.com/stephenoppenheimer/
Genetic mutations that act as markers, trace the journey of
human migration. The earliest known mutation to spread
outside of Africa is M168 (Haplotype CT) (about 50 K yrs
ago)
This graphic shows the Y chromosome of a Native American
man with various mutations including M168 (Haplotype CT),
proving his African ancestry.
Founder mutations on Y
chromosome give rise to Haplotypes
“Eurasian Adam”
In human genetics,
Haplogroup CT is a Ychromosome
haplogroup, defining one
of the major lines of
common ancestry of
humanity along father-toson male lines.
Men within this haplogroup have Y chromosomes with the SNP mutation M168,
along with P9.1 and M294. These mutations are present in all modern human male
lines except A and B, which are both found almost exclusively in Africa.
Origin and spread of Haplotype CT
Haplogroup CT is therefore the common ancestral male
lineage of all men alive today except the ones that
belong to A or B haplogroups, including most Africans
Y-DNA
Haplogroup
Mutations Table
The Y haplotype is
very stable because
there is no
recombination
happening with any
other chromosome.
The mitochondrial
genome supplies a
similar grouping in
the maternal
lineage.
Haplogroups Mutations
A no mutations
B SRY10831.1
C SRY10831.1>M168
D SRY10831.1>M168>M174
E SRY10831.1>M168>M96
F SRY10831.1>M168>M89
G SRY10831.1>M168>M89>M201
H SRY10831.1>M168>M89>M69
I SRY10831.1>M168>M89>M170
J SRY10831.1>M168>M89>M304
K SRY10831.1>M168>M89>M9
L SRY10831.1>M168>M89>M9>M11
M SRY10831.1>M168>M89>M9>M5
N SRY10831.1>M168>M89>M9>M214
O SRY10831.1>M168>M89>M9>M214>M175
O3 SRY10831.1>M168>M89>M9>M214>M175>M122
P SRY10831.1>M168>M89>M9>M45
Q SRY10831.1>M168>M89>M9>M45>P36
R SRY10831.1>M168>M89>M9>M45>M207
R1b SRY10831.1>M168>M89>M9>M45>M207>M343
The pattern of genetic diversity in modern human
populations, is the result of many evolutionary
processes.
New tools/resources promise to help identify
functional mutations important for normal phenotypic
variation as well as susceptibility to genetic disease.
The same approaches are just as important for
deciding how to protect biodiversity and in aiding
plant breeding and animal husbandry
Q: How much do humans differ ?
A: very very very little! But everyone is unique
Human genome project
HGP) involved DNA from 9
individuals from diverse
ethnic backgrounds
Identified about 26,000
genes and about 1.5 million
Single
Nucleotide
Polymorphisms – SNPs
These
are
the
most
prevalent form of genetic
variation in humans
The HGP was launched 25
years ago – above are
members of the 1989
meeting that launched the
project
http://www.nature.com/colle
ctions/dcfqmlgsrw
2: What is a SNP ?
(Single Nucleotide Polymorphism)
2: So what is a SNP ?
GCATGCATGCATGCAT
|||||||||||||||| Gene allele A1
CGTACGTACGTACGTA
GCATGCAaGCATGCAT
|||||||||||||||| Gene allele A2
CGTACGTtCGTACGTA
Comparing DNA between two individuals
shows that about every 1.5 kb there is one
base pair difference – a single nucleotide
polymorphism (SNP).
When a variant nucleotide is present in more
than one percent of a population, that DNA
position is the location of the SNP.
(less than 1% considered “rare” alleles).
Only 2% of genome encodes protein
93% of all annotated genes have 1 SNP
59% have 5 or more SNPs
39% have 10 or more SNPs
Often scientists distinguish between ancient “founder
mutations” where surrounding DNA is same as others in
the population and “hot spot mutations” which occur in
error prone regions.
Sci. Amer. Oct 2005
Old Originals versus
numerous newcomers
Sickle cell anemia is most
often caused by a
“founder mutation”
Achondroplasia (a form of
human dwarfism)
ordinarily results from a
“hotspot mutation”
Noteworthy Founder Mutations
Gene
Condition
Mutation origin
HFE
Iron overload
NW Europe
CFTR Cystic fibrosis SW Europe
HbS
Sickle cell
disease
ALDH2 Alcohol
toxicity
LCT
LactoseAsia
tolerance
GJB2 Deafness
Africa
Middle East
Far east Asia
FV
Blood clots
Leiden
W. Europe
Middle East
Migration
Possible Advantage
of 1 copy
Across Europe Protection from
anemia
Across Europe Protection from
diarrhea
To New World Protection from
malaria
North & West Protection from
across Asia
alcoholism
West & North Allows animal milk
across Eurasia consumption
West & North Unknown
across Europe
Worldwide
Protection from
sepsis
In addition to SNPs there are
Copy Number variations (CNVs) or
Structural Variations (SVs)
CNVs can be caused by structural
rearrangements of the genome
such as deletions, duplications,
inversions, and translocations.
Some associated with disease,
most are not and some are
advantageous
Approximately 0.4% of the genome
of unrelated people typically differ
with respect to copy number
This gene duplication has created a copy-number
variation. The chromosome now has two copies of
this section of DNA, rather than one.
3: Why should we care about SNPs
?
3: Why should we care about SNPs ?
We want to know the
basis of human variation
and disease susceptibility
How can some who never
smoke get lung cancer
and others who smoke
heavily stay cancer free ?
Why do some people
exposed to HIV never
develop AIDS ?
SNPs are useful to.......
1: DNA fingerprinting for criminal
or parental identification
2: Help map polygenic/disease
traits by comparing DNA of
groups with and without
inheritance of that disease
3: Genotype-specific medication
(pharmacogenomics)
4: Study human evolution
4: SNP Discovery
4: SNP Discovery
The urgency and importance of identifying thousands
of SNPs resulted in 11 major pharmaceutical and
technology companies cooperating (2001-2008)
First a pool of 24 DNAs was digested with one of
several restriction enzymes, size fractionated and
cloned into M13-based vectors.
Individual clones sequenced, repeats discarded, gene pairs
accepted only if 99% homologous.
SNP fining and validation steps - isolated more than 1.5
million SNPs
www.hapmap.org
See also
http://www.ncbi.nlm.
nih.gov/SNP/
The Goal of the International HapMap Project was to
develop a “haplotype” map of the human genome, the
HapMap, which will describe the common (not rare)
patterns of human DNA sequence variation (variants in
>1% of population).
The HapMap became a key resource for researchers to use to find
genes affecting health, disease, and responses to drugs and
environmental factors. Phase 3 was completed and there
>6million SNPs defined.
The information is freely available.
(see Nature 27 Oct 2005 for report on phase 1 of project, Nature 18
Oct 2007 for phase II and 2 Sep 2010 for phase III)
Sequencing Entire Genomes – The Terabyte era
July 10, 2008
DNA sequencing enters the terabase era
The Wellcome Trust Sanger Institute announced something
remarkable: its scientists had sequenced 300 human genomes
in six months.
In perspective. They sequenced more DNA every 2 seconds than was
sequenced during the first five years of international genome-sequencing
efforts, from 1982 to 1987. The institute has now sequenced 1 trillion =
1000 billion letters of the genetic code.
The cost of sequencing a human genome has fallen from $3
billion in 2001 (Human Genome Project)
$1 million in 2007 (for James Watson)
$50,000 in 2010 (James Lupski)
$1000 in Jan 2014 (Illumina 30X coverage Hi Seq X Ten)
$1000 in Sep 2015 for Personal Genome Project Volunteers
The Personal Genome Project
The Personal Genome
Project (PGP) is a long
term, large cohort study
which aims to sequence
and publicize the
complete genomes and
medical records of
100,000 volunteers, in
order to enable research
into personal genomics
and personalized
medicine.
~5000 volunteers to date
www.personalgenomes.org
Dr. George Church
founder of project
Would you have your genome sequenced if you
could afford it?
Yes
No
Undecided
81%
9%
10%
If you had your genome sequenced would you
want to know everything?
Yes
No
Undecided
74%
16%
10%
In 2013 Researchers were able to identify 50 people whose
DNA had been posted anonymously on the Internet for
genetics studies.
The results highlight a trade-off in making genetic data widely
available for researchers and protecting personal privacy.
SNP Discovery by sequencing individual genome
Lupski, J.R. et al., New England Journal of Medicine 362:11811191 2010
James Lupski, a physician-scientist who suffers from a
neurological
disorder
called
Charcot-Marie-Tooth,
searched for the genetic cause for > 25 years……..
Late last year, he finally found it-by sequencing his
entire genome -in SH3TC2 (the SH3 domain and
tetratricopeptide repeats 2 gene) – cost ~$50,000
First to show how whole-genome sequencing can be used
to identify the genetic cause of an individual's disease.
"I have hundreds of thousands of differences from all
the other genomes that have been sequenced. I expect
that to hold true for others. Everyone is truly unique.”
SNP Discovery by sequencing family genomes
How much genetic variation in each family?
Sequenced entire genome of two parents and 2 children
who both have a recessive genetic disease named Miller
Syndrome
Estimated a human intergeneration mutation rate of ~1.1 x
10-8 per position per haploid genome
a high degree of certainty that each parent passes 30 new
mutations—for a total of 60—to their offspring
Also narrowed candidate genes to just four
Roach et al., Analysis of Genetic Inheritance in a Family
Quartet by Whole-Genome Sequencing.
Science DOI: 10.1126/science.1186802 March 2010
SNP Discovery by sequencing 1000 genomes
With advances in sequencing technology, the 1000
genomes project became feasible – revealed more SNPs
than the HapMap project.
www.genome.gov/27542240 - useful video tutorials
Whose 1000 (actually 1096) genomes?
Figure S2. 1000 Genomes Project Phase I populations. A – Total number of samples
sequenced; B – Source of DNA (blood (bld) or LCL); C – Gender composition D – Number
that are part of trios (t), parent-child duos (d) or singletons (s).
Phase III 2504 genomes
1000 Genomes Project Phase III populations. Population sampling. a, Polymorphic variants
within sampled populations. The area of each pie is proportional to the number of
polymorphisms within a population. Pies are divided into four slices, representing variants
private to a population (darker colour unique to population),
private toa continental area (lighter colour shared across continental group),
Shared across continental areas (light grey),
and shared across all continents (dark grey).
Dashed lines indicate populations sampled outside of their ancestral continental region
Nature 526 68-74 Oct 2015
Phase III 2504 genomes
1000 Genomes Project Phase III populations.
The number of variant sites per genome. The total number of observed non-reference sites
differs greatly among populations (Fig. 1b).
Individuals from African ancestry populations harbour the greatest numbers of variant sites, as
predicted by the out-of-Africa model of human origins.
Individuals from recently admixed populations show great variability in the number of variants,
roughly proportional to the degree of recent African ancestry in their genomes.
Nature 526 68-74 Oct 2015
Phase III 2504 genomes
1000 Genomes Project Phase III
populations.
~ 64 million autosomal variants
have a frequency <0.5%,,
~ 12 million have a frequency
between 0.5% and 5%,
~ 8 million have a frequency >5%
Nevertheless, the majority of
variants observed in a single
genome are common: just 40,000 to
200,000 of the variants in a typical
genome (1–4%) have a frequency
<0.5%
Nature 526 68-74 Oct 2015
The number of variants within the phase 3
sample as a function of alternative allele
frequency.
Phase III 2504 genomes
Table 1 and Fig 1c, The average number of singletons per genome – more in African
populations and LWK which is the centre of origin of humans. Variants most likely to
affect gene function in a typical genome contained 149–182 sites with protein
truncating variants, 10,000 to 12,000 sites with peptide sequence-altering variants,
and 459,000 to 565,000 variant sites overlapping known regulatory regions.
Nature 526 68-74 Oct 2015
Whoel Genome Sequencing
Deep whole-genome sequencing of 129 trios (motherfather-daughter) from 2 populations
Low-coverage sequencing of 179 unrelated individuals
from 4 populations
Exon sequencing of 906 randomly-selected genes in 697
individuals from 7 populations.
Overall Findings:
84.7 million SNPs (less than 1% of entire genome)
3.6 million Indels (Short Insertions/Deletions)
60,000 structural variantsp
Nature 526 68-74 Oct 2015
1000 genomes website
http://browser.1000genomes.org/index/html
All data is deposited at 1000genomes.org
Paper: A map of human variation from population-scale sequencing
Nature Vol 467 p 1061 October 2010
NCBI hosts a public SNP database dbSNP
http://www.ncbi.nlm.nih.gov/snp
10K Genomes Project identifies rare variants in
health and disease
Goal is to explore contribution
of rare and low-frequency
variants to human traits
Sequence whole genomes
(low read depth, 73) or
exomes (high read depth,
803) of nearly 10,000
individuals
http://www.uk10k.org
Found ~24 million novel
single nucleotide variants
(SNVs)
Nature 526 82-90 1st Oct 2015
10K Genomes Project identifies
rare variants in health and disease
Nature 526 82-90
1st Oct 2015
Figure 1 The UK10K-cohorts resource for variation discovery. Number
of SNVs identified in the UK10K-cohorts data set in all autosomal regions in
different allele frequency (AF) bins, and percentages that were shared with
samples of European ancestry from the 1000 Genomes Project (phase I,
EUR n5379) and/or the Genomes of the Netherlands (GoNL, n5499) study,
or unique to the UK10K-cohorts data set. AF bins were calculated using the
UK10K data set, for allele count (AC)51, AC52, and non-overlapping AF bins
for higher AC.
10K Genomes Project identifies
rare variants in health and disease
Sub-populations of the 10K were chosen for rare diseases, obesity and
neurodevelopmental problems. About 4000 were unselected. 10X more
European samples compared to 1000GP, yields substantial improvements in
imputation accuracy and coverage for low-frequency and rare variants
Nature 526 82-90 1st Oct 2015
10K Genomes Project identifies
rare variants in health and disease
Nature 526 82-90
1st Oct 2015
Figure 5 | Enrichment of single-marker associations by functional annotation in the UK10Kcohorts study. Distribution of fold enrichment statistics for single-variant associations of lowfrequency Minor Allele Frequency (MAF 1–5%) and common (MAF>5%) SNVs in near-genic
elements or selected chromatin states and DNase I hotspots (DHS).
Boxplots represent distributions of fold enrichment statistics estimated across the five (out of 31
core) traits where at least 10 independent SNVs were associated with the trait at 10-7 P value
(permutation test) threshold (HDL, LDL, TC, APOA1 and APOB).
5: Haplotypes and how chromosomal
recombination gives rise to new Haplotypes
xyz
XYZ
xyz
Haplotypes
Xyz Xyz xYz xYZ
During meiosis, homologous chromosomes (1 from each parent) pair along
their lengths. The chromosomes cross over at points called chiasma. At each
chiasma, the chromosomes break and rejoin, trading some of their genes.
This recombination results in genetic variation (new haplotypes).
Crossing over occurs during Meiosis
http://www.youtube.com/watch?v=BhJf9MHHmc4
http://www.youtube.com/watch?v=3qgBKrAZCLg
Crossing Over during Meiosis increases
genetic variability
http://www.dnatube.com/video/350/Crossi
ng-Over-increases-genetic-variability
If every homologous pair in humans has just one crossing over event then
there will many possible new gametes (sperm or eggs) with many new
haplotypes (depends on how the chromosomes randomly segregate and
how many).
SNPs that are inherited close to one another
on a given chromosome are said to be
genetically “linked”
SNP1
C
Patient A
C
SNP1
SNP2
A
A
SNP2
SNP1’
T
Patient B
T
SNP1’
SNP2’
G
G
SNP2’
Maternal
chromosome
Paternal
chromosome
Maternal
chromosome
Paternal
chromosome
Haplotype refers to the set of alleles on
one particular chromosome
Patient C has two haplotypes
SNP1
C
Patient C
T
SNP1’
SNP2
A
G
SNP2’
Maternal
chromosome
Paternal
chromosome
Each haplotype is passed on to
offspring as a complete unit unless
recombination occurs between them to
create new haplotypes
A Trio is the genotype of mother
father and offspring
Recombination in patient C leads to 2 new
haplotypes in gametes (sperm or egg) that are
passed onto next generation
SNP1
C
Patient C
T
SNP1’
SNP2
A
G
SNP2’
Maternal
chromosome
Paternal
chromosome
SNP1
C
T
SNP1’
SNP2’
G
A
SNP2
“New”
chromosome
“New”
chromosome
http://www.youtube.com/watch?v=3qgBKrAZCLg
Because of recombination a haplotype that
surrounds a founder mutation will get shorter
over generations as chromosomes mix
Sci. Amer. Oct 2005
It follows that a
“recent” founder
mutation will be
associated with a
long haplotype, and
an “ancient”
founder mutation
with a short
haplotype.
Sci. Amer. Oct 2005
Underlies method of
Genome Wide
Association Studies
(GWAS)
6: How to detect SNPs ?
SNP assay requirements
a: Assay must be easily developed from sequence
information
b: Low cost of assay development
(reagents/personnel)
c: Assay must be robust
d: Easily automated
e: Simple analysis, accurate genotype calling
f: Scalable assay (up to millions/day)
g: Low cost per genotype assay
Genotyping methods are evolving
rapidly and costs greatly decreasing
How can we detect SNPs ?
Since most association studies require
genotyping large numbers of individuals
with a large number of SNPs then SNP
assays must clearly distinguish between
different alleles.
there are several methods and this is an
area of intense investigation and
improvement…………
Sequence-specific SNP Detection Methods
1: Hybridization: Allele-specific probes that only
hybridize when there is a perfect match - several
methods to detect hybridization
Affymetrix® SNP Array 6.0
1.8 million SNPs ~ $400
http://www.affymetrix.com/estore/browse/staticHtmlContentTemplate.jsp?stati
cHtmlMediaId=m1621192&isHtmlStatic=true&navMode=35810&aId=productsNav
Sequence-specific SNP Detection Methods
2:
Nucleotide
incorporation:
addition
of
nucleotides with DNA polymerase can only occur
if 3’ end of primer is a perfect match with SNP
This method can be miniaturized and large
numbers of SNPs assayed in a short time
e.g. Illumina Infinium II Assay Protocol
- can assay 650,000 SNPs on one chip
- three day protocol from start to finish
Now Infinium HD does up to 1.2 million
www.illumina.com
Illumina Omni 5 million SNPs $580
For online video see
http://www.illumina.com/applications/genotyping.ilmn
Next lecture
1: Mapping complex
traits using SNPs
2: Genome Wide
Association Studies
(GWAS)
3. Example of complex
trait mapping
Using SNP analysis to
find gene linked to
genetic disease
Genome-wide association study of systemic
sclerosis (autoimmune disease) identifies
CD247 as a new susceptibility locus