Download Introduction to genetic variation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Helitron (biology) wikipedia , lookup

Microsatellite wikipedia , lookup

Exome sequencing wikipedia , lookup

Transcript
Introduction to genetic variation
He Zhang
Bioinformatics Core Facility
6/22/2016
Outline
•
•
•
•
Basic concepts of genetic variation
Genetic variation in human populations
Variation and genetic disorders
Databases and resources
Human genetic variation
• Genetic variation is the genetic differences
both within and among populations
• No two humans, including monozygotic twins,
are genetically identical.
• On average, in terms of DNA sequence all
humans are more than 99% similar to any
other humans
Primary sources of genetic variation
• Random mutations are the ultimate source of
genetic variation.
– DNA fails to copy accurately
– Induced mutation by chemicals or radiation
• Crossing over and random segregation during
meiosis can result in the production of new
alleles or new combinations of alleles.
Hereditary mutations (inherited mutations)
• Hereditary mutations are inherited from a parent and are
present throughout a person’s life in virtually every cell in
the body usually.
• These mutations are also called germline mutations
because they are present in the parent’s egg or sperm cells,
which are also called germ cells.
Acquired mutations
• Acquired mutations may occur relatively early in
development or at any later time throughout the
lifespan, generally affecting fewer cells
• These changes can be caused by environmental
factors such as ultraviolet radiation from the sun,
or can occur if a mistake is made as DNA copies
itself during cell division.
Acquired mutations
• Acquired mutations in somatic cells (cells
other than sperm and egg cells) cannot be
passed on to the next generation.
Donald Freed, et al, 2014
Acquired mutations can be inherited in some cases
•
Acquired mutations may occur in early stage of development, and affect
both germ cells and somatic cells
•
Acquired mutations occurs in a person’s egg or sperm cell but is not
present in any of the person’s other cells.
•
In other cases, the mutation occurs in the fertilized egg shortly after the
egg and sperm cells unite.
Donald Freed, et al, 2014
Mosaic mutations
• Acquired mutations that happen in a single cell in
embryonic development can lead to a situation called
mosaicism.
• These genetic changes are not present in a parent’s egg
or sperm cells, or in the fertilized egg, but happen a bit
later when the embryo includes several cells.
• As all the cells divide during growth and development,
cells that arise from the cell with the altered gene will
have the mutation, while other cells will not.
De novo mutations
• De novo mutations are operationally defined as
genotypes observed in a child but not in either parent.
• They may originate in a parental germ cell or
postzygotically
Donald Freed, et al, 2014
Mutation rate in human genome
• The overall error rate of DNA polymerase is 10-8 per base
pair. Repair enzymes fix 99% of these lesions for an overall
error rate of 10-10 per bp.
• Mutation rate in some somatic cell can reach to 1.06x10-6
per bp (David Araten, et al, 2005)
• ~40 germline de novo mutations per generation (Donald
Conrad, et al, 2011)
• ~1,500 non-germline de novo mutations were derived each
person (Donald Conrad, et al, 2011)
Types of genetic variation
•
•
•
•
Single Nucleotide Polymorphism (SNP)
Insertion/Deletion (Indel)
Copy Number Variation (CNV)
Rearrangement
– Inversion
– Translocation
– Segmental duplication
• Numerical variation
– Polyploidy
– Aneuploidy
Single Nucleotide Polymorphism (SNP)
• SNP is a variation in a single nucleotide that occurs at a
specific position in the genome and exchanges a single
nucleotide for another
• Transitions: replacement of a purine base with another
purine or replacement of a pyrimidine with another
pyrimidine
– A <-> G , C <-> T
• Transversions: replacement of a purine with a pyrimidine or
vice versa.
– A <-> C , A <-> T , C <-> G , G <-> T
• ts/tv ~ 2 for whole genome
• ts/tv ~ 3 for whole exome
α : transition
β : transversion
Effects of SNPs in coding sequence
• Silent mutation
– A silent mutation changes a codon, but doesn’t affected the protein
sequences
– Different codons can lead to differential protein expression levels
• Missense mutation
– A missense mutation changes a codon and generate a different amino
acid.
• Nonsense mutation
– A nonsense mutation converts an amino acid codon into a termination
codon.
– This causes the protein to be shortened because of the stop codon
interrupting its normal code
• Read-through mutation
– A read-through mutation changes a stop codon to a sense codon
• Splice site mutation
– Results in one or more introns remaining in mature mRNA and may
lead to the production of abnormal proteins
Insertion/Deletion (Indel)
• Insertions add one or more extra nucleotides
into the DNA.
• Deletions remove one or more nucleotides
from the DNA.
• They are usually caused by transposable
elements, or errors during replication of
repeating elements.
Effect of Indels in coding sequence
• Reading-frame shift
– The number of nucleotides in a coding sequence of a
gene that is not divisible by three
– The message in the gene is no longer correctly parsed.
• Insertion or deletion of one or more amino acids
• Altering splicing of the mRNA
Inversion
• An inversion is a chromosome rearrangement
in which a segment of a chromosome is
reversed end to end.
• Inversions do not change the overall amount
of the genetic material
Effect of inversions
An Introduction to Genetic Analysis. 7th edition
Effect of inversions
An Introduction to Genetic Analysis. 7th edition
Effect of inversions
An Introduction to Genetic Analysis. 7th edition
Translocation
• Translocation is a chromosome abnormality
caused by rearrangement of parts between
nonhomologous chromosomes
Braude P, et al, 2002
Effect of translocations
• Balanced translocation
– An even exchange of material with no genetic
information extra or missing, and ideally full
functionality
• Unbalanced translocation
– The exchange of chromosome material is unequal
resulting in extra or missing genes
Unbalanced translocation
Copy Number Variation (CNV)
• A copy-number variation (CNV) is a difference in the genome due to
deleting or duplicating large regions of DNA on some chromosome.
• Duplications lead to multiple copies of all chromosomal regions,
increasing the dosage of the genes located within them.
• Deletions of large chromosomal regions, leading to loss of the
genes within those regions.
• Recent research indicates that approximately two thirds of the
entire human genome is composed of repeats and 4.8-9.5% of the
human genome can be classified as copy number variations (Mehdi
Zarrei, et al, 2015).
Polyploidy and Aneuploidy
• Polyploidy refers to a numerical change in a whole set of
chromosomes
• Polyploidy occurs in humans in the form of triploidy, with
69 chromosomes and tetraploidy with 92 chromosomes.
• Aneuploidy refers to a numerical change in part of the
chromosome set
• 45 or 47 chromosomes are common aneuploidy found in
human
Genetic variations in human populations
Human reference genome
• The human genome is the complete set of
nucleic acid sequence for humans (Homo
sapiens)
• Haploid human genome
– 22 autosomes
– X chromosome
– Y chromosome
Human reference genome
• Human reference genome does not correspond to any
actual human individual
• Genome Reference Consortium human genome (build
37) is mosaic haploid genome derived from 13
anonymous volunteers
– One male accounts for 66% of the total
• The latest human reference genome (GRCh38)
integrated whole genome sequencing data from other
projects to improve the completeness, but still have
gaps covering ~5% of the genome
How many variants in human genomes
• A typical genome differs from the reference human
genome at 4.1 million to 5.0 million sites
• >99.9% of variants consist of SNPs and short indels
• 2,100 to 2,500 structural variants (affecting ~20 million
bases of sequence)
Samples
Mean coverage
SNPs
Indels
Large deletions
CNVs
Inversions
AFR
661
8.2
4.31M
625k
1.1k
170
12
AMR
347
7.6
3.64M
557k
949
153
9
EAS
504
7.7
3.55M
546k
940
158
10
EUR
503
7.4
3.53M
546k
939
157
9
SAS
489
8
3.60M
556k
947
165
11
The 1000 Genomes Project Consortium, 2015
Loss of function variants in human genome
• human genomes typically contain ~100
genuine loss of function (LoF) variants with
~20 genes completely inactivated
Genetic diversity in different populations
Europe
East Asian
South Asian
America
Africa
Modern humans originated from Africa
L. Luca Cavalli-Sforza & Marcus W. Feldman, 2003
Bottleneck effect during migrations reduce
the diversity of human genetic variations
Michael C. Campbell and Sarah A. Tishkoff, 2009
• Effective population size (Albert Tenesa et al, 2009)
– non-African populations was ∼3100
– African population was ∼7500
Genetic variation exists between populations
• Founder effect and past small population size
(increasing the likelihood of genetic drift) may
have had an important influence in neutral
differences between populations.
• Natural selection may confer an adaptive
advantage to individuals in a specific environment
if an allele provides a competitive advantage.
• Genetic drift will cause some neutral mutations
fixed or disappeared randomly in a population.
Genes mirror geography within Europe
Nature. 2008 Nov 6; 456(7218): 98–101.
Variations and genetic disorders
Genetic variants and health
• Most of the variants in human genome don’t affect health
• A typical human genome contains ~100 loss of function
(LoF) variants with ~20 genes completely inactivated
(Daniel MacArthur, et al, 2012).
• LoF variants found in healthy individuals will fall into several
overlapping categories
– Severe recessive disease alleles in the heterozygous state
– Alleles that are less deleterious but nonetheless have an impact
on phenotype and disease risk
– Benign LoF variation in redundant genes
– Genuine variants that do not seriously disrupt gene function
Genetic disorder
• A genetic disorder is a disease caused in whole or in
part by a change in the DNA sequence away from the
normal sequence.
• Genetic disorders can be caused by a mutation in one
gene (monogenic disorder), by mutations in multiple
genes (multifactorial inheritance disorder), by a
combination of gene mutations and environmental
factors, or by damage to chromosomes (changes in the
number or structure of entire chromosomes, the
structures that carry genes).
Monogenetic disorders
• Monogenetic disorders (single-gene disorders, Mendelian
disorders) are caused by mutations in a single gene.
• These are usually rare diseases.
• The mutation may be present on one or both chromosomes
• Over 4000 human diseases are caused by single-gene
defects
– Sickle cell disease
– Cystic fibrosis
Multifactorial inheritance disorders
• Multifactorial inheritance disorders are caused by
a combination of variations in different genes,
often acting together with environmental factors.
• The effect of each variant/gene was usually small
• Many common diseases including cardiovascular
disease, diabetes, and most cancers are examples
of such disorders.
Chromosome disorders
• Chromosome disorders are caused by an excess
or deficiency of the genes that are located on
chromosomes, or by structural changes within
chromosomes.
• Down syndrome is caused by an extra copy of
chromosome 21 (called trisomy 21)
• Prader-Willi syndrome is caused by the absence
or non-expression of a group of genes on
chromosome 15.
Genetic Mapping in Human Diseases
• Genetic mapping is the localization of genes
underlying phenotypes on the basis of
correlation with DNA variation
• Methods
– Linkage analysis
– Association study
Linkage analysis
• Genetic linkage analysis is a statistical method that is used
to associate functionality of genes to their location on
chromosomes.
• It is based on the observation that genes that reside
physically close on a chromosome remain linked during
meiosis.
• if some disease is often passed to offspring along with
specific marker-genes , then it can be concluded that the
gene(s) which are responsible for the disease are located
close to these markers.
• Pedigree is required for linkage analysis
Linkage analysis
Association study
• Genetic association studies test for a
correlation between disease status and
genetic variation
• In case-control studies, it is investigated if the
allele frequency is significantly altered
between the case and their mathed control
group
Population-based design
• Case and controls are unrelated
• Easier to collect
• Susceptible to population stratification bias
Family-based design
• Cases and controls are related: parents, sibs etc
– Commonly used design: case-parent trios
• Not susceptible to population stratification bias
• Not easy to collect
• Not appropriate for late-onset diseases
Female
Male
Disease-affected
Healthy
Databases and resources
NCBI dbSNP and dbVar
• The Single Nucleotide Polymorphism database (dbSNP) is a publicdomain archive for a broad collection of simple genetic
polymorphisms.
• dbVar is NCBI's database of genomic structural variation (SV)
1000 Genomes Project
• The goal of the 1000 Genomes Project was to find most genetic
variants with frequencies of at least 1% in the populations studied.
• 2,504 individuals from 26 populations using a combination of lowcoverage whole-genome sequencing, deep exome sequencing, and
dense microarray genotyping.
Exome Aggregation Consortium
• The Exome Aggregation Consortium (ExAC) is a coalition of
investigators seeking to aggregate and harmonize exome
sequencing data from a variety of large-scale sequencing projects.
• The data set 60,706 unrelated individuals.
OMIM
• Online Mendelian Inheritance in Man (Online
Mendelian Inheritance in Man) is a comprehensive,
authoritative compendium of human genes and
genetic phenotypes that is freely available and updated
daily
• http://omim.org
Class of phenotype
Phenotype
Gene*
4,728
3,182
Susceptibility to complex disease or infection
700
499
"Nondiseases"
141
111
Somatic cell genetic disease
202
115
Single gene disorders and traits
NCBI ClinVar
• ClinVar is a public archive of reports of the
relationships among human variations and
phenotypes, with supporting evidence.