Download Algorithms for Genetics: Introduction, and sources of

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Extrachromosomal DNA wikipedia , lookup

Zinc finger nuclease wikipedia , lookup

Genomics wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Inbreeding wikipedia , lookup

Medical genetics wikipedia , lookup

Frameshift mutation wikipedia , lookup

Genealogical DNA test wikipedia , lookup

Human genome wikipedia , lookup

Genomic library wikipedia , lookup

Genomic imprinting wikipedia , lookup

Cell-free fetal DNA wikipedia , lookup

Epistasis wikipedia , lookup

Polymorphism (biology) wikipedia , lookup

Non-coding DNA wikipedia , lookup

Genetic engineering wikipedia , lookup

Mutation wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Tag SNP wikipedia , lookup

SNP genotyping wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Genome evolution wikipedia , lookup

Y chromosome wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene wikipedia , lookup

Helitron (biology) wikipedia , lookup

Public health genomics wikipedia , lookup

Neocentromere wikipedia , lookup

Genome-wide association study wikipedia , lookup

History of genetic engineering wikipedia , lookup

Chromosome wikipedia , lookup

Human leukocyte antigen wikipedia , lookup

Skewed X-inactivation wikipedia , lookup

Human genetic variation wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Microsatellite wikipedia , lookup

Genome editing wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Karyotype wikipedia , lookup

Point mutation wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Polyploid wikipedia , lookup

Designer baby wikipedia , lookup

X-inactivation wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genome (book) wikipedia , lookup

Genetic drift wikipedia , lookup

Population genetics wikipedia , lookup

Dominance (genetics) wikipedia , lookup

Hardy–Weinberg principle wikipedia , lookup

Microevolution wikipedia , lookup

Transcript
Algorithms for Genetics: Introduction, and sources of variation
Instructor: Vineet Bafna∗
Scribe: David Dean
1
Terms
Genotype: the genetic makeup of an individual. For example, we may refer to an individual as having a
heterozygous genotype ”Aa”, or a homozygous genotype ”AA” or ”aa” for a particular gene.
Phenotype: a measurable trait of an organism, usually due to genetic variation. A phenotype may refer
to a common trait, such as height, the presence of a particular disease, or some other measurable
biological characteristic.
Gene: a region of an organism’s genome, which codes for inherited biological traits. Some genes have been
discovered to have a critical role in the development of disease, such as the ApoE4 gene and Alzheimer’s
disease.
Allele: a specific genetic variant at a location. For example, the locus for the ApoE gene has 3 major
variants, or alleles, ApoE2, ApoE3, and ApoE4.
Locus: the location of an allele; can refer to a nucleotide position, a genetic marker, a gene, or a chromosomal
segment. For example, ”19q13.2” refers to a particular location on chromosome 19.
Ploidy: the number of copies of each chromosome that is contained in somatic (non-gamete) cells of a
species. In humans and most other animal species, the somatic cells are usually diploid, meaning
they have 2 copies of each chromosome, whereas the gamete cells are haploid and have a single copy
of each chromosome. Some plant and animal species are known to have more than 2 copies of each
chromosome, which is called polyploidy.
Haplotype: a particular combination of alleles in an individual that are located on a single chromosome.
For example, an individual may have a given sequence of alleles on one chromosome, labeled as ”DEf”,
that is different than the alleles for the other, ”DeF”.
2
Sources of Variation
A number of mechanisms introduce variation into a population. The main sources are described below.
Point Mutations: Refers to small-scale mutational events:
* The typical mutation rate seen in humans is fairly slow, estimated to be about 10−8 per base pair per
generation. Point mutations are usually caused by exposure to harmful amounts of radiation, such as
UV or microwave radiation.
* The infinite sites assumption states that each site of a point mutation will undergo at most one mutation,
over the course of human evolution. Perhaps the biggest implication of the infinite sites assumption is
that it enables a phylogenic record of evolutionary history to be constructed. SNPs from mitochondrial
DNA, which is inherited only through our mothers and does not recombine, can be analyzed to contruct
an ancestry for an individual in a population.
∗ Department
of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
1
* A single nucleotide polymorphism, or SNP, is a point mutation where the genotype of the wild type and
mutated allele are known. Essentially, a SNP refers to a single nucleotide base, that is known to have
mutated to a different nucleotide base at some point during evolution.
Given the infinite sites assumption, a convenient way to view and analyze SNPs from population data
is to create a binary matrix, using individuals as rows and variant sites as columns. Each individual
can be represented by 1 or 2 rows, which contain the DNA sequence data of one or both of their
chromosomes. Also, it is convention to use ’0’ to represent the ancestral allele and ’1’ to represent the
mutated allele, if it is known which is the ancestral allele.
* Short tandem repeats (STRs) are regions of DNA where a short DNA sequence is repeated a variable
number of times. For a given locus, different individuals can have different numbers of repeats. Here
is an example showing variable number of repeats of the sequence ’ATC’:
In order to create a ”DNA fingerprint”, a set of STR locations can be chosen such that the set of
measured repeat lengths create a unique identifier for an individial. Enough STR locations are chosen
to ensure that it is extremely unlikely for two people to have the same DNA fingerprint. A system
used by the FBI, Combined DNA Index System (CODIS), performs a form of DNA fingerprinting that
examines 13 core loci that have variable numbers of STRs. The alleles from these loci are generally
inherited independently, which means the probability of having a particular combination of STR values
can be determined by multiplying the probabilities of having a particular STR value at each locus.
This procedure creates a DNA fingerprint that is so unique that the probability of two individuals
having the same fingerprint is less than 10−18 . One of the few exceptions where this breaks down is in
the case of identical twins.
2
Structural Variation: Large scale mutation events:
* A structural variation refers to large sections of DNA that are inserted, deleted, and inverted in a genome.
These large-scale genetic changes can cause disease, including certain cancers.
* If we were interested in an experimental protocol to detect structural variations, we would compare regions
of an individual’s genome to that of a ”wild-type” human genome. The haplotype map provided by
the International HapMap Consortium is an example of a human genome database that could be
used. Also, a karyotype of whole chromosomes would be able to identify large structural changes to a
chromosome. Notice that the chromosomes are ordered from largest to smallest.
Recombination: Variation due to crossover
* Recombination events are caused by a crossing-over of homologous chromosomes during meiosis (cell
division). This causes a mixing of genetic material between the two chromosomes. DNA recombination
can also refer to an artificial recombination of DNA performed in a biology lab, such as to insert a
gene into an E. coli bacterium.
* The typical recombination rate for humans is similar to the mutation rate, estimated to be about 10−8
per base pair per generation.
3
* Not all of human DNA recombines. In particular, mitochondrial DNA (inherited from the mother only)
and the Y chromosome (inherited from the father only) do not recombine.
Gene conversion: Variation due to crossover
* During gene conversion, a gene on one chromosome is transferred to the homologous gene on the other
chromosome, leaving the first chromosome unchanged. This is similar to recombination in that genes
are being transfered from one chromosome to another. However, in recombination, DNA is exchanged
between the two chromosomes, whereas with gene conversion, only one of the chromosomes is changed.
3
Equilbiria
Population geneticists study the entirety of variations (genotype) and their consequences on phenotypes.
As the variations arise and disappear within a population, they give rise to many equilibria under ’neutral’
conditions. An important goal in population genetics is to investigate regions not under quilibria and to
investigate the cause of this departure.
Hardy-Weinberg equilibrium: an equilibrium of allele frequencies.
* The Hardy-Weinberg equilibrium is defined as follows. Given that a set of assumptions are met (including
large population size, random mating, no natural selection, etc.), then with a locus that has two alleles,
A and a, with frequencies, p and q, the frequencies of the 3 possible genotypes are p2 (for AA), 2pq
(for Aa), and q 2 (for aa).
* The Hardy-Weinberg equilibrium can be extended for multiple alleles with frequencies pi : i = 1, 2, . . . k.
If we consider multiple alleles, the HW equilibrium states something similar. The frequency of a
homozygous genotype is p2i . And the frequency of a heterozygous genotype is 2pi pj .
* The HW equilibrium can also be extended to consider multiple loci. If the alleles of different loci are not
linked (i.e. not on the same chromosome), then the frequencies of combined genotypes is simply the
product of the frequency of a genotype at one locus and the frequency of a genotype at another locus.
These frequencies are independent and can thus be multiplied together. In the case of loci that are
linked, then one needs to know the probability of the combinations of alleles being inherited together.
4
For example, if we consider 2 loci that have 2 alleles each, we can label the 4 alleles ’A’, ’a’, ’B’, and ’b’.
Then, if we know the probability of these alleles being inherited together (i.e. P(AB), P(Ab), P(aB),
and P(ab)), then these combinations can be treated as multiple alleles at a single locus. Applying the
HW equilibrium to multiple alleles at a single locus is described above.
* If we assume an infinite size population with random mating, the allele frequency does not change from
generation to generation. The allele frequency will remain constant over time as the inheritance of
alleles follows the laws of statistics. Going back to our simple example of two alleles at a single locus,
A and a, with frequencies, p and q, we should have a population with the genotypes frequencies p2 (for
AA), 2pq (for Aa), and q 2 (for aa). We can calculate the frequency of the alleles expected in the next
generation:
pnextgeneration = (1)p2 + (0.5)2pq + (0)q 2
pnextgeneration = p2 + pq
pnextgeneration = p2 + p(1 − p)
pnextgeneration = p2 + p − p2
pnextgeneration = p
The infinite population size enables any deviations from the expected frequency to be averaged out.
With finite sized populations, and especially small populations, then the allele frequencies can vary
randomly from generation to generation due to a sampling effect. This effect is called genetic drift.
* Example: Phenylketonuria
Phenylketonuria (PKU) is a disease that is caused by an autosomal recessive allele, with an observed
frequency of 1 in 10,000 caucasians. With this information, we can use the HW equilibrium to calculate
the frequency of this allele in the population, and calculate the percentage of the population who are
carriers of the allele. If we define q to be the frequency of the recessive allele (a), then the disease
genotype (aa) should occur with the frequency q 2 . By setting q 2 = 1/10000, we calculate q = 1/100.
So the frequency of the allele in the population is 1/100. To calculate the percentage of carriers, we
are looking for the value of 2pq (for the genotype Aa) = 2(99/100)(1/100) = 0.0198.
* Example: Red-green colorblindness
Males are 100 times more likely to have the red type of color blindness than females. Males are much
more likely to have this form of colorblindness because the genetic mutation occurs on the X chromosome. The disease mutation is recessive, allowing female carriers of the mutation to not develop
the phenotype. Males only have a single copy of the X chromosome, causing them to develop the
phenotype if the mutation is present. With this information, we can use the HW equilibrium to calculate the frequency of this ’disease’ allele in the population. Let’s define q to be the frequency of
the recessive allele. For men, the frequency of the ’disease’ phenotype is simply the frequency of the
recessive allele, q. For women, having 2 X chromosomes, will develop the ’disease’ phenotype with
frequency q 2 . Knowing that men are 100 times more likely to have red-green colorblindness allows us
to calculate the frequency of this allele:
q = 100q 2
q = 1/100
Linkage (dis-)equilibrium (LD): Describes correlation of allelic values across mutiple loci.
* Linkage dis-equilibrium (LD) is a measure of correlation or independence, in terms of the inheritance of
alleles from different loci. With high recombination rates or when examining loci on different chromosomes, the probability of two alleles both being inherited is simply the product of the probabilities of
each allele being inherited independently. This is refered to as linkage equilibrium. With low recombination rates or when the loci are very close to each other on a chromosome, then the probability of
two alleles both being inherited is different from the product of the probabilities of each allele being
inherited independently. This difference is the measure of linkage dis-equilibrium.
5
* Measures of LD: D, D0 , ρ.
D = |P00 − P0∗ P∗0 |
Dmax = max {P0∗ P∗0 , P0∗ P∗1 , P1∗ P∗0 , P1∗ P∗1 }
D0 =
|P00 −P0∗ P∗0 |
Dmax
ρ=
D
P0∗ P1∗ P∗0 P∗1
√
* Extra Credit: Compare LD with other measures of correlation between loci, such as correlation coefficient
and hamming distance.
* LD is known to vary with distance between the loci. There is an exponential decay of LD as the distance
between the two loci is increased. It can be assumed that the recombination rate increases linearly
with an increase in distance, however, it is known that recombination rates vary from region to region.
* Similarly, there is a relationship between LD and time. There is an exponential decay of LD as time
increases. As time moves forward, recombination events cause this decay of LD until it disappears
completely (i.e. linkage equilibrium).
* LD can be used for gene mapping by exploiting the fact that LD varies with the distance between two
loci. Instead of measuring the LD between two loci, we can replace one of the loci with a vector of
disease diagnoses (reflecting the presence or absence of a disease in individuals). Then, by measuring
the LD between this diagnosis vector and SNPs throughout the genome, the location of a possible
disease gene can be infered by its high measures of LD.
HapMap consortium
* Extra Credit: Describe the goals of the HapMap project. Read through the paper and describe a few of
the conclusions.
6