Download Genomics

Document related concepts

Nucleosome wikipedia , lookup

Epigenetics of human development wikipedia , lookup

DNA polymerase wikipedia , lookup

SNP genotyping wikipedia , lookup

Genome (book) wikipedia , lookup

Gel electrophoresis of nucleic acids wikipedia , lookup

Zinc finger nuclease wikipedia , lookup

Minimal genome wikipedia , lookup

Genetic engineering wikipedia , lookup

Pathogenomics wikipedia , lookup

DNA damage theory of aging wikipedia , lookup

Cancer epigenetics wikipedia , lookup

Mitochondrial DNA wikipedia , lookup

Replisome wikipedia , lookup

Genetic code wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Mutation wikipedia , lookup

Genealogical DNA test wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

DNA vaccination wikipedia , lookup

RNA-Seq wikipedia , lookup

DNA barcoding wikipedia , lookup

DNA supercoil wikipedia , lookup

Epigenomics wikipedia , lookup

Nucleic acid double helix wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Transposable element wikipedia , lookup

Molecular cloning wikipedia , lookup

Cell-free fetal DNA wikipedia , lookup

Primary transcript wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Genomic library wikipedia , lookup

Genome evolution wikipedia , lookup

Designer baby wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Extrachromosomal DNA wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Human genome wikipedia , lookup

Gene wikipedia , lookup

Metagenomics wikipedia , lookup

Deoxyribozyme wikipedia , lookup

History of genetic engineering wikipedia , lookup

Non-coding DNA wikipedia , lookup

Microsatellite wikipedia , lookup

Microevolution wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Genome editing wikipedia , lookup

Point mutation wikipedia , lookup

Genomics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transcript
Genomics
Genomics
• Genomics is the study of the entire genome:
the sequence of all the DNA in the cell.
– For humans, the haploid genome is about 3 billion (3 x
109) base pairs (bp). Since we are diploid, we have
about 6 x 109 bp per cell.
• Related subjects: (plus many more!)
– Proteome: all the proteins in the cell
– Transcriptome: all of the RNAs (i.e. transcripts) in the
cell
– Metabolome: all of the metabolic pathways in the cell
• What makes it possible to study all of these
things is DNA sequencing. It is possible (and
not all that hard) to determine the DNA
sequence of an entire genome.
– Which then lets you find all the genes, which can then
be translated into proteins using the well-known
genetic code.
– Important point: virtually all RNA and protein
sequences are inferred from DNA sequence data.
RNAs and proteins are not directly sequenced (except
for some minor applications).
The DNA sequence of the
human genome contains all
the information needed to
produce a person. If we
know someone’s DNA
sequence, can we predict
their phenotype?
DNA Polymerase Reaction
• The basis of DNA sequencing, as well as PCR and other DNA
techniques, is the enzyme DNA polymerase.
• DNA polymerase is a protein (like all enzymes). It consists of several
polypeptide subunits.
• DNA polymerase catalyzes the synthesis of the second strand of a
DNA molecule.
• To start, it needs:
– Single stranded DNA template molecule
– Primer: a short piece of DNA base-paired with a region of the template
– dNTPs: the 4 deoxy nucleoside triphosphates dATP, dCTP, dGTP and
dTTP, which are the raw materials for the new DNA strand.
• The basic reaction: The 2 phosphate groups on the end of the dNTP
molecule (the gamma (γ) and beta (β) phosphates) are removed, and
the phosphate next to the sugar is attached to the 3’ OH group of the
growing DNA chain.
– The removal of the phosphates provides the energy needed to drive the
reaction.
More DNA Polymerase
• The DNA polymerase reaction is processive: it starts from the 3’ end of the
primer and adds new nucleotides, one at a time, until it reaches the end
of the template. You now have a complete double stranded DNA
molecule.
• Each nucleotide added is complementary to the nucleotide on the
template strand: A paired with T, and G paired with C.
•
•
•
•
•
•
Uses RNA primers made by the
primase enzyme
Occurs on both strands
simultaneously, at a replication
fork.
The replication fork is created by
helicase unwinding the double
helix.
Leading strand is continuously
synthesized, in the same direction
the replcation fork is moving.
Lagging strand is replicated in
short stretches called Okazaki
fragments. The DNA polymerase
replaces the RNA primers with
DNA. Then, the Okazaki fragments
are ligated together by the DNA
ligase enzyme.
The DNA polymerase has an errorcorrection function: if the wrong
nucleotide is added, DNA
polymerase backs up, removes it,
and tries to add the proper
nucleotide again.
Replication in the Cell
Polymerase Chain Reaction
•
•
•
The polymerase chain reaction (PCR) is used to make many identical copies
of a short region of DNA, so it can be analyzed further.
PCR uses 2 primers that bind to opposite strands of the DNA molecule, a
short distance apart (say, 1000 bp or less).
PCR is a set of 3 reactions, done repeatedly in a cycle. With each cycle the
number of DNA molecules doubles: exponential growth.
– Starting with 1 DNA molecule, after 30 cycles you have 230 identical molecules, which can
easily be detected.
– Only the region between the primers gets amplified
•
The 3 steps of PCR:
– Denaturation (melting). Double-stranded DNA is converted to single strands by heating it
to a high temperature (say 94oC)
– Primer annealing. The primers bind to complementary regions on the DNA by incubating
them at a lower temperature (say 50oC)
– Primer extension. New second strands are build on the template by DNA polymerase,
starting at the primers (say 72oC)
•
All reactions occur in a single tube, with just the temperature changing
every minute or so.
– Uses a DNA polymerase that survives high temperatures: Taq polymerase. It was isolated
from bacteria growing in hot springs at Yellowstone National park.
PCR
Thermal profile of 1 PCR cycle. It cycles
between 94oC (denaturation), 50oC
(annealing), and 72ioC (elongation).
What happens in each of the 3 PCR
steps.
Exponential growth of DNA molecules
DNA Sequencing
•
•
•
Many sequencing methods have been
invented, and it’s still a very active area of
research.
Most use the concept of sequencing by
synthesis: starting with a primer, use DNA
polymerase to add new bases are added
one at a time, paying attention to which
base is added.
In the Illumina method (current favorite) ,
fluorescent tags attached to the 3’ OH
group are used.
–
•
•
•
•
Each of the 4 nucleotides has a different
colored tag.
The fluorescent tags block the 3’-OH of
the new nucleotide, and so the next base
can only be added when the tag is
removed.
A cycle: add one new base, then read its
color, then remove the fluorescent tag to
give a free 3’ OH group.
Repeat the cycle up to 200 times.
End up with 200 bp of sequence
information.
More Sequencing
•
•
•
To get enough signal from the DNA molecule being sequenced, each DNA molecule needs
to be amplified using PCR.
For the Illumina method, this is done by attaching individual DNA molecules to a solid
surface, then amplifying them in place, giving tiny spots with about a million identical
copies.
The DNA polymerase sequencing reactions are then monitored with a high resolution
video camera.
Sequence Assembly
• The big problem with all current sequencing methods: you only get very
short reads: 200 bp maximum for Illumina, up to 1000 bp for the older
(slower, much more expensive) Sanger method, etc.
– The human genome is 23 DNA molecules (chromosomes) that total 3 billion bp.
Human chromosomes are 50-250 million base pairs long.
– You need to assemble the tiny reads into much longer contigs (continuous sequences).
With a perfectly sequenced genome, the final contigs would be identical to the DNA
sequence of the chromosomes.
• How reads are assembled into contigs: overlapping sequences.
Assembly Problems
•
•
Chromosomes, especially eukaryotic chromosomes, are filled with sequences that are
repeated many times. If you have a read from a repeated sequence, how do you know which
copy it is?
– Some repeats are next to each other (tandem repeats) and some are scattered all over
the genome (dispersed repeats).
The main solution to this problem is to start with longer DNA template molecules and
sequence both ends. You don’t know the sequence in between, but you do know how far
apart the ends are. This often allows you to jump over repeated sequences.
– It’s not perfect, and even now there are no human chromosomes sequenced to 100% accuracy.
BLAST
How to find similar sequences
Finding Similar Sequences
• Once you have the DNA sequence of your gene, how do you find
other, similar genes?
– Within the genome: are there duplicate genes present? For example, in
the human genome there are related genes for alpha globin and betya
globin. Are there other globins in the genome?
– Between genomes: if you find a gene in one species, is it present in
others? For example, are there globin genes in plants or fungi?
– Gene function: if the function of a protein is determined by laborious
experimentation, you can extend the value of the results by saying that
similar genes in other species probably do the same thing.
– What parts of the protein are conserved across evolutionary lines (and
thus are probably important to protein function)?
• BLAST (Basic Local Alignment Search Tool) is the standard tool
for doing sequence comparison.
Protein Sequences Are More
Conserved Than DNA Sequences
• Evolutionary fitness of a particular gene depends on how well it functions
under different conditions.
• Function is determined by the amino acid sequence of the protein.
– It is also determined by the 3-dimensional structure of the protein, which is based on
the amino acid sequence
• The genetic code has many synonymous codons. A mutation that changes
the nucleotide sequence but not the amino acid sequence produces the
exact same protein, so there is no effect on evolutionary fitness.
– this means that many mutations within a gene can accumulate in a population based
solely on genetic drift.
• As a consequence, we want to compare protein sequences in preference
to DNA sequences.
– However, almost all protein sequences are derived from DNA sequences that have been
translated in silico (by a computer).
Dotplots
• We want to align one sequence with
another: for example, we can learn about
a newly sequenced gene by aligning it
with all the sequences in a large
database, to see if it is similar to a
previously known sequence.
• The dotplot is a tool for graphically
aligning 2 sequences.
– Put the letters of one sequence along the
x-axis and the other sequence on the yaxis.
– At each intersection where the sequence
letters match, output a dot.
– You can see the alignment, even though
there are some mismatches and indels
(indel= insertion/deletion, some
nucleotides present in one sequence but
not the other).
A Real Dotplot
•
•
Two haptoglobin sequences. (Haptoglobin is a blood protein that binds to hemoglobin that has gotten out
of the red blood cells).
You can see a gap in one sequence, a region of poor similarity just before it, and a simple sequence repeat
near the beginning.
Automated Sequence Alignment
• Dotplots have some problems:
– Very slow: they require humans to examine them and then judge what a good
alignment is.
– It’s hard to know what to do with difficult regions, where the alignment isn’t
clear.
– Dotplots consider all amino acid changes as equally bad: every position is
either a match or not. In practice, we know that some changes are
conservative: a different amino acid produces very little change in the
protein’s function.
• What BLAST does is mathematically create a dotplot and find the best
diagonal alignment, without needing our human visualization skills.
An Actual BLAST Search Result
•
•
•
•
This is an alignment between two superoxide dismutase genes, from two different
Bacillus species. It uses the 1 letter amino acid code.
Between the Query and Subject lines you see a letter if there is a match, and a + if
eth two amino acids are similar
Gaps are shown as dashes ---. Amino acids present in one sequence but not the
other.
A sequence alignment program needs to deal with matching amino acids,
conservative (similar) amino acids, complete mismatches, and gaps (indels).
Substitution Matrices
• If you align many sequences, it is clear that some amino acid substitutions
are common, while others are very rare.
• A substitution matrix gives a score for each possible amino acid
substitution between 2 sequences.
– Substitution matrices are created by counting the different substitutions in large
numbers of sequences that have been carefully aligned by hand.
• Two commonly used sets of matrices: PAM and BLOSUM.
– Both PAM and BLOSUM have several matrices that have been tuned to work with
different levels of evolutionary divergence.
• We are just going to use the BLOSUM62 matrix, which is the default for
BLAST searches.
• The alignment score is just the sum of the individual BLOSUM scores for
each pair of aligned amino acids.
BLOSUM62 Substitution Matrix
Observations on BLOSUM62 Matrix
•
Numbers are small positive or negative integers. They represent how frequently
different substitutions were seen in the manually curated sequences relative to
completely random pairings.
– A score of 0 means that the substitution occurs about as often as would be expected by
chance alone.
– A positive number means the substitution occurs more frequently than expected by
chance.
– A negative number means it occurs less frequently than expected by chance.
• Along the diagonal are scores for keeping the same amino acid in both
sequences.
– Some amino acids are very conserved in evolution: for example C (cysteine) has a score
of 9 and W (tryptophan) has a score of 11. These amino acids are only rarely
substituted.
– Other amino acids are less conserved: I (isoleucine), L (leucine) and V (valine) have
scores of 4.
• The body of the matrix shows scores for amino acid changes.
– Most are 0 or less
– Some are positive: I,L, and V substitute for each other fairly often, for example
Gaps
•
When many different sequence alignments have been done, it became obvious
that there were 2 important types of mutation: nucleotide substitutions and short
indels.
– Indel = insertion/deletion: some nucleotides present in one sequence but not the other.
– Indels appear as gaps in the aligned sequences, symbolized by dashes (---).
• Substitution matrices work for substitutions but not gaps.
• There is no good theory for the relationships between substitutions, gaps,
and gap lengths, so gaps are dealt with heuristically.
– Heuristic = a method or value determined by trial-and-error experiments, without a
strong guiding theory.
– In this case, gap penalties are the result of trying many possibilities and seeing which
ones give the most pleasing alignments.
• The existence of an indel seems to be relatively independent of its length.
Because of this, gaps are scored in 2 ways:
– Gap opening penalty: a negative score for each gap. BLAST default is -11.
– Gap extension penalty: a smaller negative score proportional to the length of the gaps.
BLAST default is -1.
Scoring an Alignment
• Alignment score (S) = sum of all aligned amino acid scores –
number of gaps x gap opening penalty – number of nucleotides in
gaps x gap extension penalty.
• First add up the scores for each aligned pair of amino acids.
• Then count the number of gaps (indels), multiply that by the gapopening penalty, and subtract from the total.
• Then, count the number of nucleotides that are aligned with gaps,
multiply by the gap extension penalty, and subtract from the total
score.
Practical BLAST
• Let us say you have a sequence that you want to find matches for.
Your sequence is the query sequence.
• BLAST compares the query sequence is against a database of
sequences, the subject sequences.
– The most commonly used database is the nr database. “nr” stands for nonredundant, and it consists of all known DNA sequences with any identical ones
removed.
– It is also common to use a database for a single organism or group of organisms,
such as human-only or mammalian-only.
• The sequences can be either DNA or protein.
– Protein sequences are better conserved in evolution, so they are typically used for
cross-species comparisons.
– Almost all protein sequences are sequenced DNA that was translated using the
genetic code.
• There are several versions of BLAST that do slightly different things.
We are going to concentrate on blastn (both the query and the
subject are nucleotide sequences) and blastp (both the query and the
subject are peptide sequences).
BLAST E-values
• BLAST results are usually reported as e-values (“expect-values”). The e-value for a
match between a query sequence and a subject sequence is the number of subject
sequences in a completely random database that would have the same match
score or better. The random database must be the same size as the one you are
using.
– Really bad matches have e-values of 1 or more: An e-value of 1 means that even in a completely
random database you could find a match as good as the one being reported
– Most e-values are numbers less than 1, on an exponential scale. They look like 3e-23, for example.
This means 3 x 10-23 which indicates a good match.
– The larger the negative exponent is the better the match is. Thus, 1e-80 is a better match than 3e23.
– The best matches have an e-value of 0.0. This score implies an e-value better than 1e-180.
Computer arithmetic doesn’t do numbers smaller than this (i.e. 10-180 is the smallest floating point
value that can be represented in standard computer format).
BLAST Algorithm
1.
2.
3.
4.
Make a list of all possible 3
letter “words” in the query
sequence
Use substitution matrix to
find all synonyms for each
word that exceed a
minimum score
Search the database for
sequences that have
matching words. This
might include many
sequences.
Extend the ungapped
alignment between the 2
sequences, starting at the
matching word and moving
in both directions, until the
score starts to drop.
More BLAST Algorithm
5. If the ungapped
alignment’s score is
greater than a
threshold, do a full
alignment that allows
gaps. (this is a much
slower process than an
ungapped alignment).
6. Process the scores to
convert them to evalues, using parameters
from the scoring matrix,
the length of the query
sequence, and the size
of the database.
7. Report all matches with
e-values above a
threshold (default=10).
Things that might appear on a test
• Use BLOSUM62 matrix plus gap penalties to
score an alignment
• Arrange BLAST scores from best to worst
• Find all 3 letter words in a sequence
An Actual BLAST search
• Superoxide dismutase is a gene coding for the enzyme that destroys
superoxide radicals inside the cell. Superoxide is a highly reactive
byproduct of aerobic respiration. Mutations in this gene cause amyotrophic
lateral sclerosis (aka Lou Gehrig’s disease).
• Starting with the protein sequence from humans, obtained from the
National Center for Biological Information (NCBI).
– Uses 1 letter amino acid code.
– FASTA format: a comment (title) line starting with ‘>’, then the sequence itself on one or
more lines after the comment line.
• Most BLAST searches are done at NCBI
(http://blast.ncbi.nlm.nih.gov/Blast.cgi ), because it is easy and they have
the most up-to-date version of the nr database. However, it is also quite
slow.
• We are going to use a local BLAST database, because we can (probably) do it
in real time. http://biolinx2.bios.niu.edu/rjohns/bmeg/bmeg_blast.htm
>gi|4507149|ref|NP_000445.1| superoxide dismutase [Cu-Zn] [Homo sapiens]
MATKAVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSR
KHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGN
AGSRLACGVIGIAQ
Does Bacillus megaterium have Superoxide dismutase?
Results, part 1
• The header section.
–
–
–
–
Version of BLAST. Note it’s blastp (for proteins).
Literature references.
Comment line from the query sequence and its size (184 letters)
Subject database and its size: 5629 sequences (genes) and 1,503,829 letters
(amino acids).
More Results
• This is the summary list of BLAST hits. It shows every hit, every gene
that gives a match better than some cutoff value. Sorted so the best
one is on top.
• Here, only 2 hits
• Each line shows the gene ID number (BMQ_2135 and BMQ_4952)
along with a bit of description (which is cut off somewhat). Also the
bit score and the e-value.
• E-values here are moderately good. This protein is conserved
between humans and bacteria.
Hit Details
• Comment line (note the >) from database for this gene. It is indeed a
superoxide dismutase, with a total length of 207 amino acids.
• Repeat of score and e-value (Expect), along with:
Identities: after the sequences are aligned, the number of identical amino acids
Positives: counts similar amino acids as well as identical
Gaps: places where the alignment has an amino acid in one sequence but not in the
other. Gaps are indel mutations.
More Detail
• The groups of 3 lines show the query sequence, the subject sequence,
and the positions where they match (between). A + means a similar
amino acid (one of the “Positives” from the last slide).
• The numbers are the start and end of the match. This is one
continuous sequence: between amino acids 25 and 150 in the human
(query) sequence, and between positions 80 and 205 in the B.
megaterium (subject) sequence.
The other hit and other info
What We Know about Genomes
Eukaryotic Genomes
•
•
•
•
Linear chromosomes
Lots of gene duplication
Transposable elements
Repeat sequence DNA
Orthologs and Paralogs
• Homologues : genes that match
each other in a BLAST search.
• During speciation, one species splits
into 2 different species. The
ancestral gene now has versions in
both descendant species: these
genes are called orthologs.
– The beta-globin genes in humans and
chimpanzees are orthologs
• Within a species, the gene might be
duplicated. The different copies of
the gene are called paralogs.
– The alpha and beta globin genes in
humans are paralogs.
• Paralogs are free to evolve new
functions.
Synteny
• We defined “syntenic” to means
genes on the same chromosome as
each other.
• Now, we extend this definition a bit to
cover comparisons between species.
A region of chromosome in one
species is syntenic with a region in
another species if they both have the
same genes (orthologs) in the same
order.
• Closely related species contain many
large blocks of syntenic genes. All
mammals, for instance.
– Useful for finding orthologs
Part of a study looking for genes
affecting alcoholism. A region on
human chromosome 1 that affects
severity of alcohol withdrawal
symptoms is syntenic with a
mouse gene that affects
mitochondrial respiration
Waardenburg syndrome
• Waardenburg syndrome. There
are several types: we are
discussing type 1 here.
• Different-colored eyes, white
forelock, white skin patches,
deafness.
• Due to partial absence of
melanocytes. Melanocytes are
neural crest cells: they originate
near the developing neural tube
and migrate laterally down the
flanks of the body in the embryo.
• Autosomal dominant, but varies
in expressivity.
Synteny and Waardenburg Syndrome
•
•
•
•
•
•
The disease was mapped to a region on
human chromosome 2. However, there
were dozens of genes in the mapped
region: which was the real one?
A mutation in mice, Splotch, also shows
white patches and deafness, and maps to a
syntenic region on mouse chromosome 1.
Also, the PAX3 gene, mapped as a
transcribed DNA sequence, was in the same
mouse chromosome region. PAX3 is a
transcription factor active in the neural
crest.
An unmapped human gene HuP2 showed
strong sequence identity to PAX3.
Examine DNA of the HuP2 gene from
Waardenburg patients: 6 out of 17
unrelated patients had altered DNA, and 0
out of 50 normal controls had altered DNA.
Since the human HuP2 gene was very
similar to the mouse gene, it was renamed
PAX3.
The mouse on the left is a Splotch mutant
Transposable Elements
•
•
Genes have fixed locations on the
chromosomes: that’s why we can map
them.
However, certain DNA sequences, called
transposable elements or transposons, can
move from place to place in the genome.
– First discovered by Barbara McClintock in
maize, then later seen in bacteria: it
became clear that they were in all
organisms.
•
•
•
They are mostly thought of as intracellular
parasites: as long as they replicate more
frequently than mutation can inactivate
them, they remain in the genome.
About 20% of the human genome is
(mostly inactive) transposable elements.
Two basic types: DNA transposons and
retrotransposons (which use RNA).
DNA Transposons
• DNA transposons have short inverted
repeats at their ends. Between the
repeats is the gene for transposase.
• Transposase is an enzyme that cuts the
transposon out of one location and then
inserts it into a new location: DNA
transposons move by a cut-and-paste
mechanism.
• Transposase works on the inverted
repeats. It can act on any transposon in
the cell that has the proper inverted
repeats, not just the one carrying the
transposase gene. This means that some
DNA transposons are autonomous (they
carry their own transposase gene) and
others are non-autonomous (they rely on
the transposase gene from another
transposon.
Retrotransposons
•
Retrotransposons replicate through an RNA intermediate:
they are transcribed just like a regular gene, and then the
RNA is reverse-transcribed back into DNA, which inserts at
random locations in the genome.
– Retrotransposons move by a copy-and-paste
mechanism.
•
Retrotransposons are closely related to
retroviruses, such as HIV (AIDS virus) or feline
leukemia virus. The only difference is, retroviruses
have a coat protein gene that allows them to move
outside the cell.
The ends of retrotransposons are long terminal
repeats (LTRs), which are identical direct (as
opposed to inverted) repeats.
Autonomous retrotransposons carry a gene for
reverse transcriptase, which converts RNA into
DNA.
•
•
– Non-autonomous retrotransposons use reverse
transcriptase from an autonomous element.
Non-LTR Retrotransposons
•
•
•
•
Some retrotransposons don’t have LTRs. Like all retrotransposons, they are
transcribed into RNA, then reverse-transcribed back into DNA.
Humans have 2 types of non-LTR retrotransposon: LINE elements (long
interspersed repeats) and SINE elements (short interspersed repeats).
LINE elements have a reverse transcriptase gene. About 500,000 copies in the
human genome (17% of the genome), mostly inactive. They seem to have an
important role in maintaining chromatic structure: they aren’t just parasites.
SINE elements are non-autonomous: they use reverse transcriptase from other
elements.
– In primates, the most common SINE is the Alu sequence: present in 1.5 million copies in
the human genome (11% of the genome). The Alu sequence originated as the RNA used
to guide mRNA molecules to the endoplasmic reticulum from translation into the
membrane.
Highly Repeated Sequences
•
Short sequences (say 5-200 bp) in long
tandem arrays, mostly near centromeres
or on the short arms of acrocentric
chromosomes. Some are also on other
chromosome arms, appearing as
“secondary constrictions” in metaphase
chromosomes under the microscope
(centromere is the primary constriction).
•
Centromeres are composed of highly
repeated simple sequences
•
Constitutive heterochromatin is composed
of highly repeated DNA. As seen in the
microscope, it is densely staining and late
replicating chromosomal material. It
contains very few genes.
•
These sequences are not normally
transcribed.
Molecular Phylogeny
•
These days, most evolutionary relationships are determined (or confirmed) using DNA
analysis.
– Orthologs are compared across species
•
In general, the more time that has passed since the 2 species diverged from a common
ancestor, the more changes in the DNA
– Especially for synonymous mutations (DNA changes but the amino acid stays the same).
•
A phylogenetic tree is a representation of the ancestor-descendant relationships
between species. It shows the evolutionary relationships inferred from the data.
– Based on the concept that all species diverged from a common ancestor.
•
Some trees are rooted: they show which species are ancestral and which are
descendants. Other trees are unrooted: they show how closely species are related
without implying ancestry.
– Trees are rooted using an outgroup: a species know to be the less related than all others.
For example, chimpanzees are a good outgroup for human phylogenies.
Ultrametric vs. Additive Trees
• Ultrametric trees: all leaves on the same level. A
result of the molecular clock idea: mutations
occur and are selected for at the same rate in all
lineages
– All leaves (present day taxa) are at the same level, which
represents the present day.
– Certainly not true: some genes in some lineages
evolve much faster than others.
• Additive trees: branch length is proportional to
number of mutational changes in the lineage:
leaves are usually not all at the same level,
because some lineages evolve faster than others.
– Additive trees are more realistic than ultrametric
trees: we know some genes and some lineages
evolve at different rates.
Tree of Life
• Carl Woese noticed that all living things had
ribosomes, and ribosomal RNA was easy to
sequence. By comparing 16S ribosomal RNA
sequences (or the equivalent 18S sequences
in eukaryotes), it was possible to determine
how all organisms are related.
– There are very few protein-coding genes found in all
organisms.
– Doesn’t work with viruses: they don’t have
ribosomes
– Also doesn’t work with extinct organisms unless
there is fairly recent (less than 100,000 years old)
tissue available.
• One major finding: prokaryotic organisms can
be divided into Bacteria and Archaea, based
on a very ancient split between them
– It is still not clear how the eukaryotes are related to
the bacteria and archaea.
– Evolutionary relationships between many groups are
still being determined.
Prokaryotic Genomes
• Circular chromosome, tightly packed with
genes
• Most genes are single copy; very little
repeat sequence DNA.
• Very few genes found in all species, and
many cases of convergent evolution: genes
with clearly different origins performing the
same enzyme activity.
Horizontal Gene Transfer
• An important issue: horizontal gene
transfer: transfer of DNA between distantly
related species. As opposed to vertical
gene transfer: the normal method, genes
transferred from parent to offspring.
– It’s a small problem in eukaryotes (at least, things
like plants and animals), but a major issue in
prokaryotes, where 10% or more of DNA in a
species has been transferred in across large
evolutionary distances.
– Prokaryotic sexual processes (conjugation,
transduction, transformation) often work very
well between species.
• Detected because a gene’s sequence
resembles orthologs in very different
species more than in closely related species.
Humans, Gorillas, Chimpanzees
• In the Great Ape
lineage, which species
split off first, humans,
chimpanzees, or
gorillas?
– Note that we humans
think chimps and
gorillas are much more
similar to each other
than to us.
Homo sapiens originated in Africa
• The older theory, called the multiregional
hypothesis, stated that Homo erectus arose in
Africa and spread throughout Europe, Asia, and
Africa. Different groups of modern humans are
descended from different populations of H.
erectus. The idea is that several independent
groups evolved from H. erectus to H. sapiens
separately from each other (making differences
between human groups very ancient).
• The Out of Africa theory says that H. sapiens
arose in Africa and spread out from there,
displacing H. erectus and others (like the
Neandertals).
Mitochondrial DNA Analysis
•
•
Mitochondria have their own DNA, a small circle that is
easy to isolate and sequence.
Mitochondria are inherited strictly from the mother, so it
is possible to treat the entire mitochondrial chromosome
as a single ortholog, and construct a phylogenetic tree
from it
–
–
•
•
Different people share common mutations: called
haplogroups.
At the root of the tree is haplogroup L.
–
•
•
Use chimpanzees as outgroup to root it
Later it became obvious that Neandertals are much more distant
than any living humans
All members of group L and related subgroups come from Africa
Haplogroups M and N are derived from L. Some people
in these groups are found in Africa, but everyone whose
ancestors came from Europe or Asia is haplogroup M or
N, or something derived from that.
North and South America were first colonized 10-20,000
years ago, from Asia to Alaska.
Mitochondrial Haplogroup Migration