* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Genomics
Epigenetics of human development wikipedia , lookup
DNA polymerase wikipedia , lookup
SNP genotyping wikipedia , lookup
Genome (book) wikipedia , lookup
Gel electrophoresis of nucleic acids wikipedia , lookup
Zinc finger nuclease wikipedia , lookup
Minimal genome wikipedia , lookup
Genetic engineering wikipedia , lookup
Pathogenomics wikipedia , lookup
DNA damage theory of aging wikipedia , lookup
Cancer epigenetics wikipedia , lookup
Mitochondrial DNA wikipedia , lookup
Genetic code wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Genealogical DNA test wikipedia , lookup
United Kingdom National DNA Database wikipedia , lookup
DNA vaccination wikipedia , lookup
DNA barcoding wikipedia , lookup
DNA supercoil wikipedia , lookup
Epigenomics wikipedia , lookup
Nucleic acid double helix wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Transposable element wikipedia , lookup
Molecular cloning wikipedia , lookup
Cell-free fetal DNA wikipedia , lookup
Primary transcript wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Genomic library wikipedia , lookup
Genome evolution wikipedia , lookup
Designer baby wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Extrachromosomal DNA wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Human genome wikipedia , lookup
Metagenomics wikipedia , lookup
Deoxyribozyme wikipedia , lookup
History of genetic engineering wikipedia , lookup
Non-coding DNA wikipedia , lookup
Microsatellite wikipedia , lookup
Microevolution wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Genome editing wikipedia , lookup
Point mutation wikipedia , lookup
Genomics Genomics • Genomics is the study of the entire genome: the sequence of all the DNA in the cell. – For humans, the haploid genome is about 3 billion (3 x 109) base pairs (bp). Since we are diploid, we have about 6 x 109 bp per cell. • Related subjects: (plus many more!) – Proteome: all the proteins in the cell – Transcriptome: all of the RNAs (i.e. transcripts) in the cell – Metabolome: all of the metabolic pathways in the cell • What makes it possible to study all of these things is DNA sequencing. It is possible (and not all that hard) to determine the DNA sequence of an entire genome. – Which then lets you find all the genes, which can then be translated into proteins using the well-known genetic code. – Important point: virtually all RNA and protein sequences are inferred from DNA sequence data. RNAs and proteins are not directly sequenced (except for some minor applications). The DNA sequence of the human genome contains all the information needed to produce a person. If we know someone’s DNA sequence, can we predict their phenotype? DNA Polymerase Reaction • The basis of DNA sequencing, as well as PCR and other DNA techniques, is the enzyme DNA polymerase. • DNA polymerase is a protein (like all enzymes). It consists of several polypeptide subunits. • DNA polymerase catalyzes the synthesis of the second strand of a DNA molecule. • To start, it needs: – Single stranded DNA template molecule – Primer: a short piece of DNA base-paired with a region of the template – dNTPs: the 4 deoxy nucleoside triphosphates dATP, dCTP, dGTP and dTTP, which are the raw materials for the new DNA strand. • The basic reaction: The 2 phosphate groups on the end of the dNTP molecule (the gamma (γ) and beta (β) phosphates) are removed, and the phosphate next to the sugar is attached to the 3’ OH group of the growing DNA chain. – The removal of the phosphates provides the energy needed to drive the reaction. More DNA Polymerase • The DNA polymerase reaction is processive: it starts from the 3’ end of the primer and adds new nucleotides, one at a time, until it reaches the end of the template. You now have a complete double stranded DNA molecule. • Each nucleotide added is complementary to the nucleotide on the template strand: A paired with T, and G paired with C. • • • • • • Uses RNA primers made by the primase enzyme Occurs on both strands simultaneously, at a replication fork. The replication fork is created by helicase unwinding the double helix. Leading strand is continuously synthesized, in the same direction the replcation fork is moving. Lagging strand is replicated in short stretches called Okazaki fragments. The DNA polymerase replaces the RNA primers with DNA. Then, the Okazaki fragments are ligated together by the DNA ligase enzyme. The DNA polymerase has an errorcorrection function: if the wrong nucleotide is added, DNA polymerase backs up, removes it, and tries to add the proper nucleotide again. Replication in the Cell Polymerase Chain Reaction • • • The polymerase chain reaction (PCR) is used to make many identical copies of a short region of DNA, so it can be analyzed further. PCR uses 2 primers that bind to opposite strands of the DNA molecule, a short distance apart (say, 1000 bp or less). PCR is a set of 3 reactions, done repeatedly in a cycle. With each cycle the number of DNA molecules doubles: exponential growth. – Starting with 1 DNA molecule, after 30 cycles you have 230 identical molecules, which can easily be detected. – Only the region between the primers gets amplified • The 3 steps of PCR: – Denaturation (melting). Double-stranded DNA is converted to single strands by heating it to a high temperature (say 94oC) – Primer annealing. The primers bind to complementary regions on the DNA by incubating them at a lower temperature (say 50oC) – Primer extension. New second strands are build on the template by DNA polymerase, starting at the primers (say 72oC) • All reactions occur in a single tube, with just the temperature changing every minute or so. – Uses a DNA polymerase that survives high temperatures: Taq polymerase. It was isolated from bacteria growing in hot springs at Yellowstone National park. PCR Thermal profile of 1 PCR cycle. It cycles between 94oC (denaturation), 50oC (annealing), and 72ioC (elongation). What happens in each of the 3 PCR steps. Exponential growth of DNA molecules DNA Sequencing • • • Many sequencing methods have been invented, and it’s still a very active area of research. Most use the concept of sequencing by synthesis: starting with a primer, use DNA polymerase to add new bases are added one at a time, paying attention to which base is added. In the Illumina method (current favorite) , fluorescent tags attached to the 3’ OH group are used. – • • • • Each of the 4 nucleotides has a different colored tag. The fluorescent tags block the 3’-OH of the new nucleotide, and so the next base can only be added when the tag is removed. A cycle: add one new base, then read its color, then remove the fluorescent tag to give a free 3’ OH group. Repeat the cycle up to 200 times. End up with 200 bp of sequence information. More Sequencing • • • To get enough signal from the DNA molecule being sequenced, each DNA molecule needs to be amplified using PCR. For the Illumina method, this is done by attaching individual DNA molecules to a solid surface, then amplifying them in place, giving tiny spots with about a million identical copies. The DNA polymerase sequencing reactions are then monitored with a high resolution video camera. Sequence Assembly • The big problem with all current sequencing methods: you only get very short reads: 200 bp maximum for Illumina, up to 1000 bp for the older (slower, much more expensive) Sanger method, etc. – The human genome is 23 DNA molecules (chromosomes) that total 3 billion bp. Human chromosomes are 50-250 million base pairs long. – You need to assemble the tiny reads into much longer contigs (continuous sequences). With a perfectly sequenced genome, the final contigs would be identical to the DNA sequence of the chromosomes. • How reads are assembled into contigs: overlapping sequences. Assembly Problems • • Chromosomes, especially eukaryotic chromosomes, are filled with sequences that are repeated many times. If you have a read from a repeated sequence, how do you know which copy it is? – Some repeats are next to each other (tandem repeats) and some are scattered all over the genome (dispersed repeats). The main solution to this problem is to start with longer DNA template molecules and sequence both ends. You don’t know the sequence in between, but you do know how far apart the ends are. This often allows you to jump over repeated sequences. – It’s not perfect, and even now there are no human chromosomes sequenced to 100% accuracy. BLAST How to find similar sequences Finding Similar Sequences • Once you have the DNA sequence of your gene, how do you find other, similar genes? – Within the genome: are there duplicate genes present? For example, in the human genome there are related genes for alpha globin and betya globin. Are there other globins in the genome? – Between genomes: if you find a gene in one species, is it present in others? For example, are there globin genes in plants or fungi? – Gene function: if the function of a protein is determined by laborious experimentation, you can extend the value of the results by saying that similar genes in other species probably do the same thing. – What parts of the protein are conserved across evolutionary lines (and thus are probably important to protein function)? • BLAST (Basic Local Alignment Search Tool) is the standard tool for doing sequence comparison. Protein Sequences Are More Conserved Than DNA Sequences • Evolutionary fitness of a particular gene depends on how well it functions under different conditions. • Function is determined by the amino acid sequence of the protein. – It is also determined by the 3-dimensional structure of the protein, which is based on the amino acid sequence • The genetic code has many synonymous codons. A mutation that changes the nucleotide sequence but not the amino acid sequence produces the exact same protein, so there is no effect on evolutionary fitness. – this means that many mutations within a gene can accumulate in a population based solely on genetic drift. • As a consequence, we want to compare protein sequences in preference to DNA sequences. – However, almost all protein sequences are derived from DNA sequences that have been translated in silico (by a computer). Dotplots • We want to align one sequence with another: for example, we can learn about a newly sequenced gene by aligning it with all the sequences in a large database, to see if it is similar to a previously known sequence. • The dotplot is a tool for graphically aligning 2 sequences. – Put the letters of one sequence along the x-axis and the other sequence on the yaxis. – At each intersection where the sequence letters match, output a dot. – You can see the alignment, even though there are some mismatches and indels (indel= insertion/deletion, some nucleotides present in one sequence but not the other). A Real Dotplot • • Two haptoglobin sequences. (Haptoglobin is a blood protein that binds to hemoglobin that has gotten out of the red blood cells). You can see a gap in one sequence, a region of poor similarity just before it, and a simple sequence repeat near the beginning. Automated Sequence Alignment • Dotplots have some problems: – Very slow: they require humans to examine them and then judge what a good alignment is. – It’s hard to know what to do with difficult regions, where the alignment isn’t clear. – Dotplots consider all amino acid changes as equally bad: every position is either a match or not. In practice, we know that some changes are conservative: a different amino acid produces very little change in the protein’s function. • What BLAST does is mathematically create a dotplot and find the best diagonal alignment, without needing our human visualization skills. An Actual BLAST Search Result • • • • This is an alignment between two superoxide dismutase genes, from two different Bacillus species. It uses the 1 letter amino acid code. Between the Query and Subject lines you see a letter if there is a match, and a + if eth two amino acids are similar Gaps are shown as dashes ---. Amino acids present in one sequence but not the other. A sequence alignment program needs to deal with matching amino acids, conservative (similar) amino acids, complete mismatches, and gaps (indels). Substitution Matrices • If you align many sequences, it is clear that some amino acid substitutions are common, while others are very rare. • A substitution matrix gives a score for each possible amino acid substitution between 2 sequences. – Substitution matrices are created by counting the different substitutions in large numbers of sequences that have been carefully aligned by hand. • Two commonly used sets of matrices: PAM and BLOSUM. – Both PAM and BLOSUM have several matrices that have been tuned to work with different levels of evolutionary divergence. • We are just going to use the BLOSUM62 matrix, which is the default for BLAST searches. • The alignment score is just the sum of the individual BLOSUM scores for each pair of aligned amino acids. BLOSUM62 Substitution Matrix Observations on BLOSUM62 Matrix • Numbers are small positive or negative integers. They represent how frequently different substitutions were seen in the manually curated sequences relative to completely random pairings. – A score of 0 means that the substitution occurs about as often as would be expected by chance alone. – A positive number means the substitution occurs more frequently than expected by chance. – A negative number means it occurs less frequently than expected by chance. • Along the diagonal are scores for keeping the same amino acid in both sequences. – Some amino acids are very conserved in evolution: for example C (cysteine) has a score of 9 and W (tryptophan) has a score of 11. These amino acids are only rarely substituted. – Other amino acids are less conserved: I (isoleucine), L (leucine) and V (valine) have scores of 4. • The body of the matrix shows scores for amino acid changes. – Most are 0 or less – Some are positive: I,L, and V substitute for each other fairly often, for example Gaps • When many different sequence alignments have been done, it became obvious that there were 2 important types of mutation: nucleotide substitutions and short indels. – Indel = insertion/deletion: some nucleotides present in one sequence but not the other. – Indels appear as gaps in the aligned sequences, symbolized by dashes (---). • Substitution matrices work for substitutions but not gaps. • There is no good theory for the relationships between substitutions, gaps, and gap lengths, so gaps are dealt with heuristically. – Heuristic = a method or value determined by trial-and-error experiments, without a strong guiding theory. – In this case, gap penalties are the result of trying many possibilities and seeing which ones give the most pleasing alignments. • The existence of an indel seems to be relatively independent of its length. Because of this, gaps are scored in 2 ways: – Gap opening penalty: a negative score for each gap. BLAST default is -11. – Gap extension penalty: a smaller negative score proportional to the length of the gaps. BLAST default is -1. Scoring an Alignment • Alignment score (S) = sum of all aligned amino acid scores – number of gaps x gap opening penalty – number of nucleotides in gaps x gap extension penalty. • First add up the scores for each aligned pair of amino acids. • Then count the number of gaps (indels), multiply that by the gapopening penalty, and subtract from the total. • Then, count the number of nucleotides that are aligned with gaps, multiply by the gap extension penalty, and subtract from the total score. Practical BLAST • Let us say you have a sequence that you want to find matches for. Your sequence is the query sequence. • BLAST compares the query sequence is against a database of sequences, the subject sequences. – The most commonly used database is the nr database. “nr” stands for nonredundant, and it consists of all known DNA sequences with any identical ones removed. – It is also common to use a database for a single organism or group of organisms, such as human-only or mammalian-only. • The sequences can be either DNA or protein. – Protein sequences are better conserved in evolution, so they are typically used for cross-species comparisons. – Almost all protein sequences are sequenced DNA that was translated using the genetic code. • There are several versions of BLAST that do slightly different things. We are going to concentrate on blastn (both the query and the subject are nucleotide sequences) and blastp (both the query and the subject are peptide sequences). BLAST E-values • BLAST results are usually reported as e-values (“expect-values”). The e-value for a match between a query sequence and a subject sequence is the number of subject sequences in a completely random database that would have the same match score or better. The random database must be the same size as the one you are using. – Really bad matches have e-values of 1 or more: An e-value of 1 means that even in a completely random database you could find a match as good as the one being reported – Most e-values are numbers less than 1, on an exponential scale. They look like 3e-23, for example. This means 3 x 10-23 which indicates a good match. – The larger the negative exponent is the better the match is. Thus, 1e-80 is a better match than 3e23. – The best matches have an e-value of 0.0. This score implies an e-value better than 1e-180. Computer arithmetic doesn’t do numbers smaller than this (i.e. 10-180 is the smallest floating point value that can be represented in standard computer format). BLAST Algorithm 1. 2. 3. 4. Make a list of all possible 3 letter “words” in the query sequence Use substitution matrix to find all synonyms for each word that exceed a minimum score Search the database for sequences that have matching words. This might include many sequences. Extend the ungapped alignment between the 2 sequences, starting at the matching word and moving in both directions, until the score starts to drop. More BLAST Algorithm 5. If the ungapped alignment’s score is greater than a threshold, do a full alignment that allows gaps. (this is a much slower process than an ungapped alignment). 6. Process the scores to convert them to evalues, using parameters from the scoring matrix, the length of the query sequence, and the size of the database. 7. Report all matches with e-values above a threshold (default=10). Things that might appear on a test • Use BLOSUM62 matrix plus gap penalties to score an alignment • Arrange BLAST scores from best to worst • Find all 3 letter words in a sequence An Actual BLAST search • Superoxide dismutase is a gene coding for the enzyme that destroys superoxide radicals inside the cell. Superoxide is a highly reactive byproduct of aerobic respiration. Mutations in this gene cause amyotrophic lateral sclerosis (aka Lou Gehrig’s disease). • Starting with the protein sequence from humans, obtained from the National Center for Biological Information (NCBI). – Uses 1 letter amino acid code. – FASTA format: a comment (title) line starting with ‘>’, then the sequence itself on one or more lines after the comment line. • Most BLAST searches are done at NCBI (http://blast.ncbi.nlm.nih.gov/Blast.cgi ), because it is easy and they have the most up-to-date version of the nr database. However, it is also quite slow. • We are going to use a local BLAST database, because we can (probably) do it in real time. http://biolinx2.bios.niu.edu/rjohns/bmeg/bmeg_blast.htm >gi|4507149|ref|NP_000445.1| superoxide dismutase [Cu-Zn] [Homo sapiens] MATKAVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSR KHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGN AGSRLACGVIGIAQ Does Bacillus megaterium have Superoxide dismutase? Results, part 1 • The header section. – – – – Version of BLAST. Note it’s blastp (for proteins). Literature references. Comment line from the query sequence and its size (184 letters) Subject database and its size: 5629 sequences (genes) and 1,503,829 letters (amino acids). More Results • This is the summary list of BLAST hits. It shows every hit, every gene that gives a match better than some cutoff value. Sorted so the best one is on top. • Here, only 2 hits • Each line shows the gene ID number (BMQ_2135 and BMQ_4952) along with a bit of description (which is cut off somewhat). Also the bit score and the e-value. • E-values here are moderately good. This protein is conserved between humans and bacteria. Hit Details • Comment line (note the >) from database for this gene. It is indeed a superoxide dismutase, with a total length of 207 amino acids. • Repeat of score and e-value (Expect), along with: Identities: after the sequences are aligned, the number of identical amino acids Positives: counts similar amino acids as well as identical Gaps: places where the alignment has an amino acid in one sequence but not in the other. Gaps are indel mutations. More Detail • The groups of 3 lines show the query sequence, the subject sequence, and the positions where they match (between). A + means a similar amino acid (one of the “Positives” from the last slide). • The numbers are the start and end of the match. This is one continuous sequence: between amino acids 25 and 150 in the human (query) sequence, and between positions 80 and 205 in the B. megaterium (subject) sequence. The other hit and other info What We Know about Genomes Eukaryotic Genomes • • • • Linear chromosomes Lots of gene duplication Transposable elements Repeat sequence DNA Orthologs and Paralogs • Homologues : genes that match each other in a BLAST search. • During speciation, one species splits into 2 different species. The ancestral gene now has versions in both descendant species: these genes are called orthologs. – The beta-globin genes in humans and chimpanzees are orthologs • Within a species, the gene might be duplicated. The different copies of the gene are called paralogs. – The alpha and beta globin genes in humans are paralogs. • Paralogs are free to evolve new functions. Synteny • We defined “syntenic” to means genes on the same chromosome as each other. • Now, we extend this definition a bit to cover comparisons between species. A region of chromosome in one species is syntenic with a region in another species if they both have the same genes (orthologs) in the same order. • Closely related species contain many large blocks of syntenic genes. All mammals, for instance. – Useful for finding orthologs Part of a study looking for genes affecting alcoholism. A region on human chromosome 1 that affects severity of alcohol withdrawal symptoms is syntenic with a mouse gene that affects mitochondrial respiration Waardenburg syndrome • Waardenburg syndrome. There are several types: we are discussing type 1 here. • Different-colored eyes, white forelock, white skin patches, deafness. • Due to partial absence of melanocytes. Melanocytes are neural crest cells: they originate near the developing neural tube and migrate laterally down the flanks of the body in the embryo. • Autosomal dominant, but varies in expressivity. Synteny and Waardenburg Syndrome • • • • • • The disease was mapped to a region on human chromosome 2. However, there were dozens of genes in the mapped region: which was the real one? A mutation in mice, Splotch, also shows white patches and deafness, and maps to a syntenic region on mouse chromosome 1. Also, the PAX3 gene, mapped as a transcribed DNA sequence, was in the same mouse chromosome region. PAX3 is a transcription factor active in the neural crest. An unmapped human gene HuP2 showed strong sequence identity to PAX3. Examine DNA of the HuP2 gene from Waardenburg patients: 6 out of 17 unrelated patients had altered DNA, and 0 out of 50 normal controls had altered DNA. Since the human HuP2 gene was very similar to the mouse gene, it was renamed PAX3. The mouse on the left is a Splotch mutant Transposable Elements • • Genes have fixed locations on the chromosomes: that’s why we can map them. However, certain DNA sequences, called transposable elements or transposons, can move from place to place in the genome. – First discovered by Barbara McClintock in maize, then later seen in bacteria: it became clear that they were in all organisms. • • • They are mostly thought of as intracellular parasites: as long as they replicate more frequently than mutation can inactivate them, they remain in the genome. About 20% of the human genome is (mostly inactive) transposable elements. Two basic types: DNA transposons and retrotransposons (which use RNA). DNA Transposons • DNA transposons have short inverted repeats at their ends. Between the repeats is the gene for transposase. • Transposase is an enzyme that cuts the transposon out of one location and then inserts it into a new location: DNA transposons move by a cut-and-paste mechanism. • Transposase works on the inverted repeats. It can act on any transposon in the cell that has the proper inverted repeats, not just the one carrying the transposase gene. This means that some DNA transposons are autonomous (they carry their own transposase gene) and others are non-autonomous (they rely on the transposase gene from another transposon. Retrotransposons • Retrotransposons replicate through an RNA intermediate: they are transcribed just like a regular gene, and then the RNA is reverse-transcribed back into DNA, which inserts at random locations in the genome. – Retrotransposons move by a copy-and-paste mechanism. • Retrotransposons are closely related to retroviruses, such as HIV (AIDS virus) or feline leukemia virus. The only difference is, retroviruses have a coat protein gene that allows them to move outside the cell. The ends of retrotransposons are long terminal repeats (LTRs), which are identical direct (as opposed to inverted) repeats. Autonomous retrotransposons carry a gene for reverse transcriptase, which converts RNA into DNA. • • – Non-autonomous retrotransposons use reverse transcriptase from an autonomous element. Non-LTR Retrotransposons • • • • Some retrotransposons don’t have LTRs. Like all retrotransposons, they are transcribed into RNA, then reverse-transcribed back into DNA. Humans have 2 types of non-LTR retrotransposon: LINE elements (long interspersed repeats) and SINE elements (short interspersed repeats). LINE elements have a reverse transcriptase gene. About 500,000 copies in the human genome (17% of the genome), mostly inactive. They seem to have an important role in maintaining chromatic structure: they aren’t just parasites. SINE elements are non-autonomous: they use reverse transcriptase from other elements. – In primates, the most common SINE is the Alu sequence: present in 1.5 million copies in the human genome (11% of the genome). The Alu sequence originated as the RNA used to guide mRNA molecules to the endoplasmic reticulum from translation into the membrane. Highly Repeated Sequences • Short sequences (say 5-200 bp) in long tandem arrays, mostly near centromeres or on the short arms of acrocentric chromosomes. Some are also on other chromosome arms, appearing as “secondary constrictions” in metaphase chromosomes under the microscope (centromere is the primary constriction). • Centromeres are composed of highly repeated simple sequences • Constitutive heterochromatin is composed of highly repeated DNA. As seen in the microscope, it is densely staining and late replicating chromosomal material. It contains very few genes. • These sequences are not normally transcribed. Molecular Phylogeny • These days, most evolutionary relationships are determined (or confirmed) using DNA analysis. – Orthologs are compared across species • In general, the more time that has passed since the 2 species diverged from a common ancestor, the more changes in the DNA – Especially for synonymous mutations (DNA changes but the amino acid stays the same). • A phylogenetic tree is a representation of the ancestor-descendant relationships between species. It shows the evolutionary relationships inferred from the data. – Based on the concept that all species diverged from a common ancestor. • Some trees are rooted: they show which species are ancestral and which are descendants. Other trees are unrooted: they show how closely species are related without implying ancestry. – Trees are rooted using an outgroup: a species know to be the less related than all others. For example, chimpanzees are a good outgroup for human phylogenies. Ultrametric vs. Additive Trees • Ultrametric trees: all leaves on the same level. A result of the molecular clock idea: mutations occur and are selected for at the same rate in all lineages – All leaves (present day taxa) are at the same level, which represents the present day. – Certainly not true: some genes in some lineages evolve much faster than others. • Additive trees: branch length is proportional to number of mutational changes in the lineage: leaves are usually not all at the same level, because some lineages evolve faster than others. – Additive trees are more realistic than ultrametric trees: we know some genes and some lineages evolve at different rates. Tree of Life • Carl Woese noticed that all living things had ribosomes, and ribosomal RNA was easy to sequence. By comparing 16S ribosomal RNA sequences (or the equivalent 18S sequences in eukaryotes), it was possible to determine how all organisms are related. – There are very few protein-coding genes found in all organisms. – Doesn’t work with viruses: they don’t have ribosomes – Also doesn’t work with extinct organisms unless there is fairly recent (less than 100,000 years old) tissue available. • One major finding: prokaryotic organisms can be divided into Bacteria and Archaea, based on a very ancient split between them – It is still not clear how the eukaryotes are related to the bacteria and archaea. – Evolutionary relationships between many groups are still being determined. Prokaryotic Genomes • Circular chromosome, tightly packed with genes • Most genes are single copy; very little repeat sequence DNA. • Very few genes found in all species, and many cases of convergent evolution: genes with clearly different origins performing the same enzyme activity. Horizontal Gene Transfer • An important issue: horizontal gene transfer: transfer of DNA between distantly related species. As opposed to vertical gene transfer: the normal method, genes transferred from parent to offspring. – It’s a small problem in eukaryotes (at least, things like plants and animals), but a major issue in prokaryotes, where 10% or more of DNA in a species has been transferred in across large evolutionary distances. – Prokaryotic sexual processes (conjugation, transduction, transformation) often work very well between species. • Detected because a gene’s sequence resembles orthologs in very different species more than in closely related species. Humans, Gorillas, Chimpanzees • In the Great Ape lineage, which species split off first, humans, chimpanzees, or gorillas? – Note that we humans think chimps and gorillas are much more similar to each other than to us. Homo sapiens originated in Africa • The older theory, called the multiregional hypothesis, stated that Homo erectus arose in Africa and spread throughout Europe, Asia, and Africa. Different groups of modern humans are descended from different populations of H. erectus. The idea is that several independent groups evolved from H. erectus to H. sapiens separately from each other (making differences between human groups very ancient). • The Out of Africa theory says that H. sapiens arose in Africa and spread out from there, displacing H. erectus and others (like the Neandertals). Mitochondrial DNA Analysis • • Mitochondria have their own DNA, a small circle that is easy to isolate and sequence. Mitochondria are inherited strictly from the mother, so it is possible to treat the entire mitochondrial chromosome as a single ortholog, and construct a phylogenetic tree from it – – • • Different people share common mutations: called haplogroups. At the root of the tree is haplogroup L. – • • Use chimpanzees as outgroup to root it Later it became obvious that Neandertals are much more distant than any living humans All members of group L and related subgroups come from Africa Haplogroups M and N are derived from L. Some people in these groups are found in Africa, but everyone whose ancestors came from Europe or Asia is haplogroup M or N, or something derived from that. North and South America were first colonized 10-20,000 years ago, from Asia to Alaska. Mitochondrial Haplogroup Migration