Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Phi-X 174 Human mitochondrion Epstein-Barr virus (EBV) Nanoarchaeum equitans Mycoplasma genitalium Rickettsia prowazekii Mimivirus Borrelia burgdorferi Haemophilus influenzae Neisseria meningitidis Propionibacterium acnes Listeria monocytogenes E. coli Saccharomyces cerevisiae Arabidopsis thaliana Drosophila melanogaster Humans Rice Amphibians Genome Sizes (haploid) Base pairs Genes Notes 5,386 10 virus of E. coli 16,569 37 172,282 80 causes mononucleosis This parasitic archaean has the smallest 490,885 552 genome of a true organism yet found. 580,073 483 One of the smallest true organisms 1,111,523 834 bacterium that causes epidemic typhus A virus (of an amoeba) with a genome 1,181,404 1,262 larger than many cellular organisms 6 1.44 x 10 1,738 bacterium that causes Lyme disease 1,830,138 1,738 bacterium that causes middle ear infections Group B; the most frequent cause of 2,272,351 2,221 meningitis in the U.S. 2,560,265 2,333 causes acne 2,853 of these encode proteins; the rest 2,944,528 2,926 RNAs 4,290 of these genes encode proteins; the 4,639,221 4,377 rest RNAs 12,495,682 5,770 Budding yeast. A eukaryote. 115,409,949 25,498 a flowering plant 122,653,977 13,379 the "fruit fly" 20,000– 3.3 x 109 25,000 4.3 x 108 ~60,000 109 - 1011 ? Shotgun sequencing • Experimental methods for DNA sequencing work reliably for segments ~500 bases long. • Genome must be randomly divided into fragments of 500 bp which are then sequenced and reassembled • Continuous segment of overlapping segments is called a contig. • Let a = NL/G be the coverage of genome of size G by N segments of length L. contigs Problems: 1) Find the mean proportion of the genome covered by the contigs Ans: < Gc >= 1 − e − a 2) Determine the average number of contigs for given G and L as a function of a. Ans: < n >= Ne − a 3) Determine the mean contig size. Ans: < l >= e −a − 1 L a Genetic code genome RNAP gene A, C, U, G regulatory proteins ribosome mRNA nuclease AA protein protease DNA: sequence of 4 nucleotides: A adenine, G guanine, C cytosine, T thymine Base pairing of DNA: A–T C–G mRNA: sequence of 4 nucleotides: Base pairing of DNA: A adenine, G guanine, C cytosine, U uracil A–UC–G G.U protein: 20 amino acids: A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y A Alanine Ala I Isoleucine Ile R Arginine Arg C Cysteine Cys K Lysine Lys S Serine Ser D Aspartic Acid Asp L Leucine Leu T Threonine Thr E Glutamic Acid Glu M Methionine Met V Valine Val F Phenylalanine Phe N Asparagine Asn W Tryptophan Trp G Glycine Gly P Proline Y H Histidine His Q Glutamine Gln Mapping DNA –> mRNA: A –> A Pro G –> G C –> C Tyrosine T –> U Tyr Mapping mRNA –> protein: uses a triplet code UUU Phe UCU Ser UAU Tyr UGU Cys UUC UCC UAC UGC UUA Leu UCA UAA STOP UGA STOP UUG UCG UAG STOP UGG Trp CUU Leu CCU Pro CAU CUC CCC CAC CUA CCA CAA CUG CCG CAG AUU Ile AUC AUA AUG Met His Gln ACU Thr AAU ACC AAC ACA AAA ACG AAG Asn GUU Val GCU Ala GAU GUC GCC GAC GUA GCA GAA GUG GCG GAG Asp Lys Glu CGU CGC CGA CGG Arg AGU AGC AGA AGG Ser GGU GGC GGA GGG Arg Gly • Met in the beginning of a gene signals the START of transcription • redundant, comma-less (requires specification of the frame), redundant, almost universal ORF: open reading frame – region devoid of stop codons Sequence motifs and patterns • Translation start and stop codons • Restriction enzyme cutting sites • Promoters • Regulatory or structural protein binding sites • Exons & introns Restriction Enzymes • Bind to short (4 to 6 bases) sequences of DNA • Cut the DNA at the binding site • Function during recombination • Help to protect against viruses EcoRI 5’-GAATTC-3’ 3’-CTTAAG-5’ Examples of Restriction Enzymes Enzyme Organism from which derived Target sequence (cut at *) 5' -->3' Bam HI Bacillus amyloliquefaciens G* G A T C C Bgl II Bacillus globigii A* G A T C T Eco RI Escherichia coli RY 13 G* A A T T C Hae III Haemophilus aegyptius GG*CC Hha I GCG*C Haemophilus haemolyticus Hind III Haemophilus inflenzae Rd A* A G C T T Hpa I Haemophilus parainflenzae GTT*AAC Kpn I Klebsiella pneumoniae GGTAC*C Mbo I Moraxella bovis *G A T C Mbo I Moraxella bovis *G A T C Pst I Providencia stuartii CTGCA*G Sma I Serratia marcescens CCC*GGG SstI Streptomyces stanford GAGCT*C Sal I Streptomyces albus G G*TCGAC Taq I Thermophilus aquaticus T*CGA Xma I Xanthamonas malvacearum C*CCGGG Analysis of a single sequence Statistical models for a single sequence Independent distribution: P(A) = pA , P(G) = pG , P(C) = pC , P(T) = pT iid DNA: pA = pG = pC = pT = 0.25 Weight matrix: Probability of finding the base b at a location i: pbi Markov type distribution (homogeneous): Transition matrix: Probability that the base i + 1 is b given that the base i is d: pbd Limiting probability distribution: π b : ∑ pbd π d = π b b = A, G , C ,T Problems: 1) Find the mean and variance of the number of occurrences of an M base long word in an N base long iid DNA M −1 ⎛ ⎞ <Y > Var (Y ) −M −2M ⎜ M ≅4 , 4 − ( 2 M + 1) + 2 ws 4 s ⎟ for large N ≅4 Ans: ⎜ ⎟ N N s =1 ⎝ ⎠ ∑ 2) Find the mean distance between consecutive occurrences of the word Ans: < l >= 4 M − 1 Location of protein binding sites • DNA binding protein can attach to variety of sites • The binding site can be described by the consensus sequence or weight matrix Problem: Find most likely binding sites for a protein in a given DNA sequence. Ans: Maximize the score S = ∑ s xi ,i i Ex: narL binding site in E.coli Data: TACCTCAATAGCGGTA TGCTCCTTTATAGGTA TAACTCTTTCCGGGTA TACATCGGTAAGGGTA TTACTCACTATGGGTA GTATTCCCCATGGGTA TTATCCTAAAGGGGTA TCACTCGAAAGTAGTA TACCCCGATCGGGGTA TACTCCTTAATGGGTA TAACTCTAAAGTGGTA Consensus: TAACTCTATAGGGGTA Weight matrix (frequencies) pbi 1 A G 0 3 0 4 5 0.55 0.55 0.09 0.09 0.09 C T 2 0 0 0 A −3.9 3 0.8 T 1.3 10 11 12 13 14 15 16 0 0.18 0.46 0.36 0.82 0.09 0.09 0.09 0 0 1 0 0 0.27 0.09 0.46 0.64 0.91 1 0 0 0 0.09 0.09 0 0 0 0 0.36 0.64 0 0.45 0.27 0.55 0.18 0.36 0.18 0 0 1 0 4 0 0 0.18 0.09 ⎛p ⎞ sbi = ln ⎜⎜ bi + f ⎟⎟ ⎝ pb ⎠ 5 6 7 8 0.8 −1.0 −3.9 −3.9 −0.3 G −1.0 −1.0 −3.9 −3.9 −3.9 −3.9 C −3.9 −1.0 9 0 Weight matrix (scores) 2 8 0 0.09 0.45 0.55 0.36 1 0.91 0.27 1 6 7 0.6 0.8 0.4 0.1 −3.9 0.4 0.9 −3.9 9 0.6 10 0.4 11 12 13 14 15 1.2 −1.0 −1.0 −1.0 −3.9 −3.9 0.1 −1.0 −3.9 −3.9 0.6 0.9 1.3 16 1.4 1.4 −3.9 −3.9 1.4 −3.9 −0.3 −1.0 −3.9 −1.0 −1.0 −3.9 −3.9 −3.9 −3.9 0.6 0.1 0.8 −0.3 0.4 −0.3 −3.9 −3.9 1.4 −3.9 Sequence logo (http://weblogo.berkeley.edu/) ⎞ ⎛ Height h ia = p ia ⎜⎜ 2 − ∑ p ia log 2 p ia ⎟⎟ ⎝ a ⎠ Available software on the web Regulatory sequence analysis tool (RSA) (http://rsat.ulb.ac.be/rsat/) • Allows search for patterns in single or multiple sequences • Construction of consensus sequences • Search for weight matrix motifs Consensus server (Stormo Laboratory) (http://adric.wustl.edu/oldconsensus/) • Construction of weight matrices by searching multiple sequences.