* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download document 8377837
Cre-Lox recombination wikipedia , lookup
Gene expression programming wikipedia , lookup
Zinc finger nuclease wikipedia , lookup
Public health genomics wikipedia , lookup
Mitochondrial DNA wikipedia , lookup
Extrachromosomal DNA wikipedia , lookup
Designer baby wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Genetic engineering wikipedia , lookup
Transposable element wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Point mutation wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Non-coding DNA wikipedia , lookup
Minimal genome wikipedia , lookup
Microevolution wikipedia , lookup
Genomic library wikipedia , lookup
Microsatellite wikipedia , lookup
History of genetic engineering wikipedia , lookup
Smith–Waterman algorithm wikipedia , lookup
Computational phylogenetics wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Pathogenomics wikipedia , lookup
Sequence alignment wikipedia , lookup
Human genome wikipedia , lookup
Human Genome Project wikipedia , lookup
Multiple sequence alignment wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Helitron (biology) wikipedia , lookup
Metagenomics wikipedia , lookup
Looking at Whole Genomes: Frequency of Occurrence of Oligonucleotides Lecture I Winter School on Modern Biophysics National Taiwan University December 16-18, 2002 HC Lee Dept Physics & Dept Life Science National Central University The Book of Life Millions of sequences Growth of sequenced genome data exploded after 1995 Genome data exploded after 1995 (GenBank: as of 2002 January 13) CBL@NCU Human has 24 types of The Human has 23 chromosomes Human Genome Chromosomes 3 billion bps Human genome first draft complete d Feb 16, 2001 Sequencing of first working draft of Human Genome published in 2001 February Nature, 409, February 15, 860-921 (2001) First working draft of Human Genome Science, 291, February 16, 1304-1351 (2001) Genome - Book of Life written in four letters Genome - book of four letter ackaged pair of DNA strands ith double helix structure DNA - a polymer of nucleotides Nucleotide – backbone + bases Four types of bases: A, C, G, T (the four letters) Gene – coded sequence of bases Genome – set of all genes; set of all chromosomes CBL@NCU Central Dogma • Genome (DNA): genetic information (genes) • Ribosomes: Transcribe (轉錄) & translate (翻譯) genes (nucleotide sequence) to proteins (amino acids sequence) • Proteins: expression and function New way to do Life Science Research • in vivo 在活體裡 • in vitro 在試管中 • in silico 在電腦中 CBL@NCU Frequency of occurrence of oligonucleotides A simple first look at whole genomes Oligo (or k-mer) Frequency • Oligonucleotide (oligo): short sequence of several nucleotides (k~2-30) long; a k-mer • There are 4k different kinds of k-mers • Frequencies of occurrence of all k-mer in a sequence can be obtained by reading with a “sliding window” • Complete set of frequencies of k-mers characterizes a DNA sequence • Very fast to compute; scales with seq length • For multiple seqs, scales w/ no. of seqs • Related to alignment Counting k-mers with Sliding Window N(GTTACCC) = N(GTTACCC) + 1 • Sum over all N(oligo) = Sequence (circular) length • Sequence is represented by the set {N(oligo) | all oligos} Or: for each k, sequence represented by 4k-component vector Number of oligos Frequency distribution of 6-mers Frequency of oligo More about this in lecture II ”Portraits” of microbial genomes Making a portrait • Divide a rectangular into 2k by 2k cells, each cell corresponding to one of the 4k different kinds of kmers • Write in each cell the frequency of the k-mer • Color-code ranges of frequencies Mycoplasma genitalium Length 0.58 Mb G+C content 32% Bacteria, Firmicutes Pathogen from the human urogenital tract Mycoplasma pneumoniae Length 0.816 Mb G+C content 40% Bacteria Firmicutes Parasite of the human respiratory tract. Borrelia burgdorferi Length 0.911 Mb G+C content 30% Bacteria Spirochaetales Causitive agent of Lyme disease (neurologic complications, arthritis) Rhizobium sp. NGR234 Length 0.53 Mb G+C content 59% Bacteria Proteobacteria Representative bacterium that fixes nitrogen in symbiosis with many plants. Aquifex aeolicus Length 1.55 Mb G+C content 40% Bacteria Aquificales Earliest diverging, and most thermophilic bacteria known. Can grow on hydrogen, oxygen, carbon dioxide. Parasite of the human respiratory tract. Haemophilus influenzae Length 1.83 Mb G+C content 38% Bacteria Proteobacteria Blood-loving causative agent of influenza. Methanococcus jannaschii Length 1.66 Mb G+C content 31% Archaea Euryarchaeota Anaerobic, Methane-producing hyperthermophile; grows at > 200 atm and an optimum temp. of 85 degrees C. Note: fractals Helicbacter pylori Length 1.67 Mb G+C content 40% Bacteria Proteobacteria Acid-loving causative agent of chronic gastric Diseases Note: fractals Archaeoglobus fulgidus Length 2.18 Mb G+C content 49% Archaea, Euryarchaeota Hyperthermophilic sulphur-reducer; causes havoc by souring oil wells. Synechococcus sp. PCC6803 Length 3.587Mb G+C content 48% Bacteria Cyanobacteria Unicellular cyanobacterium widely used for study of oxygen-producing photosynthesis mechanism. Exceptionally wide distribution of frequency occurrence of short oligos. Phylogeny based on alignment of homologous sequences Molecular Evolution & Phylogeny • Organism represented by Genome • A Universal Ancestor (is believed to) exists • Random mutation of DNA sequence leads to divergence and new species • Pressure from fitness causes conservation of sequence Phylogeny & Sequence similarity •Because fitness exerts pressure on functional sequence to conserve, if rate of change induced by mutation is assumed constant, then the dissimilarity between two homologous sequences is indicative of time elapsed when they diverged. Hence can use sequence similarity to study phylogeny. •E.g. phylogeny based on 16S/18S rRNA Sequence Alignment • Most important method for studying sequence homology • Example – alignment of two sequences a and b Seq a: TACCATCGCAAACAT GG (length 17b) x||||x|x|||x-|x--x| Seq b: AACCACCACAAG ACCTCG (length 18b) Consensus length 19, 10 matches(|), 6 mismatches (x), 1 single gap (-, SG), 1 extended gap (--, EG) Score: matches – (SG+EG)*P – (EG-1)*PE = (P: penalty for SG; PE: penalty for EG) Score = 10 –2 –1 = 7 Similarity = matches/total length =10/19=55% Sequence Alignment (II) • Result intuitive, evolution based • Widely used in sequence analysis – homology search, phylogeny, etc • Parameter dependent – many alignments possible (Needleman-Wunsch algorithm) • DNA & proteins sequences • Good software. E.g., BLAST, GCG,.. • Fast for length < 2000 • NP-complete problem for long and remotely related sequences, and for multiple alignments The Ribosome • E.g. phylogeny based on 16S/18S rRNA – 16S (Prokaryotes): 1550 bases; 18S Eukaryotes): 1800 bases • Ribosomal enzyme • Transcription & translation • Among the most ancient and best conserved biological machines • In genome of EVERY organism • Two subunits: 30S + 50S • 30S (small subunit): 16S/18S + 20 proteins • Translates mRNA “Cartoon” of 16S rRNA Head Body Platform Platform Head E coli 16S rRNA secondary structure Body 3‘m Bacteria 16S rRNA alignment tree 35 organisms: 19 bacteria 9 archaea 7 eukarya E. coli Bacillus Aquifex Herpetosiphon Thermotoga Mouse Homo sapiens Eukarya Methanococcus Archaea Archaeoglobus C. elegans Phylogeny based on frequency of k-mers Sequence distance based on Oligo Frequency 16S/18S rRNA k-mer tree as function of k Bacteria Archaea Eukarya Oligo Frequency and sequence alignment distances correlated • If sequence evolve ONLY by uncorrelated single mutations, then: S = X n (b/c chances of any base not changing is X) X - alignment similarity S - oligo frequency similarity n - oligo length. • In practice, more than single mutation. E.g., extended gaps. Then S = X**(kn) k < 1. Empirically: k = 2/3. Simulated Random Mutations S = X9 Oligo length = 9 log Soligo v.s. log X align Extended Gaps I Extended Gaps II Simulated Random Mutations with Extended gaps Oligo length = 9 S = X6.3 h=4 ng =3 kth=0.625 log Soligo v.s. log X align Tree of Life (35 organisms) Oligo length = 9 h=5 ng=2.5 kth=0.8 kex=0.66 log Soligo v.s. log X align Oligo frequency Eukarya Archaea Aquifex Thermotoga Bacteria Alignment Aquifex Thermotoga Comparison of 16S/18S rRNA Trees of Life (35 organisms) Similar topology Differences in detail Bacteria Aquifex Thermotoga Eukarya Archaea Black: oligo frequency Red: sequence alignment Oligo method is Robust • Three tests (Bacteria and Archaea) – Random truncation of 16S rRNA to 800 to 1200 bases – Random inversion of 16S rRNA (splice, reverse order and reconnect) – Random concatenation of 23S, 16S and 5S rRNA sequences k d r mo e L n Alignment g h r G b B D F q j a f s p i p j Thermatoga q z HH Aquifex y C z f Thermatoga E Sulfolobus HH Aquifex y i A A b a e AA Aeropyrum C k D F h s g E n B d G 0.1 Oligo L o 16s rRNA Truncated m 16s rRNA Truncated d o Alignment n a f m E b D q Aquifex y F i A A Thermatoga C h g Aquifex H L j A p H H e G k Oligo B z Thermatoga 0.1 r s Alignment d Oligo g j Aquifex z n HH i m D o A A k Thermatoga h C Aquifex f b a y Thermatoga H Mixed 5s+16s+23s rRNAs A Towards a Consensus Tree based on whole genomes Tree is sequence dependent • Phylogenetic relations expressed by genes are not universal • A tree extracted from the 16S rRNA gene differs – not always just in detail - from a tree extracted from another well conserved gene • A consensus tree may be constructed but depends on criteria that are subjective Can a Consensus Tree be construct from whole genomes? • Also a subjective choice • Genomes are vastly complex, hence possible combinations of criteria that can be chosen for tree construction is huge • Frequency of occurrence of oligonucleotides has universal characteristics across life forms (see next lecture) – Extremely frequent and extremely rare oligos (EFEROs) characterize groups of organisms “Consensus” tree of 65 microbials with complete genomes Proteobacteria Firmicutes Archaea Others Topology of first-trial EFEROs tree from 6-mers SUMMARY • Oligo frequency characterizes DNA seqs • Oligo similarity is related to alignment similarity • Oligo vs alignment gives a handle on mechanism of generation of extended gaps • Oligo method is robust to truncation and inversions • May be developed into a tool for analysis and comparison of very long sequences or complete genomes • (Preview lecture II): hints at how genomes grow Lecture and Book •Lecture by Paul Higgs • online.itp.ucsb.edu/online/infobio01/higgs/ • see online.itp.ucsb.edu/online/infobio01/ for many lectures •Book by Wen-Hsiong Li 李文雄 •“Molecular Evolution” (Sinauer Associates, 1997) Some web sites on Molecular Evolution •CMS Molecular Biology Resource •www.unl.edu/stc-95/ResTools/cmshp.html •Phylogeny - Molecular Evolution •www.unl.edu/stc-95/ResTools/biotools/biotools2.html •The Tree of Life Web Project •tolweb.org/tree/phylogeny.html •Web Resources in Molecular Evolution and Systematics •darwin.eeb.uconn.edu/molecular-evolution.html Some web sites on ClustalW (tree drawer) • On-line service • www.ebi.ac.uk/clustalw/ • clustalw.genome.ad.jp/ • Software • ftp-igbmc.u-strasbg.fr/pub/ClustalX/ • ftp-igbmc.u-strasbg.fr/pub/ClustalW/ Bacillus subtilis Length 4.21 Mb G+C content 40% Bacteria Firmicutes Aerobic bacterium commonly found in soil; important source of industrial enzymes. Methanobacterium thermoautotrophicum Length 1.75 Mb G+C content 49% Archaea Euryarchaeota Anaerobic microorganism used as representative of methanogens. Escherichia coli Length 4.64 Mb G+C content 50% Bacteria Proteobacteria Parasitic human Pathogen of the digestive tract.