Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
COMP.3500/5800 Topics in Bioinformatics What is bioinformatics ? Study of DNA sequences, genomes, protein Modeling/Inference What are involved in bioinformatics ? Biology Statistics http://sphweb.bumc.bu.edu/otlt/MPHModules/PH/PH709_DNA-Genetics/ http://pages.uoregon.edu/aarong/teaching/G4074_Outline/no de1.html Computer Science Algorithms Machine learning COMP.3500/5800 Topics in Bioinformatics Why study bioinformatics ? What makes us different ? 99.9% genomes are identical How different cells are developed from the same genome ? Study mutation in genome -> drug development Topics today Textbook, pgs. 16-19 Genes to Proteins transcription Translation DNA & RNA Genome Genome • • • one or more chromosomes that contain the code (gene) that directs the synthesis of proteins that are essential for its structure and function Human: 22 pairs of homologous chromosomes & XY http://www.ncbi.nlm.nih.gov/geno me/?term=txid9606[orgn] Genes • Allele • Alternative forms of the same gene • Dominant, recessive Central Dogma in Molecular Biology Info flows in one direction DNA/genome A template or a roadmap RNA Copies of genes to be expressed (activated) Protein Biochemical molecules performing biological functions Gene to Protein: Transcription & Translation Gene to Protein: Transcription & Translation Gene to Protein: Transcription & Translation Transcription sense anti-sense Transcription Gene to Protein Protein Coding Region 5’UTR Non-Protein Coding Protein 2 Region exon intron Protein 1 intergenic UTR 3’UTR Non-Protein Coding Region Alternative Splicing Translation • Genetic Code • A triplet (called codon) • Ribosome moves along mRNA 3 bases at a time • Degenerate coding • 4x4x4=64 possible triplets into 20 Amino Acids • 8 AA have 3rd base irrelevant – immune to mutation • Anti-codon – reverse complement of a codon Genetic Code Genetic Code Translation Genetic Code Amino Acids • • General structure of amino acids • an amino group • a carboxyl group • α-carbon bonded to a hydrogen and a side-chain group, R R determines the identity of particular amino acid • • • • • R: large white and gray C: black Nitrogen: blue Oxygen: red Hydrogen: white DNA & RNA DNA and RNA • • DNA (deoxyribonucleic acid) and RNA (ribonucleic acid) are composed of linear chains of monomeric units of nucleotides A nucleotide has three parts: a sugar, a phophate and a base • Four bases Base Types • Nucleic acid bases are of two types • • Pyrimidine [pairímədì:n]– C, T, U (two nitrogens in 6-member ring at positions 1 and 3) Purine – A, G (pyrimidine ring fused to an imidazole ring (C3H4N2)) R Y A G T C W M A G T C K A G T C B A G T C s V A G T C D A G T C H N A G T C A G T C Primary Structure of DNA and RNA • • Nucleotides are joined by phosphodiester bonds and form sugarphosphate backbone • Sugar is deoxyribose in DNA (left)and ribose in RNA (right) Nitrogen-containing nucleobases are bonded to sugar Online course on Biology • Educational Portal • DNA chemical structure • http://education-portal.com/academy/lesson/dna-and-thechemical-structure-of-nucleic-acids.html Secondary Structure • Double helix – 1953 Watson and Crick using X-ray diffraction • • • Sugar-phosphate backbone is the outer part of the helix Two strands run in antiparallel directions Dimensions • • • • Inside diameter of backbone: 11 A (1.1 nm) Outer diameter: 20 A (1A=10-10 m =0.1 nm) Length of one complete turn: 34 A, 10 base-pairs Major and minor grooves – drugs or polypeptides bind to DNA Secondary Structure of DNA • Two strands are complementary • Base pairing: A-T; G-C • Pyrimidine and Purine form complementary H bonding Monomer counts in DNA • In double strands • # of A = # of T; # of G = # of C • Erwin Chargaff’s 1st Parity Rule, 1951 • In a single strand ? • # of A = # of T; # of G = # of C • Erwin Chargaff’s 2nd Parity Rule Importance of Hydrogen Bonding • • • Many consider hydrogen bond essential to the evolution of life Individual hydrogen bond is weak, many H bonds collectively exert very strong force Orderly repetitive arrangement of H bonds in polymers determines their shape Online course on Biology • Educational Portal • Four bases • http://education-portal.com/academy/lesson/dna-adenineguanine-cytosine-thymine-complementary-base-pairing.html Chromosome Length • • • 3.4A per base 3 Billion bases • 1.8 meters of DNA • 0.09 nm of chromatin after being wound on histones Five families of histones • H1/H5, H2A, H2B, H3, and H4 RNA • • • • • Sugar in RNA nucleotide is ribose rather than 2’deoxyribose Thymine is replaced by uracil (U) RNA polymers are usually a few thousand nucleotides or shorter RNA in cells is usually single-stranded RNA is considered to be the original gene coding material, and it still code genes in a few viruses RNA Types • Four RNA’s are involved in protein synthesis RNA Type Size Function Transfer RNA Small Transports AA to protein synthesis sites Ribosomal RNA Variable combines with proteins to form ribosome, where protein polypeptide chain grows Messenger RNA Variable Transcribes AA sequence from genes Small nuclear RNA Processing of initial mRNA to its mature form in eukaryotes Online course on Biology • Educational Portal • RNA • http://education-portal.com/academy/lesson/differences-betweenrna-and-dna-types-of-rna-mrna-trna-rrna.html Genome Genome • Genome – The entire DNAs of a cell is the genome – Individual units for coding proteins or RNA are genes – A gene starts with ATG, ends with one or two stop codons – Called ORF (Open Reading Frame) – Biological Info – Contained in genome – Encoded in nucleotide sequences of DNA or RNA – Partitioned into discrete units, genes Cell – Different levels of cells – Prokaryote (karyan, “kernel” in Greek)(/proekaeriəts) (pro for “before”) – Eukaryote (“true”) – Main difference is the presence of organelle, especially the nucleus, in eukaryotes Organelle Prokaryotes Eukaryotes Nucleus No definite nucleus Present Cell membrane Present Present Mitochondria None. Present Endoplasmic reticulum None Present Ribosomes Present Present Chloroplasts None Present in green plants animal cell plant cell Prokaryotic cell Three Domain • Classification purely based on biochemistry (RNA) – C. Woese, 1981 • Eubacteria (true bacteria) • Archaea (archaebacteria, early bacteria) • Eukarya (eukaryotes) Genome Sequencing Projects Major genome sequencing centers U.S. Dept. of Energy Joint Genome Institute (435 projects) J. Craig Venter Institue (302) The Institute for Genomic Research (TIGR) (206) Washington Univ. (184) Institut Pasteur, Univ. of Tokyo www.ncbi.nlm.nih.gov/genomes/static/lcenters.html national center for biotechnology information Completely sequenced genomes include Several hundred bacteria, over 20 archea, and over 30 eukarya Human (homo sapies), chimpanzee (Pan troglodytes), mouse (Mus musculus), brown rat (Rattus norvegicus), dog (Canis familiaris), Thale cress (Arabidopsis thaliana), rice (Oryza sativa), Fruit fly (Drosophila melanogaster), yeast (Saccharomyces cerevisiae) http://www.ebi.ac.uk/2can/genomes/genomes.html has descriptions of species and their clinical and scientific significances http://www.genomesonline.org has current status of genome projects Genome Databases Completed genomes ftp site -- ftp://ftp.ncbi.nlm.nih.gov/genomes/ http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/allorg.html http://www.ebi.ac.uk/genomes/mot/index.html http:/pir.goergetown.edu/pirwww/search/genome.html Organism-specific databases http://www.unledu/stc-95/ResTools/biotools/biotools10.html http://www.fp.mcs.anl.gov/~gaasterland/genomes.html http://www.hgmp.mrc.ac.uk/GenomeWeb/genome-db.html http://www.bioinformatik.de/cgibin/browse/Catalog/Databases/Genome_Proejcts Genomes of Prokaryotes Circular double-stranded DNA Protein-coding regions do not contain introns Protein-coding regions are partially organized into operons – tandom genes transcribed into a single mRNA molecule trpE trpD The trp operon in E.Coli begins with control region, followed by genes performing successive steps in systhesis of tryptophan AA The density of coding region is high ~89% in E.Coli Genome of E.Coli Many E.Coli proteins were known before the sequencing (1853 proteins) Genome of Escherichia coli, strain MG1655 published in 1997 By F. Blattner at Univ. Wisconsin 4.64 Mbp 4284 protein-coding genes, 122 structural RNA genes, Non-coding repeat sequences, Regulatory elements, etc. Average size of ORF is 317 AA Average inter-genic gap is 118 bp ¾ transcribe single genes, and the rest are operons (gene clusters) 60% protein functions are known http://wishart.biology.ualberta.ca/BacMap/index.html contains an atlas of bacterial genome diagram (2005) Genome of Archea Microorganism Methanococcus jannaschii thrives in hydrothermal vents at temp from 48 to 94 CB genes from 45 strains Capable of self-reproduction from inorganic components Metabolism is to synthesize methane from H2 and CO2 Sequenced in 1996 by TIGR 1.665 Mbp in chromosome containing a circular DNA modecule, two extrachromosomal elements 1,784 protein-coding regions Proteins in archea for transcription and translation are closer to those in eukaryote Proteins involved in metabolism are closer to those of bacteria Genomes of Eukarya Majority of DNA is in the nucleus Smaller amount of DNA in organelles such as mitochondria and chloroplasts Organelles originated as intra-cellular parasites Organelle genomes usually have circular forms, but sometimes in linear or multi-circular shape Genetic code is different that the one for nuclear genes Diverse among species Organized into chromosomes containing single-DNA molecule each Humans have 23 chromosomes, chimpanzees have 24 Human chromosome #2 is equivalent to a fusion of chimpanzee chromosomes 12 and 13 List of genome sequences http://en.wikipedia.org/wiki/List_of_sequenced_eukaryotic_genomes Genome of Saccharomyces cerevisiae (Yeast) Simplest eukaryotic organism Sequencing from 100 labs completed in 1992 12.06 Mbp 16 chromosomes 6,172 protein-coding genes Dense: only 231 genes contain introns Genome of Caenorhabditis elegans (C. elegans) Completed in 1998 First full DNA sequence of a multi-cellular organism 97 Mbp Paired chromosomes XX for a self-fertilizing hermaphrodite (simultaneously male and female) XO for male Avg. 5 introns per gene Proteins 42% have homologues to other species 34% specific to nematodes (round worms) 24% no known homologues Chrom osome Size (Mbp) Protein genes Kbp/ge ne I 7.9 2803 5.06 II 8.5 3559 3.05 III 7.6 2508 5.40 IV 9.2 3094 5.17 V 9.8 4082 4.15 X 10.1 2631 6.54 Genome of Drosophila melanogaster (Fruit fly) Completed in 1999 by Celera Genomics and Berkeley 180 Mbp Five chromosomes: 3 large autosomes, Y, and tiny fifth 13,601 genes, 1 gene/8Kbp Has 289 homologues to human genes Such as cancer, cardiovascular, neurological, etc. There is a fly model for Parkinson and malaria Genome of Arabidopsis thaliana Relatively small genome, 146 Mbp, completed in 2000 Five chromosomes 25,498 predicted genes; 1 gene/4.6 kbp Proteins Most A. thaliana proteins have homologues in animals 60% of genes have human homologues, e.g., BRCA2 Gene distribution Nucleus: genome size (125 Mbp), genes (25,500) Chloroplast: genome (154 Kbp), genes (79) Mitochondrion: genome (367 Kbp), genes (58) 20 of 54 genes in a 340-Kbp stretch of rice genome (top) are conserved and retain the same order in five A. thalia strands Human Genome • Human Genome Project – • Conceived in 1984, begun in 1990, completed in 2001 ahead of 2003 schedule What did the sequence reveal ? 3 Bbp (base pair) – 24 chromosomes, – 22 autosomes plus two sex chromasomes (X,Y) – Longest 250 Mbp, shorted 55 Mbp – Mitochondrial genome – Circular DNA molecule of 16.569 Mbp – ~10**(13) cells – – How many is 3 Bbp ? – – Typical 11-pt font can print 60 nucleotide is 3 in (~10 cm). In this format, 3 Bbp writes out in 5,000 mi Genome of Homo sapiens 22 chromosomes plus X (163 Mbp) and Y (51 Mbp) Web resources Interactive access to DNA and protein sequences http://www.ensembl.org Images of chromosomes, maps, loci http://www.ncbi.nlm.nih.gov/projects/genome/guide/ Gene map 99 http://www.ncbi.nlm.nih.gov/genemap99 overview of human genome structure http://www.ims.u-tokyo.ac.jp/imsut/en SNP (Single nucleotide polymorphisms) http://snp.cshl.org Human genetic diseases http://www.ncbi.nlm.nih.gov/Omim (Online Mendelian Inheritance in Man ) http://www.geneclinics.org/profiles/all-html Human Genome Insights (ENCODE) Majority of genome is transcribed ~50% transposons ~25% protein coding genes/1.3% exons ~23,700 protein coding genes ~160,000 transcripts Average Gene ~ 36,000 bp 7 exons @ ~ 300 bp 6 introns @ ~5,700 bp 7 alternatively spliced products (95% of genes) RefSeq: ~34,600 “reference sequence” genes (includes pseudogenes, known RNA genes) Genome of Homo sapiens (cont’d) Repeat sequences >50 % of the genome Short interspersed nuclear elements (SINEs): 13 %, LINEs: 21 % Simple stutters (repeats of short oligomers including mini- and microsatellites) Triplet repeats such as CAG are implicated in numerous diseases (e.g., glutamine repeats in glutamine protein) SNP (pronounced snip) A->T mutation in beta-globin changes Glu -> Val, creating a sticky surface on haemoglobin molecules => sicklecell anaemia Progeria Avg 1 SNP/Kbp (100 SNPs per 100 Kbp) Many 100-Kbp regions tend to remain intact, with fewer than five SNPs discrete combinations of SNPs define individual’s haplotype (haploid genotype) Individual genomes are characterized by a distribtuion of genetic makers including SNPs Int’l HapMap Consortium Genome of Homo sapiens (cont’d) SNP consortium Collects human SNPs, nearly 5 million SNPs Show Most of variations appear in all populations However, a few SNPs are unique to particular populations Genomes of individuals from Japan and China are very similar Chromosome X varies more than other chromosomes (X is more subject to selective pressure) Mitochondrial DNA Double-stranded closed circular molecule of 16,569 bp Inherited almost exclusively through maternal lines Not subject to recombination, and changes only by mutation About 1 mutation every 25,000 years mtDNA and Y mtDNA Inherited through maternal lines Both sons and daughters get it from their mother All existing sequence variants are traced back to a single woman (Mitochondrial Eve) in Africa roughly 200,000 years ago Supports “from Africa” hypothesis Avg difference in mtDNA between pairs of individuals is 61.1, between Africans is 76.7, between non-Africans is 38.5 More divergent populations in Africa for much longer than in the rest of the world Y chromosome Most recent common male ancestor (Y-chromosome Adam) is around 59,000 years ago Most divergent sequences are found from Africans Other Species Organism Genome size Epstein – Barr virus 0.17 Mbp 80 4.6 Mbp 4,406 12.5 Mbp 6,172 Nematode worm (C.elegans) 100.3 Mbp 19,099 Thale cress (A. thaliana) 115.4 Mbp 25,498 Fruit fly (D. melanogaster) 128.3 Mbp 13,601 3223.0 Mbp 20,500 390.0 Mbp 30,000 16000.0 Mbp 30,000 E.Coli Yeast (S. cerevisiae) Human (H. sapiens) Fugu (Takifugu rubripes) Wheat # of genes