* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download IntroducTon to Biological sequences
Gel electrophoresis of nucleic acids wikipedia , lookup
Transformation (genetics) wikipedia , lookup
Real-time polymerase chain reaction wikipedia , lookup
Eukaryotic transcription wikipedia , lookup
Epitranscriptome wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Molecular cloning wikipedia , lookup
Community fingerprinting wikipedia , lookup
DNA supercoil wikipedia , lookup
Transcriptional regulation wikipedia , lookup
Genomic library wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Genetic code wikipedia , lookup
Biochemistry wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Point mutation wikipedia , lookup
Biosynthesis wikipedia , lookup
Non-coding DNA wikipedia , lookup
Gene expression wikipedia , lookup
Molecular evolution wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Introduc)on to Biological sequences Sushmita Roy www.biostat.wisc.edu/bmi576/ [email protected] September 4, 2014 BMI/CS 576 Goals for today • A few key concepts in molecular biology – Nucleic acids – Genes – Proteins – The Central Dogma • ConnecJon between DNA, RNA and proteins • Problems in sequence similarity – Sequence alignment – Sequence search A Living Cell • The fundamental unit of life • There are unicellular (one cell) and mulJ-‐cellular organisms • A cell has different cellular components • We will be concerned with – Nucleus – Ribosomes – Cytoplasm • prokaryotes (single-‐celled organisms lacking nucleus) • eukaryotes (organisms with nucleus) An animal cell hRp://www.genome.gov/Glossary/index.cfm?id=25 Deoxyribonucleic acid (DNA) image from the DOE Human Genome Program hRp://www.ornl.gov/hgmis DNA is a double helical molecule Watson and Crick Maurice Wilkins Rosalind Frankin • In 1953, James Watson and Francis Crick discovered DNA molecule has two strands arranged in a double helix • This was possible through the Xray diffracJon data from Maurice Wilkins and Rosalind Franklin hRp://www.chemheritage.org/discover/online-‐resources/chemistry-‐in-‐history/themes/biomolecules/dna/watson-‐crick-‐wilkins-‐franklin.aspx Nucleo)des • DNA is composed of small chemical units called nucleo/des Phosphate • NucleoJde – Nitrogen containing base – 5 carbon sugar: deoxyribose – Phosphate group – Phosphate-‐hydroxy bonds connect the nucleoJdes Base Sugar Hydroxy • Four nucleoJdes make DNA – adenine (A), cytosine (C), guanine (G) and thymine (T) – Each nucleoJde differs in the base Bases in the nucleo)des • Purines (Two rings) Adenine (A) • Pyrimidines (one ring) Thymine (T) Guanine (G) Cytosine (C) Nucleo)des are linked to form one strand of DNA O -‐ O P O 5’ CH2 Base O -‐ 1’ Sugar 4’ 3’ 2’ O -‐ O P O CH2 Base 5’ O -‐ 1’ Sugar 4’ 3’ 2’ 5’ and 3’ of a DNA molecule • Each strand is made up of linkages between 5’ posiJon (Phosphate) on one nucleoJde to the 3’ posiJon of the following nucleoJde • At one end, there is a free phosphate group: 5’ end • At the other end, there is a free OH group: 3’ end • Therefore we can talk about direcJonality – the 5’ and the 3’ ends of a DNA strand 5’ and 3’ of a DNA molecule contd.. • DNA sequence is read from 5’ to 3’ • The two stands run anJ-‐parallel to each other – One is the complement of the other • For example, if the AAG is the sequence on one strand the sequence on the other strand is CTT – Not TTC Watson-‐Crick Base pairing A always bonds to T C always bonds to G • This base-‐pairing is also called “complementary base-‐paring” • Each strand has a base sequence that is complementary to the sequence on the other strand. • If you know the sequence on one strand, you know the sequence on the other strand DNA stores the blue print of an organism • The heredity molecule • Has the informaJon needed to make an organism • Double strandedness of the DNA molecule provides stability, prevents errors in copying – one strand has all the informaJon • DNA replica)on is the process by this informaJon is copied through generaJons of daughter cells DNA replica)on • Helicase, an enzyme, separates the double-‐helix • DNA polymerase makes a copy of each strand using free nucleoJdes • Each strand of DNA serves as a template 5’ 3’ C A T T G C C C A G T Strand A 5’ 3’ C A T T G C C C A G T G T A A C G G G T C A 5’ 3’ Strand B Parent DNA double helix Adapted from “Understanding BioinformaJcs” G T A A C G G G T C A 5’ 3’ Template strand A New strand B 5’ 3’ C A T T G C C C A G T New strand A G T A A C G G G T C A Template strand B 5’ 3’ Videos on DNA replica)on hRps://www.youtube.com/watch? v=zdDkiRw1PdU hRps://www.youtube.com/watch? v=27TxKoFU2Nw Chromosomes • All the DNA of an organism is divided up into individual chromosomes • Each chromosome is really a DNA molecule • Different organisms have different numbers of chromosomes Cell nucleus Adenine Base pairs [ Thymine Guanine Base pairs [ Cytosine •, • DNA's Double Helix. DNA molecules are found inside the cell's nucleus, tightly packed into chromosomes. Scientists use the term "double helix" to describe DNA's winding, two-stranded chemical structure. Alternating sugar and phosphate groups form the helix's two parallel strands, which run in opposite directions. Nitrogen bases on the two strands chemically pair together to form the interior, or the backbone of the helix. The base adenine (A) always pairs with thymine (T), while guanine (G) always pairs with cytosine (C). Image from www.genome.gov Different organisms have different numbers of chromosomes Organism # of chromosomes Yeast 32 Human 46 Fly 8 Mouse 40 Arabidopsis 10 Worm 12 Genes • Genes are the units of heredity • A gene is a sequence of bases which specifies a protein or RNA molecule • The human genome has ~ 25,000 protein-‐coding genes (sJll being revised) • One gene can have many funcJons • One funcJon can require many genes …GTATGTCTAAGCCTGAATTCAGTCTGCTTTAAACGGCTTC… Genomes • Refers to the complete complement of DNA for a given species • The human genome consists of 2X23 chromosomes • Every cell (except egg and sperm cells and mature red blood cells) contains the complete genome of an organism Some Greatest Hits Genome Where Year H. Influenza (bacteria) TIGR 1995 E. Coli (K12) Wisconsin 1997 S. cerevisiae (yeast) InternaJonal collab 1997 C. elegans (worm) Washington U./ Sanger 1998 D. melanogaster (fruit fly) MulJple groups 2000 E. Coli 0157:H7 (pathogen) Wisconsin 2000 H. sapiens (humans) InternaJonal Collab./ Celera 2001 M. musculus (mouse) InternaJonal Collab. 2002 R. norvegicus (rat) InternaJonal Collab. 2004 Some Genome Sizes Genome # bases HIV 9750 E. coli 4.6 billion S. cerevisiae 12 million C. elegans 97 million D. melanogaster 137 million H. sapiens 3.1 billion The central dogma of Molecular biology DNA TranscripJon RNA TranslaJon Proteins RNA: Ribonucleic acid • RNA – Made up of repeaJng nucleoJdes – The sugar is ribose – U is used in place of T • A strand of RNA can be thought of as a string composed of the four leRers: A, C, G, U • RNA is single stranded – More flexible than DNA – Can double back and form loops – Such structures can be more stable Transcrip)on • In eukaryotes: happens inside the nucleus • RNA polymerase (RNA Pol) is an enzyme that builds an RNA strand from a gene • RNA Pol is recruited at specific parts of the genome in a condiJon-‐specific way. • TranscripJon factor proteins are assigned the job of RNA Pol recruitment. • RNA that is transcribed from a protein coding region is called messenger RNA (mRNA) Transcrip)on The RNA string produced is idenJcal to the non-‐template strand except T is replaced by U. The central dogma of Molecular biology DNA TranscripJon RNA TranslaJon Proteins Transla)on • Process of turning mRNA into proteins. • Happens outside of the nucleus inside the cytoplasm in ribosomes • ribosomes are the machines that synthesize proteins from mRNA Proteins • • • • Proteins are polymers too The repeaJng units are amino acids There are 20 different amino acids known DNA codes for protein – How many nucleoJdes are needed to specify 20 amino acids? Amino Acids Alanine Arginine Aspartic Acid Asparagine Cysteine Glutamic Acid Glutamine Glycine Histidine Isoleucine Leucine Lysine Methionine Phenylalanine Proline Serine Threonine Tryptophan Tyrosine Valine Ala Arg Asp Asn Cys Glu Gln Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val A R D N C E Q G H I L K M F P S T W Y V Codons • Each triplet of bases is called a codon • How many codons are possible? • There are three special codons – One Start codon: AUG: start of translaJon – Three Stop codons: End of translaJon • All others code for a parJcular amino acid The Gene)c Code: Specifies how mRNA is translated into protein GeneJc code is degenerate Codons and Reading Frames 3’ 5’ CUC AGC GUU ACC AU Leu Ser Val Thr C UCA GCG UUA CCA U Ser Ala Leu Pro CU CAG CGU UAC CAU Gln Arg Tyr His Proteins are the workhorses of the cell • • • • • • structural support transport of substances coordinaJon of an organism’s acJviJes response of cell to chemical sJmuli protecJon against disease Catalyzing chemical reacJons Proteins are complex molecules • Primary amino acid sequence • Secondary structure • TerJary structure • Quarternary structure • These structures are formed through different levels of protein folding and packaging Some well-‐known proteins Hemoglobin: carries oxygen Insulin: metabolism of sugar hRp://en.wikipedia.org/wiki/Hemoglobin hRp://en.wikipedia.org/wiki/Insulin hRp://en.wikipedia.org/wiki/AcJn AcJn: maintenance of cell structure Hemoglobin protein HBA1 DNA sequence (491 bp) >gi|224589807:226679-‐227520 Homo sapiens chromosome 16, GRCh37.p9 Primary Assembly 1 CCCACAGACT CAGAGAGAAC CCACCATGGT GCTGTCTCCT GACGACAAGA CCAACGTCAA 61 GGCCGCCTGG GGTAAGGTCG GCGCGCACGC TGGCGAGTAT GGTGCGGAGG CCCTGGAGAG 121 GATGTTCCTG TCCTTCCCCA CCACCAAGAC CTACTTCCCG CACTTCGACC TGAGCCACGG 181 CTCTGCCCAG GTTAAGGGCC ACGGCAAGAA GGTGGCCGAC GCGCTGACCA ACGCCGTGGC 241 GCACGTGGAC GACATGCCCA ACGCGCTGTC CGCCCTGAGC GACCTGCACG CGCACAAGCT 301 TCGGGTGGAC CCGGTCAACT TCAAGCTCCT AAGCCACTGC CTGCTGGTGA CCCTGGCCGC 361 CCACCTCCCC GCCGAGTTCA CCCCTGCGGT GCACGCCTCC CTGGACAAGT TCCTGGCTTC 421 TGTGAGCACC GTGCTGACCT CCAAATACCG TTAAGCTGGA GCCTCGGTGG CCATGCTTCT 481 TGCCCCTTTG G Amino acid sequence (142 aa) >sp|P69905|HBA_HUMAN Hemoglobin subunit alpha OS=Homo sapiens GN=HBA1 PE=1 SV=2 MVLSPADKTNVKAAWGKVGAHAG EYGAEALERMFLSFPTTKTYFPHFDL SHGSAQVKGHGKKVADALTNAVAH VDDMPNALSALSDLHAHKLRVDPV NFKLLSHCLLVTLAAHLPAEFTPAVH ASLDKFLASVSTVLTSKYR RNA genes • Not all genes encode proteins • For some genes the end product is RNA – ribosomal RNA (rRNA), which includes major consJtuents of ribosomes – transfer RNAs (tRNAs), which carry amino acids to ribosomes – micro RNAs (miRNAs), which play an important regulatory role in various plants and animals – linc RNAs (long non-‐coding RNAs), play important regulatory roles RECAP • Key components of a eukaryoJc cell – Nucleus, Cytoplasm, Ribosome • What is DNA and RNA? – A large molecule called a polymer – Made up of repeated units • NucleoJdes – DNA: ATGC – RNA: AUGC • What is a protein – Also a polymer, but the units are amino acids • The Central Dogma: DNA-‐>RNA-‐>protein • Important processes – DNA replicaJon, TranscripJon, TranslaJon • Some resources – hRp://www.genome.gov/Glossary/index.cfm hRp://www.youtube.com/watch?v=41_Ne5mS2ls A video on transcrip)on and transla)on Things we did not talk about • • • • DNA packaging AlternaJve splicing PolyadenylaJon Post translaJonal modificaJons A few important biological data/knowledge bases • 2014 Nucleic acids Research Database reports 1,552 databases • NaJonal Center of Biotechnology (NCBI) – hRp://www.ncbi.nlm.nih.gov – GenBank: Database of sequences – Refseq: Reference sequences • Ensemble – hRp://useast.ensembl.org/info/about/index.html • UniProt: Protein sequence and protein funcJon • Protein Databank: Protein structure • Pathway databases – Gene Ontology – KEGG • InteracJon databases – BioGRID – STRING See also hRp://nar.oxfordjournals.org/content/42/D1/D1.full#T1 Number of genomes in RefSeq Source: hRp://www.ncbi.nlm.nih.gov/refseq/staJsJcs/ Sequence similarity • Sequence similarity is central to addressing many quesJons in biology – Are two sequences related? • Similarity in sequence can imply similarity in funcJon. – Assign funcJon to uncharacterized sequences based on characterized sequences • Sequence from different species can be compared to esJmate the evoluJonary relaJonships between species – We will come back to this in Phylogene)c trees. Overview of sequence similarity problems • Assessing similarity between a small number of DNA or protein sequences – Pairwise sequence alignment – MulJple sequence alignment • Searching databases for a query sequence – HeurisJc search using BLAST What is sequence alignment The task of locaJng equivalent regions of two or more sequences to assess their overall similarity A very simple alignment of two sequences T H I S S E Q U E N C E T H A T S E Q U E N C E Aligned/matched posiJons How to align these two sequences? T H I S S E Q U E N C E T H A T I S A S E Q U E N C E The problem arises when the sequences to be compared are of unequal length How do sequences change? • Sequences change through mutaJons subsJtuJons: ACGA AGGA inserJons: ACGA ACGGA deleJons: ACGA AGA Need to incorporate gaps while aligning sequences _ _ _T H I S S E Q U E N C E T H I S _ _ _ S E Q U E N C E T H A T I S A S E Q U E N C E T H A T I S A S E Q U E N C E Alignment 1: 3 gaps, 8 matches Alignment 2: 3 gaps, 9 matches Issues in sequence alignment • What type of alignment? – Align the enJre sequence or part of it? – Two sequences or mulJple sequences? • How to find the alignment? – Search algorithms for alignment • How to score an alignment? – the sequences we’re comparing typically differ in length – some characters (nucleoJde or aminoacid) are more subsJtutable than others • How to tell if the alignment is biologically meaningful? – Assessing how likely the alignment could have happened by random chance Algorithms for alignment • Pairwise alignment algorithms based on Dynamic programming – Global alignment – Local alignment • MulJple sequence alignment – Progressive/Guide-‐tree based approaches – IteraJve alignments • BLAST – Searching a query sequence in a database of sequences with efficient pre-‐processing Scoring alignments • Percent idenJty • SubsJtuJon matrices of amino acids – Genuine matches may not be idenJcal – PAM, BLOSUM50 matrices • Gap penalty funcJons Reading assignment for Sep 9th • Chapter 2, SecJons 2.1-‐2.3, from Textbook: Biological Sequence Analysis