* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download DNA
DNA damage theory of aging wikipedia , lookup
DNA polymerase wikipedia , lookup
United Kingdom National DNA Database wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Metagenomics wikipedia , lookup
Genealogical DNA test wikipedia , lookup
Gel electrophoresis of nucleic acids wikipedia , lookup
Cancer epigenetics wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Designer baby wikipedia , lookup
Genomic library wikipedia , lookup
Genome evolution wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Nucleic acid tertiary structure wikipedia , lookup
Molecular cloning wikipedia , lookup
Epitranscriptome wikipedia , lookup
Cell-free fetal DNA wikipedia , lookup
DNA vaccination wikipedia , lookup
Epigenomics wikipedia , lookup
Non-coding RNA wikipedia , lookup
Human genome wikipedia , lookup
Microevolution wikipedia , lookup
History of RNA biology wikipedia , lookup
Microsatellite wikipedia , lookup
DNA supercoil wikipedia , lookup
Extrachromosomal DNA wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Genetic code wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Nucleic acid double helix wikipedia , lookup
Genome editing wikipedia , lookup
History of genetic engineering wikipedia , lookup
Point mutation wikipedia , lookup
Non-coding DNA wikipedia , lookup
Primary transcript wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Helitron (biology) wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Bioinformatique et modélisation Biological sequences & Challenges for the Bioinformatician The tree of life All living organisms (on Earth) use 3 major macromolecules: DNA RNA proteins DNA - RNA - proteins DNA transcription mRNA CCTGAGCCAACTATTGATGAA transcriptional regulation CCUGAGCCAACUAUUGAUGAA translation Protein PEPTIDE The main role of DNA is information storage and transmission The main role of protein is to direct processes on which life depends (energy metabolism, biosynthesis, inter-cellular communication) DNA - RNA - proteins A glance at the chain of command in the cell reveals DNA as the boss, running the show by its coded instructions, getting RNA to do all the fetching and carrying work, and telling the ribosomes what proteins to make next. The proteins have a completely servile role, but are the real workers. P. Davies (2003) The origin of life, Penguin, UK, p. 105 DNA DNA (deoxyribonucleic acid) A DNA sequence is a linear biopolymer of nucleotides. A nucleotide is composed of 3 main parts: a base, a pentose sugar, and a phosphate group (fig. A). There are 4 bases, separated into 2 groups: purines (adenine, guanine) and pyrimidines (cytosine, thymine) (fig. B). H In DNA, nucleotides are bound by phosphodiester bounds (fig. C). DNA structure X-ray diffraction of DNA Rosalind Franklin James Watson & Francis Crick (Nobel Prize, 1953) Watson & Crick (1953) Nature. 171:737-8. DNA structure In cells, DNA exists mainly as a two-strand coiled structure (double helix). The two strands are held together by hydrogen bounds between the bases. The bases are located inside the helix, the phosphate-linked sugar forming a backbone on the outside. The base pairing is complementary: A can only bind T: A can only bind C: G ≡ C (3 hydrogen bounds). = T (2 hydrogen bounds), and G Chargaff's rules Chargaff studied the composition in nucleic acids of each strand of the DNA and found two rules: Chargaff's first rule The 2 stands of DNA are sometimes called watson (w) and crick (c) Ac = Tw , Aw = Tc , Cc = Gw , Cw = Gc (the letters represent the molar fraction of a base on one strand). Chargaff's first rule express the fact that double stranded DNA obeys Watson-Crick base pairing. Chargaff's second rule The second rule is much less understood: Ac ≈ Tc , Aw ≈ Tw , Cc ≈ Gc , Cw ≈ Gw This parity is observed in most genomes (eukaryotic chromosomes, bacteria, virus, but not in mitochondria and plasmids). Chargaff (1951) Some recent studies on the composition and structure of nucleic acids. J Cell Physiol Suppl 38:41-59 Erwin Chargaff (1905-2002) DNA packing In cells, DNA is packed into a compact structure thanks to specialized proteins called histones. "Chromatin" usually refers to the complex DNA / histones. The fundamental packing unit is known as a nucleosome. Each nucleosome is about 11nm in diameter. The DNA double helix wraps around a central core of 8 histone protein molecules (an octamer) to form a single nucleosome. Additionnal histone proteins fasten the DNA to the nucleosome core. Nucleosomes are usually packed together, with the aid of yet another histones to form a 30nm large fiber. As a 30nm fiber, the typical human chromosome would be about 0.1cm in length and would span the nucleus 100 times. This suggests higher orders of packaging, to give a chromosome the compact structure. DNA, chromosomes, and genome The main role of DNA is information storage. It is transmitted from generation to generation: all the information required to make and maintain a new organism is stored in its DNA. The information required to reproduce even very complex organisms is stored on a relatively small number of DNA molecules (the chromosomes). This set of molecules is called the organism's genome. In human, there are 46 DNA molecules in each cell, organised into chromosomes. In bacteria, there is often a single, circular chromosome, but also nonchromosomal DNA molecules called plasmids. Genome size Organism Year Size (Mb) Mycoplasma genitalium 1995 0,6 Haemophilus influenzae 1995 1,8 Escherichia coli 1997 4,6 Saccharomyces cerevisiae 1996 12 Schizosaccharomyces pombe 2002 14 Caenorhabditis elegans 1998 97 Arabidopsis thaliana 2001 120 Oryza sativa 2002 5 000 Drosophila melanogaster 2000 180 Galus Galus 2004 1 200 Rattus Norvegicus 2004 2 900 Mus musculus 2002 3 400 Homo sapiens 2001 3 400 1Mb = 1 000 000 bases Sources: Jacques van Helden + GenBank See also: Database Of Genome Sizes http://www.cbs.dtu.dk/databases/DOGS/ http://www.genomesize.com/ Genome size Comparison of genome size Genome size (bp) Source: Cann (1997) Principles of Molecular Virology, Academic Press Genome size How many book/CD do I need to "write" the human genome? If the sequence obtained was to be stored in book form, and if each page contained 1000 base-pairs recorded and each book contained 1000 pages, then 3300 such books would be needed in order to store the complete genome. However, if expressed in units of computer data storage, 3.3 billion basepairs recorded at 2 bits per pair would equal 786 megabytes of raw data. This is comparable to a fully data loaded CD. Source: http://en.wikipedia.org/wikiHuman_Genome_Projec t The first printout of the human genome to be presented as a series of books, displayed at the Wellcome Collection, London Challenge for the bioinformatician Is the nucleotide sequence random? Or are there some "preferences" in the choice and in the order of the nucleotides? Are those "preferences" related to biological functions or structure? If yes, could we predict biological function or structure of DNA on the basis of the sequence? Sequence comparison (pairwise alignments, multiple alignments, database search, motif search, etc) and sequence statistics (based on symmetries, GC content, motif occurrence statistics, entropy, correlation, etc) will help the bioinformaticians to answer these questions. DNA representation DNA representation The complementarity between base pairs (A = T and G ≡ C) implies that if you know one sequence you can deduce the complementary sequence. It is common to represent DNA sequences by 4-letter strings: TGCTAATGCCGCTACTCTATCTGC By convention, we write sequences from 5' to 3' end. 5'- TGCTAATGCCGCTACTCTATCTGC - 3' Source: Jacques van Helden DNA representation Don't forget the second strand! When we analyze a DNA sequence represented by ATGCGCGGATG we should keep in mind that the corresponding molecule is a double strand helix with the following base pairs: 5' - ATGCGCGGATG - 3' (upper strand) ||||||||||| 3' - TACGCGCCTAC - 5' (lower strand) Note that the concept of upper strand and lower strand are purely artificial. A DNA molecules is a 3D structure and there is no reason to consider preferentially one or the other strand. This does not mean however that the two strands are functionally equivalent: in coding regions for example, only one strand will serve as a template for the synthesis of RNA. Source: Jacques van Helden DNA representation Reverse complementarity Reverse complementary sequences represent the two strands of the same DNA molecule. The reverse complement is obtained by transposing each nucleotide into its complementary nucleotide (A → T, T → A, C → G, G → C), and then reversing the string. For example the sequences ATGCGCGGATG and CATCCGCGCAT are mutually reverse complementary. These strings describe the two strands of the same DNA molecule. Consequently, the two following double strand schemes represent the same molecule: 5'- ATGCGCGGATG - 3' 5'- CATCCGCGCAT - 3' ||||||||||| ||||||||||| 3'- TACGCGCCTAC - 5' 3'- GTAGGCGCGTA - 5' Source: Jacques van Helden Symmetries in DNA sequences Symmetries in DNA sequences Tandem repeat GATAAGATAAGATAAGATAA = 2 x GATAAGATAA = 4 x GATAA GATAAGATAAatgtagGATAAGATAA = 2 x GATAAGATAA separated by a non repeated sequence. Tandem repeats are presumed to occur frequently in genomic sequences, comprising perhaps 10% or more of the human genome (Benson, NAR 27:573,1999). Tandem repeat are sometimes associated to a repeated structure of a protein (Ex: some ABC transporters have been shown to contain a tandem repeat of six transmembrane helices, Tusnady et al, FEBS Lett 402:1-3,1997). In recents years, the discovery of short tandem repeat polymorphisms are involved in various diseases (e.g. Cancer, Huntington, Parkinson,..., Zhang & Yu, Eur J Surg Oncol,33:529-34,2007). ATP-binding Cassette (ABC) transporters Source: Jacques van Helden Symmetries in DNA sequences Symmetries in DNA sequences Textual palindromes ATGGCCGGTA = ATGGC|CGGTA Note that the corresponding DNA molecule does not contain any axis of symmetry since in 3D space a nucleotide cannot be superimposed on its own image. Therefore, searching for palindromes is not relevant for detecting biological features. 5'- ATGGC - 3' ≠ 5'- CGGTA - 3' Source: Jacques van Helden Symmetries in DNA sequences Symmetries in DNA sequences Reverse complementary palindromes A reverse complementary palindrome is a sequence identical to its reverse complement. Example: ATGGGGCCCCAT Reverse complementary palindromes correspond to 3D symmetries in DNA molecules. In the following 2-strand representation, a 180° rotation around the center would swap the two strands, and each letter would take place of an identical letter on the complementary strand. 5'- ATGGGG CCCCAT - 3' ||||||.|||||| Note that this sequence is not a textual palindrome. 3'- TACCCC GGGGTA - 5' Note that reverse complementary palindromes can be separated by a stretch of nonsymmetrical nucleotides. Source: Jacques van Helden Symmetries in DNA sequences Symmetries in DNA sequences Reverse complementary motifs play important roles in biological mechanisms. Example 1: some classes of transcription factors (e.g. helix-turn-helix) typically form homodimers whose tridimensional structure is symmetrical. These protein complexes specifically recognize reverse complementary motifs in gene promoters. cAMP Receptor Protein (CRP) TGTGA-N6-TCACA Example 2: In bacteria, hexamers with reverse complementary palindromic structure also play an essential role as recognition sites for restriction enzymes. ---AAGCTT-----TTCGAA--The restriction enzyme HindIII specifically cuts DNA at instance of AAGCTT Symmetries in DNA sequences Symmetries in DNA sequences Reverse complementary motifs play important roles in biological mechanisms. Example 3: Reverse complementary motifs separated by a stretch are frequent in RNA, where they mediate the pairing between distant segments of the molecules. 5'- UCGGGcucauaaCCCGA - 3' folding a c u u a c a GC GC GC stem loop structure CG UA || 5'3' Source: Jacques van Helden RNA RNA (ribonucleic acid) A RNA sequence is also a linear biopolymer of nucleotides, but their chemical composition differ from the DNA nucleotides by 2 features: (1) the sugar group differs by one alcohol group, (2) RNA contains the base uracyl instead of thymine. OH RNA (ribonucleic acid) RNA is synthesized using DNA as template, with a one-to-one correspondence. DNA (template) RNA (synthetised) A U C G G C T A Thus, it is possible to deduce the RNA sequence (that will be synthesized) from the (template) DNA sequence: DNA RNA CTGCTAGCAAGATCTG GACGAUCGUUCUAGAC (template) (synthesized) Roles of RNA RNA molecules have multiple roles, mainly related in the transfer of information from DNA to protein. They can be classified into several types: messenger RNA (mRNA): Their role is to mediate the synthesis of proteins from the DNA (genes). They are synthesized during the transcription and are used during the translation as a template to build proteins. transfer RNA (tRNA): Amino acids do not recognise RNA codon directly. The role of tRNA is to transfer the right amino acid to the growing polypeptide chain. ribosomal RNA (rRNA): Their role is to regulate the activity of ribosomes. microRNA (miRNA): Recently discovered, miRNA have been found to have multiple roles including regulation of gene expression and protein activity. RNA structure Generally, RNA does not form a double helix, but contains base pairing and loops (we refer to these structures as secondary RNA structure, the primary structure being its ribonucleic sequence). Tetrahymena ribozyme Challenge for the bioinformatician One of the challenges for the bioinformatician is to predict the secondary structure of RNA on the basis of the primary structure of RNA. One approach is based on primary sequence analysis. The idea is to find which parts of the sequence are complementary and would therefore be able to pair. Another approach relies on minimum energy computation. Note that this topic will not be covered in this course. For more details, see Mount (2004) Bioinformatics: Sequence and Genome analysis (Chapter 8 - prediction of RNA secondary structure) PROTEINS Proteins A protein (or, more generally, a polypeptide) is a biopolymer of amino acids (aa). An amino acid contains both an amine (NH2), a carboxyl group (COOH), and a side chain (usually denoted by R for "residue"). Amino acids differ by their residue. In natural proteins, 20 residues have been identified. Amino acids are bound by a peptidic bond between the amine and the carboxyl groups. Proteins Here are the 20 amino acids. The residues are highlighted in red. Glycine G Gly Alanine A Ala Valine V Val Leucine L Leu Methionine M Met Isoleucine I Ile Serine S Ser Threonine T Thr Cysteine C Cys Proline P Pro Asparagine N Asn Glutamine G Gln Phenylalanine F Phe Tyrosine Y Tyr Tryptophane W Trp Lysine K Lys Arginine R Arg Histidine H His Aspartate D Asp Glutamate E Glu Proteins The amino acids can be classified into different groups: Charged: (+) Arg, His, Lys (-) Asp, Glu Polar (uncharged): Ser, Thr, Asn, Gln, Tyr Unpolar (hydrophobic): Ala, Ile, Leu, Met, Phe, Trp, Val Others: Cys, Gly, Pro Roles of proteins Proteins have multiple roles: They catalyze most of the biochemical reactions (enzymes). They regulate gene expression (transcription factors). They play important roles in the cellular structure and motion (cytoskeleton, channels in membrane). They are involved in sigalling pathways (hormone, receptors) They are involved in the immune system (anti-body) They are transporter (hemoglobin, myoglobin) ... Protein structure Primary structure: amino acid sequence. Secondary structure: alpha helix and beta sheets Tertiary structure: 3D structure of a single protein molecule (one chain) Quaternary structure: complex of several protein molecules or polypeptide chains (called protein subunits). myoglobin GLSDGEWQLVLNVWGKVEADIP GHGQEVLIRLFKGHPETLEKFD KFKHLKSEDEMKASEDLKKHGA TVLTALGGILKKKGHHEAEIKP LAQSHATKHKIPVKYLEFISEC IIQVLQSKHPGDFGADAQGAMN KALELFRKDMASNYKELGFQG amino acid sequence (primary structure ) α helix (secondary structure ) 3D folding (tertiary structure) Protein structure Protein structure Chain B of Protein Kinase C Quaternary structure of Protein Kinase C Protein structure The two main types of secondary structure are α helices and β sheets anti-parallel β sheet amino acid subunits α helix Protein structure Turns, hairpins and loops A third type of secondary structure is the β turn. These are short regions where the protein chain takes a 180° change in direction, doubling back on itself. Such kind of hairpin turns are found for example between two adjacent β strands. The side chain R3 is usually H (glycine) The reminder of the protein structure has much less order, and can be viewed as simply the connecting pieces (loops) that allow the α helices and β sheets to pack. Protein folding Protein chains themselves rarely have a biological function. It is only when the chain has folded into a three-dimentional structure that the protein has functional activity. In the folded form, some distant residues can come close to each other. Many proteins fold up into several discrete structural units, each of which is termed a protein domain. Each domain is associated to a specific biochemical or binding function. Challenge for the bioinformatician One challenge for the bioinformatician will be to predict protein structure, function and organisation (protein-protein interactions, protein complexes) on the basis, for example, on its amino acid sequence (or, sometimes, on the DNA coding sequence!). Sequence analysis is an alternative to 3D modeling to predict secondary structure and to detect functional domains. Due to the various properties of the amino acid side chains, certain residues are found more often in one or the other structural units. Some residues have been classified, for example, as α-helix breakers. Proline for example is a poor helix former due to the fact that its backbone N atom is already bound to its own side chain and cannot form Hbounds within the helix. Good α-helix formers are Ala, Glu, Leu, and Met, whereas good β-strand formers are Val, Ile, Tyr, and Cys. These types of preferences have been used to predict secondary structure on the basis of amino acid composition. A second approach is to make use of evolutionary relations: Proteins that have a common ancestor are said homologous. Sequence alignment and database searching can identify homologous proteins. Such homologs often (but not always) share a common structure and function. Structure and function can thus be inferred from known proteins. DNA - RNA - PROTEIN Central dogma Central dogma In 1958, Francis Crick formulated his famous “central dogma”: The central dogma states that once "information" has passed into protein it cannot get out again. In more detail, the transfer of information from nucleic acid to nucleic acid, or from nucleic acid to protein may be possible, but transfer from protein to protein, or from protein to nucleic acid is impossible. Information means here the precise determination of sequence, either of bases in the nucleic acid or of amino acid residues in the protein. Quoted from Crick (1958). Crick FH (1958) On protein synthesis. Symp Soc Exp Biol 12:138-63. Crick F (1970) Central dogma of molecular biology. Nature, 227:561-3. Source: Jacques van Helden Central dogma Central dogma This “central dogma” is often summarized by the following sentence: DNA makes RNA makes protein Note that what Crick called the “Central dogma” has nothing of a dogma. On the contrary, what he proposed fully deserves to be qualified of “scientific theory”. This formulation is admirably clear and seems to be as valid today as on the first day when it had been formulated. Even the prion makes no exception to this dogma, since Crick defines information as “the precise determination of sequence”, and not the conformation that a protein might take under particular conditions. The central idea of Crick’s 1958 paper has been so often misunderstood that Crick himself felt it necessary to write a clarification 12 years later (Crick, 1970). Source: Jacques van Helden Central dogma Central dogma DNA is transcribed into mRNA, which is in turn translated into protein molecules. Knowing the DNA sequence, it is in principle possible to deduce the mRNA sequence and the amino acid sequences of the corresponding protein. However, because the genetic code is degenerated, the reverse is not possible: we can not deduce the actual DNA sequence coding for a given protein Replication Replication: when the cell divides, the whole genome need to be replicated (each daughter cell must receive the full DNA content). In addition to DNA polymerase (which synthesize new DNA by polymerization), several additional enzymes are needed: - DNA topoisomerase to untwist the DNA - DNA helicase: to separate the 2 strands - DNA ligase to re-pair the 2 strands Replication fork DNA is first untwist and the 2 separate strands thus serve as template to synthesized new double strand DNA. Each DNA strand is read from 3' to 5' and the complementary strand is synthesized from 5' to 3'. This leads to an asymetry because on one strand (the leading strand) the complementary strand is synthesized continuously in the direction of the DNA opening, while on the other strand (the laggingstrand) the complementary strand is synthesized segment by segment (Okazaki segments). Gene A gene is a portion of DNA that codes for a protein. In practice, the gene is sometimes considered also to include surrounding regions of non-coding DNA that act as control regions. gene DNA reverse complement control region sense of RNA synthesis mRNA 3' 5' protein Transcription From DNA to RNA: transcription One strand of the DNA is involved in the synthesis of RNA. Note that the RNA synthesized is complementary to the template DNA. DNA transcription is processed by the RNA polymerase. RNA polymerase is a large protein complex which reads DNA, recruits the correct RNA nucleotide, and binds them together. Transcription From DNA to RNA: transcription Remark: RNA polymerase reads the DNA template strand (from 3' to 5'), which is complementary to the coding strand. RNA is thus synthesized from 5' to 3'. Its sequence is complementary to the template and identical to the coding strand (except that it is composed of ribonucleotides and that thymine is replaced by uracyl). mRNA organisation 5' cap: The 5' cap is a modified guanine nucleotide added to the "front" (5' end) of the pre-mRNA using a 5',5-Triphosphate linkage. This modification is critical for recognition and proper attachment of mRNA to the ribosome, as well as protection from 5' exonucleases. Coding regions (CDS): Coding regions are composed of codons, which are decoded and translated into proteins by the ribosome. Coding regions begin with the start codon (see later) and end with one of the three possible stop codons. In addition to protein-coding, portions of coding regions may also serve as regulatory sequences Untranslated regions (UTR): Untranslated regions are sections of the RNA before the start codon and after the stop codon that are not translated, termed the 5' untranslated region (5' UTR) and 3' untranslated region (3' UTR). Several roles in gene expression have been attributed to the untranslated regions, including mRNA stability, mRNA localization, and translational efficiency. 3' poly(A) tail (polyadenylation): The 3' poly(A) tail is a long sequence of adenine nucleotides (often several hundred) added to the 3' end of the pre-mRNA through the action of an enzyme, polyadenylate polymerase. Interestingly, in higher eukaryotes, the poly(A) tail is added onto transcripts that contain a specific sequence, the AAUAAA signal. Translation From RNA to protein: translation RNA translation is processed by the ribosomes. A ribosome is a protein comlex (composed of a large subunit and a small subunit) which reads the messenger RNA (mRNA), recruits the correct amino-acid, and binds them together. Genetic code The Gamow's diamond code George Gamow (one of the proponent of the Big Bang theory!) In Gamow's proposal (1954), which he called the diamond code, double-stranded DNA acted directly as a template for assembling amino acids into proteins. As Gamow saw it, the various combinations of bases along one of the grooves in the double helix could form distinctively shaped cavities into which the side chains of amino acids might fit. Each cavity would attract a specific amino acid; when all the amino acids were lined up in the correct order along the groove, an enzyme would come along to polymerize them. Each of Gamow's cavities was bounded by the bases at the four corners of a diamond. If the DNA helix is oriented vertically, the bases at the top and bottom corners of a diamond are on the same strand and are separated by a single intervening base; the left and right corners of the diamond are defined by that intervening base and by its complementary partner on the opposite strand. Source: http://www.americanscientist.org Genetic code The forgotten code cracker... Marshall Nirenberg (Nobel prize 1968) Nirenberg M, Leder P. (1964) RNA codewords and protein synthesis: the effect of trinucleotides upon the binding of sRNA to ribosomes. Science. 145: 1399–1407. Nirenberg M, Leder P, Bernfield M, Brimacombe R, Trupin J, Rottman F, O'Neal C (1965) RNA codewords and protein synthesis, VII. On the general nature of the RNA code. Proc Natl Acad Sci USA. 53: 1161–1168. Source: E. Regis, Sci. Am., nov. 2007 Genetic code Three successive nucleotides (called a codon) code for one amino acid. The correspondence between the codons and the amino acids constitutes the genetic code. Reading frame The codons are not overlapping. On one strand of DNA, there are thus three possible reading frames. Each frame would code for a different amino acid sequence. START and STOP signals Start and end of transcription are marked by specific start (AUG) and stop (UAA, UAG, UGA) signals, but that's not all... Prokaryotes START consensus sequence at position -10: TATAAT at position -35: TTGACA STOP signal: 2 short stretches of complementary sequence that can base-pair to form a RNA double helix usually involving several CG base pairs. Eukaryotes START consensus sequence at position -25: TATA box (recognized by a TATAbinding protein, TBP) STOP signal: The AAUAAAA sequence results in the clivage of the 3' end of the transcript at some 10-30 base after the signal. Open reading frame (ORF) An open reading frame (ORF) is a part of DNA which contains a sequence that could potentially code for a protein. It is usually a long portion of DNA sequence starting with a start codon and not interrupted by an end codon. The detection of long ORFs are usually a good indication of the presence of a gene, but additional information might also be used in order to support the prediction, such as the codon bias. Since the start and stop end of the ORF are not equivalent to the ends of the mRNA, a typical ORF finder will employ algorithms based on existing genetic codes and codon usage and all possible reading frames. Additional difficulties may arise in eukaryotes where long parts of the DNA within an ORF are not translated into the protein (introns). Short ORFs can also occur by chance outside of a gene. Usually such ORF are not very long and terminate after a few codons. In a "random" sequence, on average, we will find 3 stop codon every 20 codons. A typical gene length in human is 1015 kb long. Thus, we do not expect to find a stop codon in such a 3.3-5 codon long sequence. However... Introns and exons Exons are parts of DNA which will be translated into the protein. Introns are parts of the DNA that will not be translated into the protein. Introns are nevertheless present in the mRNA but subsequently removed by a process called splicing. Typically, the introns are spliced out by a two stages process: (1) mRNA forms a loop structure (called lariat) involving an adenine base, (2) the two exons are then joined and the intron is released. Gene length Human genes vary enormously in size and exon content. Exon content is shown as a percentage of the lengths of indicated genes. Source: Strachan & Read, Human molecular genetics, BIOS 1999 http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=hmg Overlapping genes Overlapping genes are defined as a pair of adjacent genes whose coding regions are partially overlapping. In other words, a single stretch of DNA codes for portions of two separate proteins. Such an arrangement of genetic code is ubiquitous. Many overlapping genes have been identified in the genomes of prokaryotes, eukaryotes, mitochondria, and viruses. One consequence of overlapping genes is to reduce the tolerance for mutation. It was shown that overlapping reduces the probability of accumulating so-called neutral mutations in a gene (mutations that have no effect). Neutral mutations are unlikely with overlapping genes, because the mutation must have no effect on two genes with different reading frames. Veerlachaneni et al (2004) identified 1316 pairs of overlapping genes in humans and mice. Example of overlap between 3 human genes: MUTH, FLJ13949, and TESK2. Dark green boxes represent coding sequences. Light green boxes represent untranslated regions. Source: Veerlachaneni et al (2004) Mammalian overlapping genes: the comparative perpective. Genome Res: 14: 280-286. Operons In prokaryotes, several genes can be under the control of a single promoter. The different genes have nevertheless each a start codon and a ribosome binding site. Such an organisation is called an operon. Examples (from E. coli): The lac operon promoter lacZ lacY lacA The trp operon promoter trpE trpD trpC trpB trpA Coding vs non-coding DNA The most part of the DNA does not code for protein... Year Genome size (Mb) Number of genes Gene spacing (kb) coding (%) non-coding (%) Mycoplasma genitalium 1995 0,6 481 1,2 90 10 Haemophilus influenzae 1995 1,8 1 717 1,0 86 14 Escherichia coli 1997 4,6 4 289 1,1 87 13 Saccharomyces cerevisiae 1996 12 6 286 1,9 72 28 Arabidiopsis thaliana 2001 120 27 000 4,4 30 70 Caenorhabditis elegans 1998 97 19 000 5,1 27 73 Drosophila melanogaster 2000 165 16 000 10,3 15 85 Mus Musculus 2002 3 400 Homo sapiens 2001 3 400 30 000 103,2 3 97 Organism Although the non-coding DNA sequences are sometimes referred to as the junk DNA, it contains many signals that allow a proper regulation of gene expression. Also, note that part of DNA is transcribed into RNA, which is subsequently not translated into protein (cf. tRNA, rRNA, siRNA, but also introns). Challenge of the bioinformatician One challenge for the bioinformatician is to predict coding and noncoding genes from raw (genomic) sequences. This question became particularly important since the large scale sequencing experiments. Gene prediction is not a trivial task. A long ORF does not always mean that a gene is encoded. Long ORF might occur by chance. On the other hand, short ORF can be exons. To support the gene predictions, additional criteria (based for example on nucleotide composition) are needed. The predictions can also be validated by comparing the sequence with genes identified in other organisms. Another challenge is to identify the regulatory region (promoter) of a gene. For this purpose, we can make use of the various signals (TATA box, etc). Similarly, analysis of mRNA sequence could be performed in order to detect the coding sequence (CDS). Control of transcription - Gene regulation The promoter of a gene does not code for a protein. It is a regulatory region of DNA usually located upstream of a gene that contains binding sites for transcription factors. Those transcription factors can be activators (if they activate the transcription) or repressors (if they repress the transcription). Such control is crucial for the regulation of the gene. This is why the binding sites are often referred to as regulatory sequences. Control of transcription - Gene regulation The transcription is tightly controlled by specific proteins called transcription factors. Those factors bind DNA sequence in the promoter of the genes and interact with the RNA polymerase. They can either activate the transcription (activators) or repress the transcription (inhibitors). Colored lines are binding sites: DNA sequence patterns. Blobs are factors (proteins) that recognize binding sites. Control of transcription - Gene regulation Transcriptional factors recognise and bind specific DNA patterns (motifs) called binding sites. Such sites are called regulatory sequences (or regulatory elements) For example, the transcription factor Pho4p recognises specifically the sequence CACGTG in the promotor of genes in yeast cAMP Receptor Protein (CRP) recognizes specifically the pattern TGTGA-N6-TCACA Exercise: assuming equal probabilities for each nucleotide, calculate the probability to find the sequence CACGTG at a particular position. Assuming that the genome is 6000000 bp long, calculate the expected number of occurrences of this pattern. Challenge of the bioinformatician One challenge for the bioinformatician will be to predict the regulation of genes on the basis of regulatory elements found in their promoter. Two questions can be addressed by the bioinformatician: (1) You already know a regulatory element and you want to find the genes whose promoter has this regulatory element. This is referred to as pattern matching. (2) You do not know the regulatory element, but you have, for example, a set of genes which are co-regulated. You can then search if they share a common regulatory element. This approach is called pattern discovery. The difficulty will be to distinguish real patterns from patterns occurring by chance, and to estimate the probability that a pattern found is indeed a binding sequence. DNA and evolution Organisms are linked together in evolutionary history, all having evolved from one or a very few ancient ancestral life forms. This process of evolution, still in action, involves changes in the genome that are passed to subsequent generations. These changes can alter the protein and RNA molecules encoded, and thus change the organism, making its survival more (or less) likely in the circumstances in which it lives. In this way the forces of evolution are inextricably linked to the genomic DNA molecules. Challenge for the bioinformatician Finding out the evolutionary links between genes/genomes would greatly help to refine the tree of life. This is the major goal of phylogenetics. On the other side, knowing that two organisms are closely related from an evolutionary perspective, their comparison might help to predict function of unknown genes/proteins in an organism when its homolog has been characterized in other organisms. This approach is called comparative genomics. Tree of life Because organisms are evolutionary related, many things can be inferred by comparing genes and genomes... Tree of life ... but not everything! Human intervention While computer-based analysis has the benefit of being easily carried out (large memory, fast computation) in an objective way, it cannot guarantee to produce biologically relevant results. Manual checking (i.e. interpreting the results using the biology knowledge) remains essential! Ultimately, only experiments will validate (or not) the bioinformatic predictions. Bioinformatic predictions can be used to reduce the number of possibilities. References Zvelebil and Baum (2007) Understanding Bioinformatics, Garland Science. Mount M (2004) Bioinformatics: Sequence and Genome Analysis, Cold Spring Harbor Laboratory Press, New York. Alberts, Bray, Lewis, Raff, Roberts, Watson (2002) Molecular Biology of the Cell, Garland Science. Lewin B (1997) Gene VI, Oxford Univ Press