* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download File Formats
Molecular cloning wikipedia , lookup
Genetic code wikipedia , lookup
DNA barcoding wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Gene expression wikipedia , lookup
Non-coding DNA wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Point mutation wikipedia , lookup
Molecular evolution wikipedia , lookup
Homology modeling wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
An Introduction to Bioinformatics Sequence information and file formats AIMS To understand the conventions regarding the presentation of DNA and protein sequence information To understand the logic underlying these conventions To become familiar with the commonly used sequence file formats To become familiar with the READSEQ programme for the interconversion of file formats OBJECTIVES Present a nucleotide or protein sequence according to accepted conventions Recognize different sequence files formats Interconvert files between formats INTRODUCTION Virtually all the information one deals with in computational molecular biology is either in the form of DNA or protein sequences There are conventions applying to the presentation/storage of sequence information The way in which sequence information is stored, retrieved and manipulated varies There are different computer file types for sequence information DNA The DNA of living organisms is normally double stranded It is the convention to show only one strand of the DNA Which strand do you show? Which way round do you show it? The orientation of a DNA strand is determined by which end has a 5'-phosphate group and which has a 3'-hydroxyl group It is usually the case that either strand can be the template (or coding) strand at any particular point Given that the two strands are anti-parallel, the genes on the two strands will face in opposite directions RNA polymerase in all organisms moves along the template strand of the DNA in the 3'-5' direction producing RNA that grows in the 5'-3' direction The RNA sequence will be identical to that of the nontemplate strand, except for the presence of uracil instead of thymine 3’ GGCATAGCAGGTACGTTATGCCAGCATTG 5’ template 5’ CCGTATCGTCCATGCAATACGGTCGTAAC 3’ non-template 5’ CCGUAUCGUCCAUGCAAUACGGUCGUAAC 3’ mRNA The convention is to show the non-template strand of the DNA because it resembles the RNA For purely cultural reasons the sequence is shown running from right to left on the page, with the 5' end of the sequence on the right Sometimes when sequencing projects are in the draft state there are still ambiguities in the sequence IUPAC have defined a standard table for the nucleotide ambiguity codes R = A or G K = G or T S = G or C Y = C or T M = A or C W = A or T B = not A H = not G D = not C V = not T N = any PROTEIN Polypeptides have a polarity, with an N-terminal and C-terminal ends possessing a free amino group and carboxyl group respectively A polypeptide is presented with its N-terminus on the left of and its C-terminus on the right. H2N-Methionine-Valine-Tyrosine-Glycine-Isoleucine-Lysine-COOH To keep the polypeptide information in a form that can be conveniently handled by computers the amino acids are each given a single letter code Myoglobin Glycine G Isoleucine I Alanine A Cysteine C Tryptophan W Arginine R Phenylalanine F Threonine T Proline P Lysine K Valine V Tyrosine Y Methionine M Aspartate D Histidine H Leucine L Serine S Asparagine N Glutamate G Glutamine Q H2N-Methionine-Valine-Tyrosine-Glycine-Isoleucine-Lysine-COOH MVYGIK FILE TYPES Many software packages have been developed for the analysis of DNA and protein sequences A variety of different file formats have been developed to store/analyse DNA and protein sequence information The various software packages will usually only accept a specific file format The situation is made worse by the fact that different databases hold the information in different file formats An essential skill is be able to recognize the different formats and to be able to interconvert files between formats Common File Formats IG/Stanford Fitch Plain/Raw GenBank/GB Fasta/Pearson PIR/CODATA NBRF Zuker MSF EMBL Olsen ASN 1.8 GCG Phylip 3.2 PAUP/NEXUS DNAStrider Phylip Pretty PLAIN SEQUENCE FORMAT A sequence in plain format may contain only IUPAC characters and spaces (no numbers!). Note: A file in plain sequence format may only contain one sequence, while most other formats accept several sequences in one file. An example sequence in plain format is: AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAA CCTCCCATCCGTGTCTATTGTACCCTGTTGCTTCGGCGGGCCCGC CGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTG CCCGCCGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTC TGAGTTGATTGAATGCAATCAGTTAAAACTTTCAACAATGGATCT FASTA FORMAT A sequence file in FASTA format can contain several sequences. One sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line must begin with a greater-than (">") symbol in the first column. An example sequence in FASTA format is: >U03518 Aspergillus awamori internal transcribed spacer 1 (ITS1) AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTACCC TGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGC CGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTGATTGAATGCAATCAGTTAAAACT TTCAACAATGGATCTCTTGGTTCCGGC EMBL FORMAT A sequence file in EMBL format can contain several sequences. One sequence entry starts with an identifier line ("ID "), followed by further annotation lines. The start of the sequence is marked by a line starting with "SQ" and the end of the sequence is marked by two slashes ("//"). An example sequence in EMBL format is: ID XX AC XX DE DE XX SQ // AA03518 standard; DNA; FUN; 237 BP. U03518; Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S rRNA and 5.8S rRNA genes, partial sequence. Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other; aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg catccgtgtc ggcgcctctg gcgtgcagtc ttccggc 60 120 180 237 GENBANK FORMAT A sequence file in GenBank format can contain several sequences. One sequence in GenBank format starts with a line containing the word LOCUS and a number of annotation lines. The start of the sequence is marked by a line containing "ORIGIN" and the end of the sequence is marked by two slashes ("//"). An example sequence in GenBank format is: LOCUS DEFINITION AAU03518 237 bp DNA PLN 04-FEB-1995 Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S rRNA and 5.8S rRNA genes, partial sequence. U03518 41 a 77 c 67 g 52 t ACCESSION BASE COUNT ORIGIN 1 aacctgcgga 61 tattgtaccc 121 ccccccgggc 181 tgagttgatt // aggatcatta tgttgcttcg ccgtgcccgc gaatgcaatc ccgagtgcgg gcgggcccgc cggagacccc agttaaaact gtcctttggg cgcttgtcgg aacacgaaca ttcaacaatg cccaacctcc ccgccggggg ctgtctgaaa gatctcttgg catccgtgtc ggcgcctctg gcgtgcagtc ttccggc