Download protein

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Protein structure prediction wikipedia , lookup

Homology modeling wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Transcript
MODULE 1
Sequence Information and File Formats
AIMS

To understand the conventions regarding the presentation of DNA and protein sequence
information.

To understand the logic underlying these conventions.

To become familiar with the commonly used sequence file formats.

To become familiar with the READSEQ programme for the interconversion of file
formats.
OBJECTIVES
The student should be able to:

Present a nucleotide or protein sequence according to accepted conventions

Recognize different sequence files formats

Interconvert files between formats
INTRODUCTION
Virtually all the information one deals with in computational molecular biology is either in the
form of DNA or protein sequences. There are certain conventions applying to the way this
sequence information is presented both in the conventional literature and in databases.
Furthermore, the way in which sequence information is stored, retrieved and manipulated varies.
That is to say there are different computer file types. This module explains the conventions of
sequence presentation, describes the various file types, and illustrates how these file types can be
interconverted. This material, while really not that exciting, is of absolutely fundamental
importance to anyone wishing to work in bioinformatics.
DNA
The DNA of living organisms is normally double stranded. However, whenever you look at a
paper which includes DNA sequence information it is the convention to show only one strand of
the DNA. This begs the questions - "which strand do you show"? -and- "which way round do
you show it"?
It is usually the case, except in some viral genomes, that either strand can be the template (or
coding) strand at any particular point, but not both. Given that the two strands are anti-parallel,
the genes on one strand will face in one direction and the genes on the other strand will face in
the opposite direction.
As you will remember, the orientation of a DNA strand is determined by which end has a 5'phosphate group and which has a 3'-hydroxyl group. Thus, any DNA strand has a 5'-3' polarity.
RNA polymerase in all organisms moves along the template strand of the DNA in the 3'-5'
direction producing RNA that grows in the 5'-3' direction. So in fact the RNA sequence will be
identical to that of the non-template strand, except for the presence of uracil instead of thymine.
Consequently, it has become the convention to show the non-template strand of the DNA when
presenting sequence information, because it resembles the RNA encoded by that particular gene.
I imagine for purely cultural reasons the sequence is shown running from right to left on the
page, with the 5' end of the sequence on the right. This would correspond to the protein sequence
also running from right to left.
Sometimes when sequencing projects are in the draft state there are still ambiguities in the
sequence that still have to be resolved. IUPAC have defined a standard table for the nucleotide
ambiguity codes.
R = A or G
K = G or T
S = G or C
Y = C or T
M = A or C
W = A or T
B = not A (G or C or T)
H = not G (A or T or C)
N = any nucleotide
D = not C (G or A or T)
V = not T (A or G or C)
PROTEIN
Polypeptides like DNA strands have a polarity, with an N-terminal and C-terminal ends
possessing a free amino group and carboxyl group respectively. It follows both from the fact that
the N-terminal part of the protein is synthesized first and from the convention regarding the
presentation of DNA sequences, that a polypeptide is presented with its N-terminus on the left of
the page and its C-terminus on the right.
H2N-Methionine-Valine-Tyrosine-Cysteine-Arginine-Glycine-Isoleucine-Lysine-COOH
To keep the polypeptide information in a form that can be conveniently handled by computers
the amino acids are each given a single letter code. Thus, the sequence above would be
represented as:
MVYCRGIK
You will notice that each amino acid is not necessarily represented by its initial letter. This table
provides the standard one-letter code for amino acids.
Glycine G
Isoleucine I
Cysteine C
Tryptophan W
Arginine R
Alanine A
Phenylalanine F
Threonine T
Proline P
Lysine K
Valine V
Tyrosine Y
Methionine M
Aspartic acid D
Histidine H
Leucine L
Serine S
Asparagine N
Glutamic acid E
Glutamine Q
There is a recent agreement in IUPAC that selenocysteine which occasionally occurs in proteins
should be represented by the letter U.
FILE TYPES
Many software packages have been developed for the analysis of DNA and protein sequences
and an unfortunate by-product of this is that a variety of different file formats have been
developed store DNA and protein sequence information. The various software packages will
usually only accept a specific file format, however, there are programmes which will convert
sequence information between the different file formats. The situation is made even worse by the
fact that different sequence databases hold the information in different file formats. So an
important set of basic skills is to be able to recognize the different file formats and to be able to
interconvert files between formats. The Table below lists many of the file formats and the most
commonly used have hyperlinks to nucleotide sequence examples.
IG/Stanford
Fitch
Plain/Raw
GenBank/GB
Pearson/Fasta
PIR/CODATA
NBRF
Zuker
MSF
EMBL
Olsen
ASN 1.8
Phylip 3.2
PAUP/NEXUS
GCG
DNAStrider
Phylip
Pretty
READSEQ
There are many programmes for file conversion (the GCG suite alone has 23!). In this module
we will use READSEQ which is particularly useful as it automatically detects many sequence
formats and interconverts them.
Web-based versions of READSEQ are available at NIH, BCM Search Launcher and Bioportal.
Have a look at them. All you need to do is to select and copy a sequence and just paste it into the
window of Readseq and select the format you want it converted to (there is help available if you
really need it)
An extensive guide to sequence exchange and re-formatting in the GCG suite of programmes is
available in the EMBnet Biocomputing tutorials.
A DETAILED LOOK AT TWO FILE FORMATS
FastA
The simplest sequence file format and one used by many molecular biology analysis tools
available on the Web is the FASTA (or Pearson) format. The fist line of the file always begins
with the > (greater than symbol) and this is followed by the sequence identification, or
sometimes by some more informative content, and then an end-of-line (carriage return). The
sequence, DNA or protein, is then represented by a simple string of characters.
GENBANK
Perhaps the most important file format to be familiar with is the Genbank flatfile. Let’s start be
explaining why this format is so important. There are three international sequence databases
(Genbank in the USA, EMBL in Europe and DDBJ in Japan) which together constitute the
International Nucleotide Sequence Database Collaboration. A key feature of this collaboration is
an agreement on the rules relating to the way that data is stored and annotated (i.e. the file
format) that permits the three databases to exchange information on a daily basis. The GenBank,
EMBL, and DDBJ nucleic acid sequence data banks have from their inception used tables of
sites and features to describe the roles and locations of higher order sequence domains and
elements within the genome of an organism. In February, 1986, GenBank and EMBL began a
collaborative effort (joined by DDBJ in 1987) to devise a common feature table format and
common standards for annotation practice – it is very useful to have a look at The
DDBJ/EMBL/GenBank Feature Table Definition. The term flatfile refers to the structure of the
database (databases can either be flatfile or relational). The sequence data in Genbank is held as
ASN.1 files that are machine readable, but don’t make much sense to human beings (have a look
at an example). The ASN.1 file can be converted to a human-readable Genbank flatfile (GBFF).
The GBFF consists of three distinct parts:
The Header contains database-specific information that give the sequence its unique
identification together with other information e.g. source organism, associated literature
reference etc.
The Features section comprises the annotation of the sequence and describes features such as
exons, introns, coding sequences etc.
The nucleotide sequence itself constitutes the third part of the file
Have a look at an example GBFF with a good explanation of what each part of the file means
produced by the National Centre for Biotechnology Information.
EXERCISES
1. Convert a sequence in Genbank format and convert it to GCG and FastA using one of the Web
READSEQ sites
1.1 Go to the Table and click the Genbank link.
1.2 Select the whole of the GenBank file and copy
1.3 Go to one one of the Readseq links (e.g. NIH)
1.4 Click within the boundaries of the text box
1.5 Paste the contents of the clipboard
1.6 Select the desired output format from the dropdown menu
1.7 Press the Submit button
References and Useful Links
File formats for sequence data used in bioinformatics
A guide to DNA sequence file formats
Standards and File Formats for Molecular Biology Computing