Download 2.4. Sequence databases

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Quantitative comparative linguistics wikipedia , lookup

Metabolic network modelling wikipedia , lookup

Human genome wikipedia , lookup

Genomic library wikipedia , lookup

DNA barcoding wikipedia , lookup

Point mutation wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

RNA-Seq wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

Pathogenomics wikipedia , lookup

Sequence alignment wikipedia , lookup

Genome editing wikipedia , lookup

Helitron (biology) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genomics wikipedia , lookup

Metagenomics wikipedia , lookup

Transcript
Databases
Bioinformatics (Bachelor)
OVERVIEW OF DATABASES
Adapted 17/09/2013
Overview of databases ......................................................................................................................... 1
2.1. Need for integration and standardisation ............................................................................. 2
2.1.1 What is a database? .......................................................................................................... 2
2.1.2 Types of databases ........................................................................................................... 2
2.1.3 Purposes of biological databases...................................................................................... 3
2.2. Sequence formats ................................................................................................................. 5
2.2.1 Genbank format................................................................................................................ 5
2.2.2 FastA format .................................................................................................................... 7
2.3. Information retrieval from biological databases .................................................................. 8
2.4. Sequence databases .............................................................................................................. 9
2.4.1 Redundant databases: sequence repositories.................................................................... 9
2.4.2 From redundant databases to comprehensive databases ................................................ 11
Kathleen Marchal
updated 17/09/13
1
Databases
Bioinformatics (Bachelor)
2.1. Need for integration and standardisation
Considerable data are collected from genome analysis, protein analysis, microarray data,… Keeping
track of so many data is beyond the capacity of most laboratories. Often data are kept on spread
sheets (excel) because they are easy to use. However, these data should be organized into a suitable
database.
2.1.1
What is a database?
A database is a computerized archive used to store and organize data in such a way that
information can be retrieved easily via a variety of search criteria. Databases are composed of
computer hardware and software for data management. The chief objective of the development of a
database is to organize data in a set of structured records to enable easy retrieval of information.
Each record, also called an entry, should contain a number of fields that hold the actual data items,
for example, fields for names, telephone numbers, addresses, dates. To retrieve a particular record
from the database, a user can specify a particular piece of information, called value, to be found in a
particular field and expect the computer to retrieve the whole data record. This process is called
making a query. Although data retrieval is the main purpose of all databases, biological databases
often have a higher level of requirement, known as knowledge discovery, which refers to the
identification of connections between pieces of information that were not known when the
information was first entered. For example, databases containing raw sequence information can
perform extra computational tasks to identify sequence homology or conserved motifs. These
features facilitate the discovery of new biological insights from the data.
2.1.2
Types of databases
The architecture of a database refers to the manner in which the entries in a database are organized
for archiving and easy retrieval.

Flat file databases, which are long text files that contain many entries separated by a
delimiter (e.g. |) and within each entry a number of fields separated by tabs or commas, were
the first databases (Fig. 3.1a). Except for the raw values in each field, the entire text file
does not contain any hidden instructions for computers to search for specific information or
to create reports based on certain fields from each record. The text file can be considered a
single table. Thus, to search a flat file for a particular piece of information, a computer has
to read through the entire file, an obviously inefficient process. This is manageable for a
small database, but as database size increases or data types become more complex, this
database style can become very difficult for information retrieval. Indeed, searches through
such files often cause crashes of the entire computer system because of the memoryintensive nature of the operation.
Therefore, database management systems were developed that not only contain raw data
records but also operational instructions to help identify hidden connections among data
records. Depending on the types of data structures, these database management systems can be
classified into relational databases or object-oriented databases.

Relational databases use a set of tables to organize the data (Fig. below). Each table, also
called a relation, is made up of columns and rows. Columns represent individual fields.
Kathleen Marchal
updated 17/09/13
2
Databases

2.1.3



Bioinformatics (Bachelor)
Rows represent values in the fields of records. The columns in a table are indexed according
to a common feature called an attribute, so they can be cross-referenced in other tables. To
execute a query in a relational database, the system selects linked data items from different
tables and combines the information into one report. Therefore, information can be found
more quickly. Relational databases can be created using a special programming language
called structured query language (SQL). The creation of this type of databases can take a
great deal of planning during the design phase. After creation of the original database, a new
data category can be easily added without requiring all existing tables to be modified. The
subsequent database searching and data gathering for reports are relatively straightforward.
Object-oriented databases store data as objects and can describe complex hierarchical
relationships between data items (Fig. below). In an object-oriented programming language,
an object can be considered as a unit that combines data and mathematical routines that act
on the data. The database is structured such that the objects are linked by a set of pointers
defining predetermined relationships between the objects. Searching the database involves
navigating through the objects with the aid of the pointers linking different objects.
Programming languages like C++ and Java are used to create object-oriented databases. This
type of databases does not have such a high usage rate. It provides high performance for
companies with extensive amounts of data that is highly complex. The object-oriented
database is more flexible in expressing data relations than the relational database. Data can
be structured based on hierarchical relationships. By doing so, Biological databases
programming tasks can be simplified for data that are known to have complex relationships.
However, the object-oriented database lacks the rigorous mathematical foundation of the
relational databases e.g. it is not so easy to add additional data to the database, therefore the
whole database structure needs to be updated. There is also a risk that some of the
relationships between objects may be misrepresented. Some current databases have
therefore incorporated features of both types of database programming, creating the objectrelational database. Current biological databases use all three types of database structure:
flat files, relational and object-oriented databases.
Purposes of biological databases
Repository databases contain large redundant datasets including high and low quality data.
The aim of these databases is to bring the biologists to the cutting edge by providing
essential information without any extras. Examples of such databases include those
containing uncurated genomic sequence or single Pass sequence data (EST database).
Curated databases contain curated information in which a user would expect to find
annotated entries. Although the organization of information and its proper implementation
are clearly important, the elimination of wrong information and addition of informative
features are equally important for the growth of biological databases. This can be achieved
by careful annotation either automatically (Ensembl, Gene) or manually (VEGA,
Swissprot). Databases such as SWISSPROT, VEGA have expended considerable effort into
having expert biologists annotate information about genes and proteins. Although manual
annotation is time-consuming, its benefits to the community far outweigh the labor and
expense involved.
Comprehensive databases are the extreme case of annotated database. These databases not
only provide clean data, but also try to give an as complete overview as possible of all the
information on the specific topic. For instance, Ensembl and Gene are examples of
comprehensive sequence databases.
Kathleen Marchal
updated 17/09/13
3
Databases
Bioinformatics (Bachelor)
Fig. 2. The schematic illustrates the basic design of relational and object oriented databases.
(From Navarro et al., 2003)
Fig In a relational database model, the data are stored in tables (Tables 1, 3 and 4) and relationships can be defined on
a many-to-one basis or a many-to-many basis. The latter type requires a separate relationship table linking the two
tables (Table 2). All records in a given table have to necessarily have identical features and any unique features have to
be stored in a separate table. This often requires creation of a large number of tables. (b) In an object oriented
database, relationships between data records (objects) are primarily of a parent-child type (in the illustration, root is
parent of Proteins and Proteins is parent of Protein 1 and Protein 2). Although in a relational system, a record is
addressed by its record identifier and its containing table, in an object system, a record is identified by the entire
hierarchy leading up to the object (for example, the pTyr record in the example is addressed as root/Proteins/Protein
1/Modifications/pTyr). Because each record in an object database is a code object, it can change the data it contains
based on current conditions making it easier to generate data dynamically. The ‘catalog’ object, for example, can
automatically update the contents of its indexes based on the contents and the rest of the database.
Kathleen Marchal
updated 17/09/13
4
Databases
Bioinformatics (Bachelor)
2.2. Sequence formats
Once sequences have been determined they are made publicly available in databases. A sequence
usually is submitted at the time of publication of the sequence in a journal article. Most sequence
databases are relational databases. The sequence repositories also offer the possibility to retrieve the
sequence and a lot of biological information. Genes and proteins can have different identifiers,
depending on the organism, but also depending on the database the sequence is stored in and the
type of sequence. The conversion of different identifiers, from gene to protein or from the one
database to the other, is called ID mapping.
Two important sequence databases exist: Gene (National Center for Biotechnology Information,
NCBI) and ENSEMBL (European Institute for Bioinformatics, EBI). Both Ensembl and Gene
(NCBI) will be used during exercises.
Before using a sequence file in a sequence analysis program, it is important that the computer
programs will recognize the sequence. Therefore special formats have been invented. Different
computer programs use different formats and sometimes it might be essential to convert one format
into another one. To this end there exist convertor programs e.g. READSEQ
(http://www.ebi.ac.uk/cgi-bin/readseq.cgi). Most of the computer programs accept ASCII files (i.e.
files saved as text only).
Two major formats can be distinguished:
2.2.1
Genbank format
Kathleen Marchal
updated 17/09/13
5
Databases
Bioinformatics (Bachelor)
Although GenBank is a relational database, the search output for sequence files is produced as flat
files for easy reading. The resulting flat file, the GenBank sequence format (Fig. above), contains
three sections, a header, features and a sequence, which all contain many fields with unique
identifiers for easy indexing by computer software.
The top line of the header section is the LOCUS, which contains a unique database identifier for a
sequence location in the database (not a chromosome locus). The identifier is followed by sequence
length and molecule type and a three-letter code for GenBank divisions. Next to the division is the
date when the record was made public.
The following line, “DEFINITION”, provides the summary information for the sequence record
including the name of the sequence, the name and taxonomy of the source organism if known, and
whether the sequence is complete or partial.
This is followed by “ACCESSION” and “VERSION”, which also includes the GI. At NCBI, each
sequence record has two sequence identification numbers, a GenInfo Identifier (GI) and an
Accession (AC). The GenInfo Identifier is a unique NCBI nucleotide sequence identification
number. If a nucleotide sequence changes in any way, a new GI number will be assigned (see
Display Settings: Revision History). A separate GI number is also assigned to each protein
translation within a nucleotide sequence record, and a new GI is assigned if the protein translation
changes in any way. These identifiers appear as qualifiers for CDS features in the features portion
Kathleen Marchal
updated 17/09/13
6
Databases
Bioinformatics (Bachelor)
of a GenBank entry. An accession number is a unique identifier given to a sequence when it is
submitted to one of the DNA repositories (GenBank, EMBL-EBI, DDBJ) and remains onstant over
the lifetime of the record, even when there is a change to the sequence
or annotation. The accession number is also shared between the three databases (GenBank,
Biological databases DDBJ, EMBL). The initial deposition of a sequence record is referred to as
version 1. If the sequence is updated, the version number is incremented, but the accession number
will remain constant (see Display Settings: Revision History) .
The accession.version system of sequence identifiers runs parallel to the GI number system, i.e.,
when any change is made to a sequence, it receives a new GI number AND an increase to its
version number. The accession number will always retrieve the most recent version of the record;
the older versions remain available under the old Accession.version identifiers and their original GI
numbers.
The next line in the header section is the “ORGANISM” field, which includes the source of the
organism with the scientific name of the species, its taxonomic classification and sometimes the
tissue type. T
he “REFERENCE” field provides the publication citation(s) related to the sequence entry. The
features table includes annotation information about the gene and the gene product, as well as
regions of biological significance reported in the sequence, with identifiers and qualifiers. The
“source” field provides the length of the sequence, the scientific name of the organism, and the
taxonomy identification number. Some optional information includes the clone source, the tissue
type and the cell line. The “gene” field is the information about the nucleotide coding sequence and
its name. For DNA entries, the “CDS” field gives information about the boundaries of the sequence
that can be translated into amino acids. The third section of the flat file is the sequence itself starting
with the label “ORIGIN”. The format of the sequence display can be changed. For DNA entries,
there is a base count report that includes the numbers of A, G, C and T in the sequence. This section
ends for all types of sequences with two forward slashes (//). In retrieving sequences from
GenBank, the search can be limited to the different fields described.
2.2.2
FastA format
>gi|1071819|pir||B54759 ba-type ubiquinol oxidase (EC 1.10.3.-) chain I Paracoccus denitrificans
MATFSNETTFLLGRLNWDAIPKEPIVWATFVVVAIGGIAALAALTKYRLWGWLWREWFTSVDHKKIGIMYIVLALIMFVR
GFADAIMMRLQQVWAFGGSEGYLNSHHYDQIFTAHGVIMIFFVAMPFITGLMNYVVPLQIGARDVSFPFLNNFSFWMTVG
GAVITMASLFLGEFAQTGWLAFPPLSGIGYSPWVGVDYYIWGLQVAGVGTTLSGINLLVTILKMRAPGMTMMRMPIFTWT
SFCANILIVASFPVLTMTLILLTLDRYVGTNFFTNDLGGNPMMYINLIWIWGHPEVYILILPLFGVFSEVTSTFSGKRLF
GYSSMVYATVCITVLSYLVWLHHFFTMGSGASVNSFFGITTMIISIPTGAKLFNWLFTMYRGRIRYELPMMWTIAFMLTF
VIGGMTGVLLAVPPADFVLHNSLFLIAHFHNVIIGGVLFGLFAAINFWWPKAFGFKLDVFWGKVSFWFWVVGFWAAFMPL
YILGLMGVTRRLRVFDDPDLRIWFAIAAFGAVLIACGIAAMFVQFGVSILRRDRPEYRDVSGDPWDGRTLEWATSSPPPA
YNFAFNPISHGLDTWWEMKQQGATRPTGGYMPIHMPKNTGTGVILAALATVCGMALVWYVWWLAALSFLGIIAVSIAHTF
NYNRDYYIPVSEIEATEDARTRQLAQGV
Fig. FASTA format: The FASTA format includes 3 parts: (1) a comment line identified by a “>”
character in the first column followed by the name and origin of the sequence; (2) the sequence in
standard 1 letter symbols and (3) optional * indicating the end of the sequence. The presence of the
* may be essential for some computer programs.
Two formats will be discussed as examples.
Kathleen Marchal
updated 17/09/13
7
Databases
2.3.
Bioinformatics (Bachelor)
Information retrieval from biological databases
Popular retrieval systems for biological databases is Entrez at NCBI, that provides access to
multiple databases for retrieval of integrated search results.
Entrez (www.ncbi.nlm.nih.gov/Entrez/) is the primary retrieval tool for molecular biologists. Just
as general web search engines index the web, the Entrez browser indexes and integrates information
from different NCBI databases: cross-references are made based on preexisting and logical
relationships between individual entries. Entrez integrates the scientific literature, DNA and protein
sequence databases, 3D protein structure and protein domain data, population study datasets,
expression data, assemblies of complete genomes, and taxonomic information into a tightly
interlinked system. It is a retrieval system designed for searching its linked databases (the Entrez
cross-database search page).
Using Entrez Global query, a search across all Entrez databases is performed by entering a simple
search term or phrase in the "Search across databases" query box. The results found in each
database are displayed on the Global Query page and clicking on the result number or its adjacent
database name will lead to the specific results. The Entrez linking scheme facilitates Links within
and between databases.
Popular search strategies such as the Limits, Advanced Search, and Clipboard can be used within
most of the individual databases. “Limits” helps to restrict the search to a subset of a particular
database, to a particular database or a particular type of data. “Advanced Search” connects
different searches with the Boolean operators and uses a string of logically connected keywords to
perform a new search. The search can also be limited to a particular search field e.g. gene name,
accession number. The History option provides a record of the previous searches so that the user
can review, revise, or combine the results of earlier searches. There is also a “Send to” option,
where you can select “Clipboard” that stores search results for late viewing for a limited time.
Alternatively, for a complex search, field tags can be used to improve the efficiency of obtaining
Kathleen Marchal
updated 17/09/13
8
Databases
Bioinformatics (Bachelor)
the search results. The tags are identifiers of each field and are placed in brackets. For example,
[Author] limits the search for author name, [Journal] for journal name, [Organism] for organism,
[Title] for publication title.
2.4. Sequence databasesRedundant databases: sequence repositories
NCBI offers several repositories for, for instance, literature (pubmed), gene sequence (Nucleotide,
GenBank) and protein sequences (Protein). NCBI uses a relational models for its databases.
The growth of these databases is increasing dramatically. There are approximately 200000 users
making 4 million database queries per day.
For instance, GenBank is the nucleotide sequence repository of NCBI. New sequences can be
submitted to GenBank by a submission page. GenBank also accepts submission of sequences with a
high error rate. E.g. EST represent first pass sequences with an error rate as high as 1 in 100,
including incorrectly identified bases and insertions (GenBank data base).
Kathleen Marchal
updated 17/09/13
9
Databases
Bioinformatics (Bachelor)
Genbank contains several repository databases which can all be accessed via the Nucleotide
database results of entrez
dbEST (http://www.ncbi.nlm.nih.gov/dbEST/) is a division of GenBank that contains sequence
data of "single-pass" cDNA sequences, or "Expressed Sequence Tags (ESTs)", from a number of
organisms. ESTs are short, unedited, single pass sequence reads derived from randomly selected
complementary DNA (cDNA) libraries, which are built by reverse transcribing extracted mRNA.
ESTs can be rapidly generated from either the 5' or 3' end of a cDNA clone in a high-throughput
manner from a particular cell, tissue or organism of interest at a low cost to get a quick insight into
transcriptionally active regions. dbEST contains a great deal of redundancy. ESTs were first used to
construct expression maps of the human genome, then to assess the gene coverage from EST
sequencing alone and to develop and map gene-based site markers.
Now, more and more, RNA-seq data will be used for these applications.
dbGSS is also a division of GenBank that stores unannotated short “single-pass” primarily genomic
sequences, including random survey sequences, clone-end sequences and exontrapped sequences.
Genome Survey Sequences (GSS) are nucleotide sequences similar to EST's, with the exception
that most of them are genomic in origin, rather than mRNA. GSS are typically generated and
submitted to NCBI by labs performing genome sequencing and are used, amongst other things, as a
framework for the mapping and sequencing of genome size pieces.
The Sequence Read Archive (SRA) is an INSDC repository for raw data from sequencing projects
that use the new massively parallel sequencing technologies, often called nextgeneration
sequencing. The SRA is accessible at http://www.ncbi.nlm.nih.gov/Traces/sra from NCBI, at
http://www.ebi.ac.uk/ena from EBI and at http://trace.ddbj.nig.ac.jp from DDBJ. SRA data include
short read sequences from sequencing of new genomes, re-sequencing of targeted genomic regions,
sequencing complete genomes of multiple individuals to mine for variations, transcriptome
Kathleen Marchal
updated 17/09/13
10
Databases
Bioinformatics (Bachelor)
sequencing to sample splice variants and expression levels, environmental samples and other
metagenome sequencing, and chromatin DNA binding protein analysis. The SRA not only provides
a place where researchers can archive their short read data, but also enables them to quickly access
known data and their associated experimental descriptions (metadata) with pin-point accuracy. The
preservation of experimental data is an important part of the scientific record, and increasing
numbers of journals and funding agencies require that next-generation sequence data are deposited
into the SRA. In September 2010, the SRA contained >500 billion reads consisting of 60 trillion
base pairs. Almost 80% of the sequencing data were derived from the Illumina GA platform. The
SOLiD™ and Roche/454 platforms account for 15% and 5% of submitted base pairs, respectively.
A problem with the general sequence databanks is that they are growing in size and contain often
redundant entries as well as fragments of varying length, so that it becomes difficult to find the
sequence you need for your research. For a particular gene thus many independent redundant
records might exist in Genbank. Redundant GenBank entries e.g. representing distinct indications
on the transcript of a gene (see exercises).
2.4.2
From redundant databases to comprehensive databases
To go from a redundant database to a comprehensive database curation is required. This curation
can be partially automated such as is the case in Unigene or Ensembl. Farfetched curated databases
make use of manual curation by experts (VEGA, SwissProt).
2.2.4.1
Unigene
UniGene is an experimental system for automatically partitioning GenBank sequences into a nonredundant set of gene-oriented clusters. Each UniGene cluster contains sequences that represent a
unique gene, as well as related information such as the tissue types in which the gene has been
expressed and map location. These clusters represent the same gene based on the alignment of EST
sequences with each other and with the genome sequences of the organism. As more overlapping
sequences are added the number of clusters for an organism decreases.
In addition to sequences of well-characterized genes, hundreds of thousands novel expressed
sequence tag (EST) sequences have been included. Consequently, the collection may be of use to
the community as a resource for gene discovery.
It should also be noted that no attempt has been made to produce contigs or consensus sequences.
There are several reasons why the sequences of a set may not actually form a single contig. For
example, all of the splicing variants for a gene are put into the same set. Moreover, EST-containing
sets often contain 5' and 3' reads from the same cDNA clone, but these sequences do not always
overlap.
Currently, sequences from the animals human, rat, mouse, cow, zebrafish, clawed frog, fruitfly and
mosquito have been processed. Plant organisms are wheat, rice, barley, maize and cress. These
species were chosen because they have the greatest amounts of EST data available and represent a
variety of species. Additional organisms may be added in the future.
2.2.4.2
RefSEQ database of ncbi
The Reference Sequence (RefSeq) collection aims at providing a comprehensive, integrated, nonredundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for
major research organisms. The RefSEQ database is the counterpart of the ENSEMBL database. For
a particular gene many independent redundant records might exist in the nucleotide database of
GenBank (entries for the coding sequence, for ESTs…). All this information is integrated as such
Kathleen Marchal
updated 17/09/13
11
Databases
Bioinformatics (Bachelor)
that for a particular locus in the genome a complete description is given that is no longer redundant:
the GeneID. As indicated in the scheme below, redundant GenBank entries e.g. representing distinct
indications on the transcript of a gene (incomplete cDNA sequences, ESTs are unified to a single
refseq that represents the complete transcript. A Refseq sequence can be a protein (starting with
NP_) it can be a genomic sequence (starting with NG_). All RefSeq sequences that belong to the
same locus on the genome receive the same locus link. Additional links to other interesting
databases containing additional functional annotation or information are made (e.g to Gene
Ontology, KEGG,…).
2.2.4.3
The Ensembl database
A nice example of a database that combines all the available information in an attempt to reliably
annotate vertebrate genomes is the Ensembl database. Ensembl consists of a carefully crafted
database that takes into account a complete body of information and includes essential cross
references to other important databases. The ensembl database is an effort of the EBI (European
bioinformatics institute and the European counterpart of Gene at Ncbi).
Genes are annotated by the Ensembl automatic analysis pipeline (see figures below)
 using either a GeneWise model from a human/vertebrate protein, a set of aligned human
cDNAs followed by GenomeWise for ORF prediction
 or from Genscan exons supported by protein, cDNA and EST evidence. GeneWise models
are further combined with available aligned cDNAs to annotate UTRs.
As this automated process is fast and not tedious it is repeated each time a better draft of a genome
becomes available. This results in frequent new releases of a genome annotation in Ensembl.
After the automated curation process, each chromosome is subjected to a manual curation process.
This process is much slower and only performed if the genomic sequence release is stable and does
not need any updates anymore. Results of this manual curation process can be found in the database
VEGA.
ENSEMBL data model and automated curation process
Human protein
(Swiss Prot)
Genewise
Other
proteins
Blast
cDNA
exonerate
EST
exonerate
Add
UTR
Ab initio gene
prediction
GeneScan
Cluster merge
merge
Add variants
EST
genes
Genes
Kathleen Marchal
M cluster
merge
updated 17/09/13
12
Databases
Bioinformatics (Bachelor)
Automatic pipeline of annotation used by Ensembl. Manual curation is performed by the Havana
group (sanger Center, http://www.sanger.ac.uk/HGP/havana/) and released in the VEGA database.
2.1.5.1
Microarray databases:
2.1.5.2
SAGE Database:
2.1.5.3
EST based expression analysis
2.1.5.4
Proteome database
2.1.6.1
KEGG database
2.1.6.2
ECOcyc database
2.1.6.3
Gene ontology
Kathleen Marchal
updated 17/09/13
13