* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download 2.4. Sequence databases
Survey
Document related concepts
Quantitative comparative linguistics wikipedia , lookup
Metabolic network modelling wikipedia , lookup
Human genome wikipedia , lookup
Genomic library wikipedia , lookup
DNA barcoding wikipedia , lookup
Point mutation wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
United Kingdom National DNA Database wikipedia , lookup
Pathogenomics wikipedia , lookup
Sequence alignment wikipedia , lookup
Genome editing wikipedia , lookup
Helitron (biology) wikipedia , lookup
Transcript
Databases Bioinformatics (Bachelor) OVERVIEW OF DATABASES Adapted 17/09/2013 Overview of databases ......................................................................................................................... 1 2.1. Need for integration and standardisation ............................................................................. 2 2.1.1 What is a database? .......................................................................................................... 2 2.1.2 Types of databases ........................................................................................................... 2 2.1.3 Purposes of biological databases...................................................................................... 3 2.2. Sequence formats ................................................................................................................. 5 2.2.1 Genbank format................................................................................................................ 5 2.2.2 FastA format .................................................................................................................... 7 2.3. Information retrieval from biological databases .................................................................. 8 2.4. Sequence databases .............................................................................................................. 9 2.4.1 Redundant databases: sequence repositories.................................................................... 9 2.4.2 From redundant databases to comprehensive databases ................................................ 11 Kathleen Marchal updated 17/09/13 1 Databases Bioinformatics (Bachelor) 2.1. Need for integration and standardisation Considerable data are collected from genome analysis, protein analysis, microarray data,… Keeping track of so many data is beyond the capacity of most laboratories. Often data are kept on spread sheets (excel) because they are easy to use. However, these data should be organized into a suitable database. 2.1.1 What is a database? A database is a computerized archive used to store and organize data in such a way that information can be retrieved easily via a variety of search criteria. Databases are composed of computer hardware and software for data management. The chief objective of the development of a database is to organize data in a set of structured records to enable easy retrieval of information. Each record, also called an entry, should contain a number of fields that hold the actual data items, for example, fields for names, telephone numbers, addresses, dates. To retrieve a particular record from the database, a user can specify a particular piece of information, called value, to be found in a particular field and expect the computer to retrieve the whole data record. This process is called making a query. Although data retrieval is the main purpose of all databases, biological databases often have a higher level of requirement, known as knowledge discovery, which refers to the identification of connections between pieces of information that were not known when the information was first entered. For example, databases containing raw sequence information can perform extra computational tasks to identify sequence homology or conserved motifs. These features facilitate the discovery of new biological insights from the data. 2.1.2 Types of databases The architecture of a database refers to the manner in which the entries in a database are organized for archiving and easy retrieval. Flat file databases, which are long text files that contain many entries separated by a delimiter (e.g. |) and within each entry a number of fields separated by tabs or commas, were the first databases (Fig. 3.1a). Except for the raw values in each field, the entire text file does not contain any hidden instructions for computers to search for specific information or to create reports based on certain fields from each record. The text file can be considered a single table. Thus, to search a flat file for a particular piece of information, a computer has to read through the entire file, an obviously inefficient process. This is manageable for a small database, but as database size increases or data types become more complex, this database style can become very difficult for information retrieval. Indeed, searches through such files often cause crashes of the entire computer system because of the memoryintensive nature of the operation. Therefore, database management systems were developed that not only contain raw data records but also operational instructions to help identify hidden connections among data records. Depending on the types of data structures, these database management systems can be classified into relational databases or object-oriented databases. Relational databases use a set of tables to organize the data (Fig. below). Each table, also called a relation, is made up of columns and rows. Columns represent individual fields. Kathleen Marchal updated 17/09/13 2 Databases 2.1.3 Bioinformatics (Bachelor) Rows represent values in the fields of records. The columns in a table are indexed according to a common feature called an attribute, so they can be cross-referenced in other tables. To execute a query in a relational database, the system selects linked data items from different tables and combines the information into one report. Therefore, information can be found more quickly. Relational databases can be created using a special programming language called structured query language (SQL). The creation of this type of databases can take a great deal of planning during the design phase. After creation of the original database, a new data category can be easily added without requiring all existing tables to be modified. The subsequent database searching and data gathering for reports are relatively straightforward. Object-oriented databases store data as objects and can describe complex hierarchical relationships between data items (Fig. below). In an object-oriented programming language, an object can be considered as a unit that combines data and mathematical routines that act on the data. The database is structured such that the objects are linked by a set of pointers defining predetermined relationships between the objects. Searching the database involves navigating through the objects with the aid of the pointers linking different objects. Programming languages like C++ and Java are used to create object-oriented databases. This type of databases does not have such a high usage rate. It provides high performance for companies with extensive amounts of data that is highly complex. The object-oriented database is more flexible in expressing data relations than the relational database. Data can be structured based on hierarchical relationships. By doing so, Biological databases programming tasks can be simplified for data that are known to have complex relationships. However, the object-oriented database lacks the rigorous mathematical foundation of the relational databases e.g. it is not so easy to add additional data to the database, therefore the whole database structure needs to be updated. There is also a risk that some of the relationships between objects may be misrepresented. Some current databases have therefore incorporated features of both types of database programming, creating the objectrelational database. Current biological databases use all three types of database structure: flat files, relational and object-oriented databases. Purposes of biological databases Repository databases contain large redundant datasets including high and low quality data. The aim of these databases is to bring the biologists to the cutting edge by providing essential information without any extras. Examples of such databases include those containing uncurated genomic sequence or single Pass sequence data (EST database). Curated databases contain curated information in which a user would expect to find annotated entries. Although the organization of information and its proper implementation are clearly important, the elimination of wrong information and addition of informative features are equally important for the growth of biological databases. This can be achieved by careful annotation either automatically (Ensembl, Gene) or manually (VEGA, Swissprot). Databases such as SWISSPROT, VEGA have expended considerable effort into having expert biologists annotate information about genes and proteins. Although manual annotation is time-consuming, its benefits to the community far outweigh the labor and expense involved. Comprehensive databases are the extreme case of annotated database. These databases not only provide clean data, but also try to give an as complete overview as possible of all the information on the specific topic. For instance, Ensembl and Gene are examples of comprehensive sequence databases. Kathleen Marchal updated 17/09/13 3 Databases Bioinformatics (Bachelor) Fig. 2. The schematic illustrates the basic design of relational and object oriented databases. (From Navarro et al., 2003) Fig In a relational database model, the data are stored in tables (Tables 1, 3 and 4) and relationships can be defined on a many-to-one basis or a many-to-many basis. The latter type requires a separate relationship table linking the two tables (Table 2). All records in a given table have to necessarily have identical features and any unique features have to be stored in a separate table. This often requires creation of a large number of tables. (b) In an object oriented database, relationships between data records (objects) are primarily of a parent-child type (in the illustration, root is parent of Proteins and Proteins is parent of Protein 1 and Protein 2). Although in a relational system, a record is addressed by its record identifier and its containing table, in an object system, a record is identified by the entire hierarchy leading up to the object (for example, the pTyr record in the example is addressed as root/Proteins/Protein 1/Modifications/pTyr). Because each record in an object database is a code object, it can change the data it contains based on current conditions making it easier to generate data dynamically. The ‘catalog’ object, for example, can automatically update the contents of its indexes based on the contents and the rest of the database. Kathleen Marchal updated 17/09/13 4 Databases Bioinformatics (Bachelor) 2.2. Sequence formats Once sequences have been determined they are made publicly available in databases. A sequence usually is submitted at the time of publication of the sequence in a journal article. Most sequence databases are relational databases. The sequence repositories also offer the possibility to retrieve the sequence and a lot of biological information. Genes and proteins can have different identifiers, depending on the organism, but also depending on the database the sequence is stored in and the type of sequence. The conversion of different identifiers, from gene to protein or from the one database to the other, is called ID mapping. Two important sequence databases exist: Gene (National Center for Biotechnology Information, NCBI) and ENSEMBL (European Institute for Bioinformatics, EBI). Both Ensembl and Gene (NCBI) will be used during exercises. Before using a sequence file in a sequence analysis program, it is important that the computer programs will recognize the sequence. Therefore special formats have been invented. Different computer programs use different formats and sometimes it might be essential to convert one format into another one. To this end there exist convertor programs e.g. READSEQ (http://www.ebi.ac.uk/cgi-bin/readseq.cgi). Most of the computer programs accept ASCII files (i.e. files saved as text only). Two major formats can be distinguished: 2.2.1 Genbank format Kathleen Marchal updated 17/09/13 5 Databases Bioinformatics (Bachelor) Although GenBank is a relational database, the search output for sequence files is produced as flat files for easy reading. The resulting flat file, the GenBank sequence format (Fig. above), contains three sections, a header, features and a sequence, which all contain many fields with unique identifiers for easy indexing by computer software. The top line of the header section is the LOCUS, which contains a unique database identifier for a sequence location in the database (not a chromosome locus). The identifier is followed by sequence length and molecule type and a three-letter code for GenBank divisions. Next to the division is the date when the record was made public. The following line, “DEFINITION”, provides the summary information for the sequence record including the name of the sequence, the name and taxonomy of the source organism if known, and whether the sequence is complete or partial. This is followed by “ACCESSION” and “VERSION”, which also includes the GI. At NCBI, each sequence record has two sequence identification numbers, a GenInfo Identifier (GI) and an Accession (AC). The GenInfo Identifier is a unique NCBI nucleotide sequence identification number. If a nucleotide sequence changes in any way, a new GI number will be assigned (see Display Settings: Revision History). A separate GI number is also assigned to each protein translation within a nucleotide sequence record, and a new GI is assigned if the protein translation changes in any way. These identifiers appear as qualifiers for CDS features in the features portion Kathleen Marchal updated 17/09/13 6 Databases Bioinformatics (Bachelor) of a GenBank entry. An accession number is a unique identifier given to a sequence when it is submitted to one of the DNA repositories (GenBank, EMBL-EBI, DDBJ) and remains onstant over the lifetime of the record, even when there is a change to the sequence or annotation. The accession number is also shared between the three databases (GenBank, Biological databases DDBJ, EMBL). The initial deposition of a sequence record is referred to as version 1. If the sequence is updated, the version number is incremented, but the accession number will remain constant (see Display Settings: Revision History) . The accession.version system of sequence identifiers runs parallel to the GI number system, i.e., when any change is made to a sequence, it receives a new GI number AND an increase to its version number. The accession number will always retrieve the most recent version of the record; the older versions remain available under the old Accession.version identifiers and their original GI numbers. The next line in the header section is the “ORGANISM” field, which includes the source of the organism with the scientific name of the species, its taxonomic classification and sometimes the tissue type. T he “REFERENCE” field provides the publication citation(s) related to the sequence entry. The features table includes annotation information about the gene and the gene product, as well as regions of biological significance reported in the sequence, with identifiers and qualifiers. The “source” field provides the length of the sequence, the scientific name of the organism, and the taxonomy identification number. Some optional information includes the clone source, the tissue type and the cell line. The “gene” field is the information about the nucleotide coding sequence and its name. For DNA entries, the “CDS” field gives information about the boundaries of the sequence that can be translated into amino acids. The third section of the flat file is the sequence itself starting with the label “ORIGIN”. The format of the sequence display can be changed. For DNA entries, there is a base count report that includes the numbers of A, G, C and T in the sequence. This section ends for all types of sequences with two forward slashes (//). In retrieving sequences from GenBank, the search can be limited to the different fields described. 2.2.2 FastA format >gi|1071819|pir||B54759 ba-type ubiquinol oxidase (EC 1.10.3.-) chain I Paracoccus denitrificans MATFSNETTFLLGRLNWDAIPKEPIVWATFVVVAIGGIAALAALTKYRLWGWLWREWFTSVDHKKIGIMYIVLALIMFVR GFADAIMMRLQQVWAFGGSEGYLNSHHYDQIFTAHGVIMIFFVAMPFITGLMNYVVPLQIGARDVSFPFLNNFSFWMTVG GAVITMASLFLGEFAQTGWLAFPPLSGIGYSPWVGVDYYIWGLQVAGVGTTLSGINLLVTILKMRAPGMTMMRMPIFTWT SFCANILIVASFPVLTMTLILLTLDRYVGTNFFTNDLGGNPMMYINLIWIWGHPEVYILILPLFGVFSEVTSTFSGKRLF GYSSMVYATVCITVLSYLVWLHHFFTMGSGASVNSFFGITTMIISIPTGAKLFNWLFTMYRGRIRYELPMMWTIAFMLTF VIGGMTGVLLAVPPADFVLHNSLFLIAHFHNVIIGGVLFGLFAAINFWWPKAFGFKLDVFWGKVSFWFWVVGFWAAFMPL YILGLMGVTRRLRVFDDPDLRIWFAIAAFGAVLIACGIAAMFVQFGVSILRRDRPEYRDVSGDPWDGRTLEWATSSPPPA YNFAFNPISHGLDTWWEMKQQGATRPTGGYMPIHMPKNTGTGVILAALATVCGMALVWYVWWLAALSFLGIIAVSIAHTF NYNRDYYIPVSEIEATEDARTRQLAQGV Fig. FASTA format: The FASTA format includes 3 parts: (1) a comment line identified by a “>” character in the first column followed by the name and origin of the sequence; (2) the sequence in standard 1 letter symbols and (3) optional * indicating the end of the sequence. The presence of the * may be essential for some computer programs. Two formats will be discussed as examples. Kathleen Marchal updated 17/09/13 7 Databases 2.3. Bioinformatics (Bachelor) Information retrieval from biological databases Popular retrieval systems for biological databases is Entrez at NCBI, that provides access to multiple databases for retrieval of integrated search results. Entrez (www.ncbi.nlm.nih.gov/Entrez/) is the primary retrieval tool for molecular biologists. Just as general web search engines index the web, the Entrez browser indexes and integrates information from different NCBI databases: cross-references are made based on preexisting and logical relationships between individual entries. Entrez integrates the scientific literature, DNA and protein sequence databases, 3D protein structure and protein domain data, population study datasets, expression data, assemblies of complete genomes, and taxonomic information into a tightly interlinked system. It is a retrieval system designed for searching its linked databases (the Entrez cross-database search page). Using Entrez Global query, a search across all Entrez databases is performed by entering a simple search term or phrase in the "Search across databases" query box. The results found in each database are displayed on the Global Query page and clicking on the result number or its adjacent database name will lead to the specific results. The Entrez linking scheme facilitates Links within and between databases. Popular search strategies such as the Limits, Advanced Search, and Clipboard can be used within most of the individual databases. “Limits” helps to restrict the search to a subset of a particular database, to a particular database or a particular type of data. “Advanced Search” connects different searches with the Boolean operators and uses a string of logically connected keywords to perform a new search. The search can also be limited to a particular search field e.g. gene name, accession number. The History option provides a record of the previous searches so that the user can review, revise, or combine the results of earlier searches. There is also a “Send to” option, where you can select “Clipboard” that stores search results for late viewing for a limited time. Alternatively, for a complex search, field tags can be used to improve the efficiency of obtaining Kathleen Marchal updated 17/09/13 8 Databases Bioinformatics (Bachelor) the search results. The tags are identifiers of each field and are placed in brackets. For example, [Author] limits the search for author name, [Journal] for journal name, [Organism] for organism, [Title] for publication title. 2.4. Sequence databasesRedundant databases: sequence repositories NCBI offers several repositories for, for instance, literature (pubmed), gene sequence (Nucleotide, GenBank) and protein sequences (Protein). NCBI uses a relational models for its databases. The growth of these databases is increasing dramatically. There are approximately 200000 users making 4 million database queries per day. For instance, GenBank is the nucleotide sequence repository of NCBI. New sequences can be submitted to GenBank by a submission page. GenBank also accepts submission of sequences with a high error rate. E.g. EST represent first pass sequences with an error rate as high as 1 in 100, including incorrectly identified bases and insertions (GenBank data base). Kathleen Marchal updated 17/09/13 9 Databases Bioinformatics (Bachelor) Genbank contains several repository databases which can all be accessed via the Nucleotide database results of entrez dbEST (http://www.ncbi.nlm.nih.gov/dbEST/) is a division of GenBank that contains sequence data of "single-pass" cDNA sequences, or "Expressed Sequence Tags (ESTs)", from a number of organisms. ESTs are short, unedited, single pass sequence reads derived from randomly selected complementary DNA (cDNA) libraries, which are built by reverse transcribing extracted mRNA. ESTs can be rapidly generated from either the 5' or 3' end of a cDNA clone in a high-throughput manner from a particular cell, tissue or organism of interest at a low cost to get a quick insight into transcriptionally active regions. dbEST contains a great deal of redundancy. ESTs were first used to construct expression maps of the human genome, then to assess the gene coverage from EST sequencing alone and to develop and map gene-based site markers. Now, more and more, RNA-seq data will be used for these applications. dbGSS is also a division of GenBank that stores unannotated short “single-pass” primarily genomic sequences, including random survey sequences, clone-end sequences and exontrapped sequences. Genome Survey Sequences (GSS) are nucleotide sequences similar to EST's, with the exception that most of them are genomic in origin, rather than mRNA. GSS are typically generated and submitted to NCBI by labs performing genome sequencing and are used, amongst other things, as a framework for the mapping and sequencing of genome size pieces. The Sequence Read Archive (SRA) is an INSDC repository for raw data from sequencing projects that use the new massively parallel sequencing technologies, often called nextgeneration sequencing. The SRA is accessible at http://www.ncbi.nlm.nih.gov/Traces/sra from NCBI, at http://www.ebi.ac.uk/ena from EBI and at http://trace.ddbj.nig.ac.jp from DDBJ. SRA data include short read sequences from sequencing of new genomes, re-sequencing of targeted genomic regions, sequencing complete genomes of multiple individuals to mine for variations, transcriptome Kathleen Marchal updated 17/09/13 10 Databases Bioinformatics (Bachelor) sequencing to sample splice variants and expression levels, environmental samples and other metagenome sequencing, and chromatin DNA binding protein analysis. The SRA not only provides a place where researchers can archive their short read data, but also enables them to quickly access known data and their associated experimental descriptions (metadata) with pin-point accuracy. The preservation of experimental data is an important part of the scientific record, and increasing numbers of journals and funding agencies require that next-generation sequence data are deposited into the SRA. In September 2010, the SRA contained >500 billion reads consisting of 60 trillion base pairs. Almost 80% of the sequencing data were derived from the Illumina GA platform. The SOLiD™ and Roche/454 platforms account for 15% and 5% of submitted base pairs, respectively. A problem with the general sequence databanks is that they are growing in size and contain often redundant entries as well as fragments of varying length, so that it becomes difficult to find the sequence you need for your research. For a particular gene thus many independent redundant records might exist in Genbank. Redundant GenBank entries e.g. representing distinct indications on the transcript of a gene (see exercises). 2.4.2 From redundant databases to comprehensive databases To go from a redundant database to a comprehensive database curation is required. This curation can be partially automated such as is the case in Unigene or Ensembl. Farfetched curated databases make use of manual curation by experts (VEGA, SwissProt). 2.2.4.1 Unigene UniGene is an experimental system for automatically partitioning GenBank sequences into a nonredundant set of gene-oriented clusters. Each UniGene cluster contains sequences that represent a unique gene, as well as related information such as the tissue types in which the gene has been expressed and map location. These clusters represent the same gene based on the alignment of EST sequences with each other and with the genome sequences of the organism. As more overlapping sequences are added the number of clusters for an organism decreases. In addition to sequences of well-characterized genes, hundreds of thousands novel expressed sequence tag (EST) sequences have been included. Consequently, the collection may be of use to the community as a resource for gene discovery. It should also be noted that no attempt has been made to produce contigs or consensus sequences. There are several reasons why the sequences of a set may not actually form a single contig. For example, all of the splicing variants for a gene are put into the same set. Moreover, EST-containing sets often contain 5' and 3' reads from the same cDNA clone, but these sequences do not always overlap. Currently, sequences from the animals human, rat, mouse, cow, zebrafish, clawed frog, fruitfly and mosquito have been processed. Plant organisms are wheat, rice, barley, maize and cress. These species were chosen because they have the greatest amounts of EST data available and represent a variety of species. Additional organisms may be added in the future. 2.2.4.2 RefSEQ database of ncbi The Reference Sequence (RefSeq) collection aims at providing a comprehensive, integrated, nonredundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms. The RefSEQ database is the counterpart of the ENSEMBL database. For a particular gene many independent redundant records might exist in the nucleotide database of GenBank (entries for the coding sequence, for ESTs…). All this information is integrated as such Kathleen Marchal updated 17/09/13 11 Databases Bioinformatics (Bachelor) that for a particular locus in the genome a complete description is given that is no longer redundant: the GeneID. As indicated in the scheme below, redundant GenBank entries e.g. representing distinct indications on the transcript of a gene (incomplete cDNA sequences, ESTs are unified to a single refseq that represents the complete transcript. A Refseq sequence can be a protein (starting with NP_) it can be a genomic sequence (starting with NG_). All RefSeq sequences that belong to the same locus on the genome receive the same locus link. Additional links to other interesting databases containing additional functional annotation or information are made (e.g to Gene Ontology, KEGG,…). 2.2.4.3 The Ensembl database A nice example of a database that combines all the available information in an attempt to reliably annotate vertebrate genomes is the Ensembl database. Ensembl consists of a carefully crafted database that takes into account a complete body of information and includes essential cross references to other important databases. The ensembl database is an effort of the EBI (European bioinformatics institute and the European counterpart of Gene at Ncbi). Genes are annotated by the Ensembl automatic analysis pipeline (see figures below) using either a GeneWise model from a human/vertebrate protein, a set of aligned human cDNAs followed by GenomeWise for ORF prediction or from Genscan exons supported by protein, cDNA and EST evidence. GeneWise models are further combined with available aligned cDNAs to annotate UTRs. As this automated process is fast and not tedious it is repeated each time a better draft of a genome becomes available. This results in frequent new releases of a genome annotation in Ensembl. After the automated curation process, each chromosome is subjected to a manual curation process. This process is much slower and only performed if the genomic sequence release is stable and does not need any updates anymore. Results of this manual curation process can be found in the database VEGA. ENSEMBL data model and automated curation process Human protein (Swiss Prot) Genewise Other proteins Blast cDNA exonerate EST exonerate Add UTR Ab initio gene prediction GeneScan Cluster merge merge Add variants EST genes Genes Kathleen Marchal M cluster merge updated 17/09/13 12 Databases Bioinformatics (Bachelor) Automatic pipeline of annotation used by Ensembl. Manual curation is performed by the Havana group (sanger Center, http://www.sanger.ac.uk/HGP/havana/) and released in the VEGA database. 2.1.5.1 Microarray databases: 2.1.5.2 SAGE Database: 2.1.5.3 EST based expression analysis 2.1.5.4 Proteome database 2.1.6.1 KEGG database 2.1.6.2 ECOcyc database 2.1.6.3 Gene ontology Kathleen Marchal updated 17/09/13 13