Download BL2165 Practical

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Open Database Connectivity wikipedia , lookup

Ingres (database) wikipedia , lookup

Encyclopedia of World Problems and Human Potential wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Concurrency control wikipedia , lookup

Database wikipedia , lookup

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

ContactPoint wikipedia , lookup

Database model wikipedia , lookup

Healthcare Cost and Utilization Project wikipedia , lookup

Transcript
Biological Databases
By : Lim Yun Ping
E mail : [email protected]
National University of
Singapore
Overview
•
•
•
•
•
Introduction
What is a database
What type of databases can we access
What roles do they play
What type of information can we get from
them
• How do we access these information
What is a database ?
• Convenient method of vast amount of
information
• Allows for proper storing, searching &
retrieving of data.
• Before analyzing them we need to assemble
them into central, shareable resources
Why databases ?
• Means to handle and share large volumes of
biological data
• Support large-scale analysis efforts
• Make data access easy and updated
• Link knowledge obtained from various
fields of biology and medicine
Different Database Types
• depends on the nature of information stored
(sequences, 2D gel or 3D structure images)
• manner of storage (flat files, tables in a relational
database, etc)
• In this course we are concerned more about the
different types of databases rather than the
particular storage
Features
• Most of the databases have a web-interface to
search for data
• Common mode to search is by Keywords
• User can choose to view the data or save to your
computer
• Cross-references help to navigate from one
database to another easily
Biological Databases
Type of databases
Bibliographic databases
Taxonomic databases
Nucleic acid databases
Genomic databases
Protein databases
Protein families, domains and
functional sites
Enzymes/ metabolic pathways
Information they contain
Literature
Classification
DNA information
Gene level information
Protein information
Classification of proteins and identifying domains
Metabolic pathways
Types Of Biological Databases Accessible
There are many different types of database
but for routine sequence analysis, the
following are initially the most important
Primary databases
Secondary databases
Composite databases
Primary databases
• Contain sequence data such as nucleic acid
or protein
• Example of primary databases include :
Nucleic Acid Databases
Protein Databases
• EMBL
• SWISS-PROT
• Genbank
• TREMBL
• DDBJ
• PIR
Secondary databases
• Or sometimes known as pattern databases
• Contain results from the analysis of the
sequences in the primary databases
• Example of secondary databases include :
 PROSITE
 Pfam
 BLOCKS
 PRINTS
Composite databases
• Combine different sources of primary
databases.
• Make querying and searching efficient and
without the need to go to each of the
primary databases.
• Example of composite databases include :
 NRDB – Non-Redundant DataBase
 OWL
NCBI : http://www.ncbi.nlm.nih.gov/
NCBI, at the NIH campus, USA
EMBL : http://www.embl-heidelberg.de/
European Molecular Biology Laboratory, UK
DDBJ : http://www.ddbj.nig.ac.jp
DNA Databank of Japan
Nucleic acid Databases
The International Sequence Database Collaboration
GenBank
EMBL
DDBJ
The International Sequence Database
Collaboration
• These three databases have collaborated since 1982. Each
database collects and processes new sequence data and relevant
biological information from scientists in their region e.g. EMBL
collects from Europe, GenBank from the USA.
• These databases automatically update each other with the new
sequences collected from each region, every 24 hours. The result is
that they contain exactly the same information, except for any
sequences that have been added in the last 24 hours.
• This is an important consideration in your choice of database. If you
need accurate and up to date information, you must search an up to
date database.
Amount Of Data Grows Rapidly
As of June 2003, there were 32528249295 bases
in 25592865 sequence
How to access them
Main Sites
NCBI : http://www.ncbi.nlm.nih.gov/
EMBL : http://www.embl-heidelberg.de/
DDBJ : http://www.ddbj.nig.ac.jp
•full release every two months
•incremental and cumulative updates daily
•available only through internet
ftp://ftp.ncbi.nih.gov/genbank/
• 66.3 Gigabytes of data
The Internet and WWW
NCBI : http://www.ncbi.nlm.nih.gov/
NCBI, a division of NLM at the NIH campus, USA
EXPASY : http://www.expasy.org
Swiss Institute of Bioinformatics
Kyoto Encyclopedia of Genes and Genomes
http://www.genome.ad.jp/kegg/kegg2.html
National Centre for Biotechnology Information
Established in 1988 as a national resource for
molecular biology information, NCBI creates
public databases, conducts research in
computational biology, develops software tools
for analyzing genome data, and disseminates
biomedical information all for the better
understanding of molecular processes affecting
human health and disease.
http://www.ncbi.nlm.nih.gov/
Entrez
Entrez is a search and retrieval
system that integrates information
from databases at NCBI.
BNIP
Brief description of the sequence.
Accession Number : Unique identifier
Source : Organism’s common name
Formal scientific name
Contains information on the
publications such as the
authors, and topic titles of
the journals that discuss the
data reported in the record.
Contains the
contact information
of the submitter
Contains the information about the genes,
gene products and regions of biological
significance reported in the sequence &
•length of sequence
•scientific name of the source organism
•Taxon ID number, Map location
Region of biological interest
Coding sequence (region of the nucleotides
that correspond to the sequence of amino
acid). This is also the location that contains
the start and stop codon.
The amino acid translation
corresponding to the
nucleotide coding
sequence
How to understand the output
Unique Identifiers :
Each entry in a database must have a unique
identifier
EMBL Identifier (ID)
GENBANK Accession Number (AC)
Other information is stored along with the sequence.
Each piece of information is written on it's own line,
with a code defining the line. For example,
DE, description;
OS, organism species;
AC, accession number.
Relevant biological information is usually described
in the feature table (FT).
Genbank Flat File Format
Refer to Summary Description of the
Genbank Flat File Format
Or
http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html
ExPASy
• Expert Protein Analysis System proteomics server of
the Swiss Institute of Bioinformatics (SIB)
• dedicated to the analysis of protein sequences and
structures
http://www.expasy.org/
Databases on the Expasy
server
• SWISS-PROT and TrEMBL - Protein knowledgebase
• PROSITE - Protein families and domains
• SWISS-2DPAGE - Two-dimensional polyacrylamide gel
electrophoresis
• ENZYME - Enzyme nomenclature
• SWISS-3DIMAGE - 3D images of proteins and other biological
macromolecules
• SWISS-MODEL Repository - Automatically generated protein
models
SWISS-PROT
A curated protein sequence database which
strives to provide a high level of annotations
(such as the description of the function of a
protein, its domains structure, posttranslational modifications, variants, etc.), a
minimal level of redundancy and high level of
integration with other databases
http://tw.expasy.org/sprot/
TrEMBL
• Computer-annotated supplement to
SWISS-PROT
ENZYME
Enzyme nomenclature
database
http://tw.expasy.org/enzyme/
ENZYME Database
• A repository of information relative to
the nomenclature of enzymes
• Describes each type of characterized
enzyme for which an EC (Enzyme
Commission) number has been
provided
Access to ENZYME
• by EC number
• by enzyme class
• by description (official name) or
alternative name(s)
• by chemical compound
• by cofactor
KEGG
Kyoto Encyclopedia of Genes
and Genomes
http://www.genome.ad.jp/kegg/kegg2.html
A structured database containing
information about metabolic
pathways in many organisms.
KEGG
• Part of the GenomeNet database
system
• Linked to all accessible databases by
search engines; LIGAND & BRITE
Link to
other
pathways
Enzym
e
Compound
Summary
• Biological databases represent an invaluable
resource in support of biological research.
• We can learn much about a particular
molecule by searching databases and using
available analysis tools.
• A large number of databases are available
for that task. Some databases are very
general while some are very specialised. For
best results we often need to access multiple
databases.
• Common database search methods include
keyword matching, sequence similarity, motif
searching, and class searching
• The problems with using biological databases
include incomplete information, data spread
over multiple databases, redundant
information, various errors, sometimes
incorrect links, and constant change.
• Database standards, nomenclature, and naming
conventions are not clearly defined for many aspects
of biological information. This makes information
extraction more difficult
• Retrieval systems help extract rich information from
multiple databases. Examples include Entrez and
SRS.
• Formulating queries is a serious issue in biological
databases. Often the quality of results depends on
the quality of the queries.
• Access to biological databases is so important that
today virtually every molecular biological project
starts and ends with querying biological databases.
The End