Download BI-Lec 3

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genetic code wikipedia , lookup

Gene wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Whole genome sequencing wikipedia , lookup

DNA barcoding wikipedia , lookup

Genomic library wikipedia , lookup

Primary transcript wikipedia , lookup

Human genome wikipedia , lookup

Pathogenomics wikipedia , lookup

RNA-Seq wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

Sequence alignment wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Point mutation wikipedia , lookup

Genome editing wikipedia , lookup

Helitron (biology) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Metagenomics wikipedia , lookup

Genomics wikipedia , lookup

Transcript
Bioinformatics
Ayesha M. Khan
Spring 2013
2
Lec-3
Introduction to databases
If we are to derive the maximum benefit from
the deluge of sequence information, we must
deal with it in a concerted way by doing the
following:
Establish
Maintain
Disseminate
the information contained in databases
3
Introduction to Databases
Lec-3
• Databases are effectively electronic filing cabinets,
a convenient and efficient method of storing vast
amounts of information.
• Central, shareable resources
Many different types of databases, depending on
-Nature of information being stored
-Manner of data storage
4
Lec-3
Primary & Secondary databases
• Primary and secondary databases are used to
address different aspects of sequence analysis,
because they store different levels of protein
sequence information
Primary or derived databases
Primary databases: experimental results directly into database
Secondary databases: results of analysis of primary databases
Aggregate of many databases /Composite databases
Links to other data items
Combination of data
Consolidation of data
5
Lec-3
Primary sequence databases
• Early 1980’s
• Nucleic acidEMBL (Europe), GenBank
(USA), DDBJ (Japan)
• Protein PIR, SWISS-PROT, TrEMBL,
NRL-3D
6
EMBL:
Lec-3
EMBL is the nucleotide sequence database from
European Bioinformatics Institute (EBI)
It has sequences from: direct author submissions,
genome sequencing groups, scientific literature and
patent applications.
DDBJ:
DNA databank of Japan, produced maintained and
distributed at the National Institute of Genetics.
GenBank:
DNA database from National
Biotechnology Information (NCBI).
Center
for
7
Lec-3
Principal requirements of a database
The principal requirements on the public data services are:
• Data quality - data quality has to be of the highest priority. However, because the data
services in most cases lack access to supporting data, the quality of the data must
remain the primary responsibility of the submitter.
• Supporting data - database users will need to examine the primary experimental data,
either in the database itself, or by following cross-references back to networkaccessible laboratory databases.
• Deep annotation - deep, consistent annotation comprising supporting and ancillary
information should be attached to each basic data object in the database.
• Timelines - the basic data should be available on an Internet-accessible server within
days (or hours) of publication or submission.
• Integration - each data object in the database should be cross-referenced to
representation of the same or related biological entities in other databases. Data
services should provide capabilities for following these links from one database or data
service to another.
8
Lec-3
Exercise
Look for a gene of your interest in the three primary
nucleic acid databases: compare the information
given in each one of them.
9
Primary Sequence Database
Amino Acid
Nucleic Acid
e.g. GenBank, EMBL, DDBJ
SwissProt and PIR
Lec-3
Secondary Sequence Database
Protein Domains & Families
Metabolic Pathways
e.g. RefSeq and Conserved
Domain Database (CDD)
within NCBI
Sequencing
centers
Researchers
Literature
Flowchart of sequence data from labs and literature to primary
sequence database and subsequent secondary databases
10
Always remember that:
• The data within primary databases is as reliable as the data
submitted.
• This depends primarily on the methods used to produce it.
• Regardless of who obtains the sequence data, nucleic acid
and amino acid sequencing results are subject to errors.
Lec-3
11
Protein Sequence databases
Lec-3
• The protein sequence database was developed at
the National Biomedical Research Foundation
(NBRF)
• Early 1960’s by Margaret Dayhoff to investigate
evolutionary relationships among proteins
• 1988 onwards, maintained collectively by:
Protein Information Resource (PIR) at NBRF,
International Protein Information Database of
Japan (JIPID), and the Martinsried Institute for
Protein Sequences (MIPS).
12
Examples of molecular sequence types in NCBI records
Lec-3
Type
Genome Sequence
Tagged site
(STS)
Description
A unique segment of DNA that
occurs only once in a genome
and marks a particular location.
Can be generated from genomic
DNA or cDNA.
Pieces of a genome that are
compiled from a DNA or cDNA
Draft sequences library. Usually large collection
of contigs and are in the process
of being ordered and catalogued.
Genome
The complete genome of an
organism.
13
Type
Chromosome Locus
Contig
Description
Lec-3
A known location on a
chromosome for a particular
gene or collection of genes that
codes for a specific function.
A contiguous segment of a
chromosome made by joining
overlapping clones or
sequences.
Chromosome The whole sequence of a single
chromosome.
14
Gene
Type
Description
Domain
A discrete portion of a protein
assumed to fold independently of
the rest of the protein and which
possesses its own function.
Complete
CDS
A complete coding sequence for a
protein.
Gene
Whole gene sequence for a
protein
Lec-3
15
mRNA
Type
Description
Expressed
sequence tag
(EST)
A partial sequence of cDNA in
mRNA form from either the 5’ or
3’ end of a gene sequence.
Lec-3
Complementary A cDNA sequence in mRNA form.
DNA sequence
(cDNA)
Complete CDS
A complete mRNA sequence for a
protein coding region.
16
Protein Sequence databases
Lec-3
• SWISS-PROT
Started in 1986-University of Geneva and EMBL
It is now maintained by Swiss Institute of
Bioinformatics (SIB) and EBI/EMBL
• TrEMBL
Started in 1996-Follows SWISS-PROT format
and contains translations of coding sequences in
EMBL.
It also provides: synthetic sequences, short
amino acid fragments, and codons that do not
encode real proteins.
17
Lec-3
Composite protein sequence databases
 A database that merges a variety of different
primary sources.
 They obviate the need to interrogate multiple
resources.
 It can eliminate identical sequence copies, or
eliminate both identical and highly similar
sequences.