Download Slide 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Introduction to databases
Tuomas Hätinen
Topics
 File
Formats
 Databases
- Primary structure: UniProt
- Tertiary structure: PDB
 Database
integration system
- Sequence retrieval system (eg SRS, Hands on session)
File formats
Fasta
 FASTA format is very common.
 Can be hand constructed when in a hurry
 Straightforward way for storing multiple sequences – just concatenate
FASTA files
 Contents:


Line 1: > all identifiers and descriptors
Remaining lines: sequence
>1NJR:A 32.1 KDA PROTEIN IN ADH3-RCA1 INTERGENIC REGION
XTGSLNRHSLLNGVKKXRIILCDTNEVVTNLWQESIPHAYIQNDKYLCIHHGHLQSLXDS
XRKGDAIHHGHSYAIVSPGNSYGYLGGGFDKALYNYFGGKPFETWFRNQLGGRYHTVGSA
TVVDLQRCLEEKTIECRDGIRYIIHVPTVVAPSAPIFNPQNPLKTGFEPVFNAXWNALXH
SPKDIDGLIIPGLCTGYAGVPPIISCKSXAFALRLYXAGDHISKELKNVLIXYYLQYPFE
PFFPESCKIECQKLGIDIEXLKSFNVEKDAIELLIPRRILTLDL
Example of FASTA sequence for PDB 1njr. Note X are ’any’ amino acid.
SwissPROT, EMBL, TrEMBL, UniProt format
 Each line begins with a 2
letter identifier
 UniProt format closely
resembles EMBL format
except that considerably
more information about
physical and biochemical
properties is provided
SwissPROT format
Example of
SwissProt
entry. Line
types are fully
explained in:
http://au.expasy
.org/sprot/user
man.html#linet
ypes
SwissPROT format
Example of
SwissProt
entry. Line
types are fully
explained in:
http://au.expasy
.org/sprot/user
man.html#linet
ypes
Databases
Key concepts
 Experimental database
 Contains
 E.g.
experimental meassurements
EMBL, PDB
 Derived database
 Derived
 E.g.
from experimental databases
UniProtKB
 Database stability
 Accession
numbers
 Non-redundancy
 Annotation
Nucleic sequence databases – experimental data
NIH
NCBI
GenBank
*Submissions
*Updates
*Submissions
*Updates
EMBL
CIB
DDBJ
EBI
NIG
*Submissions
*Updates
EMBL
Raw Protein sequence databases
DNA sequences DBs
Sub/Up
Sub/Up
NCBI
NIH
Gen
Bank
Gen Pept
Trans
Entrez
PIR-PSD
SRS
DDBJ
EMBL
EBI
Sub/Up
Proteins seq DBs
EMBL
Trans
TrEMBL
Sub/Up
UniPROT
SwissPROT
UniProt
 Universal Protein Resource
 Protein Sequence database
 UniProt Consortium
 European
 Swiss
 PIR
Bioinformatics Institute
Institute of Bioinformatics
Georgetown University
 Mission
- Maintain high quality, stable, comprehensive, fully
classified and annotated protein sequence
knowledgebase, with extensive cross-references and
querying interfaces
Organization of UniProt databases
 UniProt Archive (UniParc)

All available protein
sequences
 UniProt Knowledgebase
(UniProtKB)

Annotated proteins
sequences
 UniProt Reference Clusters
(UniRef)

Reduced redundancy for
faster searching
Database size comparison
Number of sequences
UniRef50
UniRef90
UniRef100
UniProtKB
UniParc
0
1
2
3
4
5
6
7
8
9
Millions
UniProtKB
 Annontated entries
 UniParc =>UniProtKB
 UniProt/TrEMBL
 Automated
annotation
 UniProt/SwissProt
 Manual
annotation
SWISSPROT
 Started as part of a Phd thesis, first version released in
1986. Now a collaboration between Swiss Institute of
Bioinformatics and EBI.
 Rich source for protein sequence data
 A well annotated source for sequences
 Largely non-redundant
 Updated daily, cross referenced with more than 30
different databases.
 Let us view a sample entry
TrEMBL
 1996: TrEMBL (Translation of EMBL) released
 Computer-annotated entries derived from the translation
of all coding sequences in EMBL database except those
already in SWISS-PROT
 complement to Swiss-Prot and sequence
 Sequences included to Swissprot by annotators
Errors in databases
 Be aware of errors in the databases:
 sequence
errors:
- genome projects’ error rate is 1/10,000 nts;
- ESTs’ error rate is 1/100nts.
 annotation
errors:
- Programs do not always give correct annotations.
- SwissProt is a protein database curated and annotated manually
by biologists.
- Manual curation doe
Errors in databases
 Be aware of errors in the databases:
 sequence
errors:
- genome projects’ error rate is 1/10,000nts;
- ESTs’ error rate is 1/100nts.
 annotation
errors:
- Automated computer programs do not always give correct
annotations.
- SwissProt is a protein database curated and annotated manually
by biologists.
- most reliable database, but is not up-to-date
Related documents