Download Protein sequence retrieval and other database information

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Protein sequence retrieval
AND other database information
databases
• Protein sequence(primary)
– SWISS-PROT
– PIR-International
• Protein sequence (composite)
– OWL
– NRDB
Protein sequence (secondary)
– PROSITE
– PRINTS
– Pfam
Macromolecular structures
– Protein Data Bank (PDB)
– Nucleic Acids Database (NDB)
– HIV Protease Database
– ReLiBase
– PDBsum
– CATH
– SCOP
– FSSP
• Nucleotide sequences
– GenBank
– EMBL
– DDBJ
• Genome sequences
– Entrez genomes
– GeneCensus
– COGs
• Integrated databases
– InterPro
– Sequence retrieval system (SRS)
– Entrez
Protein Sequence Alignment and Database Searching
•Alignment of Two Sequences (Pair-wise Alignment)
– The Scoring Schemes or Weight Matrices
– Techniques of Alignments
– DOTPLOT
•Multiple Sequence Alignment (Alignment of > 2 Sequences)
–Extending Dynamic Programming to more sequences
–Progressive Alignment (Tree or Hierarchical Methods)
–Iterative Techniques
• Stochastic Algorithms (SA, GA, HMM)
• Non Stochastic Algorithms
•Database Scanning
– FASTA, BLAST, PSIBLAST, ISS
• Alignment of Whole Genomes
– MUMmer (Maximal Unique Match)
Input Query
Amino Acid Sequence
DNA Sequence
Blastp
tblastn
blastn
blastx
tblastx
Compares
Against
Protein
Sequence
Database
Compares
Against
translated
Nucleotide
Sequence
Database
Compares
Against
Nucleotide
Sequence
Database
Compares
Against
Protein
Sequence
Database
Compares
Against
translated
nucleotide
Sequence
Database
An Overview of BLAST
Comparison of Whole Genomes
•
MUMmer (Salzberg group, 1999,
2002)
–
–
–
–
–
•
Pair-wise sequence alignment of genomes
Assume that sequences are closely related
Allow to detect repeats, inverse repeats, SNP
Domain inserted/deleted
Identify the exact matches
How it works
–
–
–
–
–
–
Identify the maximal unique match (MUM)
in two genomes
As two genome are similar so larger MUM
will be there
Sort the matches found in MUM and extract
longest set of possible matches that occurs
in same order (Ordered MUM)
Suffix tree was used to identify MUM
Close the gaps by SNPs, large inserts
Align region between MUMs by SmithWaterman
10
11
Secondary protein database
• SWISS-PROT (1986)
– Best annotated, least redundant
• PIR (Protein Information Resource)
– More automated annotation
– Collaborations with MIPS and JIPID
12
Secondary protein databases
• SWISS-PROT (1986)
– Best annotated, least redundant
• PIR (Protein Information Resource)
– More automated annotation
– Collaborations with MIPS and JIPID
• Uniprot (2003)
– UniProt (Universal Protein Resource) is a central
repository of protein sequence and function
created by joining the information contained in
Swiss-Prot, TrEMBL, and PIR.
13
Databases
• Secondary (curated)
• Primary (archival)
–
–
–
–
–
–
–
–
–
–
GenBank/EMBL/DDBJ
UniProt
PDB
Medline (PubMed)
BIND
14
RefSeq
Taxon
UniProt
OMIM
SGD
Organismal Divisions
Used in which database?
BCT
FUN
HUM
INV
MAM
ORG
PHG
PLN
PRI
PRO
ROD
SYN
VRL
VRT
Bacterial
Fungal
Homo sapiens
Invertebrate
Other mammalian
Organelle
Phage
Plant
Primate (also see HUM)
Prokaryotic
Rodent
Synthetic and chimeric
Viral
Other vertebrate
15
DDBJ - GenBank
EMBL
DDBJ - EMBL
all
all
EMBL
all
all
all (not same data in all)
EMBL
all
all
all
all
Functional Divisions
PAT
EST
STS
GSS
HTG
HTC
CON
Patent
Expressed Sequence Tags
Sequence Tagged Site
Genome Survey Sequence
High Throughput Genome (unfinished)
High throughput cDNA (unfinished)
Contig assembly instructions
Organismal divisions:
BCT
PRI
FUN
ROD
INV
SYN
MAM
VRL
PHG
VRT
16
PLN
EST: Expressed Sequence Tag
Expressed Sequence Tags are short
(300-500 bp) single reads from mRNA (cDNA)
which are produced in large numbers.
They represent a snapshot of what is expressed
in a given tissue, and developmental stage.
Also see:
http://www.ncbi.nlm.nih.gov/dbEST/
http://www.ncbi.nlm.nih.gov/UniGene/
17
STS
Sequenced Tagged Sites, are operationally
unique sequence that identifies the
combination of primer pairs used in a PCR
assay that generate a mapping reagent which
maps to a single position within the genome.
Also see:
http://www.ncbi.nlm.nih.gov/dbSTS/
http://www.ncbi.nlm.nih.gov/genemap/
18
GSS: Genome Survey Sequences
Genome Survey Sequences are similar in nature
to the ESTs, except that its sequences are genomic
in origin, rather than cDNA (mRNA).
The GSS division contains:
• random "single pass read" genome survey sequences.
• single pass reads from cosmid/BAC/YAC ends (these could
be chromosome specific, but need not be)
• exon trapped genomic sequences
• Alu PCR sequences
Also see:
http://www.ncbi.nlm.nih.gov/dbGSS/
19
HTG: High Throughput Genome
High Throughput Genome Sequences are
unfinished genome sequencing efforts records.
Unfinished records have gaps in the
nucleotides sequence, low accuracy, and no
annotations on the records.
Also see:
http://www.ncbi.nlm.nih.gov/HTGS/
Ouellette and Boguski (1997) Genome Res. 7:952-955
20
Which tool?
mRNA
EST
Genomic
Other
dbEST
Simple
E-mail
or FTP
WWW
BankIt
Other
•Better control
of annotations
•pop/phylo
•segmented sets
Sequin
or tbl2asn
E-mail
21
STS/
GSS
HTGS
Simple
dbSTS
dbGSS
Customized
software
or tbl2asn
WWW
BankIt
E-mail
or FTP
E-mail
or FTP
Related documents