Download No Slide Title

Document related concepts
no text concepts found
Transcript
Databases
Where to get
data?
• GenBank
– http://www.ncbi.nlm.nih.gov
• Protein Databases
– SWISS-PROT: http://www.expasy.ch/sprot
– PDB: http://www.pdb.gov/
• And many others
Bibliograph
y
Growth in genome sequencing
Working Draft Sequence
gaps
The reagent: databases
• Organized array of information
• Place where you put things in, and (if all is
well) you should be able to get them out
again.
• Resource for other databases and tools.
• Simplify the information space by
specialization.
• Bonus: Allows you to make discoveries.
Contains files or tables, each containing numerous records and fields
Simplest form, either a large single text file or collection of text files
Commonest type, stores the data within a number of
tables (with records and fields). Each table will link each
other by a shared file called a key
Flat file
Relational database model
The operators are written in query-specific languages based on relational algebra
Structured Query Language (SQL) is commonly used
•
•
•
•
•
XML (eXtensible Markup Language) is now a general tool for storage of
data and information. HTML and XHTML are subsets of XML.
The key feature is to use identifiers called tabs
<title> Understanding Bioinformatics </ title>
<publisher> tag can be defined and used to identify book publishers
Extraction from XML file is similar to database querying.
Databases
Information system
Query system
Storage System
Data
GenBank flat file
PDB file
Interaction Record
Title of a book
Book
Databases
Information system
Query system
Storage System
Data
Boxes
Oracle
MySQL
PC binary files
Unix text files
Bookshelves
Databases
Information system
Query system
Storage System
Data
A List you look at
A catalogue
indexed files
SQL
grep
Databases
Information system
Query system
Storage System
Data
The UBC library
Google
Entrez
SRS
Bioinformatics Information Space
July 17, 1999
•
•
•
•
•
•
•
•
•
•
•
Nucleotide sequences:
Protein sequences:
3D structures:
Human Unigene Clusters:
Maps and Complete Genomes:
Different species node:
dbSNP
RefGenes
human contigs > 250 kb
PubMed records:
OMIM records:
4,456,822
706,862
9,780
75,832
10,870
52,889
6,377
515
341 (4.9MB)
10,372,886
10,695
The challenge of the information space:
Feb 10 2004
Nucleotide records
Protein sequences
3D structures
Interactions & complexes
Human Unigene Cluster
Maps and Complete Genomes
Different taxonomy Nodes
Human dbSNP
Human RefSeq records
bp in Human Contigs > 5,000 kb (116)
PubMed records
OMIM records
36,653,899
4,436,362
19,640
52,385
118,517
6,948
283,121
13,179,601
22,079
2,487,920,000
12,570,540
15,138
From a CBW student course
evaluation:
“I could probably live the rest of
my life happily without ever
seeing the ‘growth of GenBank’
curve … again.”
Databases
• Primary (archival)
–
–
–
–
–
GenBank/EMBL/DDBJ
UniProt
PDB
Medline (PubMed)
BIND
• Secondary (curated)
–
–
–
–
–
RefSeq
Taxon
UniProt
OMIM
SGD
http://nar.oupjournals.org/content/vol31/issue1/
Tools of trade
for the “armchair scientist”
• Databases
– PubMed and other NCBI databases
– Biochemical databases
– Protein domain databases
– Structural databases
– Genome comparison databases
• Tools
– CDD / COGs
– VAST / FSSP
Distribution of the type of databases as classified at the
NAR database web site
Types of databases
• Archival or Primary Data
– Text: PubMed
– DNA Sequence: GenBank
– Protein Sequence: Entrez Proteins, TREMBL
– Protein Structures: PDB
• Curated or Processed Data
– DNA sequences : RefSeq, LocusLink, OMIM
– Protein Sequences: SWISS-PROT, PIR
– Protein Structures : SCOP, CATH, MMDB
– Genomes: Entrez Genomes, COGs
Nucleic Acids Research: Database Issue each January 1 Articles
on ~100 different databases
4 ways to access protein and DNA sequences
[1] LocusLink with RefSeq
[2] Entrez
[3] UniGene
UniGene collects expressed sequence tags (ESTs)
into clusters, in an attempt to form one gene per cluster.
Use UniGene to study where your gene is expressed
in the body, when it is expressed, and see its abundance.
[4] ExPASy SRS
4 ways to access protein and DNA sequences
[1] LocusLink with RefSeq
[2] Entrez
[3] UniGene
[4] ExPASy SRS
There are many bioinformatics servers outside NCBI.
Try ExPASy’s sequence retrieval system at
http://www.expasy.ch/
(ExPASy = Expert Protein Analysis System)
Or try ENSEMBL at www.ensembl.org for a premier
human genome web browser.
National Center for Biotechnology
Information (NCBI)
www.ncbi.nlm.nih.gov
Page 24
The National Center for Biotechnology Information (NCBI)
• Created as a part of the National Library of Medicine,
National Institutes of Health in 1988
– Establish public databases
– Research in computational biology
– Develop software tools for sequence analysis
– Disseminate biomedical information
• Tools: BLAST(1990), Entrez (1992)
• GenBank (1992)
• Free MEDLINE (PubMed, 1997)
• Other databases: dbEST, dbGSS, dbSTS,
MMDB, OMIM, UniGene, Taxonomy,
GeneMap, SAGE, LocusLink, RefSeq
What is GenBank?
• Archival nucleotide sequence database
• Sample slogans:
“Easy deposits, unlimited withdrawals,
high
interest”, “All bases covered”,
“Billions and billions
served”
• Data are shared nightly among three
collaborating databases:
• GenBank at NCBI - Bethesda, Maryland, USA
• DNA Database of Japan (DDBJ) at NIG Mishima, Japan
• European Molecular Biology Laboratory
Database (EMBL) at EBI - Hinxton, UK
Some guiding principles of
working with GenBank
• GenBank is a nucleotide-centric
view of the information space
• GenBank is a repository of all
publically available sequences
• In GenBank, records are grouped
for various reasons
• Data in GenBank is only as good
as what you put in
NCBI databases and their links
Article Word
Abstracts
Weight
Medline
3D
3-D
Structure
Structure
Taxonomy
MMDB
Phylogeny
Genomes
Nucleotide
Sequences
BLAST
VAST
Protein
Sequences
BLAST
www.ncbi.nlm.nih.gov
Fig. 2.5
Page 25
Fig. 2.5
Page 25
PubMed is…
• National Library of Medicine's search service
• 16 million citations in MEDLINE
• links to participating online journals
• PubMed tutorial (via “Education” on side bar)
Page 24
Entrez integrates…
• the scientific literature;
• DNA and protein sequence databases;
• 3D protein structure data;
• population study data sets;
• assemblies of complete genomes
Page 24
Entrez is a search and retrieval system
that integrates NCBI databases
Page 24
Entrez: An integrated search and retrieval system
BLAST is…
• Basic Local Alignment Search Tool
• NCBI's sequence similarity search tool
• supports analysis of DNA and protein databases
• 100,000 searches per day
Page 25
OMIM is…
•Online Mendelian Inheritance in Man
•catalog of human genes and genetic disorders
•edited by Dr. Victor McKusick, others at JHU
Page 25
OMIM record for Presenilin 1 (PSEN1)
Content
s
Additional info in OMIM
Associated LocusLink record
Each record
provides a
state of the
art summary
of current
knowledge
External resources
Extensive
references to
literature
OMIM Search Results by Titles
alzheimer AND presenilin 1
Entrez Genome: Gene Location
View of
chromoso
me 14
Multiple Maps
STSs, ESTs, etc.
Gene
Name
Integrated View of Chromosome 7
Entrez
Genomes Map
Viewer
Chromosome
7
GenBank Map
Contig Map
STS Map
Multiple Maps
STSs, ESTs, etc.
Entrez Genome: Gene Location
View of
chromoso
me 14
Gene
Name
Entrez Genome: Gene Location
Entrez
Genomes
Map Viewer
Chromosome
14 Cytogenetic
map
Location of
PSEN1 and
surrounding
genes
Books is…
• searchable resource of on-line books
Page 26
TaxBrowser is…
• browser for the major divisions of living organisms
(archaea, bacteria, eukaryota, viruses)
• taxonomy information such as genetic codes
• molecular data on extinct organisms
Page 26
Structure site includes…
• Molecular Modelling Database (MMDB)
• biopolymer structures obtained from
the Protein Data Bank (PDB)
• Cn3D (a 3D-structure viewer)
• vector alignment search tool (VAST)
Page 26
PDB
• Protein DataBase
– Protein and NA
3D structures
– Sequence
present
– YAFFF
PDB
•
•
•
•
•
•
•
•
•
JRNL
JRNL
JRNL
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
HELIX
CRYST1
ORIGX1
ORIGX2
ORIGX3
SCALE1
SCALE2
SCALE3
ATOM
HEADER
COMPND
SOURCE
AUTHOR
DATE
JRNL
REMARK
SECRES
ATOM COORDINATES
TITL 3 FLEXIBILITY
REF
J.MOL.BIOL.
REFN
ASTM JMOBAK
V. 233
UK ISSN 0022-2836
139 1993
0070
1
2
2 RESOLUTION. 3.0 ANGSTROMS.
3
3 REFINEMENT.
3
PROGRAM
X-PLOR
3
AUTHORS
BRUNGER
3
R VALUE
0.216
3
RMSD BOND DISTANCES
0.020 ANGSTROMS
3
RMSD BOND ANGLES
3.86
DEGREES
3
3
NUMBER OF REFLECTIONS
3296
3
RESOLUTION RANGE
10.0 - 3.0 ANGSTROMS
3
DATA CUTOFF
3.0
SIGMA(F)
3
PERCENT COMPLETION
98.2
3
3
NUMBER OF PROTEIN ATOMS
456
3
NUMBER OF NUCLEIC ACID ATOMS
386
4
4 GCN4: TRANSCRIPTIONAL ACTIVATOR OF GENES ENCODING FOR AMINO
4 ACID BIOSYNTHETIC ENZYMES.
5
5 AMINO ACIDS NUMBERING (RESIDUE NUMBER) CORRESPONDS TO THE
5 281 AMINO ACIDS OF INTACT GCN4.
6
6 BZIP SEQUENCE 220 - 281 USED FOR CRYSTALLIZATION.
7
7 MODEL FROM AMINO ACIDS 227 - 281 SINCE AMINO ACIDS 220 7 226 ARE NOT WELL ORDERED.
8
8 RESIDUE NUMBERING OF NUCLEOTIDES:
8 5' T G G A G A T G A C G T C A T C T C C
8 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 7 8 9
9
9 THE ASYMMETRIC UNIT CONTAINS ONE HALF OF PROTEIN/DNA
9 COMPLEX PER ASYMMETRIC UNIT.
10
10 MOLECULAR DYAD AXIS OF PROTEIN DIMER AND PALINDROMIC HALF
10 SITES OF THE DNA COINCIDES WITH CRYSTALLOGRAPHIC TWO-FOLD
10 AXIS. THE FULL PROTEIN/DNA COMPLEX CAN BE OBTAINED BY
10 APPLYING THE FOLLOWING TRANSFORMATION MATRIX AND
10 TRANSLATION VECTOR TO THE COORDINATES X Y Z:
10
10
0
-1
0
X
117.32
X SYMM
10
-1
0
0
Y
+
117.32
=
Y SYMM
10
0
0 -1
Z
43.33
Z SYMM
1 A
62 ILE VAL PRO GLU SER SER ASP PRO ALA ALA LEU LYS ARG
2 A
62 ALA ARG ASN THR GLU ALA ALA ARG ARG SER ARG ALA ARG
3 A
62 LYS LEU GLN ARG MET LYS GLN LEU GLU ASP LYS VAL GLU
4 A
62 GLU LEU LEU SER LYS ASN TYR HIS LEU GLU ASN GLU VAL
5 A
62 ALA ARG LEU LYS LYS LEU VAL GLY GLU ARG
1 B
19
T
G
G
A
G
A
T
G
A
C
G
T
C
2 B
19
A
T
C
T
C
C
1 A
ALA A 228 LYS A 276 1
58.660
58.660
86.660 90.00 90.00 90.00 P 41 21 2
8
1.000000 0.000000 0.000000
0.00000
0.000000 1.000000 0.000000
0.00000
0.000000 0.000000 1.000000
0.00000
0.017047 0.000000 0.000000
0.00000
0.000000 0.017047 0.000000
0.00000
0.000000 0.000000 0.011539
0.00000
1 N
PRO A 227
35.313 108.011 15.140 1.00 38.94
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
Accessing information on
molecular sequences
Page 26
Accession numbers are labels
for sequences
NCBI includes databases (such as GenBank) that contain
information on DNA, RNA, or protein sequences.
You may want to acquire information beginning with a
query such as the name of a protein of interest, or the
raw nucleotides comprising a DNA sequence of interest.
DNA sequences and other molecular data are tagged with
accession numbers that are used to identify a sequence
or other record relevant to molecular data.
Page 26
What is an accession number?
An accession number is label that used to identify a
sequence. It is a string of letters and/or numbers that
corresponds to a molecular sequence.
Examples (all for retinol-binding protein, RBP4):
X02775
NT_030059
Rs7079946
GenBank genomic DNA sequence
Genomic contig
dbSNP (single nucleotide polymorphism)
DNA
N91759.1
NM_006744
An expressed sequence tag (1 of 170)
RefSeq DNA sequence (from a transcript)
RNA
NP_007635
AAC02945
Q28369
1KT7
RefSeq protein
GenBank protein
SwissProt protein
Protein Data Bank structure record
protein
Page 27
Four ways to access DNA and
protein sequences
[1] Entrez Gene with RefSeq
[2] UniGene
[3] European Bioinformatics Institute (EBI)
and Ensembl (separate from NCBI)
[4] ExPASy Sequence Retrieval System
(separate from NCBI)
Note: LocusLink at NCBI was recently retired.
The third printing of the book has updated
these sections (pages 27-31).
Page 27
4 ways to access protein and DNA sequences
[1] Entrez Gene with RefSeq
Entrez Gene is a great starting point: it collects
key information on each gene/protein from
major databases. It covers all major organisms.
RefSeq provides a curated, optimal accession
number for each DNA (NM_006744)
or protein (NP_007635)
Page 27
From the NCBI home
page, type “rbp4”
and hit “Go”
Pevsner
Fig. 2.7
Page 29
revised
Fig. 2.7
Page 29
By applying limits, there are now just two entries
GenBank Record
Locus Name
Accession Number
gi Number
Medline ID
Protein Sequence
[rest of protein sequence deleted for brevity]
GenPept ID
[rest of nucleotide sequence deleted for brevity]
Nucleotide Sequence
LOCUS, Accession, NID and protein_id
LOCUS: Unique string of 10 letters and numbers in
the database. Not maintained amongst databases,
and is therefore a poor sequence identifier.
ACCESSION: A unique identifier to that record, citable
entity; does not change when record is updated. A good
record identifier, ideal for citation in publication.
VERSION: : New system where the accession and version play the
same function as the accession and gi number.
Nucleotide gi: Geninfo identifier (gi), a unique integer
which will change every time the sequence changes.
PID: Protein Identifier: g, e or d prefix to gi number.
Can have one or two on one CDS.
Protein gi: Geninfo identifier (gi), a unique integer which
will change every time the sequence changes.
protein_id: Identifier which has the same
structure and function as the nucleotide Accession.version
numbers, but slightlt different format.
Entrez Gene (top of page)
Note that links to
many other RBP4
database entries
are available
revised
Fig. 2.8
Page 30
Entrez Gene (middle of page)
Entrez Gene (bottom of page)
Fig. 2.9
Page 32
Fig. 2.9
Page 32
Fig. 2.9
Page 32
FASTA format
Fig. 2.10
Page 32
What is an accession number?
An accession number is label that used to identify a
sequence. It is a string of letters and/or numbers that
corresponds to a molecular sequence.
Examples (all for retinol-binding protein, RBP4):
X02775
NT_030059
Rs7079946
GenBank genomic DNA sequence
Genomic contig
dbSNP (single nucleotide polymorphism)
DNA
N91759.1
NM_006744
An expressed sequence tag (1 of 170)
RefSeq DNA sequence (from a transcript)
RNA
NP_007635
AAC02945
Q28369
1KT7
RefSeq protein
GenBank protein
SwissProt protein
Protein Data Bank structure record
protein
Page 27
NCBI’s important RefSeq project:
best representative sequences
RefSeq (accessible via the main page of NCBI)
provides an expertly curated accession number that
corresponds to the most stable, agreed-upon “reference”
version of a sequence.
RefSeq identifiers include the following formats:
Complete genome
Complete chromosome
Genomic contig
mRNA (DNA format)
Protein
NG_######
NC_######
NT_######
NM_###### e.g. NM_006744
NP_###### e.g. NP_006735
Page 29-30
NCBI’s RefSeq project: accession for
genomic, mRNA, protein sequences
Accession
AC_123456
AP_123456
NC_123456
NG_123456
NM_123456
NM_123456789
NP_123456
NP_123456789
NR_123456
NT_123456
NW_123456
NZ_ABCD12345678
XM_123456
XP_123456
XR_123456
YP_123456
ZP_12345678
Molecule
Genomic
Protein
Genomic
Genomic
mRNA
mRNA
Protein
Protein
RNA
Genomic
Genomic
Genomic
mRNA
Protein
RNA
Protein
Protein
Method
Mixed
Mixed
Mixed
Mixed
Mixed
Mixed
Mixed
Curation
Mixed
Automated
Automated
Automated
Automated
Automated
Automated
Auto. & Curated
Automated
Note
Alternate complete genomic
Protein products; alternate
Complete genomic molecules
Incomplete genomic regions
Transcript products; mRNA
Transcript products; 9-digit
Protein products;
Protein products; 9-digit
Non-coding transcripts
Genomic assemblies
Genomic assemblies
Whole genome shotgun data
Transcript products
Protein products
Transcript products
Protein products
Protein products
Four ways to access DNA and
protein sequences
[1] Entrez Gene with RefSeq
[2] UniGene
[3] European Bioinformatics Institute (EBI)
and Ensembl (separate from NCBI)
[4] ExPASy Sequence Retrieval System
(separate from NCBI)
Page 31
DNA
RNA
protein
complementary DNA
(cDNA)
UniGene
In genetics, complementary DNA (cDNA) is DNA synthesized from a mature mRNA
template in a reaction catalyzed by the enzyme reverse transcriptase.
Fig. 2.3
Page 23
Expressed Sequence Tag
What Are ESTs and How Are They Made?
ESTs are small pieces of DNA sequence (usually 200
to 500 nucleotides long) that are generated by
sequencing either one or both ends of an expressed
gene. The idea is to sequence bits of DNA that
represent genes expressed in certain cells, tissues,
or organs from different organisms and use these
"tags" to fish a gene out of a portion of chromosomal
DNA by matching base pairs. The challenge
associated with identifying genes from genomic
sequences varies among organisms and is
dependent upon genome size as well as the
presence or absence of introns, the intervening DNA
sequences interrupting the protein coding sequence
of a gene.
STS
Sequenced Tagged Sites, are operationally
unique sequence that identifies the
combination of primer pairs used in a PCR
assay that generate a mapping reagent which
maps to a single position within the genome.
Also see: http://www.ncbi.nlm.nih.gov/dbSTS/
http://www.ncbi.nlm.nih.gov/genemap/
UniGene: unique genes via ESTs
• Find UniGene at NCBI:
www.ncbi.nlm.nih.gov/UniGene
• UniGene clusters contain many expressed sequence
tags (ESTs), which are DNA sequences (typically
500 base pairs in length) corresponding to the mRNA
from an expressed gene. ESTs are sequenced from a
complementary DNA (cDNA) library.
• UniGene data come from many cDNA libraries.
Thus, when you look up a gene in UniGene
you get information on its abundance
and its regional distribution.
Pages 20-21
Cluster sizes in UniGene
This is a gene with
1 EST associated;
the cluster size is 1
Fig. 2.3
Page 23
Cluster sizes in UniGene
This is a gene with
10 ESTs associated;
the cluster size is 10
Cluster sizes in UniGene (human)
Cluster size (ESTs)
1
2
3-4
5-8
9-16
17-32
500-1000
2000-4000
8000-16,000
16,000-30,000
UniGene build 194, 8/06
Number of clusters
 42,800
6,500
6,500
5,400
4,100
3,300
2,128
233
21
8
UniGene: unique genes via ESTs
Conclusion: UniGene is a useful tool to look up
information about expressed genes. UniGene
displays information about the abundance of a
transcript (expressed gene), as well as its regional
distribution of expression (e.g. brain vs. liver).
We will discuss UniGene further later
(gene expression).
Page 31
Five ways to access DNA and
protein sequences
[1] Entrez Gene with RefSeq
[2] UniGene
[3] European Bioinformatics Institute (EBI)
and Ensembl (separate from NCBI)
[4] ExPASy Sequence Retrieval System
(separate from NCBI)
Page 31
Ensembl to access protein and DNA sequences
Try Ensembl at www.ensembl.org for a premier
human genome web browser.
We will encounter Ensembl as we study the human genome,
BLAST, and other topics.
click
human
enter
RBP4
Five ways to access DNA and
protein sequences
[1] Entrez Gene with RefSeq
[2] UniGene
[3] European Bioinformatics Institute (EBI)
and Ensembl (separate from NCBI)
[4] ExPASy Sequence Retrieval System
(separate from NCBI)
Page 33
ExPASy to access protein and DNA sequences
ExPASy sequence retrieval system
(ExPASy = Expert Protein Analysis System)
Visit http://www.expasy.ch/
Page 33
Fig. 2.11
Page 33
Example of how to access sequence data:
HIV-1 pol
There are many possible approaches. Begin at the main
page of NCBI, and type an Entrez query: hiv-1 pol
Page 34
Searching for HIV-1 pol:
Following the “genome” link yields
a manageable three results
Page 34
Example of how to access sequence data:
HIV-1 pol
For the Entrez query: hiv-1 pol
there are about 40,000 nucleotide or protein records
(and >100,000 records for a search for “hiv-1”),
but these can easily be reduced in two easy steps:
--specify the organism, e.g. hiv-1[organism]
--limit the output to RefSeq!
Page 34
over 100,000
nucleotide entries
for HIV-1
only 1 RefSeq
Examples of how to access sequence data:
histone
query for “histone”
# results
protein records
RefSeq entries
21847
7544
RefSeq (limit to human)
NOT deacetylase
1108
697
At this point, select a reasonable candidate (e.g.
histone 2, H4) and follow its link to Entrez Gene.
There, you can confirm you have the right gene/protein.
8-12-06
Access to Biomedical Literature
Page 35
PubMed at NCBI
to find literature
information
PubMed is the NCBI gateway to MEDLINE.
MEDLINE contains bibliographic citations
and author abstracts from over 4,600 journals
published in the United States and in 70 foreign
countries.
It has >14 million records dating back to 1966.
Page 35
MeSH is the acronym for "Medical Subject Headings."
MeSH is the list of the vocabulary terms used
for subject analysis of biomedical literature at NLM.
MeSH vocabulary is used for indexing journal articles
for MEDLINE.
The MeSH controlled vocabulary imposes uniformity
and consistency to the indexing of biomedical literature.
Page 35
PubMed search strategies
Try the tutorial (“education” on the left sidebar)
Use boolean queries (capitalize AND, OR, NOT)
lipocalin AND disease
Try using “limits”
Try “Links” to find Entrez information and external resources
Obtain articles on-line via Welch Medical Library
(and download pdf files):
http://www.welch.jhu.edu/
Page 35
1 AND 2
1
2
lipocalin AND disease
(60 results)
1 OR 2
1
2
lipocalin OR disease
(1,650,000 results)
1 NOT 2
1
2
lipocalin NOT disease
(530 results)
Fig. 2.12
Page 34
8/04
Article contents:
“globin” is
present
“globin” is
absent
Search result:
“globin” is
found
true positive
false positive
(article does not
discuss globins)
“globin” is
not found
false negative
(article discusses
globins)
true negative
8/06
Protein sequence motif
is a descriptor of a protein family
• Glutamine amidotransferase class I
[PAS]-[LIVMFYT]-[LIVMFY]-G-[LIVMFY]-C[LIVMFYN]-G-x-[QEH]- x-[LIVMFA]
[C is the active site residue]
• Glutamine amidotransferase class II
<x(0,11)-C-[GS]-[IV]-[LIVMFYW]-[AG]
[C is the active site residue]
Searching MMDB
Principles of structural alignment
• Dali: http://www.ebi.ac.uk/dali/
Looks for minimal RMSD between Ca atoms.
Calculate Ca - Ca distance matrices, then
identifies the longest alignable segments
• VAST (Vector Alignment Search Tool)
http://www.ncbi.nlm.nih.gov/Structure/
looks for pairs of secondary structure
elements (a-helices, b-strands) that have
similar orientation and connectivity
Dali alignment of Tyr phosphatase
VAST Structure Neighbors
Structure Summary
BLAST neighbors
VAST neighbors
Cn3D viewer
Cn3D : Displaying Structures
Chloroquine
Structure Neighbors
Use of structural alignments
Chloroquine
NADH
PDB
• Protein DataBase
– Protein and NA
3D structures
– Sequence
present
– YAFFF
PDB
•
•
•
•
•
•
•
•
•
HEADER
COMPND
SOURCE
AUTHOR
DATE
JRNL
REMARK
SECRES
ATOM COORDINATES
HEADER
COMPND
COMPND
SOURCE
AUTHOR
REVDAT
JRNL
JRNL
JRNL
JRNL
JRNL
JRNL
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
HELIX
CRYST1
ORIGX1
ORIGX2
ORIGX3
SCALE1
SCALE2
SCALE3
ATOM
ATOM
LEUCINE ZIPPER
15-JUL-93
1DGC
GCN4 LEUCINE ZIPPER COMPLEXED WITH SPECIFIC
2 ATF/CREB SITE DNA
GCN4: YEAST (SACCHAROMYCES CEREVISIAE); DNA: SYNTHETIC
T.J.RICHMOND
1
22-JUN-94 1DGC
0
AUTH
P.KONIG,T.J.RICHMOND
TITL
THE X-RAY STRUCTURE OF THE GCN4-BZIP BOUND TO
TITL 2 ATF/CREB SITE DNA SHOWS THE COMPLEX DEPENDS ON DNA
TITL 3 FLEXIBILITY
REF
J.MOL.BIOL.
V. 233
139 1993
REFN
ASTM JMOBAK UK ISSN 0022-2836
0070
1
2
2 RESOLUTION. 3.0 ANGSTROMS.
3
3 REFINEMENT.
3
PROGRAM
X-PLOR
3
AUTHORS
BRUNGER
3
R VALUE
0.216
3
RMSD BOND DISTANCES
0.020 ANGSTROMS
3
RMSD BOND ANGLES
3.86
DEGREES
3
3
NUMBER OF REFLECTIONS
3296
3
RESOLUTION RANGE
10.0 - 3.0 ANGSTROMS
3
DATA CUTOFF
3.0
SIGMA(F)
3
PERCENT COMPLETION
98.2
3
3
NUMBER OF PROTEIN ATOMS
456
3
NUMBER OF NUCLEIC ACID ATOMS
386
4
4 GCN4: TRANSCRIPTIONAL ACTIVATOR OF GENES ENCODING FOR AMINO
4 ACID BIOSYNTHETIC ENZYMES.
5
5 AMINO ACIDS NUMBERING (RESIDUE NUMBER) CORRESPONDS TO THE
5 281 AMINO ACIDS OF INTACT GCN4.
6
6 BZIP SEQUENCE 220 - 281 USED FOR CRYSTALLIZATION.
7
7 MODEL FROM AMINO ACIDS 227 - 281 SINCE AMINO ACIDS 220 7 226 ARE NOT WELL ORDERED.
8
8 RESIDUE NUMBERING OF NUCLEOTIDES:
8 5' T G G A G A T G A C G T C A T C T C C
8 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 7 8 9
9
9 THE ASYMMETRIC UNIT CONTAINS ONE HALF OF PROTEIN/DNA
9 COMPLEX PER ASYMMETRIC UNIT.
10
10 MOLECULAR DYAD AXIS OF PROTEIN DIMER AND PALINDROMIC HALF
10 SITES OF THE DNA COINCIDES WITH CRYSTALLOGRAPHIC TWO-FOLD
10 AXIS. THE FULL PROTEIN/DNA COMPLEX CAN BE OBTAINED BY
10 APPLYING THE FOLLOWING TRANSFORMATION MATRIX AND
10 TRANSLATION VECTOR TO THE COORDINATES X Y Z:
10
10
0
-1
0
X
117.32
X SYMM
10
-1
0
0
Y
+
117.32
=
Y SYMM
10
0
0 -1
Z
43.33
Z SYMM
1 A
62 ILE VAL PRO GLU SER SER ASP PRO ALA ALA LEU LYS ARG
2 A
62 ALA ARG ASN THR GLU ALA ALA ARG ARG SER ARG ALA ARG
3 A
62 LYS LEU GLN ARG MET LYS GLN LEU GLU ASP LYS VAL GLU
4 A
62 GLU LEU LEU SER LYS ASN TYR HIS LEU GLU ASN GLU VAL
5 A
62 ALA ARG LEU LYS LYS LEU VAL GLY GLU ARG
1 B
19
T
G
G
A
G
A
T
G
A
C
G
T
C
2 B
19
A
T
C
T
C
C
1 A
ALA A 228 LYS A 276 1
58.660
58.660
86.660 90.00 90.00 90.00 P 41 21 2
8
1.000000 0.000000 0.000000
0.00000
0.000000 1.000000 0.000000
0.00000
0.000000 0.000000 1.000000
0.00000
0.017047 0.000000 0.000000
0.00000
0.000000 0.017047 0.000000
0.00000
0.000000 0.000000 0.011539
0.00000
1 N
PRO A 227
35.313 108.011 15.140 1.00 38.94
2 CA PRO A 227
34.172 107.658 15.972 1.00 39.82
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
1DGC
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
ATOM
ATOM
TER
MASTER
END
842
843
844
1DGC
1DGC
1DGC
1DGC
1DGC
916
917
918
919
920
C5
C6
46
C B
C B
C B
0
9
9
9
0
57.692 100.286
58.128 100.193
1
0
0
0
22.744
21.465
6
842
1.00 29.82
1.00 30.63
2
0
7
UniProt
• New protein sequence database that is the result of a
merge from SWISS-PROT and PIR. It will be the
annotated curated protein sequence database.
• Data in UniProt is primarily derived from coding
sequence annotations in EMBL (GenBank/DDBJ)
nucleic acid sequence data.
• UniProt is a Flat-File database just like EMBL and
GenBank
• Flat-File format is SwissProt-like, or EMBL-like
Swiss-Prot
Swiss-Prot
• SWISS-PROT incorporates:
•
•
•
•
•
•
•
•
Function of the protein
Post-translational modification
Domains and sites.
Secondary structure.
Quaternary structure.
Similarities to other proteins;
Diseases associated with deficiencies in the protein
Sequence conflicts, variants, etc.