Download Introducing Bioinformatics Databases

Document related concepts
no text concepts found
Transcript
INTRODUCING
BIOINFORMATICS DATABASES
Tan Tin Wee/Victor Tong/Susan Moore
Dept of Biochemistry
NUS
Mohammad Asif Khan
Perdana University Graduate School of Medicine
Sources of Biological Knowledge
Past: textbooks, monographs, books, journals.
 Today: online accessible databases
Keyword searchable, e.g. Google.
 Every class of biological molecule has at least a
few databases associated with it.
 Every area of biology, biotechnology, medicine
and life science research will have some kind of
database associated with it.
 Must be aware and familiar with MAJOR
databases
 Must be able to discover NEW databases and
master them as and when they appear.

BIOLOGICAL KNOWLEDGE TODAY!




STORED digitally
Almost critical biological data, information, knowledge is
currently stored in computers
ACCESSIBLE globally
All current critical biological knowledge is publicly
accessible via the Internet network of computers
SHARED extensively
Most research data is exchanged via the Internet today if
not publicly and free, then shared among international
collaborators
PUBLISHED online
Most scientific journals are now published with a digital
version accessible online, free open access or for a
subscription fee paid by the individual or by the institution
10 years ago, this was not so.
There has been tremendous change.
UNSTOPPABLE DATA GROWTH
100
100
90
80
70
60
90
Growth of GenBank
DNA Sequence
80
(2005 – 2009)
>100,000,000 sequences
Exponential Increase
Next Gen Sequencing
Technologies
70
60
http://www.ncbi.nlm.nih.gov/Genbank/g
enbankstats.html
Growth of PDB
Protein and Macromolecular
Structures
Driven by various Structural
Genomics initiatives such as
Protein Structure Initiative
http://www.nigms.nih.gov/Initiatives/PSI
JCSG
http://www.jcsg.org/
http://www.pdb.org/pdb/statistics/contentGro
wthChart.do?content=total&seqid=100
2005
2008
RELENTLESS INCREASE IN DATABASES
Michael Y. Galperin and Guy R. Cochrane (2009) Nucl. Acids Res. 37:D1-D4 . Nucleic Acids Research annual
Database Issue and the NAR online Molecular Biology Database Collection in 2009 (doi:10.1093/nar/gkn942)
http://nar.oxfordjournals.org/cgi/content/full/37/suppl_1/D1
A lot of data
A lot of databases
What do they mean?
Most of the data begins to
make sense if they are
Integrated
But many plans to integrate
these databases have failed
Biological Databases – examples and
general considerations
7
Biological databases – what they are; purpose
 Some general considerations
 Sample databases

BIOLOGICAL DATABASES
Many (but not all) definitions of “database” include:
-
Storage of data on a computer in an organized way
Provision for searching and data extraction.

By these definitions web pages, books, journal articles, text files,
and spreadsheet files cannot be considered as databases
Purposes of biological databases:
1.
2.
3.
8
-
To disseminate biological data and information
To provide biological data in computer-readable form
To allow analysis of biological data
But first…a few terms

www.d.umn.edu/lib/reference/skills/vocab.html

Field: “the part of a record reserved for a
particular type of data…”
www.amberton.edu/VL_terms.htm
9
Database Record: “A collection of related data,
arranged in fields and treated as a unit. The data
for each [item] in a database make up a record.”
Example from the “Grocery Shopping
Database”:
Fields
A different view of the
first “record”:
A record
Field Values
Date: 18/08/2006
Item: White bread
Store: Dover Provision
Price: $1.29
10
SOME FEATURES OF BIOLOGICAL
DATABASES

Data/information…




11

Stored in records according to some predetermined
structure/format
+/- evidence
+/- unique identifiers
+/- additional annotation
+/- DB Xrefs (cross references)
AUTHORITATIVE AND RELIABLE





Most biological databases are from authoritative and
reliable sources, however…
Not all Websites and Databases are reliable.
Not all data and information stored in authoritative
and reliable websites or databases are accurate or
correct, or up-to-date
Nevertheless, most of them are useful and instructive
Many of them contain valuable information and
knowledge
Identification of authority and
Evaluation of reliability – very important
Every serious scientist must be critical of the
information they read, whether online or not.
DISCOVERABILITY
Most publications, books and courses include
online references – Web address (URL)
e.g. http://www.pdb.org/ for protein structural
data
 Most useful resources are also listed and taught
in courses, or spread by word of mouth.
 Most databases are searchable by appropriate
keywords and their authority determined by
their web addresses, the institutions behind the
databases or the authors’ reputation
Most databases have full details of their content and
how to use them.

NAR DATABASE CATEGORIES LIST
14
From: http://nar.oxfordjournals.org
TABLE OF NAR DATABASES ISSUE
http://en.wikipedia.org/wiki/Biological_database
http://www.oxfordjournals.org/nar/database/c/















Nucleotide Sequence Databases
RNA sequence databases
Protein sequence databases
Structure Databases
Genomics Databases (non-vertebrate)
Metabolic and Signaling Pathways
Human and other Vertebrate Genomes
Human Genes and Diseases
Microarray Data and other Gene Expression Databases
Proteomics Resources
Other Molecular Biology Databases
Organelle databases
Plant databases
Immunological databases
Bibliographic databases
DATABASE OF BIOLOGICAL DATABASES
Alphabetical order
http://www.oxfordjournals.org/nar/database/a/
 Category
http://www3.oup.co.uk/nar/database/cap/

Information flow in
Biology





Human Genome Project –
DNA sequence
Microarray – RNA
expression and levels
Proteomics – protein
expression and
concentration in cells
Structural proteomics or
genomics – protein
structure (and function)
Functional genomicsprotein function
EXAMPLES OF
MAJOR BIOINFORMATICS RESOURCES

Browsing databases
NCBI Entrez
http://www.ncbi.nlm.nih.gov/sites/gquery
 EBI Ensembl
http://www.ensembl.org/index.html


Retrieving sequences


SRS - Sequence Retrieval System
http://srs.ebi.ac.uk/
ExPASy – Expert Protein Analysis System –
Proteomics server

http://au.expasy.org/
BIBLIOGRAPHIC INFORMATION

PubMed and Medline

Recent National Institutes of Health USA policy
Google Scholar
 Web of Science and Science Citation Index
 Online journals



SuperTier Top Journals – Nature, Science, Cell,
PNAS, etc.
Open access journals
Public Library of Science
PLoS
 Biomed Central

LITERATURE - PUBMED
Citations and abstracts for articles from approx. 5000
(not all!) biomedical journals
 Text searching to identify citations of interest
 Links to full-text articles (free or otherwise)
 More than 16,000,000 records*

20
* 16000000 As of Dec 29 2005. PubMed News.
http://www.ncbi.nlm.nih.gov/feed/rss.cgi?ChanKey=PubMedNews
Literature – PubMed
21
p53 cancer
Authors
Article Title
Bibliographic Information
(Journal name, date, volume, issue, page
numbers)
PMID: Unique ID for this record
ABSTRACTPLUS VIEW - PUBMED
22
STORING YOUR OWN BIBLIOGRAPHIC INFORMATION
Online Wizfolio: http://www.wizfolio.com
Software: ENDNOTE or REFMAN
GENETIC AND GENOMIC DATABASES
From sequencing of specific genes or genomic
sequence of entire genomes
 Data are prepared, annotated and stored in
databases

Genbank, NCBI
 DDBJ, NIG
 EBI/EMBL


Making Deposits
http://www.ncbi.nlm.nih.gov/Genbank/update.html


Bankit
Sequin
NUCLEIC ACID DATABASES

25
Include:
 GenBank
 DDBJ
 EMBL
•Archives of Primary data
•Exchange data amongst themselves
RefSeq
Summary/Integration of primary data
GENBANK

Data from:
Individual laboratories
 Sequencing centres

Eg: sequencing errors
 Eg: incomplete sequences

NCBI Handbook
26
Any organism
 Individual records may be incomplete or
inaccurate

SEARCHING ENTREZ NUCLEOTIDE FOR
HUMAN P53
27
P53
GENBANK RECORD: GI 48094186
28
P53
GENBANK RECORD: HEADER
Organismal
Source
Data
sources
29
Identifiers,
Version,
Definition Line
P53
GENBANK RECORD: FEATURES
30
CrossReferences to
Other DBs
Protein product
P53
GENBANK RECORD: SEQUENCE
31
THE LINKED PROTEIN RECORD:
GENBANK  GENPEPT
32
LINKS FROM P53 GENPEPT RECORD
33
Available links vary from one
record to another
WITH SO MANY RECORDS HOW DO WE
KNOW WHICH ONE TO WORK WITH?


eg DDBJ, GenBank, EMBL (nucleotide)
Have the same or different sequence
information
Single changes in nucleotides/amino acids
 Incomplete sequence


Have variable extra annotation

Eg: Signal peptide; domains; DB XRefs etc
34
They may:
 Come from different source databases
THE REFSEQ PROJECT

http://www.ncbi.nlm.nih.gov/RefSeq/index.html

Info from:
Predictions from genomic sequence
 Analysis of GenBank Records
 Collaborating databases

35
Goal: a “comprehensive, integrated, nonredundant set of sequences, including genomic
DNA, transcript (RNA), and protein products,
for major research organisms.”
REFSEQ:
36
EXAMPLE: P53 REFSEQ MRNA RECORD
37
EXAMPLE: P53 REFSEQ MRNA RECORD
38
P53
REFSEQ MRNA FEATURES
39
P53
REFSEQ MRNA FEATURES
CONTINUED
40
P53
REFSEQ MRNA FEATURES
CONTINUED
41
p53 RefSeq mRNA features include…

Links:
 GeneID – locus and display of genomic, mRNA and protein
42
sequences; extensive additional annotation
OMIM – Online Mendelian Inheritance in Man – disease information
 CDD – conserved protein domain

HGNC – official nomenclature for human genes
 HPRD – Human Protein Reference Database


CDS (CoDing Sequence)
Gene Ontology terms applied to the protein
 Nucleotide sequence range of translated product
 Translation – the protein sequence
 Link to RefSeq Protein record


Other features – sequence ranges refer to the nucleotide
 Nuclear Localization Signal
 Polyadenylation site etc
P53
REFSEQ PROTEIN
43
P53
REFSEQ PROTEIN CONTINUED
44
P53
REFSEQ PROTEIN CONTINUED
45
Sequence ranges in features refer to the amino acid sequence
INTERPRETING REFSEQ IDENTIFIERS
Genomic DNA
NC_123456 - complete genome, complete chromosome, complete
plasmid

NG_123456 - genomic region

NT_123456 - genomic contig
mRNA - NM_123456
Protein - NP_123456
Gene and protein models from genome annotation
projects:

XM_123456 - mRNA

XR_123456 - RNA (non-coding transcripts)

XP_123456 - protein
46

REFSEQ STATUS
Most
confident
47
Validated
 Reviewed
 Provisional
-------------- Predicted
 Model
 Inferred
 Genome Annotation

Least
confident
Protein Database – Swiss-Prot
•
•
48
SWISS-PROT
A curated database of protein sequences
Trained biologists extract and analyze relevant evidence
from scientific publications
Post translational modifications, sequence variations,
functions, etc
TrEMBL = Translated EMBL
 UniProtKB = Swiss-Prot + TrEMBL
Protein Database – Swiss-Prot
•
•
49
SWISS-PROT
A curated database of protein sequences
Trained biologists extract and analyze relevant evidence
from scientific publications
Post translational modifications, sequence variations,
functions, etc
TrEMBL = Translated EMBL
 UniProtKB = Swiss-Prot + TrEMBL
STRUCTURES: PDB

Three-dimensional structures of biomolecules
50
Image: Eric Martz
RasMol Gallery. http://www.umass.edu/microbio/rasmol/galmz.htm
(Accessed Aug 16, 2006)
PDB
51
RESULTS SUMMARY PAGE
52
PDB – Structure Summary
53
PDB STRUCTURE SUMMARY CONTINUED
54
INTERACTIONS: BIND

Physical and genetic interaction data



p53
AP2Alpha
55

Curated from published experimental evidence
All organisms
Physical interactions span all molecule types:
 Protein-Protein
 Protein-RNA
 Protein-DNA
 Protein-Small Molecule
 Etc
Details characterizing the interaction – eg binding
sites
p53 protein-protein interactions in
BIND – query results
56
A BIND INTERACTION RECORD
57
BIND INTERACTION STATISTICS - PROTEIN
58
FUNCTION AND PATHWAYS DATABASES - KEGG
Several interconnected databases including:
59
•
PATHWAY contains info on metabolic and regulatory networks.
•
40,568 pathways generated from 301 reference pathways
•
GENES contains information on genes and proteins.
•
LIGAND contains information on chemical compounds and
reactions involved in cellular processes.
SEARCHING KEGG GENES
60
LINKING FROM GENE TO PATHWAYS
61
KEGG HUMAN CELL CYCLE PATHWAY
62
SUMMARY: Biological databases –
examples and general considerations


Scope and sample records from selected databases:
Pubmed, Genbank, RefSeq, PDB, BIND, KEGG
Primary archival databases vs. derived databases
Relative numbers of database records

Pubmed > RefSeq > Interactions > Structures > Reference
Pathways
63

EXTRACTING DATA FROM THE DATABASES
Databases have variable means of accessing and
working with the data
64







Keyword (simple) searches
+/- query by ID (eg PMID)
+/- advanced queries – Boolean; field-specific
+/- different views of the data
+/- ways to export or store your results
+/- visualization
Getting the data
A PROBLEM WITH KEYWORD SEARCHES
Matches in potentially irrelevant parts of the
record
 Eg: if we ONLY want records describing the
sequence of p53 and we do a keyword search
of Entrez Nucleotide with p53:

p53 is mentioned in a GeneRIF.
65
Biological databases:
PDB Advanced Query – Field Specific
66
PDB Advanced Query – Boolean Field
Specific Query
67
“Match ALL of the following
conditions”
Molecule Name: p53 AND Ligand Name: zinc
PDB CUSTOM REPORTS
68
SELECTING FIELDS FOR PDB CUSTOM
REPORTS
69
PDB CUSTOM REPORT
70
PDB Custom Report
– Save report in CSV (comma separated
value)
71
VIEWS: GENBANK FLAT FILE VS
FASTA
72
GETTING THE DATA


Querying through the web interface may be
ineffective
Some DBs also have programming interfaces
 Many DBs also store their data at their FTP
sites

can download entire datasets for programmatic
manipulation
 Eg: Flat Files  parse  into tables

73
Large-scale analyses  large-scale data
retrieval
EG OF KEGG API

http://www.genome.jp/kegg/soap/
74
EXTRACTING THE DATA - SUMMARY
Understanding database records allows us to
query more effectively
 Saving our results allows us to manipulate them
offline
 Different views are suited for different purposes
 The web interface is not the only way to extract
data

75
Limitations of databases…
May have redundant information
 May be incomplete
 May have errors
 May not be actively updated



Including new data
Including corrections to old data
Including updates of info from other DBs
76

WHERE DOES THE DATA AND
ADDITIONAL ANNOTATION COME FROM?

Direct deposition of data

Manually extracted from the literature


eg BIND, SwissProt, old Genbank
Text-mining


eg PDB, Genbank
Automatically extracting biological information from
the literature using computer programs
Electronic Annotation

Eg automated assignment of GO terms to proteins
based on sequence similarity
All can be +/- human validation
77

REDUNDANCY AND INCOMPLETENESS IN
BIOLOGICAL DATABASES

We’ve already seen redundancy WITHIN
databases…

Eg multiple entries for human p53 in Genbank
And…there can be multiple databases for a
single data type.
Eg: SwissProt – RefSeq Protein
 Eg: BIND – MINT – DIP – HPRD etc

78

REDUNDANCY AND INCOMPLETENESS:

79
Overlap of human protein-protein interactions between
2 databases
HPRD: Human
Protein Reference
Database
24385 interactions
• 73% overlap

DIP: the Database of
Interacting Proteins
1049 interactions
(Gandhi et al, Nature Genetics 38, 285-293 (2006)
It is likely that NEITHER of these databases is
complete
INCOMPLETE PRIMARY DATABASES
Not all published observations have been curated
into databases
 Not all experimental observations have been
published
 Not all experiments have been conducted yet
 Not all experiments can make all possible
observations

80
STRIVING FOR COMPLETE DATASETS
AND MINIMAL REDUNDANCY
Eg GenBank – DDBJ – EMBL
Share curation (data entry + checking) workload in a
non-redundant fashion
 Curate according to common standards
 Exchange data using the PSI MI exchange format

81
Eg for interaction data:
 IMEx – International Molecular Exchange
consortium
 BIND, DIP, MINT, IntAct, MPact, BioGRID
ERRORS
Can include, but are not limited to:

Typographical errors
Impact on keyword searches??
Incorrect interpretation of source data
 Experimental errors
 Text mining not validated by a human
 Incorrect automated analysis


(eg predicting mRNAs from genomic sequence)
82

ERROR EXAMPLE: A RETRACTED RECORD

83
GI 4504946, RefSeq mRNA for water buffalo
alpha-lactalbumin:
RETRACTED RECORD:
NM_002289.1 VS NM_002289.2
84
Re-interpretation of genomic sequence  different mRNA, same protein
SO:
1)

3)
4)
Eg: For crucial information – go back to the
original publication
Keep a record of unique IDs, DB versions
used
Search using appropriate identifiers (eg from
CVs – see next section) where possible
85
2)
Search multiple databases
Where possible, verify by consulting the
evidence
STANDARD NOMENCLATURE AND
CONTROLLED VOCABULARIES
Standard Nomenclature
 Limited computational value of free text
 CVs and Ontologies - Definitions

86
Grocery Shopping example
 Gene Ontology
 NCBI Taxonomy

STANDARD NOMENCLATURE
A plague of biology: many names for the same unique biological
objects
MAPK14: CSBP1, CSBP2, CSPB1, EXIP…SAPK2A, p38, p38ALPHA
MAPK1: ERK, ERK2, ERT1….PRKM2, p38, p40, p41, p41mapk
GRAP2: RP3-370M22.1, GADS, GRAP-2, GRB2L, GRBLG, GRID, GRPL,
GrbX, Grf40, Mona, P38
AHSA1: AHA1, C14orf3, p38
- Imagine a PubMed search on p38.
- Imagine a sequence database search on p38.
87
TP53: LFS1, TRP53, p53
STANDARD NOMENCLATURE
1)
Use identifiers!

2) Include the “official gene name” when describing
your research
eg HUGO Gene Nomenclature Committee
approved name: HGNC:1189, “AHSA1”
88

When referring to database entries that you have
used
When querying databases.
LIMITED COMPUTATIONAL VALUE OF
FREE TEXT

89
Text mining vs human interpretation of free text:
Humans win!
Which would be easier for a computer to assess?

1.
MoleculeA was found in the cytoplasm, whereas
MoleculeB was not; rather it continued to accumulate in
the nucleus.
2.
MoleculeACellPlace: Cytoplasm
MoleculeBCellPlace: Nucleus
CONTROLLED VOCABULARIES AND
ONTOLOGIES
 Define:
“a set of official descriptors assigned to a particular entry
in a database, illustrating the relationship between
synonyms and preferred usage terms.”
90

controlled vocabulary (CV)
truncated from: www.library.appstate.edu/tutorial/glossary/glossary.html
 Define:

ontology
“specification of a conceptualisation of a knowledge
domain. An ontology is a controlled vocabulary that
describes objects and the relations between them in a
formal way…”
truncated from:
members.optusnet.com.au/~webindexing/Webbook2Ed/glossary.htm
These
We
will think of ontologies as
“hierarchical CVs that specify
relationships”
91
terms are sometimes used
interchangeably in bioinformatics
Example from the “grocery shopping
database”

CV: we might use different words for the same
thing

Bread – le pain – das Brot
Ontology: we can formally classify our concepts of
bread
92

A sample (and simple) ontology for bread
others are possible…
Breakfast cereal
93
Grain product
Bread
Synonyms:
Le pain, das Brot
Loaf bread
Flat bread
White bread
Synonym:
WonderBread
Roti Prata
Pita
Naan
THE GENE ONTOLOGY

What: A database of terms to describe gene (or
gene product) information
94

Terms are applied to gene products

Why:
1) So we can use a common language to describe the
same biological observations
 2) So that we can compute on these observations

GO – the 3 aspects of describing genes
and their products

Cellular Component

Molecular Function
~ actions of the gene product at a molecular level – eg
catalysis, binding

Biological Process
~ biological events mediated by ordered assemblies of
molecular functions – eg signal transduction
An Introduction to the Gene Ontology.
http://www.geneontology.org/GO.doc.shtml
95
~ where it is in the cell (can also be extracellular) – eg
nucleus
Increasingly specific terms within
each “aspect”
96
Parent terms
(less granular)
Child terms
(more granular)
P53
CELLULAR COMPONENT ANNOTATION
FROM REFSEQ PROTEIN RECORD







In general, the most specific (most “granular”)
term possible is applied, given the evidence.
97

cytoplasm [pmid 7720704];
insoluble fraction [pmid 12915590];
mitochondrion [pmid 12667443];
nuclear matrix [pmid 11080164];
nucleolus [pmid 12080348];
nucleoplasm [pmid 11080164] [pmid 12915590];
nucleus [pmid 7720704]
A GO TERM RECORD: NUCLEAR MATRIX
98
“Nuclear matrix” in tree view
(QuickGO)
99
Less granular
term
GROUPING ENTRIES BY LESS GRANULAR
TERMS
Can group by A COMMON PARENT TERM:
”Intracellular Membrane-Bound Organelle”
100
Eg: ProteinA  nucleus
ProteinB  nuclear matrix
ProteinC  mitochondria
EG PDB: BROWSE BY CELLULAR
COMPONENT
101
“Show me all of the structures where one or more of the molecules can
reside in an intracellular membrane-bound organelle”
NCBI TAXONOMY: HOMO SAPIENS
102
NCBI TAXONOMY: CLASS MAMMALIA
103
GROUPING ENTRIES BY LESS GRANULAR
TERMS: NCBI TAXONOMY

Eg2: Trying to draw global patterns about hostvirus interactions
104
“Find me all of the protein-protein interactions
where one protein is from a virus and the other
protein is from a mammal”
105
VALUE OF CVS/ONTOLOGIES IN
QUERYING

Increased ability to retrieve specific records

Grouping observations at multiple levels

Eg: return all the records that have the GO term
“nucleus”, or any of its child terms
106

Eg: return all the records that have the exact GO
term “Nuclear Matrix” in them
SUMMARY OF L2

Contents of some databases:
PubMed, GenBank, RefSeq, etc
107

Databases have limitations
 Understanding database records allows us to query
more effectively
 Controlled Vocabularies can make queries more
powerful
