Download Essential Bioinformatics and Biocomputing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Protein folding wikipedia , lookup

Bimolecular fluorescence complementation wikipedia , lookup

Protein design wikipedia , lookup

Western blot wikipedia , lookup

Protein purification wikipedia , lookup

Intrinsically disordered proteins wikipedia , lookup

Proteomics wikipedia , lookup

List of types of proteins wikipedia , lookup

Structural alignment wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Protein structure prediction wikipedia , lookup

Protein domain wikipedia , lookup

Protein mass spectrometry wikipedia , lookup

Homology modeling wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Transcript
Essential Bioinformatics and Biocomputing
(LSM2104: Section I)
Lecture 3: More biological databases, retrieval
systems and database searching
Biological databases
Function and pathways databases - KEGG
KEGG (http://www.genome.ad.jp/kegg/kegg2.html) database links genetic
info with cellular functions. It provides keyword and pre-calculated sequence
comparison searches.
It consists of several interconnected databases:
1.
PATHWAY contains info on metabolic and regulatory networks.
2.
GENES contains information on genes and proteins.
3.
LIGAND contains information on chemical compounds and reactions involved in
cellular processes.
4.
EXPRESSION and BRITE contain micro-array gene expression data.
5.
SSDB helps identify protein coding genes.
It has an integrated database retrieval system: DBGET
KEGG web-page
KEGG Human TCA Pathway
Biological databases: BIND
Biomolecular Interaction Network Database http://www.bind.ca/
1. Stores descriptions of interactions, molecular complexes and
pathways.
2. Provides search tools.
PreBIND locates literature sources:
Show me a list of all of the papers in PubMed that are about my protein of
interest. Then classify all of these papers and tell me which ones are likely
to contain interaction information.
Finally, identify all of the other proteins mentioned in these papers and
indicate whether these proteins might interact with my protein of interest
Bader GD, et al. BIND The Biomolecular Interaction Network Database.
Nucleic Acids Res. 2001 29(1):242-5.
Biological databases: BIND
Biomolecular Interaction Network Database http://www.bind.ca/
3.Its Blast searches BIND database for similarity to a query
sequence.
BIND is at the forefront of the proteomics efforts and is expected
to grow from the large-scale proteomic data.
Bader GD, et al. BIND The Biomolecular Interaction Network Database.
Nucleic Acids Res. 2001 29(1):242-5.
BIND website
BIND Statistics
Database
Interaction Database
Biomolecular Pathway Database
Molecular Complex Database
Organisms represented
GI Database
DI Database
Publication Database
Record Count
11255
8
851
12
4651
0
428
Protein family/domain databases: Sequence alignment
Pfam
(http://www.sanger.ac.uk/Software/Pfam/)
•
Pfam is a collection of multiple protein sequence
alignments and statistical models that can be
used to classify protein families and domains.
•
Descriptions of protein domains:
–
–
Given an established SWISSPROT sequence, Pfam
shows pre-computed domain structure of the protein.
Given a completely new protein sequence, Pfam
computes a domain structure.
Pfam Web-site
Protein family/domain databases: Sequence patterns
PROSITE ( http://ca.expasy.org/prosite/ )
1.
Protein families and domains. It consists of biologically
significant sites, patterns and profiles that help to reliably
identify to which known protein family (if any) a new
sequence belongs.
2.
It currently contains patterns and profiles specific for more
than a thousand protein families or domains.
An example of a pattern (motif):
W-x(9,11)-[VFY]-[FYW]-x(6,7)-[GSTNE]-[GSTQCR]-[FYW]-x(2)-P
Protein sequence motif databasesPROSITE
• A profile is a matrix derived from multiple alignments
PROSITE web-site
PROSITE web-site
Biological data retrieval systems: Entrez
http://www.ncbi.nlm.nih.gov/Database/index.html
1. A retrieval system for searching a number of inter-connected databases
at the NCBI. It provides access to:
– PubMed: The biomedical literature (Medline)
– Genbank: Nucleotide sequence database
– Protein sequence database
– Structure: three-dimensional macromolecular structures
– Genome: complete genome assemblies
– PopSet: population study data sets
– OMIM: Online Mendelian Inheritance in Man
– Taxonomy: organisms in GenBank
– Books: online books
– ProbeSet: gene expression and microarray datasets
– 3D Domains: domains from Entrez Structure
– UniSTS: markers and mapping data
– SNP: single nucleotide polymorphisms
– CDD: conserved domains
2. Entrez allows users to perform various searches.
Entrez
Interface
Biological data Retrieval systems: SRS
http://srs.ebi.ac.uk/
• SRS is a retrieval system for searching several linked databases at the
EBI. Similarly to Entrez, it provides access to various databases and
enables various keyword, sequence similarity or class searches.
SRS interface
Biological databases: Database searching
Database searching can be used to answer the kinds of question like
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
What is the sequence of human IL-10?
What is the gene coding for human IL-10?
Is the function of human IL-10 known? What is it?
Are there any variants of human IL-10?
Who sequenced this gene?
What are the differences between IL-10 in human and in other species?
Which species are known to have IL-10?
Is the structure of IL-10 known?
What are structural and functional domains of the IL-10?
Are there any motifs in the sequence that explain their properties?
What is an upstream region of IL-10 containing transcriptional regulation
sites?
12. …
Biological databases: Database searching
• For well studied molecule such as IL-10, we expect to
extract much of the well-known facts.
• These searches are useful for characterizing newly
identified sequences
Notes:
• Multiple errors can be found in database entries. Some of these
errors are introduced with the submission of sequences to
databases. Some errors are due to naming conventions (or lack
of these). Some errors are due to poor links between databases.
• Users should take data extracted from databases with care and
compare these results with information from other databases,
journal articles, and other sources.
Biological databases: Keyword searching
Search DNA and protein databases with keywords (10-July-2002)
Biological databases: Keyword searching
Notes:
• GenPept is protein translation of GenBank. SPTR is
SWISS_PROT plus protein translation of EMBL sequences.
• Different databases contain different, but overlapping, sets of
entries. The same sequence may have entries in different
databases
• Some databases have non-redundant sections. For example the
UniGene System which automatically partition GenBank
sequences into a non-redundant set of gene-oriented clusters.
• For completeness of results usually it is necessary to search
multiple databases.
Biological databases: Database coverage
• Example: Scorpion KALIOTOXIN 2 (SwissProt:P45696)
Biological databases: Database errors
• Our scorpion study (Srinivasan et al., 2002) also revealed numerous errors and
missing data in the major databases. One of the entries had an error in sequence in
journal publication, but correct sequence in the databases.
Biological Databases:
Sequence Similarity Searching
Proteins that have similar sequence often have similar structure and similar
function. If we have only a protein sequence we can deduce its
structural and functional properties by analyzing sequences that are similar.
Sequence – Structure – Function Relationship:
Similar sequence = Similar structure = Similar function
Why this relationship?
Evolution: involves sequence variation
Laws of physics and chemistry: defines sequence-structure relationship
Function as defined by molecular interaction: structure-based
Database Searching: Cautionary Notes
• Some database matches happen because of chance
similarities, keywords and sequence similarity alike.
Distinguishing chance matches from biologically significant
matches is one of the most important issues for effective
use of biological databases.
• Searching GenBank by sequence similarity tool BLAST for
short, nearly exact matches, for the sequence similarity to
the names of the lecturers of this module in last semester
returned two imperfect matches to “VLADIMIR” and seven
perfect matches to “TINWEE”.
Database Searching: Cautionary Notes
Database Searching: Cautionary Notes
• If we blindly interpret these results, we would erroneously conclude
that motif VLADIMIR may have some functional importance for
structure or function of the strawberry vein binding virus, and that
TINWEE has to do with calciumdependent protein kinase in rice and
possibly in Legionella pneumophila.
• We would avoid conclusions like this by looking at the similarity
scores. This will be done in more detail later in the course, for now it
is important to know that the lower the expected value, the better the
match. Anything close or greater than 1 should be observed with
suspicion. However, sometimes matches that are not statistically
significant, still can have biological significance. If we suspect that
this might be the case, further analysis is necessary.
Database Searching: Cautionary Notes
• Examples of chance matches: virtually any string
or keyword can show “matches” to database
entries. We are interested only in real ones.
• The same search with GenBank. Fortunately we
have statistical measures that indicate the quality
of matches. However, sometimes matches that
have low statistical significance, nevertheless
have real, biological significance. More about that
will be taught later in the course.
Biological databases: Concluding remarks
1. Biological databases represent an invaluable
resource in support of biological research.
2. We can learn much about a particular molecule by
searching databases and using available analysis
tools
3. A large number of databases are available for that
task. Some databases are very general, some are
more specialized, while some are very specialized.
For best results we often need to access multiple
databases.
Biological databases: Concluding remarks
4. Major types of databases covered in this course are focusing
on general nucleotide, general protein, structure, pathways,
molecular interactions, protein motifs, publication, and
specialized databases.
5. Common database search methods include keyword matching,
sequence similarity, motif searching, and class searching.
6. The problems with using biological databases include
incomplete information, data spread over multiple databases,
redundant information, various errors, sometimes incorrect
links, and constant change.
Biological databases: Concluding remarks
7. Database standards, nomenclature, and naming
conventions are not clearly defined for many aspects of
biological information. This makes information
extraction more difficult.
8. Retrieval systems help extract rich information from
multiple databases. Examples include Entrez and SRS.
9. Formulating queries is a serious issue in biological
databases. Often the quality of results depends on the
quality of the queries.
Biological databases: Concluding remarks
10. Statistical measures indicate the quality of matches.
Often the statistical and biological significance are
related. Sometimes, however matches of real biological
significance have low statistical scores.
11. Access to biological databases is so important that
today virtually every molecular biological project starts
and ends with querying biological databases.
Biological databases
Summary of Today’s lecture
• Popular databases: KEGG, BIND, Pfam, PROSITE,
PUBMED
• Data retrieval systems: Entrez, SRS
• Database searching: capability, potential problems.
• Statistics:
–
–
–
–
Protein families (> 5K)
Sequence patterns (> 1.5K)
Interactions (>11K or 110 X 110 which is relatively few)
Relatively small amount of data for function (e.g. Pathways < 200)