Download Module 5

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Rosetta@home wikipedia , lookup

Circular dichroism wikipedia , lookup

List of types of proteins wikipedia , lookup

Protein design wikipedia , lookup

Alpha helix wikipedia , lookup

Bimolecular fluorescence complementation wikipedia , lookup

Protein folding wikipedia , lookup

Structural alignment wikipedia , lookup

Protein wikipedia , lookup

Cyclol wikipedia , lookup

Trimeric autotransporter adhesin wikipedia , lookup

Protein moonlighting wikipedia , lookup

Proteomics wikipedia , lookup

Protein purification wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Western blot wikipedia , lookup

Protein mass spectrometry wikipedia , lookup

Homology modeling wikipedia , lookup

Intrinsically disordered proteins wikipedia , lookup

Protein structure prediction wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Protein domain wikipedia , lookup

Transcript
Module 5
Protein domains
Aims

To introduce the concept of multidomain proteins

To define the terms associated with analysis of multidomain proteins

To introduce the major secondary databases
Objectives
The student should be able to:

To select an appropriate secondary database for analysis of protein domains

To carry out an analysis to establish to establish the domain structure of a protein

To ascribe likely biological functions to protein domains
Introduction
We need to consider something about the evolutionary history of genes and the proteins they
encode before we can look at how we can study protein domains. When the amino acid
sequences of two proteins are compared and found to exhibit significant similarity they are
assumed to be evolutionarily related i.e. they are homologues. We can distinguish two classes
of homologue (orthologue and paralogue) via a consideration of genes. Firstly, orthologous
genes are descended from a unique ancestral gene and their divergence with comparable
genes in different organisms is simply parallel to speciation. In contrast, paralogous genes are
descended from copies of a gene that duplicated within a single ancestral genome.
It is now widely accepted that a substantial proportion of all proteins are composed of
more than one domain. A domain is defined as sequentially consecutive residues in a protein
that can fold up independently of other parts of the protein. Crystallographers commonly refer
to domains as folds and the term module is also sometimes used. Some people would go as far
as saying that the domain is the fundamental unit of protein structure, since inter-domain
splicing, fusion, deletion, duplication and shuffling have occurred frequently during
evolution, whereas intra-domain rearrangements have occurred rarely (Saier, 1996).
It was clear from Module 4 that when two homologous proteins are aligned, there are
one or more regions where sequence identity is particularly high, and these regions frequently
enable the definition of motifs or signature sequences that are diagnostic. Any particular
domain may have one or more characteristic motifs. Domains, motifs and signature sequences
constitute the content of many secondary databases and are of enormous value in attempting
to predict the function and structure of new proteins.
Low complexity regions
The individual domains of multidomain proteins are frequently separated from each other by
regions of low complexity, also referred to as linker sequences. Long stretches of repeated
residues, particularly proline, glutamine, serine or threonine often indicate linker sequences.
The program SEG (see below) is designed to detect such low complexity regions and can be
used as part of BLAST to mask off segments of the query sequence that have low
compositional complexity. Filtering can eliminate statistically significant, but biologically
uninteresting reports from the blast output (e.g., hits against common acidic-, basic- or
proline-rich regions), leaving the more biologically interesting regions of the query sequence
available for specific matching against database sequences.
Secondary (pattern) databases
Two approaches which frequently help with establishing the function and/or structure of an
unknown protein involve the identification of motifs or the production of profiles from the
primary databases. Analysis of the primary protein sequence databases, primarily through the
generation of multiple sequence alignments has led to the identification of sequence patterns
(or motifs) common to homologous proteins. These motifs, usually of the order of 10-20
amino acids in length, usually correspond to key functional or structural elements, often
domains, and are extremely useful in identifying such features in new uncharacterized
proteins. There is a number of such secondary databases in which the information has been
derived from different primary databases by different analytical methods. All these databases
are based, though, on the same principal. The sequence of an unknown protein is often too
distantly related to any protein of known sequence to detect its resemblance by overall
sequence alignment, but it can potentially be identified by the occurrence in its sequence of a
particular cluster of amino acid residues, which are variously known as a patterns, motifs,
signatures, blocks or fingerprints. Usually the motifs do not overlap, but are separated along a
sequence, though they may be contiguous in 3D-space.
Analysis of the primary protein sequence databases and the production of multiple
sequence alignments can also lead to the construction of profiles. Profiles are scoring tables
that summarize the information in an alignment. The profile determines which residues are
allowed at each point, which residues are conserved or degenerate, which positions can
tolerate insertions etc. Unknown proteins can then be scored against the profile to see if they
fit.
There are a number of programs which allow the searching of an unknown protein
against databases of motifs and profiles, or indeed both. Some commonly used programmes
are listed below:
Pfam is a collection of multiple alignments and profile hidden Markov models of protein
domain families, which is based on proteins from both SWISS-PROT and SP-TrEMBL.
SMART (a Simple Modular Architecture Research Tool) allows the identification and
annotation of genetically mobile domains and the analysis of domain architectures. More than
400 domain families found in signalling, extracellular and chromatin-associated proteins are
detectable. These domains are extensively annotated with respect to phyletic distributions,
functional class, tertiary structures and functionally important residues. Each domain found in
a non-redundant protein database as well as search parameters and taxonomic information are
stored in a relational database system. User interfaces to this database allow searches for
proteins containing specific combinations of domains in defined taxa.
PROSITE is a database of protein families and domains. It consists of biologically significant
sites, patterns and profiles that help to reliably identify to which known protein family (if any)
a new sequence belongs.
PRINTS is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs
used to characterise a protein family; its diagnostic power is refined by iterative scanning of a
SWISS-PROT/TrEMBL composite database. Fingerprints can encode protein folds and
functionalities more flexibly and powerfully than can single motifs, full diagnostic potency
deriving from the mutual context provided by motif neighbours.
BLOCKS Blocks are short multiply aligned ungapped segments corresponding to the most
highly conserved regions of proteins. The rationale behind searching a database of blocks is
that information from multiply aligned sequences is present in a concentrated form, reducing
background and increasing sensitivity to distant relationships. This information is represented
in a position-specific scoring table or "profile" (4), in which each column of the alignment is
converted to a column of a table representing the frequency of occurrence of each of the 20
amino acids.
IDENTIFY Motifs derived from the BLOCKS and PRINTS databases
INTERPRO SEARCH provides an integrated resource for protein families, domains and
functional sites which amalgamates the resources of the PROSITE, Pfam, ProDom and
PRINTS databases
CD-SEARCH at NCBI employs the reverse position-specific BLAST algorithm to search the
Conserved Domain Database (CDD), which is at present composed of Smart and Pfam, plus
contributions from colleagues at NCBI.
Exercises
The slr0228 gene of the cyanobacterium encodes a multidomain homologue of the E. coli
protein FtsH. Carry out the following tasks:
1. Retrieve the protein sequence from NCBI using Entrez
2. Analyse the FtsH sequence for transmembrane segments using predict protein
3. Analyse the FtsH sequence for coiled-coils using the COILS server and MULTICOIL.
Are there any differences between the predictions? Is this protein likely to have any
coiled-coil regions?
4. Analyse the domain structure of the FtsH homologue using PFAM, IDENTIFYand
SMART. Are there any differences between the predictions?
5. Compare the predictions obtained in (4) with those obtained using INTERPRO SEARCH
References and Useful Links
Bateman A, Birney E, Durbin R, Eddy SR, Finn RD, Sonnhammer ELL (1999) Pfam 3.1:
1313 multiple alignments match the majority of proteins, Nucleic Acids Research 27:260-262
Saier MH (1996) Phylogenetic approaches to the identification and characterization of protein
families and superfamilies. Microbial and Comparative Genomics 1, 129-150.
Schultz, J., Copley, R.R., Doerks, T., Ponting, C.P. and Bork, P. (2000) SMART: A Webbased tool for the study of genetically mobile domains Nucleic Acids Res 28, 231-234