Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Protein functions prediction
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
Introduction
Signal peptides
Transmembrane regions
and topology
PTM (post-translational
modifications)
Low complexity and
biased regions
Repeats
Coils
Secondary structure
Antigenic peptides
Domain/Motifs
Tools
The EMBOSS package
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
Different techniques
Algorithms
Sliding window, Nearest Neighbor
Patterns, regular expression
Weight matrices
HMM, profiles
Neural Networks
Rules
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
Sliding window
THISISATESTSEQVENCETHATDISPLAYSTHESLIDINGWINDQW
Score1
Score2
Scoren
Width or Size=11, Step=5
Results are usually displayed
as a graph, see example ->
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
Patterns / regular expression
Pattern: <A-x-[ST](2)-x(0,1)-{V}
Regexp: ^A.[ST]{2}.?[^V]
Text: The sequence must start with an
alanine, followed by any amino acid, followed
by a serine or a threonine, two times,
followed by any amino acid or nothing,
followed by any amino acid except a valine.
Simply the syntax differ…
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
Weight matrices (PSSM)
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
HMM / profiles
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
Neural Networks
General principle:
Example:
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
Signals found in proteins
N-ter
exportation - secretion
mitochondria
chloroplast
internal
NLS (nuclear localization
signal)
C-ter
GPI-anchor (Glycosyl
Phosphatidyl Inositol)
other membrane
anchors (see PTM)
other unknown ?
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
Signals detection tools
SignalP
MitoProt
ChloroP
Predotar
PSort
TargetP
Sigcleave (EMBOSS)
Big-PI
DGPI
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
Transmembrane regions
Detection (signal peptide, hydropathy, helices)
Organisation (topology)
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
Transmembrane detection tools
TMHMM
TMPred
TopPred2
DAS
HMMTop
Tmap (EMBOSS)
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
Post translational modifications
Phosphorylation
S - T - (HO)K
Acetylation, methylation
N
O-glycosylation
S-T-Y
N-glycosylation
D-E-K
Sulfation
Y
Farnesylation, myristylation,
palmitoylation,
geranylgeranylation, GPIanchor
Ubiquitination and family
C - Nter - Cter
K - Nter
Inteins (protein splicing)
Pre-translational
Selenoprotein
C
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
PTM detection
Pattern prediction
(PROSITE)
Short or weak signal
Frequent hit producer
Best method is experimental
MS/MS detection
Most method use « rules »
joining pattern detection and
knowledge to predict sites.
NetOGlyc - Prediction of type Oglycosylation sites in mammalian
proteins
DictyOGlyc - Prediction of GlcNAc
O-glycosylation sites in
Dictyostelium
YinOYang - O-beta-GlcNAc
attachment sites in eukaryotic
protein sequences
NetPhos - Prediction of Ser, Thr
and Tyr phosphorylation sites in
eukaryotic proteins
NMT - Prediction of N-terminal Nmyristoylation
Sulfinator - Prediction of tyrosine
sulfation sites
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
Low complexity regions
repeats
compositional bias
PEST
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
Low complexity / Repeats
DUST (DNA) / SEG
search collection
search collection
REPRO, Radar
REP
de novo detection
EMBOSS (DNA)
RepeatMasker (DNA)
de novo detection
einverted
equicktandem
etandem
palindrome
EMBOSS (protein)
oddcomp
PEST, PESTFind
de novo detection
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
Coils
Helix of helix
coiled-coil
Leu-zipper
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
Coils detection
COILS
Paircoil, Multicoil
Pairwise correlation
Marcoil
Weight matrices
HMM
Pepcoil (EMBOSS)
Weight matrices
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
Secondary structure
Structure to predict
Alpha-helices
Beta-sheets
Turns
Random coil
Garnier (EMBOSS)
PHD
DSC
PREDATOR
NNSSP
Jpred
Jnet
Many others
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
Antigenic peptide
Peptides binding to MHC
class I
Use of experimental
knowledge
8, 9, 10 mers
class II
15 mers (3+9+3)
Depend highly on MHC type
Databases of known
peptides
SYFPEITHI
HLA_Bind (BIMAS)
MAPPP combined expert
Antigenic (EMBOSS)
Many more
Prediction of proteasome
cleavage sites
NetChop
PaProc
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
Domain / Motif
All the protein domain
descriptors
PROSITE
PFAM
SMART
PRODOM
BLOCKS
PRINTS
…
Federation: InterPro
Many techniques
Patterns, Regexp
PSSM (PSI-BLAST)
Profiles
HMM
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
Other Tools
You can find some of them on our servers
Or on ExPASy server
www.ch.embnet.org
www.expasy.org/tools
Or ask Google!!
www.google.com
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
European Molecular Biology Open Software Suite
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
How to use EMBOSS/Jemboss at SIB
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
Free Open Source (for most Unix plateforms)
GCG successor (compatible with GCG file format)
More than 150 programs (ver. 2.7.1)
Easy to install locally
Interfaces
but no interface, requires local databases
Unix command-line only
Jemboss, www2gcg, w2h, wemboss… (with account)
Pise, EMBOSS-GUI, SRSWWW (no account)
Staden, Kaptain, CoLiMate, Jemboss (local)
Access: www.emboss.org
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
Some details
Format USA
'asis'
Format
Format
Format
Format
Format
Format
Format
::
::
::
::
::
::
::
::
Sequence [start : end : reverse]
'@' ListFile [start : end : reverse]
'list' : ListFile [start : end : reverse]
Database : Entry [start : end : reverse]
Database - SearchField : Word [start : end : reverse]
File : Entry [start : end : reverse]
File : SearchField : Word [start : end : reverse]
Program Program-parameters '|' [start : end : reverse]
Example: fasta::Swissprot:UBP5_HUMAN[200:300]
Databases
Any can be added, use showdb to display the available databases
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
databases
showdb
Displays information on the currently available databases
# Name
Type ID Qry All Comment
# ====
==== == === === =======
ipr_fetch
P
OK OK OK InterPro current by fetch
ipi_fetch
P
OK OK OK IPI current by fetch
refseq_fetch P
OK OK OK refseq current by fetch
repbase_fetch P
OK OK OK repbase current by fetch
swiss_fetch
P
OK OK OK SwissProt current by fetch
swissprot
P
OK OK OK SWISSPROT sequences
trembl
P
OK OK OK TREMBL sequences
trembl_fetch P
OK OK OK trembl current by fetch
tremblnew
P
OK OK OK TREMBL New sequences
ug_fetch
P
OK OK OK Unigene by fetch
embl
N
OK OK OK EMBL release
emhum
N
OK OK OK EMBL release, Human section by emboss index
emrod
N
OK OK OK EMBL release, Rodent section by emboss index
emvrt
N
OK OK OK EMBL release, Vertebrate (nonhuman, nonrodent)
seqret (seqretall, seqretset, seqretsplit)
entret (for complete untouched entry, e.g., for unigene, interpro, swissprot…)
Possible to define your own « .embossrc » file
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
Some tools for DNA
redata
remap
restover
restrict
showseq
silent
cirdna
lindna
revseq
…
Search REBASE for enzyme name, references, suppliers etc
Display a sequence with restriction cut sites, translation etc
Finds restriction enzymes that produce a specific overhang
Finds restriction enzyme cleavage sites
Display a sequence with features, translation etc
Silent mutation restriction enzyme scan
Draws circular maps of DNA constructs
Draws linear maps of DNA constructs
Reverse and complement a sequence
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
Example: remap
ECLAC
E.coli lactose operon with lacI, lacZ, lacY and lacA genes.
Hin6I
TaqI
| HhaI
| Bsc4I
| Bsu6I
| |
Hin6I
| BssKI
| |
| HhaI
AciI
| | BsiSI
\ \
\ \
\
\ \ \
GACACCATCGAATGGCGCAAAACCTTTCGCGGTATGGCATGATAGCGCCCGGAAGAGAGT
10
20
30
40
50
60
----:----|----:----|----:----|----:----|----:----|----:----|
CTGTGGTAGCTTACCGCGTTTTGGAAAGCGCCATACCGTACTATCGCGGGCCTTCTCTCA
/ /
/ /
/
/ / ///
| TaqI | Hin6I
AciI
| | ||BssKI
Bsc4I HhaI
| | |BsiSI
| | Bsu6I
| Hin6I
HhaI
# Enzymes that cut Frequency
Isoschizomers
AciI
1
Bsc4I
1
BsiSI
1
BssKI
1
Bsu6I
1
HhaI
2
Hin6I
2
HinP1I,HspAI
TaqI
1
# Enzymes that do not cut
AclI
BamHI
BceAI
Bse1I
BshI
ClaI
EcoRI
EcoRII
Hin4I
HindII
HindIII
HpyCH4IV KpnI
NotI
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
Example: cirdna
File: ../../data/data.cirp
Start
1001
End
4270
group
label
Block
1011
ex1
endlabel
label
Tick
1610
EcoR1
endlabel
label
Block
1647
endlabel
label
Tick
2459
BamH1
endlabel
label
Block
4139
ex2
endlabel
endgroup
group
label
Range
2541
Alu
endlabel
label
Range
3322
MER13
endlabel
endgroup
1362
3
8
1815
1
8
4258
3
2812
[
]
5
3497
>
<
5
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
Example: plotorf
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
EMBOSS format input/output
UFO Universal Feature Object
Alignments
Multiple and pairwise, many flavors (FASTA, MSF, SRS…)
Reports
gff, swissprot, embl, pir, nbrf (with or without sequence)
Feature (UFO), SRS, motif, seqtable, excel, diffseq, listfile (USA),
etc…
Sequences (compatible with USA)
Many!!! E.g., fasta, clustal, gcg, paup, gff, embl, swissprot, acedb,
abi, etc…
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
Web interfaces
PISE (Pasteur Institute Software Environment)
http://www-alt.pasteur.fr/~letondal/Pise/
EMBOSS-GUI (Canada) (not yet at SIB)
http://bioinfo.pbi.nrc.ca/~lukem/EMBOSS/
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
Pise
http://emboss.ch.embnet.org/Pise
a tool to generate Web interfaces for Molecular Biology programs
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
GUI (Canada)
http://bioinfo.pbi.nrc.ca:8090/EMBOSS/
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
Launch Jemboss
http://emboss.ch.embnet.org/Jemboss
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
Launch Jemboss
First time only…
Each time…
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
Jemboss windows
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
Jemboss windows other systems
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
Summary
Anonymous web access through Pise
Registered access through Jemboss
Registered access through command-line
(requires UNIX skills)
Please report problems!
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10
Exercises
DEA Exercises web based sequence analysis
The goal of this exercise is to use web based tools for protein sequence analysis
List of useful links:
a) Take this TrEMBL sequence (Q9X252) and try a BLAST against swissprot with the complete protein or
with the first 70 residues. Explain the difference. Use TMPred, SignalP, and COILS to help you.
b) Pass this sequence through PFSCAN and search all databases. Compare with this command on
ludwig-sun1/2: hits -b "prf pat pfam" tr:Q9X252
c) use the different profile, motifs, pattern databases to get more information about the domain(s) you
found.
d) How do you evaluate the PRINTS tropomyosin annotation in this TrEMBL entry (Q9WZH0)?
basic BLAST or advanced BLAST or PSI-BLAST
TMPred prediction tool for transmembrane regions (or TMHMM)
COILS prediction tool for coiled-coil regions
SignalP prediction tool for signal-peptide cleavage site
Profile, domain, motifs databases and search sites:
PFSCAN
InterPro (Pfam, PRINTS, PROSITE, SMART)
HITS
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2003.10