Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Interactome wikipedia , lookup

Magnesium transporter wikipedia , lookup

Gene expression wikipedia , lookup

Metalloprotein wikipedia , lookup

Molecular ecology wikipedia , lookup

Community fingerprinting wikipedia , lookup

Protein wikipedia , lookup

Western blot wikipedia , lookup

Multilocus sequence typing wikipedia , lookup

Genetic code wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Proteolysis wikipedia , lookup

Point mutation wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Protein structure prediction wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Transcript
Bioinformatics and Protein
Sequence Analysis
With sequencing of large number of proteins and
subsequent storage of data, it has become easier
for researchers to study the proteins. These studies
help in providing preliminary insights into the
structural and functional aspects of proteins without
conducting experiments.
Surabhi Agarwal
1
Master Layout (Part 1)
This animation consists of 2 parts:
Part 1: Protein Sequence Alignment
Part 2: Alignment analysis and interpretations
2
Select the relevant algorithm
and its associated parameters:
3
4
Pair-wise
Sequence
alignment
Seq 1
Seq 2
Seq 3
5
Extract the newly determined
amino acid sequence for your
query peptide.
Multiple
Sequence
alignment
Assess the significance of
the result with its alignment
score
1
Definitions of the components
Part 1 – Protein sequence alignment
1.
Query Peptide: This refers to the unknown protein or peptide that is provided
as an input to the sequence analysis server. The sequence of this protein is
determined before carrying out further studies for analyzing similarity matches
with other proteins.
2.
Relevant Algorithm: An algorithm refers to the sequence of logical steps that
are used for comparing the query peptide with other given protein sequences.
The nature of query such as “Local” or “Global” and “Pair-wise alignment” or
“Multiple Sequence Alignment” determines the algorithm that is used.
3.
Local Alignment: “Local” alignment represents matching individual blocks of
protein sequences in which the protein alignment gets broken at positions
where a mismatch occurs. The aim of such alignment studies is to find the
longest possible blocks of similarity in aligned protein sequences.
4.
Global Alignment: “Global” alignment represents an end-to-end alignment of
two or more sequences, where gaps are introduced at the positions where
mismatches occur.
5.
Pair-wise sequence alignment: This procedure compares and aligns two
given sequences. The comparison can either be Global or Local with the quality
of alignment being judged by the alignment score.
2
3
4
5
1
2
3
Definitions of the components
Part 1 – Protein sequence alignment
6.
Multiple Sequence Alignment: This refers to the end-to-end alignment of several
given sequences that are provided to the search engine. Multiple alignment tends
to introduce minimum gaps and finds regions of similarity within all given
sequences.
7.
Word –length: The minimum length of an amino acid sequence that needs to
match exactly in order to initiate an alignment process in either direction. Sensitivity
and speed of alignment are dependent on the word length provided by the user.
8.
Scoring Matrix: The matrix of values that are referred to for assigning a score to
the alignment of pairs of residues. The matrix used for a BLAST search is selected
depending on the type of sequences that one is searching with. These are PAM
series matrices and BLOSUM series.
a)
PAM: PAM stands for Point Accepted Mutations. It is a log-odds, matrix
scoring system that is constructed on the amino acid replacements in a set of
closely related proteins. PAM value helps in defining the percentage of
mutations that get accepted from a given set of proteins. 1 PAM refers to a
change in position for an average of 1% of amino-acids residues.
b)
BLOSUM: This stands for “Blocks of Amino Acid Substitution Matrix” and is
constructed from a set of distantly related proteins. BLOSUM provides a
comprehensive biological insight into proteins when the evolutionary distance
is not known beforehand. It is based on the relative frequency of amino acid
residues and the probabilities of their substitution in a set of highly conserved
blocks of residues in proteins that are evolutionarily distant.
4
5
1
2
3
4
5
Definitions of the components
Part 1 – Protein sequence alignment
9.
Threshold: Threshold provides a measure of the statistical significance of the
results of an alignment study and represents the expected number of matches
occurring by chance event.
10.
Gap Penalty and Gap Extension: In an alignment of two or more given
protein sequences, a gap is introduced wherever an amino acid mismatch
occurs. In this context, “Gap penalty” refers to a deduction in the overall
alignment score on introduction of a gap while the “Gap Extension” is for
extending an already existing gap.
11.
Alignment Score: This is also referred to as the Bit Score and provides a
comparative quantification of the quality of alignment. The score increases
when a higher number of residue matches and lower number of mismatches
are encountered. The alignment having a higher bit score is a better match.
12.
Percentage Identity: This indicates the percentage of amino acid residues
that are an identical match to each other during the comparison of two
sequences.
13.
E-value: E-value provides a quantification of any chance alignment between
two or more sequences instead of them being a biologically significant match.
For similarity match against a database, this value is dependant on the size of
the database against which the sequence is compared. The closer the e-value
is to zero, the higher is the biological significance of the match.
14.
Hit: The results of a search are called a ‘Hit’ and the term ‘best Hit’ would
refer to the best result for that particular query.
1
2
Step 1: Pair-wise sequence alignment for
two given sequences - INPUT
Length of initial set of amino acids
that needs to be matched before SEQUENCE DATABASE
alignment begins
Enter sequence
1 of Matches that
Expected
Number
Word Size
>gi|268576797|ref|XP_002643378.1|
C. briggsae CBRare allowed
to occur
chance
Values
deducted
fromby
overall
alignment
COL-186 protein [Caenorhabditis briggsae]
MKSTEKKSTELDLELEAQSLRRIAFFGVAMSTVATFV
score on introduction and extension of
CIITVPLAYNKMQQMQSNMIDQYMASARGIRVA …
1
3
mismatches
The reference matrix used to assign scores
Enter sequence
2
to matches
of residues
>gi|6682|emb|CAA35955.1| collagen [Caenorhabditis
Enter sequence 1
elegans]
MSEDLKQIAQETESLRKVAFFGIAVSTIATLTAIIAVP
MLYNYMQHVQSSLQSEVEFCQHRSNGLWDEYK …
Threshold
Gap penalty
Scoring Matrix
ALIGNMENT ALGORITHM (BLAST)
3
10
Existence 11, Extension 1
BLOSUM62
PAM30
BLOSUM62
4
Action
5
Schematic of
the process of
pair-wise
alignment
Description of the action
Follow the animation steps. Re-draw all figures.
Show all definitions first by highlighting the
parameter. Follow it with input of 2 sequences and
the parameter values one by one. Downlink after
scoring matrix should look like the downlinks seen
on web-pages. Click on the downlink and show the
BLOSUM62 Matrix getting selected. Click on BLAST
tool
Audio Narration
Alignment algorithms are computer algorithms which take the 2 protein
sequences and align them residue by residue. Here we depict alignment
done between 2 given sequences. To align two sequences, enter them in
input box. We took the example of CBR-COL-186 protein of
Caenorhabditis briggsae and collagen of Caenorhabditis elegans. The
sequences are abridged for the purpose of animation. To carry out the
exact study, users can download the sequences corresponding to the
Gene ID. Enter the parameters as per the nature of the query and the
purpose of the search and finally click on the BLAST tool.
Step 2: Pair-wise sequence alignment for
two given sequences - OUTPUT
1
Bit score are the normalized scores
which
aregraphical
found after
normalization
Dot-Plot
is the
visualization
The statistical
measure of the
raw
scores
based
onto
the
scoring
of theof
The
two
percentage
given
sequences
of
residues
find
which
biological significance.
The closer ematrix
usedininthe
the
algorithm
approximate
were
identical
overlaps
to
two
identify
sequences
value is to 0, higher is the biological
regions of close similarity
significance
2
3
4
5
ALIGNMENT:
Sequence 1
Sequence 2
Action
Shows the
various
output
formats
for pairwise
alignment
PERCENTAGE
DOT-PLOT
BITE-VALUE
SCORE
Shows
the match or mismatch
Sequence 1
IDENTITY
Gaps
between each of theintroduced
residues in sequence 2 due
6e-19
to lack
of
similar
LELEAQSLRRIAFFGVAMSTVATFVCIITVPLAYNKMQQMQSNMIDQYMASARGIRVARR
77.4
34%
bits residues in
Sequence
2
+ E +SLR++AFFG+A+ST+AT sequence
II VP+ 1YN MQ +QS++ +
IAQETESLRKVAFFGIAVSTIATLTAIIAVPMLYNYMQHVQSSLQSE----------VEF
Description of the action
Show the smaller image of
the server with every output
and definitions coming out of
it one at a time as shown in
the powerpoint animation
http://blast.ncbi.nlm.nih.gov/Blast.cgi
Audio Narration
Pair-wise alignment with the help of BLOSUM 62 matrix gives various kinds of results
after alignment. These are alignment, alignment score, dot-plot, percentage identity
and e-value. The raw score from BLOSUM62 matrix is 189 and from PAM30 matrix is
178. Bit score for alignment of the exact same study done using BLOSUM62 is 77.4 and
for PAM30 matrix is 78.7. Therefore, the Bit scores give a uniform and normalized
measure of the overall quality of alignment irrespective of the scoring system. The
biological significance of this result is very high as the e value is very near to 0. For a
more detailed study on the types of BLAST tools available, visit
http://blast.ncbi.nlm.nih.gov/Blast.cgi
1
Step 3: Pair-wise alignment of sequences
against database- INPUT
SEQUENCE DATABASE
Enter sequence 1
2
MSEDLKQIAQETESLRKVAFFGIAVSTIATLTAIIAVPMLYNYM
QHVQSSLQSEVEFCQHRSNGLWDEYKRFQGVSGVEGRIKRDAYH
RSLGVSGASRKARRQSYGNDAAVGGFGGSSGGSCCSCGSGAAGP
AGSPGQDGAPGNDGAPGAPGNPGQDASEDQTAGPDSFCFDCPAG
PPGPSGAPGQKGPSGAPGAPGQSGGAALPGPPGP
SELECT DATABASE
3
4
5
PROTEIN
NUCLEOTIDE
GENE
PROTEOME
GEO
EST
SNP
Action
Schematic of
the process of
pair-wise
alignment
Word Size
3
Threshold
10
Gap penalty
Existence 11, Extension 1
Scoring Matrix
PAM30
PAM30
BLOSUM62
ALIGNMENT ALGORITHM (BLAST)
Description of the action
Follow the animation steps. Re-draw all figures. Show
all definitions first by highlighting the parameter.
Follow it with input of 1 sequence. Downlink after
“Select Database” and “Scoring Matrix” should look
like the downlinks seen on web-pages. Select “Protein”
under the “Select Database” options box as shown in
the animation. Follow this by inputting the parameter
values one by one. Click on the downlink against
“Scoring Matrix” and show the PAM30 Matrix. Click on
BLAST tool.
Audio Narration
Alignment can also be done by matching a sequence against a
related database of sequences to identify it. Input the unknown
sequence, and then select the database against which the
sequence is to be matched. Fill the parameter values as per the
purpose of the search and the nature of the query sequence. In
this case we study the hits using PAM30 scoring Matrix. Click on
the BLAST tool once all parameters have been entered.
1
Step 4: Pair-wise alignment of sequences
against database- OUTPUT
SEQUENCE DATABASE
Enter sequence 1
MPSSVSWGILLLAGLCCLVPVSLAEDPQGDAAQKTDTSHH
DQDHPTFNKITPNLAEFAFSLYRQLAHQSNSTNIFFSPVSIA
TAFAML
Word Size
3
Threshold
10
Percentage of residues exactly matching
SELECT DATABASE
The
query
is
scanned
to find
domains
Existence 11, Extension 1
Gap penalty
Identifies
the
sequence
In the
case
of
database
searches,
E-value and
in the
query
sequence
andprotein
the selected
from
Pfam
Database.
In case,PROTEIN
such
apairthe
organism
for
unknown
NUCLEOTIDE
is found
bysource
the
multiplication
ofthe
hit
BLOSUM
Scoring Matrix
Alignment
shows
100%
matching
with
GENE
Measure
of
the
quality
of
the
domain
is
identified,
it
is
shown
as
part
PROTEOME
PAM
wisealignment
e-value number
ofsequence
sequences
identified
sequence
GEO
when
compared
to bitin the
BLOSUM
ofthe
the
result
EST
database.
SNP
ALIGNMENT
ALGORITHM
(BLAST)
scores of other hits of the search
Pfam ID: pfam01484:
Domain Name: Col_cuticle_N
Description:
Nematode
Domain
Identified
(if any) cuticle collagen
N-terminal domain
ALIGNMENT:Percentage Identity
TOTAL
SCORE
17 IDENTIFICATION
69
E-Value
100% 200
1Query
50MSEDLKQIAQETESLRKVAFFGIAVSTIATLTAIIAVPMLYNYMQHVQSSLQSEVEFCQH
250
300
100
150
GENE ID:
179452 col-13 | Collagen [Caenorhabditis
MSEDLKQIAQETESLRKVAFFGIAVSTIATLTAIIAVPMLYNYMQHVQSSLQSEVEFCQH
624 bits elegans]
2
3
Database
4
5
Action
Shows the
various
output
formats
for pairwise
alignment
1e-176
MSEDLKQIAQETESLRKVAFFGIAVSTIATLTAIIAVPMLYNYMQHVQSSLQSEVEFCQH
Description of the action
Show the smaller image of the server
with every output and definitions
coming out of it one at a time as shown
in the powerpoint animation
Audio Narration
Pair-wise alignment gives various kinds of results
after alignment. These are alignment views,
alignment score, dot-plot, e-value, percentage
identity amongst many others. When compared to
bit scores from other hits of the result, the bit score
turns out to be the highest for collagen proteins in
Caenorhabditis elegans
http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html; http://pfam.sanger.ac.uk/
1
2
Step 5: Multiple Sequence Alignment - INPUT
The word-size is the length of the initial
seed set of amino acids, which needs to
match exactly to get the alignment
DATABASE
Window Length is the length of SEQUENCE
the
extended in both directions
residues
on1either side of the initial
Enter
sequence
matched sequence, till which the
>gi|268574584|ref|XP_002642271.1| Hypothetical protein
CBG18259 [Caenorhabditis briggsae]
Word Size
alignment will be extended.
MDEKQRLQAYRFVAYSAVTFSTVAVFSLCITLPLVYNYVDGIKTQINHEIKFCKHSARDIF
AEVNHIRANPKNASRFARQAGYGTDEAVSGGS
Users can choose to see absolute scores
for comparing
Enter
sequence 2or percentage value of the
>gi|32565788|ref|NP_871711.1| COLlagen family member (colscores
96) [Caenorhabditis elegans]
3
MDEITRRNAYRFVAYSAVTFSVVAVFSLCITLPMVYNYVHGIKSQINHQISFCKHSARD
IFSEVNHIRASPNNATLREKRQAGDCSGCCL
Enter sequence 3
>gi|17559060|ref|NP_505677.1| COLlagen family member (col-13)
[Caenorhabditis elegans]
MSEDLKQIAQETESLRKVAFFGIAVSTIATLTAIIAVPMLYNYMQHVQSSLQSEVEFCQHRS
NGLWDEYKRFQGVSGVEGRIKRDAYH
ADD MORE
SEQUENCES
4
Action
Schematic of
the process of
pair-wise
alignment
5
Window length
3
10
Gap penalty
Existence 11, Extension 1
Score type
ABSOLUTE
ABSOLUTE
PERCENTAGE
MULTIPLE SEQUENCE ALIGNMENT (CLUSTAL-W)
Description of the action
Follow the animation steps. Enter first 2
sequences. Click on “Add more sequences”. Open
the 3rd input box for entering thee 3rd sequence.
Show the input of 3rd sequence. Show the input of
parameters. Select “Absolute” ahead of “Score
Type” downlonk. Downlink after scoring matrix
should look like the downlinks seen on web-pages.
Audio Narration
Multiple Sequence Alignment tools are used to compare the amino
acid sequences of more than two proteins. The word-size is the
length of the seed set of amino acids, which needs to match exactly
to get extended in both directions. Window Length is the length of
the residues on either side, till which the alignment will be
extended. The Gap penalty and extension hold the same meaning
as in pair-wise alignment. In the scores, users can choose to see
absolute scores for comparing or percentage value of the scores.
1
Step 6: Multiple Sequence Alignment - OUTPUT
SEQUENCE DATABASE
Enter sequence 1
MPSSVSWGILLLAGLCCLVPVSLAEDP
QGDAAQKTDTSHHDQDHPTFNKITP
Word Size
Enter sequence 2
Threshold
Gap penalty
MKLLKLTGFIFFLFFLTESLTLPTQPRDIE
NFNSTQKFIEDNIEYITIIAFAQYVQEA
Enter sequence 2
3
1
0
Existence 11, Extension
1
Mapping ofScoring
colors
to amino
acid
Matrix
BLOSUM
groups
MKLLKLTGFIFFLFFLTESLTLPTQPRDIE
2
NFNSTQKFIEDNIEYITIIAFAQYVQEA
MULTIPLE SEQUENCE ALIGNMENT (CLUSTAL-W)
3
4
5
Color coded alignment
of score
querywhich can be
Alignment
Text alignment ofsequences
query
sequences
compared with other scores to
measure the quality of alignmnet
MULTIPLE
ALIGNMENT
COLOR
CODEDSEQUENCE
ALIGNMENT
ALIGNMENT SCORE
sequence 1
Sequence
1
sequence
2
Sequence
2
sequence 3
Sequence 3
Action
Shows the
various
output
formats for
multiple
sequence
alignment
MDE-----KQRLQAYRFVAYSAVTFSTVAVFSLCITLPLVYNYVDGIKTQ
MDE-----ITRRNAYRFVAYSAVTFSVVAVFSLCITLPMVYNYVHGIKSQ
MSEDLKQIAQETESLRKVAFFGIAVSTIATLTAIIAVPMLYNYMQHVQSS
Description of the action
Show the smaller image of the server
with every output coming out of it one
at a time
http://www.ebi.ac.uk/Tools/es/cgi-bin/clustalw2/
5269
Audio Narration
Multiple sequence alignment gives various kinds of results after alignment. The
alignment view in text format displays the residue wise matching for the input
sequence. The color coded alignment gives a better graphical picture as the
amino acid residues are assigned colors based on their physico-chemical
properties. Here we depict one of the many color coding available. Alignment
score is an absolute term, as selected previously. It can be compared with other
scores to measure the quality of alignment. Users obtain .output file for the
summary of the result, .aln files which contains the text alignment and .dnd
files which contain the distance based information. For detailed understanding
of these outputs, kindly visit
http://www.ebi.ac.uk/Tools/clustalw2/index.html
1
Master Layout (Part 2)
This animation consists of 2 parts:
Part 1: Protein Sequence Alignment
Part 2: Alignment analysis and interpretations
2
Phylogram representing
evolutionary relationships
3
4
5
Structural features
that decide function
Protein secondary
structures
1
Definitions of the components
Part 2 – Alignment analysis and interpretations
1.
Computational Phylogenetic Predictions: Sequence alignment studies of
proteins can reveal the conserved and variable residues between the two
sequences. Protein sequences derived from different organisms, but having
a high degree of similarity are assumed to be coming from the same
ancestor. Such predictions, which can now be carried out computationally
with the help of various algorithms, help in providing an insight into
evolutionary processes.
2.
Phylogram: Phylogram is a pictorial representation that provides a
visualization of evolutionary relationships or phylogeny. In this, the length of
branches in the tree are considered to be proportional to the evolutionary
distance.
3.
Cladogram: A Cladogram is another form of pictorial representation that
also gives a visual insight into evolutionary relationships or phylogeny. Unlike
the phylogram, the branches of a cladogram are of equal length irrespective
of the evolutionary distance.
4.
Maximum Parsimony: A method used for alignments which show very
strong sequence similarity. This is usually applied for less than twelve
sequences.
2
3
4
5
Definitions of the components
1
Part 2 – Alignment analysis and interpretations
5.
Distance methods: This predicts the evolutionary distance when there is any
sequence variation present and can be used on large number of sequences. As
the distance between two sequences increases, the uncertainty of the alignment
also increases.
6.
Maximum likelihood: This method is useful for prediction of evolutionary
distance when sequence variability is high. It can be used for alignments with
any amount of variability.
7.
Protein structure prediction: The three dimensional structure of a protein is
largely specified by its amino acid sequence. Protein structures can be predicted
with an accuracy of 70-75% when provided with the sequence.
8.
Functional annotation: Function(s) of proteins can be predicted for those
proteins having a well-described homology. Gene Ontology terms (GO terms)
provide a unique identification of the function that the gene is involved in. These
functions are categorized at different levels of functional hierarchy.
9.
Protein motif: Common patterns of residues in a set of protein sequences is
known as a motif.
2
3
4
5
1
Step 1: Phylogenetic analysis from alignment- Input
SEQUENCE DATABASE
2
3
Enter a sequence alignment
for 2 or more sequences
Select a
method
USED FOR SEQUENCES
Seq1 -------------- LLFLFSSAYSRGVFRRDTHK
WITH HIGHLY CONSERVED
Seq2 MKWVTFISLLFLFSSAYSRGVFRRDAH
RESIDUES
Seq3 MKWVTFLLLLFVSGSAFSRGVFRREA USED FOR SEQUENCES
WITH MODERATELY
CONSERVED RESIDUES
USED FOR SEQUENCES
WITH HIGHLY VARIABLE
RESIDUES
MAXIMUM PARSIMONY
MAXIMUM PARSIMONY
DISTANCE METHODS
MAXIMUM LIKELIHOOD
PHYLOGENETIC ANALYSIS (PHYLIP)
4
5
Action
Schematic of the
process of
analysis of
alignment
Description of the action
Follow the animation steps. Show the description
of each of the methods as the mouse hovers over
them. Finally select “Maximum Parsimony”
method. Downlink after scoring matrix should look
like the downlinks seen on web-pages.
Audio Narration
Multiple sequence alignment produces alignment files (.aln),
which can be used to determine the evolutionary distances of
a set of given protein sequences. This can be achieved by
many server-based and stand-alone programs. The user
needs to select the method for calculating the distance. Here
we depict the usage of alignment files for phylogenetic
analysis.
Step 2: Phylogenetic analysis from
alignment- Output
1
SEQUENCE DATABASE
Enter a sequence
alignment for 2 or more
sequences
2
5
MAXIMUM PARSIMONY
PGFPPLVAPEPDALCAAFQDN
DND
files givesisthe
distance measure
of
Phylogram
a branching
depicting
evolutionary
PNLPRLVRPEVDVMCTAFHDN
PKLK-PDPNTLCDEFKADEKKF
the Branching
aligned
sequences
from their
common
diagram
depicting
relationships
or phylogeny.
Inevolutionary
this, the length of
ancestral
relationships
or phylogeny.
branches
in the node
tree
are considered to be PHYLOGENETIC ANALYSIS (PHYLIP)
proportional to the evolutionary distance.
3
4
Select a
method
PHYLOGRAM
CLADOGRAM
Action
Schemati
c of the
process
of
analysis
of
alignmen
t
DND FILES
Description of the action
Follow the animation steps.
The server on the previous
slide gives the following
outputs
( seq 1:0.13525,
Seq 2:0.09868,
seq 3:0.09868);
Audio Narration
The outputs from the analysis will be Distance file known as the DND file,
Cladogram and Phylogram which are evolutionary trees. In the DND file, there is
a common node. The values against the sequence are the distance from the
common node. DND files give the distance measure of the aligned
sequences from their common ancestral node. Cladograms are the
graphical representation of the branching during evolution of the proteins
that were aligned. Cladograms do not represent the evolutionary distances
or the common ancestral node. Phylograms also represent the evolutionary
distance tree in a graphical format. In this, the branch lengths correspond to
the evolutionary distance between the two proteins. All branches will
converge to a common ancestral root.
1
Step 3: Structural and Functional prediction from
alignment- Input
SEQUENCE DATABASE
2
Enter a sequence alignment
for 2 or more sequences
Seq 1
Seq 2
Seq 3
PGFPPLVAPEPDALCAAFQDN
PNLPRLVRPEVDVMCTAFHDN
PKLK-PDPNTLCDEFKADEKKF
Range for width of the
motifs to be found
6-50
Maximum number of
motifs to be found
3
3
Structural and Functional prediction (MeMe server)
4
Action
Schematic of
the for
structural and
functional
analysis
Description of the action
Follow the animation steps. Input the
alignment. Input the parameters. Click on
the server tool.
5
http://meme.sdsc.edu/meme4_4_0/intro.html
Audio Narration
Alignment files can also be used for a variety of
structural and functional analysis. Here we
represent the functioning of such programs and
servers by taking a simple example of protein motif
prediction. The range of the width and the
maximum number of motifs to be found are
defined by the user.
1
Step 4: Structural and Functional prediction
from alignment- Output
SEQUENCE DATABASE
Enter a sequence alignment
for 2 or more sequences
2
Range for width of
the motifs to be
The color coded diagram shows the
found
positions
of the
motifs
in the text
alignment
Maximum number of
Block
diagram
ofPGFPPLVAPEPDALCAAFQDN
motif
prediction
is
the
PNLPRLVRPEVDVMCTAFHDN
motifs to be found
of
the
compared
sequences
PKLK-PDPNTLCDEFKADEKKF
schematic used to visualize
the positions and
6-50
3
kinds of motifs in the alignment of two
or more sequences Structural and Functional prediction (MeMe server)
3
Residue-wise sites for motifs
Color coded
block diagram for
motifs
4
Action
5
Schematic of
the for
structural and
functional
analysis
Description of the action
Follow the animation steps., The server on
the previous slide gives the following
outputs
http://meme.sdsc.edu/meme4_4_0/intro.html
Audio Narration
The outputs obtained are
1. Block Diagram of protein motifs, which is the
schematic used to visualize the positions and kinds
of motifs in the alignment of two or more
sequences. The color coding varies from server to
server.
2. Sites of the blocks on a residue-by-residue basis.
Step 5: Structural and Functional prediction
from alignment- Further Analysis
1
Protein Motif
2
Enzyme Active
Subtilisn sites
Epitope prediction
in antigens
Finding Enzyme Active Epitope Prediction in
Antigen
Site
3
4
5
Finding transmembrane domain
Finding Transmembrane domains
Action
Description of the action
Identify DNA binding
Residues
Identify DNA
binding residues
Audio Narration
Once the protein motifs are detected, they can
Animator needs to re-draw all the
be used for further analysis, such as
images shown as they have been
retrieved from web-resources. Show 1. Epitope Prediction
2. Active site determination
the pie chart. Highlight one quarter
3. Determination of trans-membrane domains
of it one at a time and depict the
4. Identification of DNA binding residues
diagram next to it along with
narrating it.
http://qwickstep.com/search/the-active-site-of-an-enzyme.html, http://www.science.uva.nl/research/its/molsim/research/TMsignalling_lizhe/index.html
Functions
that can be
predicted
from
sequence
data
https://www.uzh.ch/oci/ssl-dir/group/files/14_roverview.jpg, http://medgadget.com/archives/2008/03/3d_imaging_of_bleomycindna_binding.html
1
2
Interactivity option 1: Find the evolutionary distance between
insulin chain A of human and mouse
Chose the protein
sequences corresponding
to insulin A
2.
Store the FASTA
sequences mentioned
against Human and
mouse in separate
locations 4
Check the.dnd file to find
evolutionary distance
8
3
Run the server to obtain
output
6.
Check for the .aln file and
input it into programs for
finding Phylogenetic
distances such as phylip 7
Input the two sequences in
a multiple alignment server
5
Input the term “insulin chain
A” in the protein database
of your choice 1
Check the source
organism for the protein
sequence. 3.
4
Interacativity Type
5
Arrange the steps in the
order to be performed.
Options
Remove the step number from the bottom of
the tab . Show all the steps in the mixed order.
The user must click on the tabs order wise. If
the user clicks at a tab which is not in the right
order, then flash a message saying “try again”
Results
All the tabs must be
arranged in right
order.
1 Interactivity option 2.a : Match the following
2
3
PAM MATRIX
SIMILARITY BASED SCORING MATRIX
DOMAIN
IDENTIFICATION
EVOLUTIONARY TREE
PHYLOGRAM
MEASURE OF BIOLOGICAL
SIGNIFICANCE
BIT SCORE
DISTANCE BASED SCORING MATRIX
E-VALUE
MEASURE OF QUALITY OF ALIGNMENT,
NORMALIZED ACCORDING TO SCORING
MATRIX
BLOSUM MATRIX
BLAST RESULT LINKED TO PFAM
4
Interacativity Type
5
Match the left column to
the right
Options
Match the meaning of the
parameter on the right to the
name of the parameter on the
left. If the matching is correct,
turn the tab green, else flash
“Try Again”
Results
Results on next slide
1 Interactivity option 2.b : Match the following
2
PAM MATRIX
SIMILARITY BASED SCORING MATRIX
DOMAIN
IDENTIFICATION
BLAST RESULT LINKED TO PFAM
PHYLOGRAM
EVOLUTIONARY TREE
MEASURE OF QUALITY OF ALIGNMENT,
NORMALIZED ACCORDING TO SCORING
MATRIX
MEASURE OF BIOLOGICAL
SIGNIFICANCE
BIT SCORE
3
E-VALUE
BLOSUM MATRIX
DISTANCE BASED SCORING MATRIX
4
Interacativity Type
5
Match the left column to
the right
Options
Match the meaning of the
parameter on the right to
the name of the
parameter on the left. If
the matching is correct,
turn the tab green, else
flash “Try Again”
Boundary/limits
Results
Correct Matching
1
Questionnaire
1. Which is a scoring matrix based on distantly related proteins?
Answers: a) PAM
2
b)BLOSUM
2. Which parameter signifies whether the match between two sequences is a
chance alignment?
Answers: a) word-length
3
b) e-value
c) dot-plot
d) none
3. Which evolutionary tree has the branch length corresponding to the evolutionary
distances?
Answers: a) Phylogram
4
c) Both d) None
b)Cladogram
c) both
d) none
4. Which is NOT a ClustalW output file extension?
Answers: a) .dnd
b) .txt
c) .aln
d) .output
5. Phylogenetic method for most variable sequence is
5
Answers: a) Distance method b) Maximum Distance c) Maximum Parsimony d)
Maximum Likelihood
Links for further reading
Reference websites:
http://blast.ncbi.nlm.nih.gov/Blast.cgi
http://www.ebi.ac.uk/Tools/clustalw2/index.html
http://www.pdb.org/pdb/home/home.do
http://expasy.org/sprot/
http://expasy.org/prosite/
http://pfam.sanger.ac.uk/
http://www.psc.edu/general/software/packages/phylip/
Links for further reading
Following URLs are used for animations
http://www.ncbi.nlm.nih.gov/
http://blast.ncbi.nlm.nih.gov/Blast.cgi
http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
http://pfam.sanger.ac.uk/
http://www.ebi.ac.uk/Tools/es/cgi-bin/clustalw2/
http://meme.sdsc.edu/meme4_4_0/intro.html
http://www.ebi.ac.uk/Tools/clustalw2/index.html
http://qwickstep.com/search/the-active-site-of-an-enzyme.html
http://www.science.uva.nl/research/its/molsim/research/TMsignalling_lizhe/index.ht
ml
https://www.uzh.ch/oci/ssl-dir/group/files/14_roverview.jpg
http://medgadget.com/archives/2008/03/3d_imaging_of_bleomycindna_binding.html
Links for further reading
Books:
Bioinformatics Sequence and Genome Analysis by David Mount