Download Aligning Sequences…. - School of Biotechnology, Devi Ahilya

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup

Amino acid synthesis wikipedia , lookup

Biosynthesis wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Gene expression wikipedia , lookup

Expression vector wikipedia , lookup

G protein–coupled receptor wikipedia , lookup

Metalloprotein wikipedia , lookup

Magnesium transporter wikipedia , lookup

Interactome wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Protein purification wikipedia , lookup

Biochemistry wikipedia , lookup

Protein wikipedia , lookup

Genetic code wikipedia , lookup

Western blot wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Point mutation wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Proteolysis wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Transcript
Protein Sequence Analysis
By
Rashmi Shrivastava
Lecturer
School of Biotechnology
Devi Ahilya Vishwavidyalaya
Indore
Introduction
• Genomes of most organism have been deciphered.
• Further step is to identify key regions, speciallly
protein coding regions.
• Assigning functions to individual proteins
• Predicting molecular structures of the proteins.
• Developing protein interaction network.
• Utilizing the information obtained for structure
based drug design, discovering new drug targets,
Creating mutations to alter properties/ create
desired property in proteins and so on...............
• By themselves the letters(amino acid sequence/
genome sequence) have no meaning.
our aim is to create sentence------proteins
words-------- motifs (recognize
patterns and signatures)
• To investigate the meaning of sequences there are
two approachespattern recognition techiques- detect similarity
between sequences.
ab initio prediction methods-prediction of
structure and thus the function
•
•
-
Protein databases
(The source of information)
Primary and Secondary databases
Primary sequence databasesEntrez-protein
PIR- Developed at NBRF
Swiss-Prot
TrEMBL
Secondary database
-Results of analysis of primary databases
-PROSITE/InterPro-protein families characterized
by presence of single most conserved motif
(domains) by multiple sequence alignment
-PRINTS-protein families are characterized by
several conserved motifs to develop a fingerprint
or signature for a particular family.
BLOCKS and Pfam
Profiles-variable regions between conserved motifs
contain information about insertions and
deletions—distant sequence relationship
Enzyme and KEGG- Functional classification
Structure classification databases
• SCOP(Structural classification of proteins)
classify on Hierarchy –Family, superfamily and
fold
• CATH(Class, Architecture, Topology, Homology)Hierarchial domain classification of proteins
C-gross secondary structure content
A- Arrangement of secondary elemnts
T-Overalll shape and connectivity
H- >= 35% sequence identity
• Protein Data Bank (PDB)
Sequence alignment
Pair wise
Multiple
Pair wise Sequence Alignment
Sequence alignment
Global
Sequence
Alignment
Local
sequence
alignment
Algorithm
• Global sequence alignment:Needleman Wunch
• Local Sequence alignment:Smith Waterman
Identity & Similarity:
In alignment the sequence which is already in database is
known as Subject and the sequence for which the alignment
is going on is termed as query or probe sequence.
If the aligned Probe residue is same with the Subject residue
then it is identical but if they are of same nature (Glutamate
& Aspartate) then they are similar.
VLSPADKTNVKAAWGKVGAHAGYEG
|||
.
|
| || |
|
VLSEGEWQLVLHVWAKVEADVAGHG
Total Residue: 25
Identical Residue: 09
Similar (not identical):01
Gap:00
Percent Similarity: 40.000 (| and .) (Identity + similarity)
Percent Identity: 36.000 (| only)
Alignment
ATCAGAGTC
TTC----AGTC
ATCAGAGTC
TTCAG----TC
ATCAGAGTC
TTCA----GTC
Aligning Sequences….
actaccagttcatttgatacttctcaaa
taccattaccgtgttaactgaaaggacttaaagact
actaccagttcatttgatacttctcaaa
taccattaccgtgttaactgaaaggacttaaagact
Sequence 1
Sequence 2
actaccagttcatttgatacttctcaaa
taccattaccgtgttaactgaaaggacttaaagact
actaccagttcatttgatacttctcaaa
taccattaccgtgttaactgaaaggacttaaagact
actaccagttcatttgatacttctcaaa
taccattaccgtgttaactgaaaggacttaaagact
actaccagttcatttgatacttctcaaa
taccattaccgtgttaactgaaaggacttaaagact
actaccagttcatttgatacttctcaaa
taccattaccgtgttaactgaaaggacttaaagact
actaccagttcatttgatacttctcaaa
actaccagttcatttgatacttctcaaa
taccattaccgtgttaactgaaaggacttaaagact
taccattaccgtgttaactgaaaggacttaaagact
actaccagttcatttgatacttctcaaa
taccattaccgtgttaactgaaaggacttaaagact
Gap Insertion:
V K LA W AA K G N E AA PA K AA V D H Y V AA
V K A W AA K G N E A E G L S AA P D J K V AA P
Total Residue: 25
Identical Residue: 04
Gap:00
Percent Identity: 16.00
V K LA W AA K G N E AA PA K AA V D H Y V AA
V K _ A W AA K G N E A E G L S AA P D J K V AA
Total Residue: 25
Identical Residue: 18
Gap:01
Percent Identity: 72.00
Scoring System
Proteins can differ in close organisms.
Some substitutions are more frequent than
other substitutions.
Chemically similar amino acids can be
replaced without severely effecting the
protein’s function and structure
Matrices formed to score alignment:
Sparse Matrices:
Based on identical residue matching
Problem Faced:
1. Diagnostic power is relatively poor, as all the
identical matches carry equal weighting
2. Mathematically significant but biologically
insignificant.
To solve this problem:
Scoring matrices has been devised that weight matches between
non identical residues, according to observed substitution rates
across large evolutionary distances.
This scoring matrices are mathematically insignificant but
biologically significant specially for aligning sequences
of very low identity.
Percent Accepted Mutation (PAM or
Dayhoff) Matrices
• Similar sequences organized into phylogenetic
trees
• Number of amino acid changes counted
• Relative mutabilities evaluated
• 20 x 20 amino acid substitution matrix calculated
• PAM 1: 1 accepted mutation event per 100
amino acids; PAM 250: 250 mutation events
per 100 …
• PAM 1 matrix can be multiplied by itself N
times to give transition matrices for
sequences that have undergone N mutations
• Derived from global alignments of closely related
sequences.
• Matrices for greater evolutionary distances are
extrapolated from those for lesser ones.
• The number with the matrix (PAM40, PAM100)
refers to the evolutionary distance; greater
numbers are greater distances.
• Does not take into account different evolutionary
rates between conserved and non-conserved
regions.
PAM 1
PAM 250
Scoring:
AKWTNLK- - - -WA KV- ADVAGH- G
A K - T N V KA K L P W G K V G G H V A G E Y G
The score of the alignment in this system is:
-Matrix value at (A,A) + (K,K) + (T,T) + (K,K) +(W,W) + (A,G) + …
-(penalty for gap insertion/deletion)*gap
- (penalty for gap extension)*(total length of all gaps)
• Henikoff, S. & Henikoff J.G. (1992)
• Use blocks of protein sequence fragments from
different families (the BLOCKS database)
• Amino acid pair frequencies calculated by
summing over all possible pairs in block
• Different evolutionary distances are incorporated
into this scheme with a clustering procedure
(identity over particular threshold = same cluster)
• Target frequencies are identified directly
instead of extrapolation.
• Sequences more than x% identitical within
the block where substitutions are being
counted, are grouped together and treated as
a single sequence
– BLOSUM 50 : >= 50% identity
– BLOSUM 62 : >= 62 % identity
BLOSUM
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
A 4
B -2 6
C 0 -3 9
D -2 6 -3 6
E -1 2 -4 2 5
F -2 -3 -2 -3 -3 6
G 0 -1 -3 -1 -2 -3 6
H -2 -1 -3 -1 0 -1 -2 8
I -1 -3 -1 -3 -3 0 -4 -3 4
K -1 -1 -3 -1 1 -3 -2 -1 -3 5
L -1 -4 -1 -4 -3 0 -4 -3 2 -2 4
M -1 -3 -1 -3 -2 0 -3 -2 1 -1 2 5
N -2 1 -3 1 0 -3 0 1 -3 0 -3 -2 6
P -1 -1 -3 -1 -1 -4 -2 -2 -3 -1 -3 -2 -2 7
Q -1 0 -3 0 2 -3 -2 0 -3 1 -2 0 0 -1 5
R -1 -2 -3 -2 0 -3 -2 0 -3 2 -2 -1 0 -2 1 5
S 1 0 -1 0 0 -2 0 -1 -2 0 -2 -1 1 -1 0 -1 4
T 0 -1 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1 0 -1 -1 -1 1 5
V 0 -3 -1 -3 -2 -1 -3 -3 3 -2 1 1 -3 -2 -2 -3 -2 0 4
W -3 -4 -2 -4 -3 1 -2 -2 -3 -3 -2 -1 -4 -4 -2 -3 -3 -2 -3 11
X -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
Y -2 -3 -2 -3 -2 3 -3 2 -1 -2 -1 -1 -2 -3 -1 -2 -2 -2 -1 2 -1 7
Z -1 2 -4 2 5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -1 -2 5
A B C D E F G H I K L M N P Q R S T V W X Y Z
Thumb rules
Lower PAMs and higher Blosums find short local
alignment of highly similar sequences.
Higher PAMs and lower Blosums find longer weaker
local alignment.
PAM vs. BLOSUM
Based on the basic assumptions and the
construction of each matrix:
PAM model is designed to track evolutionary
origin of proteins.
Blosum model is designed to find conserved
domains of proteins.
Protein Structure
• Primary structure- The linear sequence
of amino acids in a protein molecule
• Secondary structure- regions of local
regularity within a protein fold (α helices, β strands, turns
etc)
• Super secondary structure- the arrangement of α helices
and/or β strands, into discrete folding units (β-barrels, β
αβ- units, greek key motifs etc.)
• Tertiary structure-The overall fold of a protein sequence
formed by packing of its secondary and/or supersecondary structure elements.
• Quaternary structure- Arrangement of separate protein
chains in a protein molecule
From the Primary sequence to
protein properties
• Predicting protein localization/ secretory
nature by the presence of signal peptide and
localization signal
• Transmembrane helix prediction to identify
membrane proteins
• Calculation of physiochemical propertiespI, Mwt.
• Identification of coiled coiled regions
Post translational modification
prediction
www.expasy.org
Kyte-Doolitle hydrophobicity
plot
• Nature of amino acids- hydrophilic or
Hydrophobic
• A window of 9-20 a,.a taken
• A value greater than 0 means hydrophobic
From Sequence to Structure
• Secondary structure prediction- GOR,
Predict protein, nnpredict
• Domain Prediction- SBASE, PRODOM
Importance of protein secondary
structure prediction
Basis of Secondary structure
prediction
• Conservation in the multiple sequence
alignment
• Hidden Markov Models and Neural
networks
• 70-80% accuracy is achieved.
Method used
Key features of secondary structure
prediction
Chou Fasman Algorithm
GOR
Multiple Sequence
Some sites
• Predator