Download Aligning protein sequences by hand

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Protein purification wikipedia , lookup

Protein wikipedia , lookup

Protein mass spectrometry wikipedia , lookup

Western blot wikipedia , lookup

Rosetta@home wikipedia , lookup

Protein design wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Protein folding wikipedia , lookup

Cyclol wikipedia , lookup

Intrinsically disordered proteins wikipedia , lookup

Degradomics wikipedia , lookup

Circular dichroism wikipedia , lookup

Protein domain wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Alpha helix wikipedia , lookup

Protein structure prediction wikipedia , lookup

Homology modeling wikipedia , lookup

Structural alignment wikipedia , lookup

Transcript
Aligning protein sequences by hand
The most powerful tools in the bioinformaticist's toolbox is sequence alignment. Let’s see why this is so
with the following example:
Well, lets give a few examples.
 Suppose we have cloned and sequenced a protein, which we believe to be a protease. Which protease
could it be? Search using PUBMED to find out more about proteases and protease families. A
database search using BLAST tells us that this protein is a remote family member of the serine
protease family. So we make an alignment against the most similar serine protease. This tells us that
the overall sequence identity is about 29%. That is not very much, but the local sequence identity
around the three active site residues is considerably higher, and because of that we know for sure that
our new protein is a serine protease.
 Suppose we have a protein that we can easily obtain in large quantities, and we want to use it in a
bioreactor. Unfortunately, the industrial process requires a temperature of 65 oC, but our protein is
heat labile and denatures at temperatures higher than 52 oC. What will you do? Introducing some
mutations may make the protein more stable. Simple, but which of the 298 amino acids should be
mutated? The only thing we know is that we should not mutate in or near the active site because that
would alter the specificity. This is when sequence alignments come in. We align our protein against
a series of family members that have been purified from thermophilic members of domain Bacteria
and Archaea. We look at the multiple sequence alignment, and if we see positions where all the
thermostable stable proteins have one type of residue, and our protein another, we may have a site
which we could mutate. If one such position is also far away from the active site, and not in an
unpleasant position (like the first residue because of cleavage of the pro-peptide, or just in the middle
of the epitope that our monoclonal antibody recognizes), we have potentially found a stabilizing
mutation.
 A third example will pop up in due time.
But, lets start with some examples on alignments:
The question with the first example given below: "Write down in your own words why the green
alignment is better than the red one, and why this seems to be wrong at first.
If we have two sequences, with two different alignments:
A
TVTVTGNSITIT
A
TVTVTGNSITIT
B1 TVTVTG--ITIT
B2 TVTVT—GITIT
then the left alignment looks much better, but look at the corresponding structures that are shown below:
Structure A
TVTVTGNSITIT
the structure that would lead to alignment B1
TVTVTGNSITIT
TVTVTG--ITIT
the structure that would lead to alignment B2
TVTVTGNSITIT
TVTVT--GITIT
1 Aligning sequences by hand.
The alignment given below is very straight forward to achieve and does not require software.
-ASTRGFHILTYHGVCIPPYILRTSA
AATTKGFHVISYHGICLPPYMIRT-However, the following alignment of the two sequences has been not straightforward and required some
thinking.
-ASTRGFHILTYHGVCIPPYILRTSA
AATTQPF--ISFHSICLGNFMIRS--
Nevertheless, I think that this alignment is the best that can be achieved for these two sequences. How can
I know that? How did I make this alignment?
Lets think about an alignment. An alignment is a representation of a whole series of events that took place
during evolution and that left their traces in the sequence. So, the more likely it is that something happens
(or does not happen!) during evolution, the more important is it to have this "something" show up in the
alignment.
What kind of "something"s is important? lets give a few examples:
 It is much easier to mutate than to insert or delete (indel).
 Once nature decided on an indel, its length is less important, but longer indels are more difficult
to make than shorter ones.
 Active site residues don't mutate.
 Residues tend to mutate into similar residues (e.g. V <-> I; S <-> T; etc).
 Residues mutate more easily to residues encoded by similar codons.
 Cysteines that sit in cysteine bridges don't mutate easily.
 Surface residues mutate more easily than core residues.
 Core residues mutate easier when they make fewer contacts.
 It is hard to mutate a glycine that sits somewhere with torsion angles that other residues cannot
have.
 Etc.
We will now start working on sequence alignments. We will slowly add one rule after the other, and learn
a few new physico chemical properties of amino acids while we are doing this.
2 Hydrophobicity in sequence alignment
For each of the following examples, work out which is the better alignment, the one at the right or the one
at the left.
CPISRTWASIFRCW
CPISRTWASIFRCW
CPISRT---LFRCW
CPISRTL---FRCW
CPISRTSASIFRCW
CPISRT---TFRCW
CPISRTSASIFRCW
CPISRTT---FRCW
CPISRTGASIFRCW
CPISRTA---FRCW
CPISRTGASIFRCW
CPISRT---AFRCW
CPISRTRASEFRCW
CPISRTK---FRCW
CPISRTRASEFRCW
CPISRT---KFRCW
CPISRTIASNFRCW
CPISRTH---FRCW
CPISRTIASNFRCW
CPISRT---HFRCW
CPISRTEASDFRCW
CPISRT---NFRCW
CPISRTEASDFRCW
CPISRTN---FRCW
CPISRTEASNFRCW
CPISRTQ---FRCW
CPISRTEASNFRCW
CPISRT---QFRCW
CPISRTFASTFRCW
CPISRT---YFRCW
CPISRTFASTFRCW
CPISRTY---FRCW
3 Secondary structure and sequence alignment
Sometimes the secondary structure of at least one of the sequences is known. This can either be the
secondary structure as derived from a PDB file (remember, those are the files in which coordinates are
stored) or it can be a predicted secondary structure.
Before we use this information lets look at some aspects of secondary structure. By now we know that
secondary structure elements fall in four categories:
1. Helix
2. Strand
3. Turn
4. The rest
And if you look at the Chou and Fasman parameters (and some other very useful data) you see that there
is relation between residue type and secondary structure.
Of course, as always in bioinformatics, the rules that are suggested by these parameters aren't very hard,
and exceptions are everywhere. Nevertheless, they make some sense. So we will study them.
Using these rules, 'predict' the secondary structure of the following sequences:
1. ELMKIAQLAKRGP
2. VVICETTWYVEVT
3. VTITVEGPKITVE
4. SRGGEPTRHEAKE
5. ELLALKLLTVTVT
And select from each of these pairs the better helix:
ALLKAMEAALL
ALLNAMQAAGL
KRAAEALLEAE
DEAAEALLKAR
ALLLAALLLAL
AAEALAKALLR
And which are the better strands in:
VVKISVTIKSG
LLKISLTIILI
VVTTVVTTVVTT
VTVTVTVTVTV
VVICFFWIIFVI
VKICFKSIYVR
4 Using secondary structure information in sequence alignment
Now, how do we use this information? Well, lets start with an example. Predict and sketch the structure
of:
VTVTVTGNTVTVTV
and make the alignment with:
VTVTVSGVTVTV
That alignment requires two deletions in the middle. However, after you made the alignment, predict and
sketch the secondary structure of this VTVTVSGVTVTV. And finally, compare the secondary structure
predictions (and sketches) with the alignment. Do you now see how secondary structure can help?
Align the two sequences:
LLAELALAAMKGSTPNGS
LLLEALMRGTTPNGG
Now predict the secondary structure of the first sequence and look at the alignment again. What is the
problem? How do we solve this?
5 The last example
In this last example, we show everything in pictures again. The question with this examples is again:
"Write down in your own words why the green alignment is better than the red one, and why that seems
funny at first"
If we have two sequences, with two different alignments:
A
ALLELAMKLAIGNSGP
A
ALLELAMKLAIGNSGP
B1
ALLELAMK--IGNSGP
B2
ALLELAMKIG--NSGP
then the left alignment looks much better, but look at the corresponding structures that are shown below:
Structure A
ALLELAMKLAIGNSGP
the structure that would lead to alignment B1
ALLELAMKLAIGNSGP
ALLELAMK--IGNSGP
the structure that would lead to alignment B2
ALLELAMKLAIGNSGP
ALLELAMKIG--NSGP
And, if by now it does not seem clear that knowledge about the structure can help with the fine-tuning of
the alignment, you are in trouble.