* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Aligning Sequences…. - School of Biotechnology, Devi Ahilya
Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup
Amino acid synthesis wikipedia , lookup
Biosynthesis wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Gene expression wikipedia , lookup
Expression vector wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Metalloprotein wikipedia , lookup
Magnesium transporter wikipedia , lookup
Interactome wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Protein purification wikipedia , lookup
Biochemistry wikipedia , lookup
Genetic code wikipedia , lookup
Western blot wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Point mutation wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Protein Sequence Analysis By Rashmi Shrivastava Lecturer School of Biotechnology Devi Ahilya Vishwavidyalaya Indore Introduction • Genomes of most organism have been deciphered. • Further step is to identify key regions, speciallly protein coding regions. • Assigning functions to individual proteins • Predicting molecular structures of the proteins. • Developing protein interaction network. • Utilizing the information obtained for structure based drug design, discovering new drug targets, Creating mutations to alter properties/ create desired property in proteins and so on............... • By themselves the letters(amino acid sequence/ genome sequence) have no meaning. our aim is to create sentence------proteins words-------- motifs (recognize patterns and signatures) • To investigate the meaning of sequences there are two approachespattern recognition techiques- detect similarity between sequences. ab initio prediction methods-prediction of structure and thus the function • • - Protein databases (The source of information) Primary and Secondary databases Primary sequence databasesEntrez-protein PIR- Developed at NBRF Swiss-Prot TrEMBL Secondary database -Results of analysis of primary databases -PROSITE/InterPro-protein families characterized by presence of single most conserved motif (domains) by multiple sequence alignment -PRINTS-protein families are characterized by several conserved motifs to develop a fingerprint or signature for a particular family. BLOCKS and Pfam Profiles-variable regions between conserved motifs contain information about insertions and deletions—distant sequence relationship Enzyme and KEGG- Functional classification Structure classification databases • SCOP(Structural classification of proteins) classify on Hierarchy –Family, superfamily and fold • CATH(Class, Architecture, Topology, Homology)Hierarchial domain classification of proteins C-gross secondary structure content A- Arrangement of secondary elemnts T-Overalll shape and connectivity H- >= 35% sequence identity • Protein Data Bank (PDB) Sequence alignment Pair wise Multiple Pair wise Sequence Alignment Sequence alignment Global Sequence Alignment Local sequence alignment Algorithm • Global sequence alignment:Needleman Wunch • Local Sequence alignment:Smith Waterman Identity & Similarity: In alignment the sequence which is already in database is known as Subject and the sequence for which the alignment is going on is termed as query or probe sequence. If the aligned Probe residue is same with the Subject residue then it is identical but if they are of same nature (Glutamate & Aspartate) then they are similar. VLSPADKTNVKAAWGKVGAHAGYEG ||| . | | || | | VLSEGEWQLVLHVWAKVEADVAGHG Total Residue: 25 Identical Residue: 09 Similar (not identical):01 Gap:00 Percent Similarity: 40.000 (| and .) (Identity + similarity) Percent Identity: 36.000 (| only) Alignment ATCAGAGTC TTC----AGTC ATCAGAGTC TTCAG----TC ATCAGAGTC TTCA----GTC Aligning Sequences…. actaccagttcatttgatacttctcaaa taccattaccgtgttaactgaaaggacttaaagact actaccagttcatttgatacttctcaaa taccattaccgtgttaactgaaaggacttaaagact Sequence 1 Sequence 2 actaccagttcatttgatacttctcaaa taccattaccgtgttaactgaaaggacttaaagact actaccagttcatttgatacttctcaaa taccattaccgtgttaactgaaaggacttaaagact actaccagttcatttgatacttctcaaa taccattaccgtgttaactgaaaggacttaaagact actaccagttcatttgatacttctcaaa taccattaccgtgttaactgaaaggacttaaagact actaccagttcatttgatacttctcaaa taccattaccgtgttaactgaaaggacttaaagact actaccagttcatttgatacttctcaaa actaccagttcatttgatacttctcaaa taccattaccgtgttaactgaaaggacttaaagact taccattaccgtgttaactgaaaggacttaaagact actaccagttcatttgatacttctcaaa taccattaccgtgttaactgaaaggacttaaagact Gap Insertion: V K LA W AA K G N E AA PA K AA V D H Y V AA V K A W AA K G N E A E G L S AA P D J K V AA P Total Residue: 25 Identical Residue: 04 Gap:00 Percent Identity: 16.00 V K LA W AA K G N E AA PA K AA V D H Y V AA V K _ A W AA K G N E A E G L S AA P D J K V AA Total Residue: 25 Identical Residue: 18 Gap:01 Percent Identity: 72.00 Scoring System Proteins can differ in close organisms. Some substitutions are more frequent than other substitutions. Chemically similar amino acids can be replaced without severely effecting the protein’s function and structure Matrices formed to score alignment: Sparse Matrices: Based on identical residue matching Problem Faced: 1. Diagnostic power is relatively poor, as all the identical matches carry equal weighting 2. Mathematically significant but biologically insignificant. To solve this problem: Scoring matrices has been devised that weight matches between non identical residues, according to observed substitution rates across large evolutionary distances. This scoring matrices are mathematically insignificant but biologically significant specially for aligning sequences of very low identity. Percent Accepted Mutation (PAM or Dayhoff) Matrices • Similar sequences organized into phylogenetic trees • Number of amino acid changes counted • Relative mutabilities evaluated • 20 x 20 amino acid substitution matrix calculated • PAM 1: 1 accepted mutation event per 100 amino acids; PAM 250: 250 mutation events per 100 … • PAM 1 matrix can be multiplied by itself N times to give transition matrices for sequences that have undergone N mutations • Derived from global alignments of closely related sequences. • Matrices for greater evolutionary distances are extrapolated from those for lesser ones. • The number with the matrix (PAM40, PAM100) refers to the evolutionary distance; greater numbers are greater distances. • Does not take into account different evolutionary rates between conserved and non-conserved regions. PAM 1 PAM 250 Scoring: AKWTNLK- - - -WA KV- ADVAGH- G A K - T N V KA K L P W G K V G G H V A G E Y G The score of the alignment in this system is: -Matrix value at (A,A) + (K,K) + (T,T) + (K,K) +(W,W) + (A,G) + … -(penalty for gap insertion/deletion)*gap - (penalty for gap extension)*(total length of all gaps) • Henikoff, S. & Henikoff J.G. (1992) • Use blocks of protein sequence fragments from different families (the BLOCKS database) • Amino acid pair frequencies calculated by summing over all possible pairs in block • Different evolutionary distances are incorporated into this scheme with a clustering procedure (identity over particular threshold = same cluster) • Target frequencies are identified directly instead of extrapolation. • Sequences more than x% identitical within the block where substitutions are being counted, are grouped together and treated as a single sequence – BLOSUM 50 : >= 50% identity – BLOSUM 62 : >= 62 % identity BLOSUM • • • • • • • • • • • • • • • • • • • • • • • • A 4 B -2 6 C 0 -3 9 D -2 6 -3 6 E -1 2 -4 2 5 F -2 -3 -2 -3 -3 6 G 0 -1 -3 -1 -2 -3 6 H -2 -1 -3 -1 0 -1 -2 8 I -1 -3 -1 -3 -3 0 -4 -3 4 K -1 -1 -3 -1 1 -3 -2 -1 -3 5 L -1 -4 -1 -4 -3 0 -4 -3 2 -2 4 M -1 -3 -1 -3 -2 0 -3 -2 1 -1 2 5 N -2 1 -3 1 0 -3 0 1 -3 0 -3 -2 6 P -1 -1 -3 -1 -1 -4 -2 -2 -3 -1 -3 -2 -2 7 Q -1 0 -3 0 2 -3 -2 0 -3 1 -2 0 0 -1 5 R -1 -2 -3 -2 0 -3 -2 0 -3 2 -2 -1 0 -2 1 5 S 1 0 -1 0 0 -2 0 -1 -2 0 -2 -1 1 -1 0 -1 4 T 0 -1 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1 0 -1 -1 -1 1 5 V 0 -3 -1 -3 -2 -1 -3 -3 3 -2 1 1 -3 -2 -2 -3 -2 0 4 W -3 -4 -2 -4 -3 1 -2 -2 -3 -3 -2 -1 -4 -4 -2 -3 -3 -2 -3 11 X -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 Y -2 -3 -2 -3 -2 3 -3 2 -1 -2 -1 -1 -2 -3 -1 -2 -2 -2 -1 2 -1 7 Z -1 2 -4 2 5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -1 -2 5 A B C D E F G H I K L M N P Q R S T V W X Y Z Thumb rules Lower PAMs and higher Blosums find short local alignment of highly similar sequences. Higher PAMs and lower Blosums find longer weaker local alignment. PAM vs. BLOSUM Based on the basic assumptions and the construction of each matrix: PAM model is designed to track evolutionary origin of proteins. Blosum model is designed to find conserved domains of proteins. Protein Structure • Primary structure- The linear sequence of amino acids in a protein molecule • Secondary structure- regions of local regularity within a protein fold (α helices, β strands, turns etc) • Super secondary structure- the arrangement of α helices and/or β strands, into discrete folding units (β-barrels, β αβ- units, greek key motifs etc.) • Tertiary structure-The overall fold of a protein sequence formed by packing of its secondary and/or supersecondary structure elements. • Quaternary structure- Arrangement of separate protein chains in a protein molecule From the Primary sequence to protein properties • Predicting protein localization/ secretory nature by the presence of signal peptide and localization signal • Transmembrane helix prediction to identify membrane proteins • Calculation of physiochemical propertiespI, Mwt. • Identification of coiled coiled regions Post translational modification prediction www.expasy.org Kyte-Doolitle hydrophobicity plot • Nature of amino acids- hydrophilic or Hydrophobic • A window of 9-20 a,.a taken • A value greater than 0 means hydrophobic From Sequence to Structure • Secondary structure prediction- GOR, Predict protein, nnpredict • Domain Prediction- SBASE, PRODOM Importance of protein secondary structure prediction Basis of Secondary structure prediction • Conservation in the multiple sequence alignment • Hidden Markov Models and Neural networks • 70-80% accuracy is achieved. Method used Key features of secondary structure prediction Chou Fasman Algorithm GOR Multiple Sequence Some sites • Predator