* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Document
Western blot wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Multilocus sequence typing wikipedia , lookup
Community fingerprinting wikipedia , lookup
Metalloprotein wikipedia , lookup
Magnesium transporter wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Genetic code wikipedia , lookup
Proteolysis wikipedia , lookup
Point mutation wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Protein structure prediction wikipedia , lookup
Protein signatures, classification and functional analysis 1 Menu • Introduction: some definitions • How to model domains ? – Pattern – Profile – HMM • Domain/family databases (InterPro…) Protein domain/family: some definitions • Most proteins have « modular » conserved structures • Estimation: ~ 3 domains / protein -> Prediction of domain content of a unkown protein sequence may help to find a ‘function’ …Estimation: ~ 80% of protein have at least a ‘known’ domain Number of domains per protein ~100 protein sequences with 50 domains http://prodom.prabi.fr/prodom/current/archives/2006.1/stat.html CSA_PPIASE Cys 181: active site residue Binding cleft (motif) Example of conserved regions (PPID family) - 1 CSA_PPIASE (cyclophilin-type peptydil-prolyl cis-trans isomerase) (domain) - 3 TPR repeats (tetratrico peptide repeat). - 1 active site - Binding cleft (motif) InterPro scan results ? General definitions of conserved sequence signatures Conserved regions in biological sequences can be classified into 5 different groups: • Domains: specific combination of secondary structures organized into a characteristic three dimensional structure or fold. • Families: groups of proteins that have the same domain arrangement or that are conserved along the whole sequence. • Repeats: structural units always found in two or more copies that assemble in a specific fold. Assemblies of repeats might also be thought of as domains. • Motifs: region of domains containing conserved active or binding residues, or short conserved regions present outside domains that may adopt folded conformation only in association with their binding ligands. • Sites: functional residues (active sites, disulfide bridges, post-translation modified residues). CSA_PPIASE Cys 181: active site residue Binding cleft (motif) Example of conserved regions (PPID family) - 1 CSA_PPIASE (cyclophilin-type peptydil-prolyl cis-trans isomerase) (domain) - 3 TPR repeats (tetratrico peptide repeat). - 1 active site - Binding cleft (motif) What makes Bee special? Measures of Conservation • Identity: Proportion of pairs of identical residues between two aligned sequences. Generally expressed as a percentage. This value depends on how the two sequences are aligned. • Similarity: Proportion of pairs of similar residues between two aligned sequences. If two residues are similar can determined by a substitution matrix (e.g. BLOSUM62). This value depends strongly on the scoring system used. • !!! But not Homology: Two sequences are homologous if and only if they have a common ancestor. This is not a measure of conservation and there is no percentage of homology! (It's either yes or no). Homologous sequences do not necessarily serve the same function, nor are they always highly similar: structure may be conserved while sequence is not. How to measure ‘conservation’ ? Pairwise vs multiple sequence alignments Blast vs modelled MSA Detect conservation using pairwise alignments A popular way to identify similarities between proteins is to perform a pairwise alignment (Blast, Fasta). When the identity is higher than 40% this method gives good results. However, the weakness of the pairwise alignment is that no distinction is made between an amino acid at a crucial position (like an active site) and an amino acid with no critical role (not enough information). Domain Family databases Murcia 2011 13 Pairwise alignment Detect conservation using MSA • A multiple sequence alignment (MSA) gives a more general view of a conserved region by providing a better picture of the most conserved residues, which are usually essential for the protein function. • MSA contains higher information content than pairwise alignments How to use MSA to look for conservation ? -> 1- Model MSA using various methods -> 2- ‘Align’ the model with your sequence (InterPro scan…) Methods to Build Models of MSA • Consensus: – Consensus, Patterns • Profile: – – – – Position Speficic Scoring Matrices (PSSMs), Generalized Profiles, Hidden Markov Models (HMMs), PSI-BLAST. …pattern or PSSM/profile specific is called descriptor, descriptor motif, discriminator or predictor Domain Family databases Murcia 2011 19 Why do we need models of MSA? Why do we need classifiers ? • to resume in a single “descriptor" the differences and similarities observed in each column of the MSA; • to use the model/descriptor to search for similar sequences; • to classify similar sequences; • to align correctly important residues and detect variations in active sites and other important regions of one protein (i.e. SNP); • to build databases of models/descriptors which can be used to annotate new proteomes… • MSA models are more sensitive than Blast (pairwise alignment) • … Consensus - pattern Consensus Sequences • Useful to detect protein belonging to a specific family or a protein domain; much less useful at the DNA level due to the small alphabet (4 letters) and the low sequence conservation of DNA sequence elements (except for the detection of enzyme restriction sites). • Patterns do not attempt to describe a complete domain or protein family, but simply try to identify the most important residue combinations, such as the catalytic site of an enzyme. • They focus on the most highly conserved residues in a protein family (motifs, sites). Domain Family databases Murcia 2011 22 Use of pattern • Patterns are used to describe small functional regions: – – – – – Enzyme catalytic sites; Prosthetic group attachment sites (heme, PLP, biotin, etc.); Amino acids involved in binding a metal ion; Cysteines involved in disulfide bonds; Regions involved in binding a molecule (ATP, calcium, DNA etc.) or a protein. – N-glycosylation sites Domain Family databases Murcia 2011 23 How to Build a PROSITE Pattern • Start with a multiple sequence alignment (MSA) Domain Family databases Murcia 2011 24 Consensus Sequences: PROSITE Patterns syntax The PROSITE patterns are described using the following conventions: ex: <M-R-[DE]-x(2,4)-[ALT]-{AM} 1. The standard IUPAC one-letter codes for the amino acids are used. 2. The symbol `x' is used for a position where any amino acid is accepted. 3. Ambiguities are indicated by listing the acceptable amino acids for a given position, between square parentheses `[ ]'. For example: [ALT] stands for Ala or Leu or Thr. 4. Ambiguities are also indicated by listing between a pair of curly brackets `{ }' the amino acids that are not accepted at a given position. For example:{AM} stands for any amino acid except Ala and Met. 5. Each element in a pattern is separated from its neighbor by a ‘-’. 6. Repetition of an element of the pattern can be indicated by following that element with a numerical value or, if it is a gap ('x'), by a numerical range between parentheses. Examples: x(3) corresponds to x-x-x x(2,4) corresponds to x-x or x-x-x or x-x-x-x A(3) corresponds to A-A-A Note: You can only use a range with 'x', i.e. A(2,4) is not a valid pattern element. 7. When a pattern is restricted to either the N- or C-terminal of a sequence, that pattern either starts with a `<' symbol or respectively ends with a `>' symbol. Domain Family databases Murcia 2011 25 You can also automatically build a pattern (from MSA) by using Pratt or Splash softwares: http://www.expasy.org/tools/pratt/ http://www.research.ibm.com/splash/ Automatic discovered patterns are usually different from those designed by a human expert with knowledge of the biochemical literature • http://www.expasy.org/tools/scanpro site/ http://www.expasy.org/tools/scanprosite/ Advantage and Limitation of PROSITE Patterns • Advantages: – efficient for the identification of sites or short motifs. – Intelligible to any user, you don’t need to be an expert in bioinformatic to read or build a consensus sequence. • Limitation: – The regular expression syntax is too rigid to represent highly divergent domains. (one mismatch is enough to eliminate a match). Domain Family databases Murcia 2011 29 PSSM Profile specific scoring matrix Position Specific Scoring Matrix (PSSM) • A PSSM or a profile is based on the frequencies of each residue at a specific position in a MSA. • The MSA is converted into a matrix where a score is given to each amino acid at each position of the MSA according to the observed frequency (positive scores for expected amino acids and negative scores for unexpected ones). Domain Family databases Murcia 2011 32 Construction of a PSSM 1: weight sequences of the MSA (i.e. algorithms based on phylogenetic tree) 2: count the number of occurrence of the different amino acids (or bases) at each position of the alignment 3: derivation of the preliminary matrix (calculate the frequency) 4: correction of the sample bias (use substitution matrix (PAM, Blosum etc.) In proteins some mismatches are more acceptable than others. Domain Family databases Murcia 2011 33 Profiles Sequence alignment Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7: Profile (or weight matrix) (residue frequency at each position in alignment) Profiles Sequence alignment Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7: F most frequent Profile (or weight matrix) (residue frequency at each position in alignment) Phenylalanine has highest score Profiles Sequence alignment Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7: L and Y equal frequency Profile (or weight matrix) (residue frequency at each position in alignment) Different scores Profiles Sequence alignment Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7: L and Y equal frequency Profile (or weight matrix) (residue frequency at each position in alignment) Leucine is aliphatic (dissimilar from F) Tyrosine and phenylalanine both aromatic (similar) Profiles Sequence 1: Sequence 2: Sequence 3: Sequence 4: Sequence 5: Sequence 6: Sequence 7: Profiles score frequency • Highest frequency aa highest score • Lower frequency aa lower score • Similar aa not in alignment even lower score • Dissimilar aa not in alignment very low score ** In a pattern would be [FLY] equal frequency Search a Database With a PSSM • The sequence (MCFVNRFYSFCMP) is ‘aligned’ to the PSSM: M C F V N R F Y S F C M P A 1 2 3 4 5 6 7 C D E F G H I K L M N P Q R S T V W Y 12,-41,-20, 5,-25,-42,-18,-18, 33,-12,-12,-19,-41, 42, 9, 2, 9, 16,-61,-11; -23,-54, -5,-24,-37,-19,-45, -3, 7,-35,-38, 59,-41,-12,-42, 10, 65,-17,-68,-15; -13,-62,-14, 4,-53, 78,-36,-65,-15,-64,-49,-14,-48, 9, 5,-10,-11,-63,-61,-42; -36,-68,-63,-36, 60,-63,-38,-14,-47, 3,-21,-52,-53,-34,-58,-39,-45,-26,138, 36; -22,-60,-54,-24, 6,-43, 0, 30, 13, 0,-22,-27,-59, 55, -9,-38,-11, 37,-57, 12; -35,-46,-18, 14, -9,-51,-12,-19, 34,-39,-28, 36,-45, 44, -9, -3, 41,-27,-24, 17; -33,-58, 37, -6,-16,-39,-21, 61,-23, -1,-28, -6,-58,-17,-54,-20, -9, 14,-12, 11; Searching algorithm: sliding windows. At each position of the sliding window the score is obtained by summing the score of all columns Best score: 16+59+5+60+12-3-16=133 Domain Family databases Murcia 2011 39 Avantages and limitations of PSSMs • Advantages: – The score produced permits to estimate the quality of the match produced. – The method is relatively fast and simple to implement • Limitations: – Indels are forbidden: long region can not be implement. PSSM: Fingerprints • To overcome the gap limitation of PSSMs, two or more PSSMs can be used to describe long regions. The combination of various PSSMs is called ‘fingerprints’ • PRINTS database is a collection o annotated fingerprints (usefull to define sub-families) Generalized profiles • A generalized profile is an extension of the PSSM, in which we introduce position specific deletion and insertion penalties. Generalized Profiles The following information is stored in any generalized profile: • Each position is called a match state. A score for every residue is defined at every match states (M), just as in the PSSM. • Each match state can be ommitted in the alignment, by what is called a deletion state (D) and receives a positiondependent penalty. • Insertion of variable lenght are possible between any two adjacent match (or deletion) states. These insertion states (I) are given a position-dependent penalty that might also depend upon the inserted residues. • A couple of additional parameters allow to adapt the behaviour of the profile on its extremities which can force to match the whole domain or produce partial matches. Domain Family databases Murcia 2011 44 Example of a Generalized Profile ID AC DT DE MA MA . . . MA MA MA MA MA MA MA MA MA MA MA . . . // ZF_RING_2; MATRIX. PS50089; DEC-2001 (CREATED); DEC-2001 (DATA UPDATE); DEC-2004 (INFO UPDATE). Zinc finger RING-type profile. /GENERAL_SPEC: ALPHABET='ABCDEFGHIKLMNPQRSTVWYZ'; LENGTH=43; /DISJOINT: DEFINITION=PROTECT; N1=5; N2=39; /DEFAULT: D=-20; I=-20; B1=0; E1=-10; MI=-105; MD=-105; IM=-105; DM=-105; M0=-5; /I: B1=0; BI=-105; BD=-105; /M: SY='C'; M=-10,-20,119,10,0,-20,-30,10,-30,-30,-20,-20,-20,-40,-30,-30,-10,-10,-10,-50,-30,-30; /M: SY='P'; M=-1,-9,-21,-10,-4,-17,-14,-10,-11,-8,-14,-8,-6,4,-5,-10,0,-1,-10,-27,-14,-6; /M: SY='I'; M=-7,-27,-24,-32,-25,-1,-32,-25,32,-22,16,15,-21,-23,-19,-21,-17,-7,25,-21,-3,-24; D=-3; /I: I=-3; DM=-16; /M: SY='C'; M=-10,-20,119,-30,-30,-20,-30,-30,-30,-30,-20,-20,-19,-40,-30,-30,-10,-10,-10,-50,-30,30; /M: SY='L'; M=-10,-12,-17,-14,-9,-1,-19,-7,-7,-9,2,2,-11,-21,-8,-7,-12,-8,-7,-17,1,-9; /M: SY='E'; M=-8,9,-22,12,17,-24,-13,-3,-23,2,-20,-15,5,-11,6,-2,3,-2,-19,-29,-15,11; /M: SY='E'; M=-7,-4,-23,-4,1,-16,-17,-8,-12,-2,-12,-8,-2,-5,-3,-3,-3,-2,-11,-25,-10,-2; /M: SY='F'; M=-10,-19,-24,-21,-13,7,-24,-11,4,-15,6,7,-16,-13,-12,-13,-15,-9,-2,-12,6,-13; Domain Family databases Murcia 2011 46 Align the generalized profile with a sequence…. (Dynamic programming, ~Smith Waterman algorithm) a sequence Algorithm and Software to buid and use Generalized Profiles • Pftools is a package to perform the different steps of the construction of a profile and to search a database of protein (or DNA) with a profile. – http://www.isrec.isb-sib.ch/ftp-server/pftools • Searching algorithm: dynamic programming (similar to SmithWaterman algorithm). -> guaranteed to find the optimal local alignment with respect to the scoring system being used (which includes the substitution matrix and the gap-scoring scheme) Domain Family databases Murcia 2011 49 • http://www.expasy.org/tools/scanpro site/ http://www.expasy.org/tools/scanprosite/ Statistical Significance of Sequence Similarities • • • Each method (except patterns) gives a score of similarity between the query sequence and the subject sequence or the method. Ones need to estimate if this raw score can occure by chance. This is done by the E-value or expected value The E-value is the number of matches with a score equal to or greater than the observed score that are expected to occur by chance. An E-value of 1 is considered not to be significant. An E-value of 0.1 possibly to be significant. An E-value of 0.01 most likely to be significant. • Pitfall: The E-value depends on the size of the searched database, as the number of false positives expected above a given score threshold usually increases proportionally with the size of the database. Domain Family databases Murcia 2011 51 Advantage and Limitation of Generalized Profiles • Strenghs: – Very sensitive to detect similarities (close to the twilight zone). – Good scoring system. • Weaknesses: – Require some expertise to use efficiently. – Very CPU expensive. Domain Family databases Murcia 2011 52 HMM Generalized Profiles can be represented in a probabilistic framework named Hidden Markov Models (HMMs). HMM profiles • Each position in an HMM consists of a Match, Insert and Deletion state • Parameters describing a HMM profile: – Emission probability: the probability of emitting an amino acid ‘x’ being in state q (Amino acid emission probabilities are evaluated from observed frequencies as for PSSM). – Transition probability: 3 states: Match (M), Deletion (D), Insertion (I). Transitions: M->I, M->D, I->M, I->D … Transition probabilities are evaluated from observed transition frequencies. Domain Family databases Murcia 2011 56 Hidden Markov Models (HMM) I2 I1 M1 I3 I4 I5 I6 I7 I8 I9 M2 M3 M4 M5 M6 M7 M8 M9 D2 D3 D4 D5 D6 D7 D8 D9 M10 M = match state I = insert state D = delete state Each position in an HMM consists of a Match, Insert and Deletion state HMMER HMM Profile NAME ig ACC PF00047.15 LENG 65 GA 25.1 13.4 TC 25.1 13.4 NC 25.0 25.0 XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455 NULE 595 -1558 85 338 -294 453 -1158 197 249 902 -1085 EVD -28.914425 0.238245 HMM A C D E F G H I K L M m->m m->i m->d i->m i->i d->m d->d b->m m->e -16 * -6461 M 1 -2647 -5115 -567 223 -5436 3047 164 -5186 -1236 -2912 -4204 I -149 -500 233 43 -381 399 106 -626 210 -466 -720 -1 -11609 -12651 -894 -1115 -701 -1378 -16 * 2 -972 -498 831 1649 -5434 884 766 -2367 62 -5129 -1 -149 -500 233 43 -381 399 106 -626 210 -466 -720 -1 -11609 -12651 -894 -1115 -701 -1378 * * 3 -1011 -5113 411 -343 -1695 -2365 989 -5184 60 -699 50 -149 -500 233 43 -381 399 106 -626 210 -466 -720 -1 -11609 -12651 -894 -1115 -701 -1378 * * Domain Family databases -142 -21 -313 N P Q 524 275 -2643 394 765 275 -278 275 45 531 201 384 R S T V W Y -554 45 -178 96 -319 359 -622 117 -4737 -369 -5298 -294 -4615 -249 1 -1114 394 1363 45 -2178 96 904 359 -1046 117 -4735 -369 -5296 -294 -1449 -249 2 850 394 -148 45 400 96 1625 359 1230 117 -856 -369 -5296 -294 -4613 -249 3 Murcia 2011 -1998 58 -644 HMM Profile softwares • HMMER is a package to build and use HMMs (http://hmmer.janelia.org/) Used by Pfam, SMART and TIGRfam databases. • SAM is a similar package (http://www.cse.ucsc.edu/research/compbio/sam.html). Used by SCOP superfamily and gene3D. Domain Family databases Murcia 2011 59 Advantage and Limitation of HMM Profiles • Advantage: Solid theoretical basis: more efficient than generalized profile to estimate insertion and deletion penalties. Other advantages and limitations just like generalized profiles. Domain Family databases Murcia 2011 60 Generalized Profiles and HMM Profiles • The format of generalized profiles is equivalent to the one of HMM profiles. • It is easy to convert a generalized profile in a HMM profile without loosing information: – htop program: convert a HMM profile (HMMER) in generalized profile. – ptoh program: convert a generalized profile in HMM profile (HMMER). Domain Family databases Murcia 2011 61 Domain/Family databases MSA models are stored in databases (Prosite, PRINTS, Pfam …and InterPro…) Signatures Methods • Pattern • Fingerprint • Sequence clustering • Profile • HMM InterPro scan results ? Part of the protein sequence wich has been ‘recognized’ by different modelled MSA protein folding InterPro hits InterPro domain architecture PROSITE • PROSITE is a database containing patterns and generalized profiles. • http://www.expasy.org/prosite • Contains ~1300 patterns and ~1000 generalized profiles. • Good documentation. • PROSITE is also use to annotate UniProtKB/Swiss-Prot. Domain Family databases Murcia 2011 67 PROSITE Documentation Page Domain Family databases Murcia 2011 68 PROSITE Pattern Page ID AC DT DE PA NR NR NR CC CC DR DR DR DR DR ... DR DR DR 3D DO // ZF_RING_1; PATTERN. PS00518; DEC-1991 (CREATED); JUN-1994 (DATA UPDATE); DEC-2005 (INFO UPDATE). Zinc finger RING-type signature. C-x-H-x-[LIVMFY]-C-x(2)-C-[LIVMYA]. /RELEASE=48.7,204086; /TOTAL=354(354); /POSITIVE=352(352); /UNKNOWN=0(0); /FALSE_POS=2(2); /FALSE_NEG=375; /PARTIAL=2; /TAXO-RANGE=??E?V; /MAX-REPEAT=1; /VERSION=1; Q02084, A33_PLEWA , T; Q09654, ARD1_CAEEL , T; P36406, ARD1_HUMAN , Q8BGX0, ARD1_MOUSE , T; P36407, ARD1_RAT , T; O76924, ARI2_DROME , O95376, ARI2_HUMAN , T; Q9Z1K6, ARI2_MOUSE , T; Q99728, BARD1_HUMAN, O70445, BARD1_MOUSE, T; Q9QZH2, BARD1_RAT , T; Q9NZS9, BFAR_HUMAN , Q8R079, BFAR_MOUSE , T; Q5PQN2, BFAR_RAT , T; Q96CA5, BIRC7_HUMAN, P18541, ZNFP_LYCVA , N; Q88470, ZNFP_TACV , N; Q6UY11, EGFL9_HUMAN, F; 1BOR; 1CHC; 1FBV; 1G25; PDOC00449; T; T; T; T; T; P19326, ZNFP_LYCVP , N; P19325, ZNFP_LYCVT , N; Q8NEG5, ZSWM2_HUMAN, N; Q9D9X6, ZSWM2_MOUSE, N; P30735, VE6_MNPV , F; 1JM7; 1RMD; Domain Family databases Murcia 2011 69 PROSITE profile Page Domain Family databases Murcia 2011 70 Scanprosite Web Page Domain Family databases Murcia 2011 71 Scan Prosite Output The PROSITE database is now complemented by a series of rules that can give more precise information about specific residues. Domain Family databases Murcia 2011 72 ProRule Domain Family databases Murcia 2011 73 Pfam • The largest collection of curated domains and families (~10000). • Very good descriptors (Few false positives and false negatives). • But ~3000 motifs have less than 10 matches on UniProtKB. • Uses HMM profiles (HMMER3). • http://pfam.sanger.ac.uk/ Domain Family databases Murcia 2011 74 Pfam entry page Domain Family databases Murcia 2011 75 SMART • ~ 800 descriptors. • Concentrates on large domain families and the identification of new domains. • Uses HMM profiles (HMMER2). • Weak annotation. • Good tools for genomic analysis. • http://smart.embl.de/smart/set_mode.cgi?NORMAL=1 Domain Family databases Murcia 2011 76 SMART homepage Domain Family databases Murcia 2011 77 ProDom • ProDom is a database of protein domain families generated automatically from the global comparison of all available protein sequences (last release in 2008 !!). • Descriptors are built with PSI-BLAST • No annotation • http://prodom.prabi.fr/prodom/current/html/home.php • Used to defined new pfam families Domain Family databases Murcia 2011 78 Family databases: PRINTS • Fingerprints are combination of ungapped PSSM. As gaps are not allowed they are usually directed against well conserved short motifs. • The PRINTS database is specialised in subfamily classification. (The GPCR family was divided in more than 100 sub-families) • • Contains 12’000 motifs. http://www.bioinf.manchester.ac.uk/dbbrowser/PRINTS/index.php Domain Family databases Murcia 2011 79 PRINTS homepage Domain Family databases Murcia 2011 80 Other family databases • PANTHER: was developed to annotate the human genome. Contains a lot of models for mammalian proteins, but very few for plant, fungi or bacteria. Family/subfamily classification, more than 5000 families and 25 000 subfamilies. Automatically generated. http://www.pantherdb.org • PIRSF: good annotation for functional residues. ~30000 automatically generated HMM profiles. http://pir.georgetown.edu/pirsf/ • TIGRFAM only for prokaryotic proteins. 3500 HMM profiles http://www.tigr.org/TIGRFAMs/ Domain Family databases Murcia 2011 81 Scop superfamily and CATH • Scop Superfamily and CATH are structural domain database using HMM profiles. Hierarchical classification of domains. • Use HMM profiles (SAM). • Domain boundaries are semi-automatically extracted. • Very sensitive methods (often more matches for a given domain than Pfam or PROSITE). • Usefull for structure prediction but dangerous for functional prediction. Tends to group structurally related domains but with no functional relationship. (ex: tpr repeat: only alpha helices. SCOP or CATH tpr repeat profiles picked-up a lot of conserved regions rich in alpha helices but not evolutively link to tpr) • • http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/ http://www.cathdb.info/ Domain Family databases Murcia 2011 82 InterPro integrates MSA models from various databases and organize them and their annotation so relationships emerge. InterPro • Interpro is an attempt to group a number of protein databases:Pfam, PROSITE, PRINTS, ProDom, SMART TIGRFAM, SCOP superfamily, Gene3D. • http://www.ebi.ac.uk/interpro • InterPro tries to have and maintain a high quality annotation. • The database and a stand-alone package are available to locally run a complete InterPro analysis. • ftp://ftp.ebi.ac.uk/pub/databases/interpro/ Domain Family databases Murcia 2011 84 InterProScan Domain Family databases Murcia 2011 85 InterProScan Output Domain Family databases Murcia 2011 86 InterPro protein coverage 96.0% of UniProtKB/SwissProt 78.6% of UniProtKB/TrEMBL Protein Sequence Databases Murcia, Protein Sequence Databases Murcia, Protein Sequence Databases Murcia, Never forget that: • The computational sequence analysis tools are naïve about real biology and the complex relationships between molecular elements and proteins. • Therefore we should be critical about what we can achieve with such computational sequence analysis tools. • So, again, be critical… and understand the biology. Many thanks to • Lorenzo Cerruti • Nicolas Hulo • Jennifer McDonald • And you ! Further Reading • Durbin, Eddy, Mitchison, Krog. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic acids. Cambridge University Press, 1998. • Attwood TK, Parry-Smith DJ. Introduction to bioinformatics. Addison Wesley Longman Limited, 1999 • Krogh A, Brown M, Mian IS, Sjolander K, Haussler D. Hidden Markov models in computational biology. Applications to protein modeling. J Mol Biol. 1994 Feb 4;235(5):1501-31. • Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997 Sep 1;25(17):3389-402. Domain Family databases Murcia 2011 94