* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Tutorial: Sequence-Based Analysis
Interactome wikipedia , lookup
Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup
Community fingerprinting wikipedia , lookup
Biosynthesis wikipedia , lookup
Gene expression wikipedia , lookup
Expression vector wikipedia , lookup
Amino acid synthesis wikipedia , lookup
Western blot wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Magnesium transporter wikipedia , lookup
Biochemistry wikipedia , lookup
Point mutation wikipedia , lookup
Genetic code wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Sequence Based Analysis Tutorial NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Information Resource at Georgetown University Medical Center Retrieval, Sequence Search & Classification Methods Retrieve protein info by text / UID Sequence BLAST, FASTA, Dynamic Programming Family Similarity Search Classification Patterns, Profiles, Hidden Markov Models, Sequence Alignments, Neural Networks Integrated Search and Classification System 2 Sequence Similarity Search (I) Based on Pair-Wise Comparisons Dynamic Programming Algorithms Global Similarity: Needleman-Wunch Local Similarity: Smith-Waterman Heuristic Algorithms FASTA: Based on K-Tuples (2-Amino Acid) BLAST: Triples of Conserved Amino Acids Gapped-BLAST: Allow Gaps in Segment Pairs PHI-BLAST: Pattern-Hit Initiated Search PSI-BLAST: Position-Specific Iterated Search 3 Sequence Similarity Search (II) Similarity Search Parameters Scoring Matrices – Based on Conserved Amino Acid Substitution • Dayhoff Mutation Matrix, e.g., PAM250 (~20% Identity) • Henikoff Matrix from Ungapped Alignments, e.g., BLOSUM 62 Gap Penalty Search Time Comparisons Smith-Waterman: 10 Min FASTA: 2 Min BLAST: 20 Sec 4 Feature Representation Features of Amino Acids: Physicochemical Properties, Context (Local & Global) Features, Evolutionary Features Alternative Amino Acids: Classification of Amino Acids To Capture Different Features of Amino Acid Residues Alphabet AA Identity Exchange Group Charge/Polarity Hydrophobicity Structural 2D Propensity Size 20 6 4 3 3 3 Features Membership A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y Sequence Identity EvolutionSubstitution {HRK}{DENQ}{C}{STPAG}{MILV}{FYW} Charge and Polarity {HRK} {DE} {CTSGNQY} {PMLIVFW} {DENQRK} {CSTPGHY} {AMILVFW} Hydrophobicity {DENQHRK} {CSTPAGWY} {MILVF} Surface Exposure Secondary Structure {AEQHKMLR} {CTIVFYW} {SGPDN} 5 Substitution Matrix Likelihood of One Amino Acid Mutated into Another Over Evolutionary Time Negative Score: Unlikely to Happen (e.g., Gly/Trp, -7) Positive Score: Conservative Substitution (e.g., Lys/Arg, +3) High Score for Identical Matches: Rare Amino Acids (e.g., Trp, Cys) 6 BLAST BALST (Basic Local Alignment Search Tool) Extremely fast Robust Most frequently used It finds very short segment pairs (“seeds”) between the query and the database sequence These seeds are then extended in both directions until the maximum possible score for extensions of this particular seed is reached 7 BLAST Search From BLAST Search Interface Table-Format Result with BLAST Output and SSEARCH (Smith-Waterman) Pair-Wise Alignment Links to iProClass and UniProtKB reports Link to NCBI taxonomy Link to PIRSF report Click to see SSearch alignment Click to see8 alignment Blast Result & Pairwise Alignment BLAST Aligment 9 Classification What is classification? Why do we need protein classification? Different levels of classification Basis for functional protein classification How to classify a protein of unknown function? 10 Classification Databases Protein motif Protein domain 3-D structure Whole-protein Group proteins C - x(2,4) - C - x(3) - [LIVMFYWC] - x(8) - H - x(3,5) - H to The 2 C's and the 2 H's are zinc according ligands the presence of a common domain Group proteins Group proteins according to according to common 3D structure common domain architecture and length 11 Family Classification Methods Based on Other Classification Information Multiple Sequence Alignment (ClustalW) ProSite Pattern Search Profile Search Hidden Markov Models (HMMs) Domain (Pfam); Whole protein (PIRSF) Neural Networks 12 How do you build a tree? Pick sequences to align Align them Verify the alignment Keep the parts that are aligned correctly Build and evaluate a phylogenetic tree Integrated Analysis 13 Multiple Sequence Alignment ClustalW Progressive Pairwise Approach Base on Exhaustive Pairwise Alignments Neighbor Joining Joining Order Corresponding to a Tree Alignment Varies Dependent on Joining Order 14 Multiple Alignment and Tree From Text/Sequence Search Result or ClustalW Alignment Interface 15 16 Motif Patterns (Regular Expressions) Signature Patterns for Functional Motifs PCM_AC PCM00836 PCM_ID ALADH_PNT_1; MOTIF PS_DE Alanine dehydrogenase & pyridine nucleotide transhydrogenase signature 1 PS_PA G-[LIVM]-P-x-E-x(3)-N-E-x(1,3)-R-V-A-x-[ST]-P-x-[GST]-V-x(2)-L-x-[KRH]-x-G. PROSITE PS00836; PDOC00654 LENGTH Conserve = 16aa; Maximum = 29aa; Minimum = 27aa COUNT PST = 5 (5); PSN = 2 (2); PCT = 2 (2); PCN = 3 (3); NNTM_BOVIN+DEBOXM +G02257 DHA_BACSH+A34261 DHA_BACST+B34261 DHA_MYCTU+A43830 PNTA_ECOLI+DEECXA PST PCT PST PST PST PST 60 60 4 4 4 4 DHA_BACSU+A49337 PNTA_HAEIN+E64119 +S74638 +S77433 +F64694 PSN PSN PCN PCN PCn 4 4 4 23 4 Predicted Not Predicted GVPKEIFQNEK--RVALSPAGVQALVKQG GVPKEIFQNEK--RVALSPAGVQNLVKQG GIPKEIKNNEN--RVAMTPAGVVSLTHAG GIPKEIKNNEN--RVAITPAGVMTLVKAG GIPTETKNNEFQFRVAITPAGVAELTRRG GIPRERLTNET--RVAATPKTVEQLLKLG *:* * ** *** :* * * : * GVPKEIKNNEN--RVALTPGGVSQLISNG GVPRELLENES--RVAATPKTVQQILKLG GVPKEIKDQEF--RVGLTPSSVRALLSQG GVPRESFDQEC--RVAMTPDTAQKLQKLG GLVKESMDLES--RVALVPDDVALIVQKG *: * * **. * . : * Member True Positive (“T”) False Negative (“N”) Non-Member False Positive (“F”) True Negative ProClass Motif Alignments 17 PIR Pattern Search From Text/Sequence Search Result or Pattern Search Interface One Query Sequence Against PROSITE Pattern Database One Query Pattern (PROSITE or User-Defined) Against Sequence DB 18 Pattern Search Result (I) One Query Sequence Against PROSITE Pattern Database 19 Pattern Search Result (II) One Query Pattern Against Sequence Database Display the query pattern 1 Sorting arrows 2 3 Links to iProClass and UniProtKB reports Link to NCBI taxonomy Link to PIRSF report 20 Profile Method Profile: A Table of Scores to Express Family Consensus Derived from Multiple Sequence Alignments Num of Rows = Num of Aligned Positions Each row contains a score for the alignment with each possible residue. Profile Searching Summation of Scores for Each Amino Acid Residue along Query Sequence Higher Match Values at Conserved Positions 21 1 PIRSF scan Search One Query Protein Against all the Full-length and Domain HMM models for the fully curated PIRSFs by HAMMER The matched regions and statistics will be displayed. Shows PIRSF that the query belongs to Statistical data for all domains Statistical data per domain Alignment with consensus sequence 22 Secondary Structure Features a Helix Patterns of Hydrophobic Residue Conservation Showing I, I+3, I+4, I+7 Pattern Are Highly Indicative of an a Helix (Amphipathic) b Strands That Are Half Buried in the Protein Core Will Tend to Have Hydrophobic Residues at Positions I, I+2, I+4, I+6 23 3D Structure Proteins share the same fold suggesting homology Gamma Crystallin C Beta B1 Crystallin 24 Creation and Curation of PIRSFs 25 Integrated Bioinformatics System for Function and Pathway Discovery Data Integration Associative Analysis User Input Input (Local Data, Search (Gene/Protein Criteria, Report Format) Expression Data) Output (Analysis Results, Biological Interpretation) Integrated Bioinformatics System Data Mining Tools Sequence Analysis Pipeline (Retrieval, Visualization, Analysis, Correlation) (Family Classification & Feature Identification) Graphical User Interface (Browsing, Querying, Navigation) Data Warehouse (Gene, Protein, Family, Function, Structure, Pathway, Interaction) 26 Query Sequence UniProt Family Classification & Functional Analysis BLAST Search HMM Domain Search Analytical Pipeline Top-Matched Superfamilies/Domains HMM Motif Search Pattern Search SignalP/TMHMM Predicated Superfamilies/Domains/Motifs/Sites/SignalPeptides/TMHs SSEARCH CLUSTALW Superfamily/Domain/Motif Alignments Family Relationships & Functional Features 27 Integrated Bioinformatics System Gene Expression Data Proteomic Data Global Bioinformatics Analysis of 1000’s of Genes and Proteins Pathway Discovery, Target Identification Integrated Protein Knowledge System Clustering Gene/Peptide-Protein Mapping Expression Pattern Protein List Functional Analysis (Sequence Analysis & Information Retrieval) Comprehensive Protein Information Matrix Visualization & Statistical Analysis Pathway Discovery (Browsing, Sorting, Visualization & Statistical Analysis) Clustered Matrix Clustered Graph Pathway Map Process Hierarchy 28 Lab Section 29 Text Search 30 Text Search Result (I) Extend your search or start over Choose columns to be displayed Expand view Pre-computed BLAST Results Links to iProClass and UniProtKB reports Link to NCBI taxonomy Link to PIRSF report 31 Text Search Result (III) Number of Related Seq. at 3 different E-value cut-offs 32 Text Search Result (II) Extend your search or start over Choose columns to be displayed Link to PIRSF report Curated domain architecture with links to Pfam database Extent of family curation 33 Peptide Search 34 Peptide Search & Results Sorting arrows Links to iProClass and UniProtKB reports Link to NCBI taxonomy Link to PIRSF report Matching peptide highlighted in the sequence 35 Batch Retrieval Results (I) Retrieve more sequences 1 Choose columns to be displayed 2 3 4 5 6 Links to iProClass and UniProtKB reports 36 Batch Retrieval Results (II) Retrieve more families 1 2 Choose columns to be displayed 3 4 5 6 Links PIRSF reports Curated domain architecture (N- to C- termini) with links to Pfam database 37 Blast Similarity Search 38 Blast / Related Sequences Results 40 Blast Result & Pairwise Alignment BLAST Aligment 41 Pairwise Alignment 42 Multiple Alignment Interactive Phylogenetic Tree and Alignment 43 Phylogenetic Tree and Alignment View 44 Pattern Search (I) 45 Pattern Search (II) Display the query pattern Sorting arrows Links to iProClass and UniProtKB reports Link to NCBI taxonomy Link to PIRSF report 46 PIRSF scan 47 PIRSF Report 48 PIRSF Family Hierarchy 49 Taxonomic Distribution & Phylogenetic Pattern 50 Rabbit Alpha Crystallin A Chain An iProClass View of the entry Pre-computed BLAST results See protein synonyms See IDs from different databases 51 alpha-Crystallin and Related Proteins 52