* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download bchm6280_lect5_16
Circular dichroism wikipedia , lookup
Rosetta@home wikipedia , lookup
List of types of proteins wikipedia , lookup
Structural alignment wikipedia , lookup
Protein design wikipedia , lookup
Protein folding wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Bimolecular fluorescence complementation wikipedia , lookup
Intrinsically disordered proteins wikipedia , lookup
Homology modeling wikipedia , lookup
Protein structure prediction wikipedia , lookup
Western blot wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Protein moonlighting wikipedia , lookup
Protein mass spectrometry wikipedia , lookup
Trimeric autotransporter adhesin wikipedia , lookup
Protein purification wikipedia , lookup
Protein families, domains and motifs in functional prediction May 31, 2016 Outline • • • • • Usefulness of protein domain analysis Types of protein domain databases Interpro integrated protein domain database SMART database Predicting post-translational modifications Protein families • Groups of homologous sequences (within and across species) that share similar functions and domains • Examples: – Carbonic anhydrases (14 in humans) – Chitin synthases (8 in C. neoformans) – Ser/Thr kinases Protein domains • Conserved part of protein sequence that can evolve, function and exist independent of the rest of the protein chain • Often independently stable and folded • Can recombine or evolve from gene duplications into proteins with different combinations of domains Protein motifs • Short linear peptide sequences that serve a specific function for the protein, but will not be stable or fold independent of the rest of chain • Protein-protein interaction, ligand interactions, cleavage sites, targeting • Examples: – 14-3-3: Interaction with kinases – KELCH: ubiquitin targeting – SUMO: site recognized for modification by SUMO Predicting function for unknown proteins • Do they belong (by sequence homology) to a protein family? • Do they contain known protein domains? • Do they have motifs that suggest a specific function? When annotation is NOT enough • You’ve got a list of genes, most of which have been annotated with gene ontology and a potential protein function • Why would you want to go on and look more specifically at the protein domains? Limitations of annotation • Even in a model organism with large amount of resources, most genes are still annotated by similarity • Often, the name given is based on the BEST match to a particular domain or known protein • But… Limitations of BLAST • Likelihood of finding a homolog to a sequence: – >80% bacteria – >70% yeast – ~60% animal • Rest are truly novel sequences • ~900/6500 proteins in yeast without a known function • NAME: Similar to yeast protein YAL7400 not very informative Limitations of similarity • Proteins with more than one domain cause problems. – Numerous matches to one domain can mask matches to other domains. • Increased size of protein databases – Number related sequences rises and less related sequence hits may be lost • Low-complexity regions can mask domain matches Proteins are modular • Individual domains can and often do fold independently of other domains within the same protein • Domains can function as an independent unit (or truncation experiments would never work) • Thus identity of ALL protein domains within a sequence can provide further clues about their function Proteins can have >1 domain The name: protein kinase receptor UFO doesn’t necessarily tell you that this protein also contains IgG and fibronectin domains or that it has a transmembrane domain Domains are not always functional • If a critical residue is missing in an active site, it’s not likely to be functional • A similarity score won’t pick that up Protein signature databases • Identify domains or classify proteins into families to allow inference of function • Approaches include: – – – – regular expressions and profiles position-specific scoring matrix-based fingerprints automated sequence clustering Hidden Markov Models (HMMs) PROSITE • Regular expression patterns describing functional motifs M-x-G-x(3)-[IV]2-x(2)-{FWY} – Enzyme catalytic sites – Prosthetic group attachment sites – Ligand or metal binding sites • Either matches or not • Some families/domains defined by co-occurrence Citrate synthase G-[FYAV]-[GA]-H-x-[IV]-x(1,2)-[RKTQ]-x(2)-[DV]-[PS]-R PRINTS • Similar to PROSITE patterns • Multiple-motif approach using either identity or weight-matrix as basis • Groups of conserved motif provide diagnostic protein family signatures • Can be created at super-family, family and sub-family level http://www.bioinf.manchester.ac.uk/dbbrowser/PRINTS/index.php Profile-HMMs • Models generated from alignments of many homologues then counting frequency of occurrence for each amino acid in each column of the alignment (profile). • Profile-HMMs used to create probabilities of occurrence against background evolutionary model that accounts for possible substitutions. • Provides convenient and powerful way of identifying homology between sequences. • Find domains in sequences that would never be found by BLAST alone HMM domain databases • PFAM – Classify novel sequences into protein domain profiles – Most comprehensive; >16,000 protein families (v29) • SMART – Signaling, extracellular and chromatin proteins – Identification of catalytic site conservation for enzymes • TIGRFAMs – Families of proteins from prokaryotes • PANTHER – Classification based on function using literature evidence PFAM • Manually curated profiles • a statistical measure of the likelihood that an alignment occurred by chance alone • Does not indicate functionality PFAM Summary PFAM Domain Organization SMART database • SMART: Simple Modular Architecture Research Tool – Focus on signaling, extracellular and chromatin-associated proteins – Curated models for >1200 domains • Use? – I have several kinase domains in my protein list and want to know which ones are functional. – What other domains are found in signaling proteins? SMART: Search interface Uniprot or Ensemble Protein Accession number Add other searches SMART Output InterPro Scan • Combines search methods from several protein databases • Uses tools provided by member databases – Uses threshold scores for profiles & motifs • Interpro convenient means of deriving a consensus among signature methods • Interpro records integrated with Uniprot. If have a Uniprot accession number, access the Interpro information from Uniprot MAPk14 Interpro record MAPK14 – Uniprot record Function from sequence • • • • Membrane bound or secreted? GPI anchored? Cellular localization? Post-translational modification sites? CBS prediction services • Protein sorting – SignalP, TargetP, others • Post-translational modification – Acetylation, phosphorylation, glycosylation • Immunological features – Epitopes, MHC allele binding, ect • Protein function & structure – Transmembrane domains, co-evolving positions Transmembrane domain prediction Phosphorylation prediction O-glycosylation EMBOSS Open source software for molecular biology • Predict antigenic sites – Useful if want to design a peptide antibody • Look for specific motifs, even degenerate – Known phosphorylation motifs – Find motifs in multiple sequences with one submission • Get stats on proteins/nucleic acid sequences • Sequence manipulation of all kinds Today in lab • Tutorial on protein information sites • From a sublist generated using DAVID, generate a list of protein IDs and obtain the sequences • Obtain protein accession numbers for the cluster • Submit to SMART database to characterize/analyze the domains • Pick 2 proteins to do additional predictions