Download The presentation

Identification of Protein Domains Orthologs and Paralogs Describing evolutionary relationships among genes (proteins): Two major ways of creating homologous genes is gene duplication and speciation. Homology: not sufficiently well-defined Therefore additional terms are used: ortho para ortho Orthologs are two genes from two different species that derive from a single gene in the last common ancestor of the species. Paralogs are genes that derive from a single gene that was duplicated within a genome. co-ortho Co-orthologs are paralogs produced by duplications of orthologs subsequent to a given speciation event. in-para in-para out-para Inparalogs are paralogs in a given lineage that all evolved by gene duplications that happened after the speciation event. Outparalogs are paralogs in the given lineage that evolved by gene duplications that happened before the speciation event Orthologs and Paralogs • Orthologs - evolutionary functional counterparts in different species • Inparalogs – important for detecting lineage-specific adaptations Proteins : • Rapidly growing databases of protein sequences due to genome sequencing projects. • Many new proteins belong to protein families with known functions, (significant sequence similarity). • Only a small fraction of known proteins have functions determined by experiment. • Databases providing computational sequence analysis allow us to classify new proteins to known families, and thus determine their function. Protein Domains • A domain is an independent structural unit which can be found alone or in conjunction with other domains or repeats. • Module = mobile domain. • Different domains have distinct functions. • Many eukaryotic proteins have multiple domains. Protein Domains PX domain with ligand SH3 domain with ligand Identifying Protein Domains : Problems : – Defining the members of each family. – Building multiple alignments of the members. – Finding the boundaries of the domain. Identifying Protein Domains • Little structural data  identification by sequence analysis. • Even when the structure of the domain is not known it may be possible to define its boundaries from sequence alone. • Sequence characterization of families determine 3D structure and molecular functions. Identifying Protein Domains : Motif matches are often useful to indicate functional sites, however : • They do not give a clear picture of the domain boundaries. • Lack sensitivity. Identifying Protein Domains : Automatic methods : • Fast, effective, deals with a lot of information. • Might fragment domain families. • Might cause fusion of domain families. Manual methods : • Knowledge of protein experts is put to use. • Slow, require a lot of manpower. SMART : (Simple Modular Architecture Research Tool) Web-based resource used for : – rapid annotation of protein domains. – analysis of domain architectures. Domain Architecture Protein: PA-3427CG Species: Drosophila melanogaster Protein: ENSMUSP00000023109 Species: Mus musculus Protein: ENSANGP00000009529 Species: Anopheles gambiae SMART (Simple Modular Architecture Research Tool) • There are over 600 domain families. • Provides information about : – function . – subcellular localization. – phyletic distribution. – tertiary structure. • Based on HMMs (Hidden Markov Models). SMART (Simple Modular Architecture Research Tool) HMM – based on seed alignment. Threshold values used to determine homology of domains. SMART (Simple Modular Architecture Research Tool) • Alignments of proteins by: – Minimize insertions/deletions in conserved alignment blocks. – Optimize amino acid property conservation. – Closing unnecessary gaps. • Gapped alignments prefered over ungapped ones: – prediction of domain boundaries. – greater information content. • Alignment of entire structural domains. PROSITE domains database of protein families and • Database of biologically significant sites and patterns. Contains 1,609 profiles. • Pattern – conserved sequence of a few amino acids. • Identifies to which known family of proteins (if any) the new sequence belongs. • Used to determine the function of uncharacterized proteins translated from genomic or cDNA sequences. PROSITE - database of protein families and domains • A protein too distant from any other to detect its resemblance by overall sequence alignment, can be classified according to a Pattern. • Patterns arise because of requirements of binding sites that impose very tight constraint on the evolution of portions of the protein. PROSITE – how is a pattern developed ? • As short as possible. • Detects all/most sequences it describes. • As little false results as possible. high sensitivity and high specificity. PROSITE – how is a pattern developed ? First – study reviews on a protein family. Then build alignment table with particular attention to residues and regions important to the biological function of that family. - Enzyme catalytic sites. - Prostethic group attachment sites (heme). - Amino acids involved in binding a metal ion. - Cysteines involved in disulfide bonds. - Regions involved in binding a molecule (ADP/ATP, GDP/GTP, calcium, DNA, etc.) or another protein. PROSITE steps in the development of a pattern: • Finding a core pattern : 4-5 biologically significant residues. • Test the pattern on a large database. • If lucky – there is correlation in this region which indicates a good pattern. • Mostly, there is no correlation : – Gradually increase the size of the pattern. – search over other patterns. PROSITE – An example This pattern is small and would probably pick up too many false positive results : ALRDFATHDDF SMTAEATHDSI ECDQAATHEAS Patterns - small regions, high sequence similarity. Profiles – characterize a protein family or domain over its entire length. Research: Finding new domain families Automatic methods • The team started with 107 nuclear domains. • Using SMART - get all proteins with at least one of these domains, characterize their complete domain structure. • Regions not annotated using known SMART domain models were extracted with their domain context. Finding new domain families: Automatic methods • Grouping proteins by region similarity. • Finding homologs using PSI-BLAST on longest of every group (Threshold Evalue<0.001). • Finding domain organization via SMART. • Homologous regions – candidates for a novel domain family. Finding new domain families: 107 nuclear domains finding proteins -SMART regions not known by SMART group regions PSI-BLAST finding homologs domain architecture - SMART manual inspection more searches Finding new domain families: Manual confirmation • Different context – novel module family. • Proteins with nuclear AND extracellular domains excluded. • Multiple alignments and known locations of domains – definition of domains’ borders. • Automatic searches to find more members, Evalue < 0.1, and manual checks. • Marginal similarity to domain family – possible divergent family. Prediction of Function: Chromatin-Binding Domains • Protein SPT6 containing CSZ domain, regulates transcription through a histonebinding capability. • It also contains two other types of domains, which are unlikely to bind histones. • Therefore it was predicted that CSZ domain has that function. Research : • Arabidopsis protein – UBA in N-terminal. • Search of C-terminal by PSI-BLAST (Evalue<10-5) found UBX containing proteins and metazoan homologs of PNGases. • PNGases – proteins involved in UPR. • UPR – unfolded protein response. • PUG – the homologous regions. • PUG domains found in proteins with domains central to ubiquitinmediated proteolysis, (UBA and UBX). Conclusion : PUG containing proteins might link the UPR to ubiquitin mediated protein degradation. PUG UBA PUG UBX PUG UBCc PNGases PUG Believed to have a role in the UPR Domains central to ubiquitin mediated proteolysis Apoptosis Ubx domain from human faf1 Dna binding protein c-terminal uba domain of the human homologue of rad23a (hhr23a) • Orthologs of PNGases in metazoan are present singly, (not in multiple paralogs) – likely to have similar cellular localization. • The ortholog in Sacharaomyces cervisiae is known to be localized mainly in the nucleus. Likely that PNGases are localized in the nucleus too. • HMM from the PUG – marginal similarity to IRE1p-like Kinases which are known to initiate the UPR as well. • They suggest the presence of divergent PUG domains in the C termini of these Proteins. • Analysis revealed a conserved region in metazoan PNGases. Named it PAW. Put it in SMART. • The team found 28 novel nuclear domain families. • Most of them with representatives in diverse molecular context in different species. • Some specific to single species. • Others divergent members of previously recognized families. The End

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download The presentation