* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Slide 1
Protein design wikipedia , lookup
Circular dichroism wikipedia , lookup
Homology modeling wikipedia , lookup
Protein folding wikipedia , lookup
Protein structure prediction wikipedia , lookup
Bimolecular fluorescence complementation wikipedia , lookup
RNA-binding protein wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Protein mass spectrometry wikipedia , lookup
Protein purification wikipedia , lookup
Western blot wikipedia , lookup
Intrinsically disordered proteins wikipedia , lookup
List of types of proteins wikipedia , lookup
P-type ATPase wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Protein moonlighting wikipedia , lookup
Ankita Sarangi School of Informatics, IUB Capstone Presentation, May 11, 2009 Advisor : Yuzhen Ye Sequence based approaches Structure-based approaches Motif-based approaches (sequence motifs, 3D motifs) ◦ Protein A has function X, and protein B is a homolog (ortholog) of protein A; Hence B has function X ◦ Protein A has structure X, and X has specific structural features; Hence X’s function sites are used to assign function to the Protein A ◦ A group of genes have function X and they all have motif Y; protein A has motif Y; Hence protein A’s function might be related to X “Guilt-by-association” ◦ Gene A has function X and gene B is often “associated” with gene A, B might have function related to X ◦ Associations Domain fusion, phylogenetic profiling, PPI, etc. A protein domain is a part of protein sequence and structure that can evolve, function, and exist independently of the rest of the protein chain. ◦ Each domain forms a compact three-dimensional structure and often can be independently folded. ◦ Many proteins consist of several structural domains. Among relevant sequence features of a protein, domains occupy a key position. They are sequential and structural motifs found independently in different proteins, in different combinations, and as such seem to be the building blocks of proteins However, it is also known that certain sets of independent domains are frequently found together, which may indicate functional cooperation. Supra- Domains : A supra-domain is defined as a domain combination in a particular N-to-C-terminal orientation that occurs in at least two different domain architectures in different proteins with: (i) different types of domains at the N and C-terminal end of the combination; or (ii) different types of domains at one end and no domain at the other. ` A type of Supra-domain are ones whose activity is created at the interface between the two domains of a protein ◦ (Ref: JMB, 2004, 336:809–823) We may make mistakes if we do function prediction based on individual domains ◦ We know proteins that have domain A and B have function F, what about proteins having domain A or domain B only? A survey of mis-annotation based on single domains ◦ We are interested to know how serious this problem is in the current annotation system ◦ There is no systematic survey on this so far Function annotation using domain patterns (domain combinations) instead of individual domains ◦ Utilize the relationship of the predicted functions (as shown in the GO directed acyclic graph of functions) ◦ Provide a web-tool and visualization of the predicted functions and their relationship with domain patterns SUPERFAMILY is a database of structural and functional protein annotations for all completely sequenced organisms. The SUPERFAMILY web site and database provides protein domain assignments, at the SCOP 'superfamily' and 'family' levels, for the predicted protein sequences in over 900 organisms We made a local copy of this database The GENE ONTOLOGY(GO) project is a collaborative effort to address the need for consistent descriptions of gene products in different databases. Consists of three structured, controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner. We looked at several supra-domains listed in this paper: Supra-domains: Evolutionary Units Larger than Single Protein Domains; Voget etal.; J. Mol. Biol. (2004) 336, 809–823 Superfamily Use the SCOP ID of the domains to obtain Gene Identifiers associated with the supra-domains as well as their individual domains UniProt ID mapping file (is a tab-delimited table, which includes mappings for 20 different sequence identifier types(example: ENSGALP000, AN7518.2, Afu1g003, gi|41409236|ref|NP_962072.1|)and gene_association.goa_uniprot (GO assignments for the UniProt KnowledgeBase (UniProtKB)) To obtain Swiss prot ID, GO ID Find Gene Ontology functions that are associated with proteins which contain both the domains and the individual domains (SCOP ID - 63380) (SCOP ID – 52343) The N-terminal domain binds FAD and the C-terminal domain binds NADPH. The FAD acts as an intermediate in electron transfer between NADPH and substrate, and this domain combination is used by many different enzymes GO:0046872 GO:0050661 GO:0051536 GO:0006810 GO:0016491 GO:0016021 GO:0055114 GO:0004517 GO:0004783 GO:0005506 GO:0004497 GO:0010181 GO:0005488 GO:0003824 GO:0008152 GO:0020037 GO:0009055 GO:0050381 GO:0005737 GO:0050660 GO:0016020 120 100% IEA 80 % IEA 100 80 60 Riboflavin Synthase domain-like 40 Ferredoxin Reductase-like Supra-domain 20 0 10 proteins with Supra domains annotated to GO:0016491---- 2491proteins with Supra domains 3 proteins with Riboflavin Synthase domain-like annotated to GO:0016491 --- 42 proteins with Riboflavin Synthase domain-like 1 protein with reductase-like, C-terminal NADPlinked domain annotated to GO:0016491--- 47 protein with reductase-like, C-terminal NADP-linked domain Specific proteins searched and presence and absence of the combined domain was confirmed along with GO ID as well as annotation evidence which was found to be Inferred Electronic Annotation Supra –Domains: Riboflavin Synthase domain-like, Ferredoxin reductase-like, C-terminal NADP-linked domain Protein Name : Oxidoreductase FAD-binding domain protein Gene Ontology : Biological Process: GO:005514 is_a child of GO:0008152 molecular function: GO:0016491 PFAM domains: PF00970. FAD_binding PF00175 NAD_binding Evidence : IEA (Inferred Electronic Annotation) Proteins: A4FHX1 , A1UCP3, A4T5V2, A3PWD0 ,Q1BCA1 Protein Name : Sulfide dehydrogenase (Flavoprotein) subunit SudA sulfide dehydrogenase (Flavoprotein) subunit SudB Gene Ontology: Biological Process: GO:005514 is_a child of GO:0008152 molecular function: GO:0016491 PFAM Domains : PF00175. NAD_binding PF07992. Pyr_redox (FAD_pyr_nucl-diS_OxRdtase.) Evidence : IEA (Inferred Electronic Annotation) Proteins: Q2J1U9, Q13CJ3, Q5PB24 Riboflavin Synthase domain-like Protein Name : Putative uncharacterized protein Gene Ontology: Biological Process: GO:005514 is_a child of GO:0008152 molecular function: GO:0016491 PFAM Domains : PF07992 - Pyr_redox (Q0A5G3) OR PF08021. FAD_binding (A4FEM2, A1WVX7 ) Evidence : IEA (Inferred Electronic Annotation) Proteins: Q0A5G3, A4FEM2, A1WVX7 Ferredoxin reductase-like, C-terminal NADPlinked domain Protein Name : Dihydroorotate dehydrogenase electron transfer subunit, putative Gene Ontology: Biological Process: GO:005514 is_a child of GO:0008152 molecular function: GO:0016491 molecular function: GO:0016491 PFAM domains: PF00970. FAD_binding PF00175. NAD_binding Evidence : IEA (Inferred Electronic Annotation) Proteins: A3CN91, Q73P17 Protein Name: Protein-P-II uridylyltransferase Gene Ontology: Biological Process: GO:0008152 is a parent of GO:0006807 PFAM: PF01966 - NAD Binding Evidence : IEA (Inferred Electronic Annotation) Protein: Q6MLQ2 Ref: http://www.uniprot.org/uniprot/ PreATP-grasp domain (SCOP ID = 52440) Glutathione synthetase ATP-binding domain-like (SCOP ID = 56059) Lots of different enzymes forming carbon–nitrogen bonds have this combination of domains. Both domains contribute to substrate binding and the active site, and the C-terminal domain binds ATP as well as the other substrate; GO:0006750 GO:0004086 GO:0005618 GO:0009374 GO:0003824 120 GO:0008152 GO:0009317 GO:0004075 GO:0008716 GO:0003989 GO:0016491 GO:0006633 GO:0006807 GO:0009252 GO:0016874 GO:0005737 GO:0004363 GO:0005524 75% IEA 100% IEA 100 80 Pre-ATP Grasp domain 60 Glutathione Synthetase ATP-Binding 40 Domain Like Supra-Domain 20 0 Functional annotations were found to be shared by proteins having the Supra-domains as well as the single domains. The percentage of proteins having Supradomains were much higher than single domains. Since, both domains are required for the function of the protein, the functions assigned to single domain proteins may be said to be misannotated. This study gave us motivation of developing a computational tool for function annotation based on domain combinations (domain patterns) instead of individual domains Utilize the relationship of the predicted functions (as shown in the GO directed acrylic graph of functions) Provide a web-tool and visualization of the predicted functions and their relationship with domain patterns Functional annotation term F (in this case a Gene Ontology) and a domain set D. The probability that a protein exhibiting D would possess F is modeled as P(F|D)=P(D|F)P(F)/P(D) (i.e., posterior probability of a function given a set of domains D; P(D|F), P(F), and P(D) can be learned from proteins with known functions) Ref: Predicting protein function from domain content; Forslund et al;Bioinformatics, Vol. 24 no. 15 2008, pages 1681–1687 Gene Ontology database gene_association.goa_uniprot Swisspfam For an input domain pattern (pfam domains): All the Pfam pattern containing the given pattern are extracted (e.g., if input domain pattern is A + B, all the domain patterns that contain this domain pattern will be considered, such as A + B + C, etc) GO function associated with all the domain patterns are extracted Calculate the probability using P(F|D)=P(D|F)P(F)/P(D) number of proteins that occurs with the domain pattern possessing the function If the percentage probabilities lie close to one another than the parent GO function is found and a diagram depicting a sum of the distance of the parent from the two children is printed; otherwise the GO terms that have P(F|D) >= 0.9 * Max{P(F|D)} are extracted Summary graph providing all the GO functions associated with the pattern search A survey of annotation based on single domains Function annotation using domain patterns (domain combinations) instead of individual domains To DO: ◦ Do a more thorough survey with the annotation studies of single domains ◦ Define all the relationships between the GO ID’s in the Summary Graph ◦ Refine and test the computational tool. I would like to thank: Dr. Yuzhen Ye Faculty of the Department of Bioinformatics Drs. Dalkilic, Kim, Hahn, Radivojac, Tang Linda Hostetter and Rachel Lawmaster Family and Friends