Download Graph preprocessing

Graph preprocessing  Introduction  Noise removal and data enhancement problem  Noise removal and data enhancement on binary data  Noise removal and data enhancement on graph data  Noise removal and data enhancement tools  Current research problems  Future directions Protein Function and Interaction Data • Proteins usually interact with other proteins to perform their function(s) • Interaction data provides a glimpse into the mechanisms underlying biological processes o Networks of pairwise protein-protein interactions o Protein complexes o Neighboring proteins in an interaction network tend to perform similar functions o Several computational approaches proposed for predicting protein function from interaction networks [Pandey et al, 2006] • A group of proteins occurring in many complexes may represent a functional modules that consists of proteins involved in similar biological processes Problems with Available Interaction Data (I) • Noise: Spurious or false positive interactions Hart et al,2006 • Leads to significant fall in performance of protein function prediction algorithms [Deng et al, 2003] Problems with Available Interaction Data (II) • Incompleteness: Unavailability of a major fraction of interactomes of major organisms Hart et al, 2006 • Yeast: 50%, Human: 11% • May delay the discovery of important knowledge Pre-processing of Protein Interaction Networks • Overall Objective: Accurate inference of protein function from interaction networks • Complexity: Noise and incompleteness in interaction networks adversely impact accuracy of functional inferences [Deng et al, 2003] • Potential Approach: Pre-processing of interaction networks Our Approach • Transform graph G=(V,E,W) into G’=(V,E’,W’) Input PPI graph Transformed PPI graph where Pi and Pj are connected if (Pi,Pj) is a hyperclique pattern • Tries to meet three objectives: – Addition of potentially biologically valid edges – Removal of potentially noisy edges – Assignment of weights to the resultant set of edges that indicate their reliability Sparsification to remove spurious edges Common neighborbased transformation # edges = 6490 Pruning to remove spurious edges # edges = 95739 # edges = 6874 Related Approaches: Neighborhood-based Similarity i j i j • Motivation: Two proteins sharing several common neighbors are likely to have a valid interaction • Probability (p-value) of having m common neighbors given degrees of the two proteins n1 and n2, and size of the network N [Samanta et al, 2003] • Handles the problem of high degree nodes • # common neighbors or Jacquard similarity (m/(n1+n2-m)) [Brun et al, 2003] • Min(fractions of common neighbors) = Min(m/n1, m/n2) – Identical to pairwise h-confidence References [1] Hui Xiong, X. He, Chris Ding, Ya Zhang, Vipin Kumar, Stephen R. Holbrook, Identification of Functional Modules in Protein Complexes via Hyperclique Pattern Discovery, in Proc. of the Pacific Symposium on Biocomputing, (PSB 2005), 2005 [2] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, Introduction to Data Mining, Addison-Wesley April 2005 [3] Jinze Liu, Susan Paulsen, Xing Xu, Wei Wang, Andrew Nobel, Jan Prins, Mining Approximate Frequent Item sets in the Presence of Noise: Algorithms and Analysis, SIAM 2006 [4] Pang-Ning Tan, Vipin Kumar, and Jaideep Srivastava, Selecting the Right Interestingness Measure for Association Patterns, Proc of the Eighth ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining (SIGKDD-2002) [5] Hui Xiong, Pang-Ning Tan, and Vipin Kumar, Mining Strong Affinity Association Patterns in Data Sets with Skewed Support Distribution, In Proc. of the Third IEEE International Conference on Data Mining (ICDM 2003) [6] Hui Xiong, Pang-Ning Tan, and Vipin Kumar, Hyperclique Pattern Discovery, Data Mining and Knowledge Discovery Journal, accepted for publication as a regular paper, 2006 [7] A. Gavin et al. Functional organization of the yeast proteome by systematic analysis of protein complexes, Nature, 415:141-147, 2002 [8] Matteo Pellegrini et al., Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles, Proc. Natl. Acad. Sci. USA Vol. 96, pp. 4285–4288, April 1999, Biochemistry

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Graph preprocessing