Download BIBE06_kaushik - Ohio State Computer Science and Engineering

Exploratory Tools for Follow-up Studies to Microarray Experiments Kaushik Sinha Ruoming Jin Gagan Agrawal Helen Piontkivska Ohio State and Kent State Universities 1 Overall Motivation    Biological literature is vast Need tools to find interesting patterns from literature Specific Example   Identify genes from DNA microarray and other gene and protein assays Next step    What is known about these genes? How are these genes related to each other or other genes identified in similar studies? Which other genes are most similar 2 Outline    Hypergraph Mining Similarity Measures Evaluation and Observations 3 Hypergraph Mining: Motivating Example     Micro array experiment - suspects that a small set of genes are related to a disease Confirm by searching existing literature - expect related genes to appear together in literature However, suppose Gene A and C are related and both of them are weakly related to another term B In literature, one would expect     A,C appear together OR/AND A,B appear together B,C appear together How do we efficiently conclude that A,C are actually related? 4 Hypergraph Mining  Basic Motivation   Example (Gene-Disease Relationship)     Gene A is related to a term B Term B is related to a gene C Is Gene A related to Gene C ? Gene Source   To find useful “Transitive Relation” (hyperedges) among genes Microarray Experiments Information Source  Online Literature abstracts 5 Formal Problem Definition  Given     A dictionary KT A set KM of user provided keywords (KT‫כ‬KM) Collection of literature abstracts - each abstract is represented as a set of words from dictionary Task  To find hyperedges exceeding user defined threshold, each of which involves a set of key words from KM and are potentially connected by another set of linking words from KT-KM 6 Relationship to Work on Frequent Pattern Mining  Frequent itemset mining     Can represent each document abstract as a transaction with several keywords Find sets of keywords that appear together and often Cannot capture cross relationships Differences   How do we define support ? How do we prune search space 7 Solution Approach  Define    total weight=support + cross support Support: set of keywords appear together in one document Cross support: set of keywords can be partitioned    each partition appears in different document Common linking words Issues  Since downclosure property does not hold for total weight modified downclosure property can be defined 8 Idea  Support satisfies downclosure property    Cross support can be designed to be restricted below a particular value, i.e., it is bounded Form a function h as addition of two functions h=f+g    f satisfies downclosure property g is bounded h satisfies modified down closure property   Let X be a set, Ω be its power set. A function f : Ω →R+ satisfies downclosure property if for all A,B ∈ Ω , A ‫ כ‬B ,f(B)>f(A) For any θ≥0, if h(Kn) ≥θ then f(Kn-1) ≥ max{0,(θ-sup(g))} This property can be used to devise efficient algorithm 9 Outline    Hypergraph Mining Similarity Measures Evaluation and Observations 10 Similarity Measure among Sets of Genes    Given two list of gene names Need to find most similar genes, based on literature abstract occurrences Standard statistics approach      Each file containing gene names can be considered as a Discrete Random Variable (DRV) Each such DRV can take several values (gene names) For two such files X,Y and for any pair (x,y), joint probability mass function p(x,y)=P(X=x,Y=y) Compute from online abstracts based on co-occurrence 11 Probability Computation  Assume,     File X has n gene names xi, i ∈{1,…,n} File Y has m gene names yj, j ∈{1,…,m} M(i,j) is the number of times (xi,yj) appears together in transactions (article abstracts) Then,  p(xi,yj)=M(i,j)/{∑i∑jM(i,j)} 12 Expectation Computation  Now define,    Expectation of Z is,     Z=g(X,Y), where g: X x Y →[0,∞) Clearly, Z is a random variable E(Z)=E(g(X,Y))=∑i∑j (g(xi,yj)M(i,j)/Mt) Where, Mt=∑i∑jM(i,j) Expected value of Z can directly be used as a similarity measure Different choices of g, give rise to different similarity measures 13 Some Choices of function g  First Choice,    Choose g=M(i,j) This choice leads to similarity measure, se1= ∑i∑j M(i,j)2 /Mt Second Choice,    Choose g=tot_length(xi,yj), where tot_length (xi,yj) is the sum of transaction lengths where (xi,yj) co-occur The idea is longer the transaction length, higher the chance of having related linking key words This choice leads to similarity measure, se2= ∑i∑j tot_length(xi,yj)*M(i,j) /Mt 14 Extending the notion towards gene ranking  Extend to rank genes from a list Y     Most similar to the genes from list X Here, instead of Y as a random variable, for each yj ∈Y, consider Uj as a random variable taking value only yj Find the similarity measure between X and Uj for all j∈{1,…,m} Sort the genes from list Y according to decreasing similarity measure 15 Datasets Used two sets of 21 and 31 genes    A standard dictionary, as reported in literature, containing 300 genes was used   These genes are differentially expressed between prostate epithelial and stromal cells in prostate cancer patients Dr Gail Frazer’s lab, Kent State University These genes were significantly up or down regulated in tumor and adjacent normal tissues when compared with a normal donor tissue Each literature abstract was represented in a bag of word format containing words,  where each word comes from a dataset or the dictionary or is a GO term 16 Results: Hypergraph Mining  Results show the linking GO terms and linking genes from the dictionary for 21 and 31 dataset obtained by hypergraph mining 17 Results: Similarity Measures  4 sets of 300 genes each ,- A,B,C,D were formed     The task is to identify which of A,B,C,D is most similar to the 21 or 31 dataset As one would expect, A is most similar to the 21 dataset as shown below   A is the dictionary of 300 genes as mentioned before B,C,D were randomly chosen from superarray’s DNA micro-array experiments It also shows that some naïve similarity measure, such as s1, fails to capture this Sometimes, this tool discovers some interesting result,  For 31 dataset, randomly chosen list C was most similar This has been justified by checking the functionalities of top ranked genes from list C 18 Results: Ranking   Results of the ranked genes from the most similar list to either 21 or 31 data set Linking words from hypergraph mining were also found within top 20 genes 19 Summary     Biological Literature is large and complex Need data mining tools to summarize interesting patterns Proposed hypergraph mining and similarity metrics Initial results are promising 20

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download BIBE06_kaushik - Ohio State Computer Science and Engineering