* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download BIBE06_kaushik - Ohio State Computer Science and Engineering
Copy-number variation wikipedia , lookup
Cancer epigenetics wikipedia , lookup
Heritability of IQ wikipedia , lookup
Genetically modified crops wikipedia , lookup
Transposable element wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Epigenetics in learning and memory wikipedia , lookup
X-inactivation wikipedia , lookup
Gene therapy wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Genetic engineering wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Oncogenomics wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Public health genomics wikipedia , lookup
Gene nomenclature wikipedia , lookup
Pathogenomics wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Gene desert wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Essential gene wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Nutriepigenomics wikipedia , lookup
History of genetic engineering wikipedia , lookup
Genome evolution wikipedia , lookup
Genomic imprinting wikipedia , lookup
Gene expression programming wikipedia , lookup
Ridge (biology) wikipedia , lookup
Microevolution wikipedia , lookup
Minimal genome wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Genome (book) wikipedia , lookup
Designer baby wikipedia , lookup
Exploratory Tools for Follow-up Studies to Microarray Experiments Kaushik Sinha Ruoming Jin Gagan Agrawal Helen Piontkivska Ohio State and Kent State Universities 1 Overall Motivation Biological literature is vast Need tools to find interesting patterns from literature Specific Example Identify genes from DNA microarray and other gene and protein assays Next step What is known about these genes? How are these genes related to each other or other genes identified in similar studies? Which other genes are most similar 2 Outline Hypergraph Mining Similarity Measures Evaluation and Observations 3 Hypergraph Mining: Motivating Example Micro array experiment - suspects that a small set of genes are related to a disease Confirm by searching existing literature - expect related genes to appear together in literature However, suppose Gene A and C are related and both of them are weakly related to another term B In literature, one would expect A,C appear together OR/AND A,B appear together B,C appear together How do we efficiently conclude that A,C are actually related? 4 Hypergraph Mining Basic Motivation Example (Gene-Disease Relationship) Gene A is related to a term B Term B is related to a gene C Is Gene A related to Gene C ? Gene Source To find useful “Transitive Relation” (hyperedges) among genes Microarray Experiments Information Source Online Literature abstracts 5 Formal Problem Definition Given A dictionary KT A set KM of user provided keywords (KTכKM) Collection of literature abstracts - each abstract is represented as a set of words from dictionary Task To find hyperedges exceeding user defined threshold, each of which involves a set of key words from KM and are potentially connected by another set of linking words from KT-KM 6 Relationship to Work on Frequent Pattern Mining Frequent itemset mining Can represent each document abstract as a transaction with several keywords Find sets of keywords that appear together and often Cannot capture cross relationships Differences How do we define support ? How do we prune search space 7 Solution Approach Define total weight=support + cross support Support: set of keywords appear together in one document Cross support: set of keywords can be partitioned each partition appears in different document Common linking words Issues Since downclosure property does not hold for total weight modified downclosure property can be defined 8 Idea Support satisfies downclosure property Cross support can be designed to be restricted below a particular value, i.e., it is bounded Form a function h as addition of two functions h=f+g f satisfies downclosure property g is bounded h satisfies modified down closure property Let X be a set, Ω be its power set. A function f : Ω →R+ satisfies downclosure property if for all A,B ∈ Ω , A כB ,f(B)>f(A) For any θ≥0, if h(Kn) ≥θ then f(Kn-1) ≥ max{0,(θ-sup(g))} This property can be used to devise efficient algorithm 9 Outline Hypergraph Mining Similarity Measures Evaluation and Observations 10 Similarity Measure among Sets of Genes Given two list of gene names Need to find most similar genes, based on literature abstract occurrences Standard statistics approach Each file containing gene names can be considered as a Discrete Random Variable (DRV) Each such DRV can take several values (gene names) For two such files X,Y and for any pair (x,y), joint probability mass function p(x,y)=P(X=x,Y=y) Compute from online abstracts based on co-occurrence 11 Probability Computation Assume, File X has n gene names xi, i ∈{1,…,n} File Y has m gene names yj, j ∈{1,…,m} M(i,j) is the number of times (xi,yj) appears together in transactions (article abstracts) Then, p(xi,yj)=M(i,j)/{∑i∑jM(i,j)} 12 Expectation Computation Now define, Expectation of Z is, Z=g(X,Y), where g: X x Y →[0,∞) Clearly, Z is a random variable E(Z)=E(g(X,Y))=∑i∑j (g(xi,yj)M(i,j)/Mt) Where, Mt=∑i∑jM(i,j) Expected value of Z can directly be used as a similarity measure Different choices of g, give rise to different similarity measures 13 Some Choices of function g First Choice, Choose g=M(i,j) This choice leads to similarity measure, se1= ∑i∑j M(i,j)2 /Mt Second Choice, Choose g=tot_length(xi,yj), where tot_length (xi,yj) is the sum of transaction lengths where (xi,yj) co-occur The idea is longer the transaction length, higher the chance of having related linking key words This choice leads to similarity measure, se2= ∑i∑j tot_length(xi,yj)*M(i,j) /Mt 14 Extending the notion towards gene ranking Extend to rank genes from a list Y Most similar to the genes from list X Here, instead of Y as a random variable, for each yj ∈Y, consider Uj as a random variable taking value only yj Find the similarity measure between X and Uj for all j∈{1,…,m} Sort the genes from list Y according to decreasing similarity measure 15 Datasets Used two sets of 21 and 31 genes A standard dictionary, as reported in literature, containing 300 genes was used These genes are differentially expressed between prostate epithelial and stromal cells in prostate cancer patients Dr Gail Frazer’s lab, Kent State University These genes were significantly up or down regulated in tumor and adjacent normal tissues when compared with a normal donor tissue Each literature abstract was represented in a bag of word format containing words, where each word comes from a dataset or the dictionary or is a GO term 16 Results: Hypergraph Mining Results show the linking GO terms and linking genes from the dictionary for 21 and 31 dataset obtained by hypergraph mining 17 Results: Similarity Measures 4 sets of 300 genes each ,- A,B,C,D were formed The task is to identify which of A,B,C,D is most similar to the 21 or 31 dataset As one would expect, A is most similar to the 21 dataset as shown below A is the dictionary of 300 genes as mentioned before B,C,D were randomly chosen from superarray’s DNA micro-array experiments It also shows that some naïve similarity measure, such as s1, fails to capture this Sometimes, this tool discovers some interesting result, For 31 dataset, randomly chosen list C was most similar This has been justified by checking the functionalities of top ranked genes from list C 18 Results: Ranking Results of the ranked genes from the most similar list to either 21 or 31 data set Linking words from hypergraph mining were also found within top 20 genes 19 Summary Biological Literature is large and complex Need data mining tools to summarize interesting patterns Proposed hypergraph mining and similarity metrics Initial results are promising 20