Download BIBE06_kaushik - Ohio State Computer Science and Engineering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

NEDD9 wikipedia , lookup

Copy-number variation wikipedia , lookup

Twin study wikipedia , lookup

Cancer epigenetics wikipedia , lookup

Heritability of IQ wikipedia , lookup

Genetically modified crops wikipedia , lookup

Transposable element wikipedia , lookup

Epistasis wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

X-inactivation wikipedia , lookup

Gene therapy wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Genetic engineering wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Oncogenomics wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Public health genomics wikipedia , lookup

Gene nomenclature wikipedia , lookup

Pathogenomics wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Gene desert wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Essential gene wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Nutriepigenomics wikipedia , lookup

History of genetic engineering wikipedia , lookup

RNA-Seq wikipedia , lookup

Genome evolution wikipedia , lookup

Genomic imprinting wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene wikipedia , lookup

Ridge (biology) wikipedia , lookup

Microevolution wikipedia , lookup

Minimal genome wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Genome (book) wikipedia , lookup

Designer baby wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Exploratory Tools for Follow-up Studies
to Microarray Experiments
Kaushik Sinha
Ruoming Jin
Gagan Agrawal
Helen Piontkivska
Ohio State and Kent State Universities
1
Overall Motivation



Biological literature is vast
Need tools to find interesting patterns from
literature
Specific Example


Identify genes from DNA microarray and other
gene and protein assays
Next step



What is known about these genes?
How are these genes related to each other or other
genes identified in similar studies?
Which other genes are most similar
2
Outline



Hypergraph Mining
Similarity Measures
Evaluation and Observations
3
Hypergraph Mining: Motivating
Example




Micro array experiment - suspects that a small set of genes
are related to a disease
Confirm by searching existing literature - expect related
genes to appear together in literature
However, suppose Gene A and C are related and both of
them are weakly related to another term B
In literature, one would expect




A,C appear together OR/AND
A,B appear together
B,C appear together
How do we efficiently conclude that A,C are actually
related?
4
Hypergraph Mining

Basic Motivation


Example (Gene-Disease Relationship)




Gene A is related to a term B
Term B is related to a gene C
Is Gene A related to Gene C ?
Gene Source


To find useful “Transitive Relation” (hyperedges) among
genes
Microarray Experiments
Information Source

Online Literature abstracts
5
Formal Problem Definition

Given




A dictionary KT
A set KM of user provided keywords (KT‫כ‬KM)
Collection of literature abstracts - each abstract is
represented as a set of words from dictionary
Task

To find hyperedges exceeding user defined threshold,
each of which involves a set of key words from KM and are
potentially connected by another set of linking words from
KT-KM
6
Relationship to Work on Frequent
Pattern Mining

Frequent itemset mining




Can represent each document abstract as a
transaction with several keywords
Find sets of keywords that appear together
and often
Cannot capture cross relationships
Differences


How do we define support ?
How do we prune search space
7
Solution Approach

Define



total weight=support + cross support
Support: set of keywords appear together in one
document
Cross support: set of keywords can be partitioned



each partition appears in different document
Common linking words
Issues

Since downclosure property does not hold for total
weight modified downclosure property can be defined
8
Idea

Support satisfies downclosure property



Cross support can be designed to be restricted below a
particular value, i.e., it is bounded
Form a function h as addition of two functions h=f+g



f satisfies downclosure property
g is bounded
h satisfies modified down closure property


Let X be a set, Ω be its power set. A function f : Ω →R+ satisfies
downclosure property if for all A,B ∈ Ω , A ‫ כ‬B ,f(B)>f(A)
For any θ≥0, if h(Kn) ≥θ then f(Kn-1) ≥ max{0,(θ-sup(g))}
This property can be used to devise efficient algorithm
9
Outline



Hypergraph Mining
Similarity Measures
Evaluation and Observations
10
Similarity Measure among Sets of
Genes



Given two list of gene names
Need to find most similar genes, based on literature
abstract occurrences
Standard statistics approach





Each file containing gene names can be considered as a
Discrete Random Variable (DRV)
Each such DRV can take several values (gene names)
For two such files X,Y and for any pair (x,y),
joint probability mass function p(x,y)=P(X=x,Y=y)
Compute from online abstracts based on co-occurrence
11
Probability Computation

Assume,




File X has n gene names xi, i ∈{1,…,n}
File Y has m gene names yj, j ∈{1,…,m}
M(i,j) is the number of times (xi,yj)
appears together in transactions (article
abstracts)
Then,

p(xi,yj)=M(i,j)/{∑i∑jM(i,j)}
12
Expectation Computation

Now define,



Expectation of Z is,




Z=g(X,Y), where g: X x Y →[0,∞)
Clearly, Z is a random variable
E(Z)=E(g(X,Y))=∑i∑j (g(xi,yj)M(i,j)/Mt)
Where, Mt=∑i∑jM(i,j)
Expected value of Z can directly be used as a
similarity measure
Different choices of g, give rise to different
similarity measures
13
Some Choices of function g

First Choice,



Choose g=M(i,j)
This choice leads to similarity measure,
se1= ∑i∑j M(i,j)2 /Mt
Second Choice,



Choose g=tot_length(xi,yj), where tot_length (xi,yj) is
the sum of transaction lengths where (xi,yj) co-occur
The idea is longer the transaction length, higher the
chance of having related linking key words
This choice leads to similarity measure,
se2= ∑i∑j tot_length(xi,yj)*M(i,j) /Mt
14
Extending the notion towards
gene ranking

Extend to rank genes from a list Y




Most similar to the genes from list X
Here, instead of Y as a random variable, for
each yj ∈Y, consider Uj as a random variable
taking value only yj
Find the similarity measure between X and Uj
for all j∈{1,…,m}
Sort the genes from list Y according to
decreasing similarity measure
15
Datasets
Used two sets of 21 and 31 genes



A standard dictionary, as reported in literature, containing
300 genes was used


These genes are differentially expressed between prostate
epithelial and stromal cells in prostate cancer patients
Dr Gail Frazer’s lab, Kent State University
These genes were significantly up or down regulated in tumor and
adjacent normal tissues when compared with a normal donor
tissue
Each literature abstract was represented in a bag of word
format containing words,

where each word comes from a dataset or the dictionary or is a GO
term
16
Results: Hypergraph Mining

Results show the linking GO terms and linking
genes from the dictionary for 21 and 31 dataset
obtained by hypergraph mining
17
Results: Similarity Measures

4 sets of 300 genes each ,- A,B,C,D were formed




The task is to identify which of A,B,C,D is most similar to the 21 or
31 dataset
As one would expect, A is most similar to the 21 dataset as shown
below


A is the dictionary of 300 genes as mentioned before
B,C,D were randomly chosen from superarray’s DNA micro-array
experiments
It also shows that some naïve similarity measure, such as s1, fails to
capture this
Sometimes, this tool discovers some interesting result,

For 31 dataset, randomly chosen list C was most similar
This has been justified by checking the functionalities of top ranked
genes from list C
18
Results: Ranking


Results of the ranked genes from the most similar list to either 21
or 31 data set
Linking words from hypergraph mining were also found within top
20 genes
19
Summary




Biological Literature is large and
complex
Need data mining tools to summarize
interesting patterns
Proposed hypergraph mining and
similarity metrics
Initial results are promising
20