Download 2008-01-16-David

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
ClueGene: An Online Search
Engine for Querying Gene
Regulation
David M. Ng
2008 January 16
System Overview
• Every operation generates a “working
set” that can be modified and used as
the query in the next search iteration
• Common structure for all search and
test operations with no dead ends
2
3
New Features
• Coexpression test
• Dataset ranking and heat map
• Heat map for expression data
4
Coexpression Test
• Coexpression search performed using half of
the working set selected at random
• AUC computed based on finding the held-out
half of the working set
• Coexpression test score is the average of ten
such searches
• Test score displayed in the context of
representative pathways with scores
computed the same way as a “thermometer”
• Precision-recall curves are also displayed
5
Dataset Ranking and
Heat Map
• Datasets are ranked by their
contribution to the scores of the working
set genes
• Display as a heat map
• Future work: allow user to provide
dataset feedback
6
Expression Data Heatmap
• Displays the expression data for a
dataset
• For the following genes
– Result genes
– Query genes
– Contrast genes
• Randomly selected non-query and non-result
genes
• Same number as number of result genes
7
Expression Data Heat Map
Script
• Generate a heat map as a Web page for
specified query, result, and contrast genes for
a given dataset.
• Usage:
– Invoke as a URL: http://sysbio.soe.ucsc.edu/cgibin/ClueGeneProd/cluegene_heatmap.pl
– Specify parameters following a ?
– Parameters are name-value pairs separated by
ampersands
8
Expression Data Heat Map
Script Parameters
• Parameters
– species=<species code>
– ds=<dataset name>
– transactionId=<transaction id>
– <result gene id>=resultGene
– <query gene id>=queryGene
– <contrast gene id>=contrastGene
9
Expression Data Heat Map
Example
• http://sysbio.soe.ucsc.edu/cgibin/ClueGeneProd/cluegene_heatmap.pl?
• ds=Segal03&species=sce&transactionId=120
0474871417.4&
• YJR123W=resultGene&YLR340W=resultGen
e&YNL301C=resultGene&
• YJR123W=queryGene&YLR340W=queryGen
e&YBL072C=queryGene&
• YNL232W=contrastGene&YDL175C=contrast
Gene&YDL104C=contrastGene
10
Invoking ClueGene via URL
• ClueGene provides a GET interface
11
Future Work
• Dataset selection
• Reimplement
• Set-based user model
12
Reimplement ClueGene
• Current ClueGene
– 10,000+ lines of Perl in 20 files
– 800+ lines of HTML and JavaScript
• Hard to maintain
• Old CGI technology
13
Set-Based User Model
• Generalization of Greg’s Gene Sets and
Gene Set Families
– Set members can be atomic or sets
– Set members have attributes
• Intrinsic to the element
• Dependent on the set under consideration
• Issue: combining duplicate attributes
14
Benefits of Set Model
• A single, consistent model for all aspects of
gene search engines
– Easier understanding of inputs, operations, and
results
– More straightforward user interface
implementation
– More general manipulation of sets supports
• saving/loading of sets
• combining result sets via set operations such as
intersection and union
15
ClueGene Sets
• Gene: atom
– Attributes such as unique id, display name, aliases
•
•
•
•
•
Cluster: set of genes
Dataset: set of cluster sets
Cluster compendium: set of dataset sets
Query set: set of genes
Expected set: set of genes
16
ClueGene Query
• Inputs
– Cluster compendium set
– Query set
• Output
– Set of all genes in the genome
• Set-specific attributes for rank and score
• Computing AUC
– Additional input: expected set
– Result AUC: attribute of result set
17
Other Operations
• Known and Novel Motif Search
– Input: Working set
– Output: Set of {set for each result motif
containing the genes with the motif}
• GO Category Search
– Input: Working set
– Output: Set of {set for each result motif
containing the genes with the motif}
18
Clustering
• Expression data: set of genes
– Set-specific attributes for expression data for each
gene
• Clustering
– Input expression data: set of genes of expression
data
– Output dataset: set of cluster sets
– Issue: handling operations that take a really long
time
19
Related documents