Download xianxu

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Transcriptional regulation wikipedia , lookup

Gene expression wikipedia , lookup

Molecular evolution wikipedia , lookup

Genomic imprinting wikipedia , lookup

Ridge (biology) wikipedia , lookup

Genome evolution wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Gene therapy wikipedia , lookup

Community fingerprinting wikipedia , lookup

Gene wikipedia , lookup

Promoter (genetics) wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene desert wikipedia , lookup

Gene nomenclature wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Gene regulatory network wikipedia , lookup

Gene expression profiling wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Advanced Gene Selection Algorithms
Designed for Microarray Datasets
• Limitation of current feature selection methods:
– Ignores gene/gene interaction: single gene based
discriminative scores, correlation (redundancy) based
algorithms
• Virtual Gene Algorithm
– Using correlations between genes
• Gene Ontology Based Gene Selection
– Integrating domain knowledge
• Boost Selection
– Feature selection based on bootstraps
Virtual Gene Algorithm
• Gene to gene correlations are generally
ignored in feature selection algorithms. In
this work, we examine using instead of
ignoring such correlations for the purpose
of gene selection.
• Motivating examples are shown in the next
two pages, from both synthetic and real
datasets.
Virtual Gene: Motivating Example
Virtual Gene: Motivating
Example
Virtual Gene Algorithm
• The expression levels of any single gene
does not capture the class label distinction
• However, the combination of expression
levels of two genes captures class label
distinction pretty well
• Virtual Gene: a linear combination of
genes
Virtual gene definitions
Systematically examining all
possible virtual genes
n
2
• There are
possible virtual genes that can be
constructed from a set of n genes.
• Pairwise virtual genes are those virtual genes
that limit the size of constituent gene set to be 2.
This reduces computation enormously.
• Clustering algorithms are further used to reduce
the number of gene pairs to be considered.
Clustering algorithm identifies genes that
potentially interact or share similar functions.
Pairwise virtual gene algorithm
• Our experiments show that limiting
pairwise virtual gene computation to
genes in the same cluster greatly reduces
computational complexity while
preserving classification accuracy.
Pairwise virtual gene algorithm
• Pairwise Virtual Gene algorithm runs in
three stages
1. Cluster genes into gene clusters using kmeans algorithm
2. Compute pairwise virtual genes within
clusters, their virtual gene expressions and
their discriminative power
3. Select top ranked virtual gene, degrade the
discriminative power using α, β(parameters
supplied by user)
Pairwise virtual gene algorithm
• Parameters to pairwise virtual gene
algorithm:
α: ranges [0,1], the likelihood of virtual genes
with same constituent genes being selected
β: ranges [0,1], the likelihood of virtual genes
whose constituent genes come from same
cluster being selected
k : number of virtual genes to be selected
Experiments: Virtual Gene
• Extensive experiments are performed on three
publicly available datasets: colon cancer,
leukemia and multi-class cancer.
• We will briefly discuss the performance on
these dataset, and report more detailed result
on colon cancer dataset.
• Performance are measured by cross validation
procedure, three classifiers (SVM, KNN, DLD)
are used.
• Performance of four FSS algorithms are
compared.
Experiments: Virtual Gene
• Summary of classification performance of
virtual gene algorithm.
Experiments: Virtual Gene
• Summary of classification performance of
virtual gene algorithm.
Experiments: Virtual Gene
• More detailed result on colon cancer
dataset
– Study how the choice of number of clusters in
the pairwise virtual gene algorithm affects
classification performance.
– Study how the choice of initial cluster centers
in the pairwise virtual gene algorithm affects
gene selection performance.
Experiments: Virtual Gene, number
of clusters
Experiments: Virtual Gene, initial
cluster centers
The limit of pairwise virtual gene
algorithm
• Biological process obviously could involve
more than 2 genes at a time. Pairwise
virtual gene algorithm might be too
restrictive in this sense.
• Our goal is to investigate the relative
expression values of biologically related
genes.
• Using domain knowledge enables us to
do just that, to some degree.
Different levels of feature selection
• Single gene based discriminative scores
ignore feature correlations completely.
• Exhaustive search of the power set is too
slow.
• GO based virtual gene algorithm utilizes
domain knowledge information and
decide which set to explorer intelligently.
More on GO and GO annotation
•
•
•
•
•
Gene Ontology (GO) consists of GO terms, which form
a shared biological vocabulary.
GO terms are connected based on is-a or is-part-of
relationship.
Combined, GO terms and relationships between them
form a DAG (directed acyclic graph).
Genes are annotated by GO terms by GO collaborators.
Gene annotations are assumed to be transitive in this
thesis: if a gene is annotated by a GO term, it is also
considered to be annotated by all the parent GO terms
of that GO term.
Domain knowledge in form of gene
ontology annotations
Some definitions
Explaining of Definitions
• The GO distance between genes measures
how close two genes are from the information
embedded in GO annotations.
• Gene connectivity graph shows the overall
gene affinity.
• We want to examine correlation in gene
expressions between tightly related genes.
• Our algorithm best demonstrated using the
graph in the next slide.
GO based virtual gene algorithm
• First, GO distances between genes are
computed. Genes that are close to each
other are identified by finding cliques in
gene connectivity graph.
• Each small gene clique is used to create a
virtual gene. Virtual genes are then ranked
using single gene based discriminative
scores.
Experiment Setup
• Two publicly available microarray
expression data sets are used: colon
cancer, leukemia.
• Three gene ontology branches are used
separately.
• Three classifiers are used.
• GO annotations are extract from
Stanford's online database SOURCE.
Experiments: GO Virtual Gene
•
Experiment result on Colon Cancer data set.
Experiment: GO Virtual Gene
•
Experiment result on Leukemia data set.
Conclusion: GO Virtual Gene
• Usage of domain knowledge embedded in
GO annotations enables us to example
expression correlations between a large
set of genes.
• GO based virtual gene algorithm
sometimes improves gene selection
performance significantly.