Download Introduction to Molecular Biology and Genomics

Document related concepts

Cancer epigenetics wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Essential gene wikipedia , lookup

Genetic engineering wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Long non-coding RNA wikipedia , lookup

NEDD9 wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Metagenomics wikipedia , lookup

Oncogenomics wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Gene therapy wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Gene nomenclature wikipedia , lookup

Pathogenomics wikipedia , lookup

Gene desert wikipedia , lookup

History of genetic engineering wikipedia , lookup

Minimal genome wikipedia , lookup

The Selfish Gene wikipedia , lookup

Public health genomics wikipedia , lookup

Genome evolution wikipedia , lookup

Genomic imprinting wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Gene wikipedia , lookup

Ridge (biology) wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Genome (book) wikipedia , lookup

Gene expression programming wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Microevolution wikipedia , lookup

RNA-Seq wikipedia , lookup

Designer baby wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Alexandros Kanterakis
17-5-2005
Heraklion
Crete
Presentation Outline
 DNA and Microarray Experiments
 From Genomic to Post-Genomic Informatics
 Combined Clinico-Genomic Knowledge Discovery
 Towards Reliable Gene-Markers: Supervised Gene
Selection
 Discovery of Co-Regulated Genes: A Clustering
Approach
 The MineGene System and Implementation Issues
 Future Work
DNA Microarrays
• Devices that can estimate in parallel, the
expression of many thousands of genes.
• Their invention in 1995 brought a revolution in
molecular biology, medicine as well as in
pharmaceutical and biotechnology.
• They mainly used to estimate differential
expression of genes acquired from tissues in
various states and conditions, making practical
comparisons between a sample genotype profile
and an arbitrary phenotype attribute or clinical
observation
DNA Microarray Experiments
• Microarray experiments consist of
numerous steps where each
include a variety of procedures,
protocols and data.
• Most of these steps and
procedures
follow
specific
guidelines,
annotations
and
ontologies that need to be
followed
• It is crucial for a laboratory to
record, maintain and publish that
data in modern information
systems.
• The final outcome of this
procedure is the gene expression
matrix that is a 2D matrix
containing the expressions of
genes per sample. Genes and
samples are accompanied with
covariate information.
From Genomic to Post-Genomic
Informatics
Forms of Genomic Informatics:
• Sequence Databases
There are three major co-operating DBs (EMBL, GenBank, DNA
Data Bank) containing millions of sequences with billions of
nucleotides from several organisms with exponential growth.
• Secondary Sequence Databases
Suitable for Microarray experiments. Contain better annotation and
meta-information. Example: UniGene, TIGR, RefSeq
• Genomic Databases
Examine sequences for microarrays from a genomic perspective
Contain gene names and annotations (rather than gene sequences)
organized per organism. Example: Ensembl, CMR (Microbial
Genomes).
• Gene Expression Databases 
Gene Expression Databases
Provide data management for data generated by gene
expression experiments. Their main purposes are to:
•
•
•
•
Handle Gene expression data:
– Store, retrieve and update data.
– Analyze data
Publish:
– Verify, compare, expand and improve
findings.
– Develop novel data analysis methods
Provide a Laboratory Information
Management System (LIMS)
– Record every step of the experimental
process as it happens (experiments,
dates, protocols used,
experimental
parameters)
– Provides data reproducibility
– Standardize microarray experiments.
Flow data seamlessly between the different
components. Ideally it should be possible to
replace any component without affecting the
other parts of the flow.
In many respects gene
expression databases are
inherently more complex
than sequence databases..
The Microarray Gene Expression
Data Society (MGED)
MGED is a group of researchers with the intention of establishing standards for
microarray data annotation and to enable the creation of public databases for
microarray data.
MGED’s work is arranged into four working groups:
• MIAME. Minimal Information About a Microarray Experiment.
Formulates the information required to record about a microarray
experiment in order to be able to describe and share the experiment.
• Ontologies. Determine ontologies for describing microarray
experiments and the samples used with microarrays (available in
RDF, OWL and DAML).
– Other Ontologies used in GEDs are Taxonomic and Gene Ontologies.
• MAGE. Formulates the object model (MAGE-OM), exchange
language (MAGE-ML) and software modules (MAGE-stk) for
implementing microarray software.
• Transformations. Determines recommendations of describing
methods for transformations, normalizations and standardizations of
microarray data.
Expression Database Comparison
Objective: Analyze existing Microarray Gene Expression databases for their
ability to serve as an integrated environment for a laboratory as part of the
PrognoChip project. Selected candidates are widely known, open source
systems: BASE and ArrayExpress (cooperation with FORTH-ISL):
BASE (selected)
ArrayExpress
Supporting Standards
•Support MAGE-ML extraction
•Did not support experiment
MAGE-ML submission
•Problems with MAGE-ML
submission and extraction
Consensus/Supporting
community
•Mailing list, active community
•On line documentation
•Mailing list, active community
•Better on line documentation
Installation/Software
maintenance
•Light-weight and robust
inherent RDBMS (MySql)
•Rational hardware
requirements
•Tricky and problematic
installation and tuning (Oracle).
•Extreme hardware
requirements
Provided tools / Extensions
•Basic analysis tools
•Integrated plug-in schema
(through PHP language)
•Perl Language (Obsolete?)
•Analyze through Expression
Profiler
Interface supplied / Usability •Includes LIMS with graphic
interface
/ Security
•Basic security schema
•No graphic submission tool
•More sophisticated security
schema.
Applications of Genomic:
The “New Genomics”
• In USA, projections suggest that 40% of those alive
today will be diagnosed with some form of cancer at
some point in their lives.
• By 2010, that number will have climbed to 50%.
• Today it is known that 9 of the 10 leading causes of
mortality have genetic components.
• This aspect of genetics has to consider diseases caused
partly by mutations in specific genes (e.g., breast cancer,
colon cancer, diabetes, Alzheimer disease) or prevented
by mutations in genes (e.g., HIV, atherosclerosis, some
forms of cancer).
• These conditions are significantly common enough to
directly affect virtually everyone making genetics play
large role in healthcare and in society.
Genomic Medicine and Healthcare
Genomic medicine will change healthcare by providing:
• knowledge of individual genetic predispositions via microarray and
other technologies.
– individualized screening (i.e. Mammography schedule).
– Individualized behavior changes (informed dietary).
– presymptomatic medical therapies.
• creating Pharmacogenomics
– individualized medication based on genetically determined variation in
effects and side effects.
– new medications for specific genotypic disease subtypes.
• allowing genetic engineering.
• better understanding of non-genetic (environmental) factors in health
and disease.
• emphasizing health maintenance rather than disease treatment
• creating a fundamental understanding of the etiology of many
diseases, even “non-genetic” diseases.
Integrating Clinical and Genomic
Information
• Most genetic contributions to common disease identified
so far have been low frequency with high penetrance
alleles (i.e., BRCA1, BRCA2 , HNPCC).
• On a population level, most genetic contributions to
common disease are from high frequency, low
penetrance alleles (i.e., APC, Alzheimer disease,
HIV/AIDS resistance).
• What makes these low penetrance alleles to be
expressed seems to be a complex concept that has to
include environmental factors.
• Thus, clinical observations are strictly correlated with
specific alleles during the expression of these diseases.
Integrated Clinico-Genomic
Knowledge Discovery: A Scenario
The conceptualization of individualized medicine is to be realized by
respective procedures, protocols and guidelines in the context of
integrated and synergic clinico-genomics decision-making scenarios.
Such a scenario is presented for the case of cancer – the same
scenario may be conceptualized and appropriately extended to other
diseases.
The 5 step scenario illustrates the key processes, namely: collection
of samples, phenotyping, genotyping and the transition from
phenotypes to genotypes.
• Step 1. Collections of samples
Tissue sample is extracted from specific cancer patients.
The tissue sample is appropriately treated and preserved
in order to reserve RNA expression.
Integrated Clinico-Genomic
Knowledge Discovery: A Scenario
Step 1/5
Integrated Clinico-Genomic
Knowledge Discovery: A Scenario
• Step 2. Phenotyping
– Characterization of samples:
Collected samples are assigned to various
clinico-histopathological types and stages.
– Classification of samples:
Assigned to different phenotypical profiles (e.g.
phenotypes F1 and F2) which may include: age,
habits & environmental factors, family-history,
tumour type, medical-imaging parameters,…
During this procedure we build various phenotypes as:
Phenotype F1
Phenotype F2
Domain 1
Good Prognosis
Bad Prognosis
Domain 2
Respond to chemotherapy
Don’t Respond to chemotherapy
Domain 3
Metastasis occured
No Metastasis occured
Integrated Clinico-Genomic
Knowledge Discovery: A Scenario
Step 2/5
Integrated Clinico-Genomic
Knowledge Discovery: A Scenario
• Step 3. Genotyping.
– By microarrays technology, the molecular
profiles of the samples are extracted.
– By fundamental molecular biology
knowledge we may assess relevant
molecular-pathways (e.g., genetic networks).
Such knowledge will help to the identification
of validated and more refined genotypes.
Integrated Clinico-Genomic
Knowledge Discovery: A Scenario
• Step 4. From Phenotypes to Genotypes .
– Applying data-mining operations (gene
selection) on the acquired geneexpression matrix and identify potential
discriminatory genes. For example genes
that distinguish between the two identified
phenotypes.
– These genes compose the molecular
signature (or gene markers) of the respective
phenotypes.
Integrated Clinico-Genomic
Knowledge Discovery: A Scenario
Step 3,4/5
Integrated Clinico-Genomic
Knowledge Discovery: A Scenario
• Step 5. From Genotypes to Phenotypes.
– The decision making process described above may be initiated
the other way around, towards the establishment of more
fundamental knowledge.
– Applying again data-mining operations (e.g. clustering) we are
able to identify clusters of samples based on their geneexpression profiles.
– These clusters represent potential interesting genotypes, e.g.,
genotypes G1 and G2.
– In the course of diagnostic, prognostic or, therapeutic decision
making process, each, yet untreated, patient may be assigned
to its corresponding genotypical class (i.e., to the discovered
cluster genotype into which the patient belongs).
– Then, with the aid of a supervised predictive learning operation
(i.e., decision trees) re-classification of the disease on the
phenotypical level - a fundamental task in the clinical research
for compacting major diseases.
Integrated Clinico-Genomic
Knowledge Discovery: A Scenario
Step 5/5
Gene Expression Data Mining
• Gene expression database mining is used to identify
intrinsic patterns and relationships in gene expression
data.
• Traditionally molecular biology has concentrated on a
study of a single or very few genes in research projects.
• With genomes being sequenced, this is now changing
into so-called systems approach where new research
questions can be studied such as:
–
–
–
–
–
how many genes are expressed in different cell types?
which genes are expressed in all cell types?
what are the functional roles of these genes?
how a group of genes is regulated?
what genes are interfered in a specific phenotype?
• We make a distinction between two types of analysis
tasks: gene selection and gene clustering.
Towards Reliable Gene-Markers:
Supervised Gene Selection
Microarray gene expression experiments are organized in four basic types:
• A comparison of two biological samples.
• A comparison of two biological conditions, each represented
by a set of replicate samples
• A comparison of multiple biological conditions
• Analysis of covariate information
Although biological
experiments vary
considerably in their
design, the data
generated by microarray
experiments can be
viewed as a matrix of
expression levels,
organized by samples
versus genes.
A Novel Gene Selection Approach:
Methodology and Algorithms
We present a novel gene-selection methodology
composed by four main modules and is based on
Discretisation of gene-expression data:
Discretization of Gene-Expression Data
• In most of the cases, we are confronted with the problem
of selecting genes that discriminate between two classes
(i.e., diseases, disease-states, treatment outcome,
recurrence of disease, in other words phenotypes). It is
convenient to follow a two-interval discretisation of geneexpression patterns.
• A general statement of the two-interval discretisation problem followed
by a two-step process to solve it follows.
Given: A sorted vector of
where, each number in
k numbers:
V
V  n1 , n2 ,, nk 
is assigned to one of two classes.
Find: A number,  : n1    nk that splits the numbers in V into
two intervals: [n1 ,  ) and [ , nk ] , and best discriminates
between the two classes. Best discrimination is decided
according to a specified criterion.
Discretization of Gene-Expression
Data
• Step 1
For all consecutive pair of numbers ni , ni 1 in V their
midpoint, i  ni  ni1  2 is computed, and the corresponding
ordered vector of midpoint numbers is formed: M   ,  ,, 
• Step 2
For each   M the well-known information gain metric is
computed
1
IG(V, μ)  Entropy(V) 

u  {l, h}
Vu
V
2
Entropy(V u)
where sets Vl , and Vh include numbers from V which are
less than  and higher (or equal) to  , respectively.
k 1
Discretization of Gene-Expression
Data
• Step 3
The midpoint that exhibits the maximum information
gain:
max  arg max IG (V ,  ),    
is considered as the gene’s expression value which,
when considered as a split point, exhibits the best
discrimination between the classes.
This point is selected to assign the gene’s expression
values to the nominal ‘l’ow or, ‘h’igh values, respectively
(i.e., less than  max and higher that  max ).
Discretization of Gene-Expression
Data, an overview
Discretization of Gene-Expression
Data
The aforementioned discretisation process is
applied independently on each gene in the
training set. The final result is a discretised
expression-value representation / transform of
each gene:
Gene Ranking
For each discretised gene we count the number of ‘h’s and
‘l’s that occur in the respective samples. Assume that each
sample is assigned to one of two classes, i.e., P, and N.
The following quantities are computed:
H g , P = number of ‘h’ values for gene g assigned to class P
Lg , P = number of ‘l’ values for gene g assigned to class P
H g , N = number of ‘h’ values for gene g assigned to class N
Lg , N = number of ‘l’ values for gene g assigned to class N
Gene Ranking
Formula below, computes a rank for each gene that measures the
power of the gene to distinguish between the two classes:
rg  H g , P  L g , N   H g , N  L g , P 
For a completely distinguishing gene where, all of its values for
class P are ‘h’, and all of its values for class N are ‘l’, Lg , P  H g , N  0
and, rg , takes its maximum positive value. In this case the gene is
considered to be descriptive of (associated with) class P.
The gene remains completely distinguishing in the inverse case
where, H g , P  Lg , N  0 and, rg , takes the minimum negative
value. In this case the gene is consider descriptive of class N.
The gene ranking formula encompasses and expresses:
(a) a polarity characteristic
(b) the descriptive power of the gene with respect to the
present disease-state classes
Gene Grouping
By gene grouping we group genes that have
similar ranking. First we estimate the value:
MaxRank  MinRank
g
n 1
MaxRank and MinRank are the maximum and minimum ranking
of the genes respectively as they were computed from the previous step.
Gene i is assigned to a group Oi according the
formula:
, i  1, k  1
 1

Oi   k
, Ri  Ri 1  g
k  1 , R  R  g , k  k  1
i
i 1

Ri is the ranking of gene i, and k is an integer variable.
Greedy gene-groups elimination
Step 1. Initialisation
Group p1
Group p3
During Greedy gene-groups elimination, we
initially consider all groups as identifiers and we
assess the predictive power of the selected
genes
Group p4
Step 2. Choose what to eliminate
Group p5
We consequentially choose to eliminate:
Group p2
A. The last Positive Group …
Group n5
Group n4
Group n3
Group n2
Group n1
B. The last Negative Group…
C. Both of them…
Step 3. Estimation of prediction ability.
We assess the predictive ability of selected genes in cases
A, B, C and we choose the best predictive set (say C), and
we continue steps 2, 3 until we increase accuracy no more.
Greedy gene-groups addition
Step 1. Initialisation
During Greedy gene-groups addition, we initially
consider no groups of identifiers at all.
Group p1
Group p2
Step 2. Choose what to add
Group p3
We consequentially choose to add:
Group p4
A. The first Positive Group…
Group p5
B. The first Negative Group…
C. Both of them
Group n5
Group n4
Step 3. Estimation of prediction ability.
Group n3
We assess the predictive ability of selected genes in
cases A, B, C and we choose the best predictive set
(say C), and we continue steps 2, 3 until we increase
accuracy no more.
Group n2
Group n1
Samples Class Prediction
Assess the predictive power of each selected gene.
Estimate
Unclassified
the sumsample
of the product
is assigned
of the to
predictive
class Pos
power of each gene and the
For
positive
is: (HighPos
–aLowPos)
/ with
#Pos genes
descritization
because
C1
of genes
>
the
C2,
sample.
and
the
Estimation
process
is
continues
done
for positive
and
negative
During
class
prediction
we
have
set
of separately
selected
along
with
A
new
unclassified
sample
enters..
Keep only
Descritise
new
values
sample
of sample…
selected
according
genes..
to MidPoints…
genes.
the
unclassified
theirnext
identifiers
as computed
in the previous steps:
For negative genes is: (HighNeg – LowNeg) / #Neg
Sample Class Prediction
The previous process can be modeled in the following formula:

H g , P  Lg , P
H g , N  Lg , N

Cs  arg max  sign  max, g  Es , g 
,  sign  max, g  Es , g 
 gR
P
N
gRN
 p
Cs
R p , RN
 max, g
is the class that will be assigned to unclassified sample s.
is the set of positive ranked genes and negative ranked
genes respectively.
is the midpoint of gene g.
Es , g
is the expression value of unclassified sample s at gene g.
P, N
is the total number of positive and negative number of train
sample.




Sample Class Prediction
• As with the gene-ranking formula, this formula also
encompasses a polarity characteristic. In addition, the
strength with which the sample is predicted to belong to one
of the two classes is also provided so that, strong (or, weak)
predictions could be made.
• This strength can be applied to tackle domains with more
than two classes (multi-class prediction):
Let S be an unclassified sample that belongs to a domain
with c classes. We also assume that we have selected g
genes to be our discriminant attributes. We apply the
predictor described above subsequently for each class. That
is, we estimate the prediction strength of S belonging to
each one of the c classes. Finally we assign the sample S to
the class that made the best prediction score.
Experimental Evaluation
We applied the introduced gene-selection and samples classification
methodology on eight real-world gene-expression domain studies that are
pioneers in their fields:
Experimental Evaluation
Summarization of the results of applying the introduced gene-selection
and sample classification/prediction method:
Discovery of Co-Regulated Genes:
A Clustering Approach
• By comparing gene-expression profiles, and forming
clusters, we can hypothesize that the respective genes
are coregulated and possibly functionally related.
• The discovery of genes’ function may help to the
identification of genes being involved in particular
molecular pathways, and by though ease the modelling
and exploration of metabolic pathways (i.e.,
metabolomics).
• Clustering of genes may reveal gene-families, i.e.,
metagenes, and their potential linkage with combined
clinical features – a task which is too-difficult to be
achieved when we are confronted with the huge number
of available genes (~25000-30000 for the human case).
A Graph Theoretic Clustering
(GTC)
• We present a novel Graph Theoretic Clustering (GTC)
approach on clustering of microarray gene expression
profile data. The approach is based on:
– The arrangement of the genes in a weighted graph
– The construction of the graph’s Minimum Spanning Tree
– An algorithm that recursively partitions the tree.
• Main advantages of the method:
– Domain background knowledge can be utilized in order to
compute distances between objects.
– No need to specify the number of clusters in advance.
– Hierarchical clustering.
Step 1: Fully Connected Graph
Compute the distances of all gene expression profiles and
construct the fully connected graph:
•
•
Distances may be simple or more domain specific (i.e., Euclidean,Pearson,
Mahalanobis).
Or, a complete arbitrary, external source of information. This characteristic
makes the whole data analysis process more ‘knowledgeable’ in the sense
that established domain knowledge guides the clustering process.
Step 2: Minimum Spanning Tree
Construction
The minimum spanning tree of the fully-connected weighted graph of the
objects is constructed. The formed MST contains exactly n-1 edges:
• MST reserves the shortest distance between the genes. This guarantees that
objects lying in ‘close areas’ of the tree exhibit low distances.
• Finding the ‘right’ cuts of the tree could result in a reliable grouping of the genes.
Step 3: Iterative MST partition
At each node in the sofar formed hierarchical
tree, each of the edges
in the corresponding
node’s sub-MST is cut.
With each cut a binary
split of the genes is
formed. If the current
node includes n genes
then n-1 such splits are
formed. The two subclusters, formed by the
binary split, plus the
clusters formers so far
compose a potential
partition
Step 4. Best Split
For each binary split we compute a category utility (CU) that indicates the
division ability of the split. The more compact the clusters formed the higher
the CU.
J. Yoo and S. Yoo.“Concept Formation in Numeric Domains. Proceedings of Computer
Science Conference, pp. 36-41, Nashville, TN, March, 1995.
Where K is the number of clusters formed so far,  ik is the standard deviation for
sample i in class k , and  iP is the standard deviation for attribute i of all the genes
participating in the clustering.
The one that exhibits the highest CU is selected as the best partition of genes in the
current node.
Step 5: Iteration and termination
criterion
Each new cutting point
found on the tree,
divides the tree in two
sub-trees: The left and
the right.
The best cut of these
two trees is found as
described in steps 3
and 4.
In order to decide what
will be the new cut, four
potentials have to be
examined.
In order to decide what
potential is the proper
one we estimate the
CU of each one.
Time Complexity
The time and space complexity of calculating all distances of n genes with F
samples is O n 2  F . When dealing with real-domain problems the order of
11
computed distances may reach the order of 10 .


Even though this complexity can be arranged by contemporary modern
computers in the field of time, it is very hard to be arranged in the field of space.
In order to overcome this bottleneck we introduce a heuristic that reduces
significantly the order of the computed distances:
We assume that the maximum degree of computed MST’s nodes is a value
less than a constant value, let t. This hypothesis comes from the belief,
that the data has a minimum sparseness. Thus a MST of a fully connected
graph cannot have a node with degree greater than t. This reduces the
space complexity to
even though it increases the time
 F t  n
complexity as the burden of sorting the distances of each node has been
added.


Experimental Evaluation on GeneExpression Data Clustering
Large-scale temporal gene-expression mapping
of Central Nervous System development (112 genes; 9 developmental time-points)
Wen, et.al., PNAS 95, 334-339, January 1998
c2 (w5)
c1112 (w2)
c1111 (w3)
c112 (w4)
c12 (w1)
GTC: Comparison & Interpretation
of Results
GTC
 Clusters almost identical to Wen
c2
C2 1 2 (w 4 )
w4
1.00
1.00
0.50
0.50
0.00
0.00
E13
E15
E18
E21
P0
w3
P7
P14
A
E11
E13
C2 1 1 1 (w 3 )
1.00
0.50
0.50
E11
E13
E15
w2
E18
E21
0.00
P0
E18
E21
c1111
P0
P7
P14
A
w3
1.00
0.00
E15
P7
P14
A
E11
E13
C2 1 1 2 (w 2 )
c1112
E15
E18
E21
P0
P7
P14
A
1.00
0.50
0.50
0.00
P0
P7
P14
A
0.00
E11
E13
E15
E18
E21
P0
P7
P14
A
E11
E13
E15
w1
E18
E21
0.50
0.50
0.00
Constant
1.00
0.00
E15
E18
E21
P0
P7
P14
A
E11
E13
w4
E15
E18
E21
P0
P7
P14
A
c112
w1
1.00
0.50
0.50
0.00
0.00
E13
E15
E18
E21
P0
P7
P14
A
GTC is:
Well-formed
Reliable
Stable
EARLY
C22 (w 1)
1.00
E11
LATE
EARLY_MID
EARLY_MID_C
Constant
EARLY
w5
1.00
E13
:
:
:
:
:
c12
c 1 (w 5 )
E11
c112
c1112
c1111
c2
c12
 The same using GTC-VDM
w2
1.00
-
EARLY_MID_C
E11
w4
w2
w3
w5
w1
LATE
w5
Indicative Patterns
(b)
EARLY_MID
Wen
(a)
E11
E13
E15
E18
E21
P0
P7
P14
A
The MineGene System:
Implementation Issues
• MineGene is a collection of Machine Learning /
Data Mining algorithms and heuristics for
intelligent processing of gene expression data
produced by DNA Microarray experiments.
• It is designed and implemented to be suited as a
plug-in in a gene expression database.
• It implements (among others) all the methods
presented.
Minegene’s Pathway
There is not yet any standard method for microarray gene expression data analysis but
some general guidelines that recently have started to be formed.
These guidelines represent a sequencing procedure, a pathway that starts after data
acquisition and ends to the construction of a predictor or a clustering mechanism
depending if we are performing supervised or unsupervised data analysis.
Class hierarchy of MineGene
MineGene should:
• Act as a plug-in in a
gene expression
database.
• Be composed by
several components
with certain
correlations between
them, as algorithms
belonging to the same
family share common
attributes.
• Utilize a Graphical
User Interface.
Thus, Object Oriented
Programming via C++
MineGene’s GUI
MineGene supports:
• Filtering methods:
– Remove NaN (Not a Number Values).
– Remove not Significant genes (according to Wilcoxon
rank-sum test.
– Read from external resource.  study genes
• Ranking Methods
– According to Entropy (as presented)
– According to Standard Deviation (Signal to Noise):
 a  b
a b
– According to Significance (Wilcoxon rank-sum test)
– According to an external resource (file)
MineGene Supports:
• Grouping Methods:
– According to the method presented.
– No grouping at all.
• Gene Selection Methods:
– ADD / DEL Methods Presented
– A priori gene or groups Selection:
MineGene Supports
• Prediction Methods:
– Descritisation
(presented before) for
dual or multiclass
domains.
– Support Vector
Machines (through
libsvm)
– K-Nearest Neighbours
(KNN)
– K-Means
MineGene Supports
• Clustering through GTC (as presented)
– MST, Distance and Category
Utility methods selection
– Heuristics for a-priori cluster size.
– Options to cluster an arbitrary tree,
to use external distances and
to cluster an arbitrary graph
(not fully connected).
– Option to visualize clustering
in .jpg format through GraphViz.
MineGene Supports
• Study comparison
A study contains the genes selected by an
external work. These are compared with
the genes found by our study and the
common genes are exported.
• Validation
Leave One Out Cross Validation is
supported (currently extended).
MineGene Supports
• Study clustering
When we are performing clustering, our outcomes can be compared
with an external clustering. The similarity of two clusterings can be
assessed by:
#CL
# CLi
E
E (CLi )
n
i 1
#Cl
E (CLi )   P(Clij ) log P(Clij )
j 1
Clij 
# Clij
# CLi
Where #CL is the total number
of clusters produced by our
algorithm and #Cl is the total
number of external clusters.
# CLi is the number of genes
contained in cluster i of our
algorithm and # Clij is the
number of genes contained in
cluster j of external clustering
and belong to cluster i of our
algorithm.
Future Work
• Porting to other well known analysis tools
as R-package (standard in Bioinformatics).
• Inclusion in an Integrated ClinicoGenomics Environment (not a standalone
application or a Gene Expression
Database).
• Include Visualization methods.
• Support of clinico-genomic knowledgedicsovery scenarios.
…
Integrated Clinico-Genomics with MineGene
A Multi-Strategy Data Mining Approach

Clustering
Clusters of Genes  Means of Clusters = Meta-Genes

Association Rules
Interesting associations between Clinical-Parameters and
Meta-Genes = Interesting Clinical Profiles/Categories
ER+ & PR+ & AGE > 40 & GOOD-prognosis
VS.
+
+
ER & PR & AGE > 40 & BAD-prognosis)

Gene-Selection
Select discriminant genes that distinguish between the
discovered Clinical profiles
ER+ & PR+ & AGE > 40
& MG-1=High & MG-2=Low  GOOD-prognosis (> 5 yrs)
THANK YOU!
?