Download Bioinformatics System for Gene Diagnostics and Expression Studies

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Naive Bayes classifier wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Bioinformatics System for Gene Diagnostics and
Expression Studies
Justin Chan Shao Ling
National University of Singapore, 10 Kent Ridge Crescent, Singapore 119260
[email protected]

The algorithms used for classification are the Naïve Bayes
Classifier, C4.5 Decision Tree and Support Vector
Machines.
Abstract-- This report is based on work undertaken on
the CADRA (Coronary Artery Disease RiskAssessment) project. Pattern recognition techniques
were employed in the classification of CAD (Coronary
Artery Disease) cases using various risk factors and
information from microarray gene-expression data. The
machine learning algorithms used are the Naïve Bayes
Classifier (NBC), C4.5 Decision Tree and Support
Vector Machines (SVM). The Chinese samples in the
dataset were segregated and investigated using NBC
and C4.5, allowing for comparison with ‘all-race’
classifiers. Lastly, the implementation of a web-based
Bioinformatics system using Microsoft technology
undertaken with my colleagues is described.
1.
2.
Experiment Issues and Methodology
Jonatrim 2 is the dataset used to do classification. It consists
of a total of 29 independent attributes and 1 dependent
attribute (ie. the CAD status). 19 of the independent
attributes are categorical data and they pertain to
genotype/phenotype information from microarray analysis (
henceforce known as A attributes).
Several issues and complications have arisen in the dataset.
These include possible noise in the genotype/phenotype
classification process, contaminated risk factors (TC, HDL,
LDL and TG), missing data and class imbalances.
Introduction
Advances in the Life Sciences and various related
technologies has fueled much interest amongst researchers
in the study of genes and diseases. The Biosensors-Focused
Interest Group (BFIG) of the National University of
Singapore (NUS) was formed with the focus of harnessing
DNA microarrays for genetic diagnosis and gene expression
studies. CADRA (Coronary Artery Disease Risk
Assessment) was one of the projects undertaken which
involved the study of 30 genes associated with
atherosclerosis. This project involves the development and
implementation of a Web-based front-end for integrating the
DNA information with Microsoft technologies, as well as to
understand and implement machine learning algorithms in
classifying CAD patients.
2.1
Conduct of Training and Testing
The training and testing of models of the Naïve Bayes
Classifier and C4.5 Decision Tree were carried out as
described by Russel and Norvig [1].
The results of classification on the test set are quoted for the
various algorithms used in their respective chapters. The
average results are given across the entire training set size
range (except for SVM). The following steps and measures
were also taken in the conduct of the various training and
testing of models.

Coronary Artery Disease (CAD) is a major cause of death in
developed countries. It is described as a complex disease
due to its multifactorial nature. This disease is further
complicated by gene-environment and gene-gene
interactions resulting in risk factor variability.
CADRA researchers are utilising microarray technology in
studying various candidate genes involved in CAD. Having
obtained the large quantity of candidate gene-expressions
data, the bioinformatics group of CADRA project has been
involved in data mining and analysis in relation to CAD.



1
Sampling has been carried out based on uniform
distribution. The test sets were sampled without
replacement while the training set was sampled with
replacement when insufficient examples arose.
Training set sizes ranging from 100 to 800 instances (
in increments of 100 ) and 1000 to 2000 instances (in
increments of 200) were used. Test set size has been
fixed at 100 samples throughout.
SVM calls for a different method for training and
testing. The test set is still fixed at 100 samples but no
resampling will be carried out for training sets, which is
fixed at training size of 500.
The Chinese training sets were of size 280 and test sets
were of size 100.
sample. The chosen class, y, should be the one which
maximizes
To address the various issue of contaminated attributes of
TC, HDL, LDL and TG, and for comparative studies with
previous investigations [3], Jonatrim 2 dataset has been split
into several categorical types.
Type
A
B
C
D
E
F
G
H
I
J
K
PrY  y | X  x 
PrY  y  PrX  x | Y  y
PrX  x
where y represents the target class out of all the various
possible classes ( ie. y  Y ) and x={x1, x2, …,x}, the
attributes used for classification. The various probabilities
of Pr{Y}, Pr{X} and the conditional probability Pr{Y|X}
had been obtained from the training sample. In the case
where the various attributes making up X are statistical
independent, then we can say that the attributes X1, X2, …,
X, are statistically independent iff
Attributes
All attributes
All attributes except A
All attributes except TC,HDL,LDL,TG
All attributes except A,TC,HDL,LDL,TG
A from Type A
A
All attributes except Race
All attributes except A, Race
All attributes except TC,HDL,LDL,TG,Race
All attributes except
A,TC,HDL,LDL,TG,Race
A from Type A with Race


Pr  X i  
i  I 
 Pr{ X
i
}
I  a ,..., 
( 3.6 )
iI
Therefore, NBC predicts the class, y, that maximizes
Table 1: Description of Dataset Types used
PrY  y | X  x 
Within each run of the algorithm concerned, the test set
used is the same throughout. However, each run using Type
F uses a different test set. The training and test sets are also
different for between algorithms.
PrY  y 
 Pr{ X i  xi | Y  y }
PrX  x i 1
( 3.7 )
4 Introduction to C4.5 Decision Trees
2.2 Performance Measure
The divide-and-conquer approach to decision tree
induction, sometimes called top-down induction of decision
trees, was developed and refined over many years by Ross
Quinlan[6]. Its expressiveness and ability to generate rules
are its major area of attraction to researchers.
To evaluate the performance of the algorithm, the various
measures in the confusion matrix as well as others ( eg.
sensitivity, predictive value positive ) would be presented in
tabular form for the various algorithms used. It should be
noted that in the case of medical diagnosis of a condition
like CAD, not all classification errors carry the same
penalty [4]. Specifically, one would like to avoid classifying
CAD cases as non-CAD cases.
A decision tree is a collection of branches (paths from the
root to the leafs), leafs (indicating the class) and nodes
(specifying test to be carried out). Decision trees classify
cases as identified through attributes (or features).
Essentially, they recursively partition regions of the
attribute space into sub-regions according to the most
‘informative’ attribute [12]. To specify the criterion for
selecting the most ‘informative’ attribute, the use of
information gain ratio criterion was used for C4.5.
Originally, only the information gain criterion was used (ie.
in ID3) but it was observed that this measure had a strong
bias in favour of attributes with many outcomes. These
criterion measures are largely based on Information Theory.
3 Introduction to Naïve Bayes Classifier
The Naive Bayesian Classifier (NBC) is a simple, yet
effective method for pattern classification in cases involving
both discrete and continuous attributes. Its basis is rooted in
Probability Theory, particularly Baye’s Rule. Although, it is
less expressive and assumes attributes to be equally
important and independent of one another, its simplicity has
been known to rival and outperform more complicated
classifiers like decision trees and instance-based learners
[5].
4.1 Information Measures
The following are some of the information measures used in
C4.5. The entropy H(X) of a discrete random variable X is
defined as
The concept of conditional probability in Probability
Theory provides the link in obtaining Baye’s Rule, which is
PrY | X  
PrY   PrX | Y 
PrX 
H( X ) 

x: p X ( x ) 0
Baye’s rule of conditional probability shows how to predict
the class of a previously unseen example, given a training
2
p X ( x ) log 2
1
pX ( x )
The conditional entropy H(X|Y) of a random variable X
given the random variable Y is
In actual real-life situations, the data obtained may be
contaminated or corrupted with noise, causing patterns to be
linearly non-separable. SVM provides a ‘soft’ approach in
such classification through the use of slack variable,
 p ( y )  H( X | Y  y )
H( X | Y ) 
Y
y : pY ( x )  0
 i  0,
The information gain or mutual information between
random variable X and Y is given as
i  1,...,
along with relaxed constraints,
I( X ;Y )  H ( X )  ( X | Y )
yi (  w  x i   b )  1   i
The information gain ratio criterion or gain ratio, as it is
more commonly known, is thus defined as
Gain ratio 
i  1,...,
A classifier that generalises well is then found by
controlling both the classifier (via w.w) and the number
of training errors, minimizing the objective function
H( X )  ( X | Y )
H( X )
minimise
5 Introduction to Support Vector Machines

1
w  w C  i
2
i 1
subject to the constraints of the slack variable, the relaxed
constraints and C > 0. The only modification in the dual
formulation is the constraint on the i values, namely
Support Vector Machines (SVM) are learning systems that
use a hypothesis of space of linear functions in a high
dimensional feature space, trained with a learning algorithm
from Optimisation Theory that implements a learning bias
derived from Statistical Learning Theory [7]. The
motivation for use of higher dimensional feature space is
that there is higher probability of encountering linearly
separable patterns. SVM incorporates many ideas including
quadratic programming and the use of kernels.
0  i  C
i  1,...,
instead of i  0 . The parameter C controls the tradeoff
between the complexity of the machine and the number of
non-separable points.
5.1 Linearly Separable Patterns
6 Software Used
Consider a training sample
The Weka package (Version 3.0.1) , from the University of
Waikato, was used to run NBC and C4.5. For NBC, the
package offers the use of kernel density estimation for
modeling continuous attributes. For C4.5, the confidence
intervals set to 0.1 for pruning. Jonatrim 2 dataset was preprocessed into the required ARFF format before running the
software. References with regard to the ARFF format and
software features are available[8].
S=((x1,y1),…,(x,y))
that is linearly separable in the feature space implicitly
defined by the kernel K(xi,xj) and suppose the parameters 
and b solve the following quadratic optimisation problem:

maximise
W(  )   I 
I 1

subject to
y
i 1
i
i
 i  0,
1 
 yi y j  i  j K x i  x j
2 i , j 1
An application was developed to create the required models
to do SVM classification. The software uses the CSV
(comma-separated variable) data file format which is readily
supported by major spreadsheet programs like Microsoft
Excel. The polynomial kernel of degree 2 is used the kernel
used in building the classifier ( ie. K(xi,xj) =(xi xj + 1)2 ).
0
i  1,..., 
Then the decision rule is given by
7 Comparing Results from Different Algorithms

sgn( f ( x ))
where f ( x )   yi  i K xi  x  b
The classification results from the 3 algorithms explored
based on Type A to Type F datasets are summarised in the
tables below. Note that the run that gave the better results
for the particular dataset type in NBC and C4.5 classifiers
are quoted here. Also, the results stated here are still limited
by the different issues methodologies involved in training
i 1
is equivalent to the maximal margin hyperplane in the
feature space implicitly defined by the kernel K(xi,xj).
5.2 Linearly Non-Separable Patterns
3
and testing the classifiers, and serve to indicate the
estimated potential of the classifiers.
Statistics
NBC
C4.5
SVM
True Positive(%)
44.64%
44.57%
36.00%
True Negative(%)
42.36%
43.57%
33.00%
False Positive(%)
7.64%
6.43%
17.00%
False Negative(%)
5.36%
5.43%
14.00%
Sensitivity
0.8929
0.8914
0.72
Specificity
0.8471
0.8714
0.66
Predictive Value Positive
0.8548
0.8762
0.679245
Predictive Value Negative
0.8884
0.8916
0.702128
Success Rate
0.87
0.8814
0.69
Failure Rate
0.13
0.1186
0.31
NBC
C4.5
SVM
True Positive(%)
45.29%
44.64%
47.00%
True Negative(%)
41.93%
44.57%
39.00%
False Positive(%)
8.07%
5.43%
11.00%
False Negative(%)
4.71%
5.36%
3.00%
Sensitivity
0.9057
0.8929
0.94
Specificity
0.8386
0.8914
0.78
Predictive Value Positive
0.8488
0.8938
0.810345
Predictive Value Negative
0.8996
0.8945
0.928571
Success Rate
0.8721
0.8921
0.86
Failure Rate
0.1279
0.1079
0.14
NBC
C4.5
SVM
True Positive(%)
41.36%
42.57%
36.00%
True Negative(%)
43.07%
44.86%
35.00%
False Positive(%)
6.93%
5.14%
15.00%
False Negative(%)
8.64%
7.43%
14.00%
Sensitivity
0.8271
0.85143
0.72
Specificity
0.8614
0.89714
0.7
Predictive Value Positive
0.8629
0.8969
0.705882
Predictive Value Negative
0.8397
0.86299
0.714286
Success Rate
0.8443
0.87429
0.71
Failure Rate
0.1557
0.12571
0.29
NBC
C4.5
SVM
True Positive(%)
44.00%
41.36%
44.00%
45.00%
False Positive(%)
12.79%
6.93%
5.00%
False Negative(%)
6.00%
8.64%
6.00%
Sensitivity
0.88
0.8271
0.88
Specificity
0.7443
0.8614
0.9
Predictive Value Positive
0.7751
0.8629
0.897959
Predictive Value Negative
0.8624
0.8397
0.882353
Success Rate
0.8121
0.8443
0.89
Failure Rate
0.1879
0.1557
0.11
Statistics
NBC
C4.5
SVM
True Positive(%)
28.57%
26.79%
24.00%
True Negative(%)
31.71%
27.29%
31.00%
False Positive(%)
18.29%
22.71%
19.00%
False Negative(%)
21.43%
23.21%
26.00%
Sensitivity
0.5714
0.5357
0.48
Specificity
0.6343
0.5457
0.62
Predictive Value Positive
0.6101
0.5398
0.55814
Predictive Value Negative
0.5982
0.5471
0.54386
Success Rate
0.6029
0.5407
0.55
Failure Rate
0.3971
0.4593
0.45
Statistics
NBC
C4.5
SVM
True Positive(%)
30.07%
31.07%
39.00%
True Negative(%)
25.29%
23.50%
22.00%
False Positive(%)
24.71%
26.50%
28.00%
False Negative(%)
19.93%
18.93%
11.00%
Sensitivity
0.601429
0.621429 0.78
Specificity
0.505714
0.47
0.44
Predictive Value Positive
0.549736
0.54034
0.58209
Predictive Value Negative
0.55879
0.557516 0.666667
Success Rate
0.553571
0.545714 0.61
Failure Rate
0.446429
0.454286 0.39
Table 7: Classification result comparison on Type F
As shown by the various tables above, it is observed that
learning algorithms generally perform better for Type A, B,
C and D datasets. The exception is the SVM, which
performs only average on Type A and C. Performance of
Type E and F, which uses A attributes, perform poorly.
The highest success rates and the lowest false negative rates
for each type are underlined for reference.
Table 4: Classification result comparison on Type C
Statistics
43.07%
Table 6: Classification result comparison on Type E
Table 3: Classification result comparison on Type B
Statistics
37.21%
Table 5: Classification result comparison on Type D
Table 2: Classification result comparison on Type A
Statistics
True Negative(%)
4
8 Classification based on Chinese samples
10
It was shown that for both algorithms explored (ie. NBC
and C4.5), better classification was generally observed
when Chinese-trained classifiers were used on Chinese test
cases.
This may be due to less genetic and environmental
variations among the race. The best result was by a Type G
C4.5 decision tree with a success rate of 95%.
Generally good classification results were generally
obtained for all 3 algorithms used, except when microarray
gene-expression data were used only. It is felt that for SVM,
better results could be obtained using other kernels.
Furthermore, better classification results were generally
observed when Chinese test sets were shown to Chinesetrained classifiers as opposed to classifiers trained by all
races based on NBC and C4.5. Lastly, a web-based
bioinformatics system implemented with my colleagues for
both medical information-retrieval and CAD riskassessment has been described.
9 CADRA's Web-based Bioinformatics System
A web-based Bioinformatics system in providing patient
database, data-mining and delivery of microarray geneexpression data has been proposed and its prototype has
been implemented by colleagues and myself using various
Microsoft technologies.
References
[1]Stuart Russell & Peter Norvig, Artificial Intelligence: A
Modern Approach, 1995, Prentice Hall, p 538
A web-server was set up at the Clinical Research Centre to
house the website. The server is running on the Windows
2000 operating system with Internet Information Server 5.
Microsoft's Active Server Pages 3.0 (ASP 3.0) was used.
Access 2000 was used as the prototype database. Microsoft
Backoffice 4.5 and SQL Server 7 has been installed. Entry
into the website can be found at
http://bfigcad.nus.edu.sg/cad_ra/login.asp .
9.1
Conclusion
[2] Gary M. Weiss, Foster Provost, The Effects of Class
Distribution on Classifier Learning, Rutgers University,
2001
[3]Chin W.C., Gene Identification of DNA Microarray
Patterns Using Neural Network Techniques, National
University of Singapore, 2000
Variations on the CADRA Prototype
[4]Congdon, C. B. A Comparison of Genetic Algorithms
and Other Machine Learning Systems on a Complex
Classification Task from Common Disease Research,
University of Michigan, 1995
The prototype of the CADRA web-based system is a
generic template for more elaborate structures. One such
possible structure is described in the figure below.
Clinic
[5] Domingos, P. and Pazzani, M. Beyond Independence:
Conditions for the Optimality of the Simple Bayesian
Classifier, Proceedings of the 13th International Conference
on Machine Learning, San Francisco, CA, Morgan
Kaufmann (1996), p 105-112
Application and
Internet
Service Provider
Administrator
Clinic Assistant
Internet
Doctor
[6] Ross Quinlan, C4.5: Programs for Machine Learning,
Morgan Kaufmann, 1993, p vii,
[7] Nello Cristinanini & John Shawe-Taylor, An
Introduction to Support Vector Machines and Other
Kernel-based Learning Methods, Cambridge University
Press, 2000, p 7, 30-31, 93-110
Microarray
Lab
Lab-technician
Figure 1: Alternative structure of service delivery with
Application and Internet Service Provider separate from
Microarray Lab
[8] Witten, I. And Frank, E. Data Mining: Practical
Machine Learning Tools and Techniques with Java
Implementations. Morgan Kaufmann, 2000, p 265-286
It was required that other technologies and products be
explored for the project and several discoveries were made.
Java technology from Sun Microsystems offers a viable
alternative to Microsoft's products. In combination with the
Linux operating system, Apache Web Server and mySQL
database, this solution offers a low cost yet robust option for
the Internet and Application Service Provider.
5