Download Part 1

Document related concepts

Transformation (genetics) wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Molecular cloning wikipedia , lookup

Gel electrophoresis of nucleic acids wikipedia , lookup

DNA supercoil wikipedia , lookup

Genomic library wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Real-time polymerase chain reaction wikipedia , lookup

Gene wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Biosynthesis wikipedia , lookup

Point mutation wikipedia , lookup

RNA-Seq wikipedia , lookup

Non-coding DNA wikipedia , lookup

Gene expression wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Community fingerprinting wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Rule-Based Data
Mining Methods for
Classification
Problems in
Biomedical Domains
Jinyan Li
Limsoon Wong
Copyright © 2004 by Jinyan Li and Limsoon Wong
Rule-Based Data
Mining Methods for
Classification
Problems in
Biomedical Domains
Part 1:
Background and Introduction
Copyright © 2004 by Jinyan Li and Limsoon Wong
Outline
•
•
•
•
•
•
Practical Introduction to Biology
Brief Introduction to Bioinformatics
Overview of Gene Expression Profiling
Overview of Proteomic Profiling
A motivating Biomedical Application
Brief Introduction to Some Data Sets
Copyright © 2004 by Jinyan Li and Limsoon Wong
Practical Introduction
to Biology
Copyright © 2004 by Jinyan Li and Limsoon Wong
History
• 1866 Mendel discovered
genetics
• 1869 DNA discovered
• 1944 Avery & McCarty
demonstrated DNA as carrier of
genetic info
• 1953 Watson & Crick deduced
3D struct of DNA
• 1960 Elucidation of genetic
code, mapping DNA to protein
• 1970 Development of DNA
sequencing techniques:
sequence segmentation and
electrophoresis
Copyright © 2004 by Jinyan Li and Limsoon Wong
• 1980 Development of PCR:
exploiting natural replication,
amplify DNA samples so that
they are enough for doing expt
• 1990 Human Genome Project
• 2002 Human genome
published
• Now Understanding the detail
mechanism of the cell
Body
• Our body consists of a number of organs
• Each organ composes of a number of tissues
• Each tissue composes of cells of the same
type
Copyright © 2004 by Jinyan Li and Limsoon Wong
Cell
• Performs two types of function
– Chemical reactions necessary to maintain our life
– Pass info for maintaining life to next generation
• In particular
– Protein performs chemical reactions
– DNA stores & passes info
– RNA is intermediate between DNA & proteins
Copyright © 2004 by Jinyan Li and Limsoon Wong
Protein
• A sequence composed
from an alphabet of 20
amino acids
– Length is usually 20 to
5000 amino acids
– Average around 350
amino acids
• Folds into 3D shape,
forming the building
blocks & performing
most of the chemical
reactions within a cell
Copyright © 2004 by Jinyan Li and Limsoon Wong
Amino Acid
• Each amino acid consist of
– Amino group
– Carboxyl group
– R group
Amino
group
NH2
Carboxyl group
H
O
C
C
R
C
(the central carbon)
Copyright © 2004 by Jinyan Li and Limsoon Wong
OH
R group
Classification of Amino Acids
• Amino acids can be
classified into 4 types.
• Positively charged (basic)
– Arginine (Arg, R)
– Histidine (His, H)
– Lysine (Lys, K)
• Negatively charged
(acidic)
– Aspartic acid (Asp, D)
– Glutamic acid (Glu, E)
Copyright © 2004 by Jinyan Li and Limsoon Wong
Classification of Amino Acids
• Polar (overall uncharged, • Nonpolar (overall
but uneven charge
uncharged and uniform
distribution. can form
charge distribution. cant
hydrogen bonds with
form hydrogen bonds
water. they are called
with water. they are
hydrophilic)
called hydrophobic)
–
–
–
–
–
–
–
Asparagine (Asn, N)
Cysteine (Cys, C)
Glutamine (Gln, Q)
Glycine (Gly, G)
Serine (Ser, S)
Threonine (Thr, T)
Tyrosine (Tyr, Y)
Copyright © 2004 by Jinyan Li and Limsoon Wong
–
–
–
–
–
–
–
–
Alanine (Ala, A)
Isoleucine (Ile, I)
Leucine (Leu, L)
Methionine (Met, M)
Phenylalanine (Phe, F)
Proline (Pro, P)
Tryptophan (Trp, W)
Valine (Val, V)
Protein & Polypeptide Chain
• Formed by joining amino acids via peptide bond
• One end the amino group, called N-terminus
• The other end is the carboxyl group, called C-terminus
H
O
H
O
NH2 C
C
OH + NH2 C
C
R
R’
NH2
Peptide bond
OH
H
O
C
C
R
Copyright © 2004 by Jinyan Li and Limsoon Wong
H
O
N
C
C
H
R’
OH
DNA
• Stores instruction
needed by the cell to
perform daily life function
• Consists of two strands
interwoven together and
form a double helix
• Each strand is a chain of
some small molecules
called nucleotides
Francis Crick shows James Watson the model of DNA
in their room number 103 of the Austin Wing at the
Cavendish Laboratories, Cambridge
Copyright © 2004 by Jinyan Li and Limsoon Wong
Nucleotide
• Consists of three parts:
– Deoxyribose
– Phosphate (bound to the 5’ carbon)
– Base (bound to the 1’ carbon)
Base
(Adenine)
5`
4`
Phosphate
1`
3`
Copyright © 2004 by Jinyan Li and Limsoon Wong
2`
Deoxyribose
Classification of Nucleotides
• 5 diff nucleotides: adenine(A), cytosine(C), guanine(G),
thymine(T), & uracil(U)
• A, G are purines. They have a 2-ring structure
• C, T, U are pyrimidines. They have a 1-ring structure
• DNA only uses A, C, G, & T
A
C
Copyright © 2004 by Jinyan Li and Limsoon Wong
G
T
U
Watson-Crick rules
• Complementary bases:
– A with T (two hydrogen-bonds)
– C with G (three hydrogen-bonds)
C
A
T
10Å
Copyright © 2004 by Jinyan Li and Limsoon Wong
G
10Å
Orientation of a DNA
• One strand of DNA is generated by chaining together
nucleotides, forming a phosphate-sugar backbone
• It has direction: from 5’ to 3’, because DNA always
extends from 3’ end:
– Upstream, from 5’ to 3’
– Downstream, from 3’ to 5’
P
P
P
P
3’
5’
A
C
Copyright © 2004 by Jinyan Li and Limsoon Wong
G
T
A
Double Stranded DNA
• DNA is double stranded in a cell. The two
strands are anti-parallel. One strand is
reverse complement of the other
• The double strands are interwoven to
form a double helix
Copyright © 2004 by Jinyan Li and Limsoon Wong
Locations of DNAs in a Cell?
• Two types of organisms
– Prokaryotes (single-celled organisms with no nuclei. e.g., bacteria)
– Eukaryotes (organisms with single or multiple cells. their cells have
nuclei. e.g., plant & animal)
• In Prokaryotes, DNA swims within the cell
• In Eukaryotes, DNA locates within the nucleus
Copyright © 2004 by Jinyan Li and Limsoon Wong
Chromosome
• DNA is usually tightly wound around histone
proteins and forms a chromosome
• The total info stored in all chromosomes
constitutes a genome
• In most multi-cell organisms, every cell
contains the same complete set of
chromosomes
– May have some small different due to mutation
• Human genome has 3G base pairs, organized
in 23 pairs of chromosomes
Copyright © 2004 by Jinyan Li and Limsoon Wong
Gene
• A gene is a sequence of DNA that encodes a
protein or an RNA molecule
• About 30,000 – 35,000 (protein-coding) genes
in human genome
• For gene that encodes protein
– In Prokaryotic genome, one gene corresponds to
one protein
– In Eukaryotic genome, one gene can corresponds to
more than one protein because of the process
“alternative splicing”
Copyright © 2004 by Jinyan Li and Limsoon Wong
Complexity of Organism
vs. Genome Size
• Human Genome: 3G
base pairs
• Amoeba dubia (a single
cell organism): 600G
base pairs
Copyright © 2004 by Jinyan Li and Limsoon Wong
 Genome size has no
relationship with the
complexity of the
organism
Number of Genes vs. Genome Size
• Prokaryotic genome
(e.g., E. coli)
– Number of base pairs: 5M
– Number of genes: 4k
– Average length of a gene:
1000 bp
• Eukaryotic genome (e.g.,
human)
– Number of base pairs: 3G
– Estimated number of
genes: 30k – 35k
– Estimated average length
of a gene: 1000-2000 bp
Copyright © 2004 by Jinyan Li and Limsoon Wong
• ~ 90% of E. coli genome
are of coding regions.
• < 3% of human genome
is believed to be coding
regions
 Genome size has no
relationship with the
number of genes!
RNA
• RNA has both the
properties of DNA &
protein
• Nucleotide for RNA has
of three parts:
– Ribose Sugar (has an
extra OH group at 2’)
– Phosphate (bound to 5’
carbon)
– Base (bound to 1’ carbon)
– Similar to DNA, it can
store & transfer info
– Similar to protein, it can
form complex 3D structure
& perform some functions
Base
(Adenine)
5`
4`
Phosphate
1`
3`
Copyright © 2004 by Jinyan Li and Limsoon Wong
2`
Ribose Sugar
RNA vs DNA
• RNA is single stranded
• Nucleotides of RNA are similar to that of DNA,
except that have an extra OH at position 2’
– Due to this extra OH, it can form more hydrogen
bonds than DNA
– So RNA can form complex 3D structure
• RNA use the base U instead of T
– U is chemically similar to T
– In particular, U is also complementary to A
Copyright © 2004 by Jinyan Li and Limsoon Wong
Mutation
• Sudden change of
genome
• Basis of evolution
• Cause of cancer
• Can occur in DNA, RNA,
& Protein
Copyright © 2004 by Jinyan Li and Limsoon Wong
Central Dogma
• Gene expression
consists of two steps
– Transcription
DNA  mRNA
– Translation
mRNA  Protein
Copyright © 2004 by Jinyan Li and Limsoon Wong
Transcription
• Synthesize mRNA from
one strand of DNA
– An enzyme RNA
polymerase temporarily
separates doublestranded DNA
– It begins transcription at
transcription start site
– A  A, CC, GG, &
TU
– Once RNA polymerase
reaches transcription stop
site, transcription stops
Copyright © 2004 by Jinyan Li and Limsoon Wong
• Additional “steps” for
Eukaryotes
– Transcription produces
pre-mRNA that contains
both introns & exons
– 5’ cap & poly-A tail are
added to pre-mRNA
– RNA splicing removes
introns & mRNA is made
– mRNA are transported out
of nucleus
Translation
• Synthesize protein from
mRNA
• Each amino acid is
encoded by consecutive
seq of 3 nucleotides,
called a codon
• The decoding table from
codon to amino acid is
called genetic code
Copyright © 2004 by Jinyan Li and Limsoon Wong
• 43=64 diff codons
 Codons are not 1-to-1
corr to 20 amino acids
• All organisms use the
same decoding table
• Recall that amino acids
can be classified into 4
groups. A single-base
change in a codon is
usually not sufficient to
cause a codon to code
for an amino acid in
different group
Genetic Code
• Start codon: ATG (code for M)
• Stop codon: TAA, TAG, TGA
Copyright © 2004 by Jinyan Li and Limsoon Wong
Gene Structure
Coding region
Image credit: Bajic
Copyright © 2004 by Jinyan Li and Limsoon Wong
Ribosome
• Translation is handled by a molecular complex,
ribosome, which consists of both proteins &
ribosomal RNA (rRNA)
• Ribosome reads mRNA & the translation starts
at a start codon (the translation start site)
• With help of tRNA, each codon is translated to
an amino acid
• Translation stops once ribosome reads a stop
codon (the translation stop site)
Copyright © 2004 by Jinyan Li and Limsoon Wong
tRNA
• 61 diff tRNAs, each
corresponds to a nontermination codon
• Each tRNA folds to form
a cloverleaf-shaped
structure
– One side holds an
anticodon
– The other side holds the
appropriate amino acid
Copyright © 2004 by Jinyan Li and Limsoon Wong
Introns and exons
• Eukaryotic genes
contain introns & exons
– Introns are seq that are
ultimately spliced out of
mRNA
– Introns normally satisfy
GT-AG rule, viz. begin w/
GT & end w/ AG
– Each gene can have
many introns & each
intron can have thousands
bases
Copyright © 2004 by Jinyan Li and Limsoon Wong
• Introns can be very long
• An extreme example is a
gene associated with
cystic fibrosis in human:
– Length of 24 introns ~1Mb
– Length of exons ~1kb
Basic Biotechnology Tools
• Cutting & breaking DNA
– Restriction Enzymes
– Shortgun method
• Copying DNA
– Cloning
– Polymerase Chain Reaction (PCR)
• Measuring length of DNA
– Gel Electrophoresis
Copyright © 2004 by Jinyan Li and Limsoon Wong
Restriction Enzymes
• Recognize certain point,
called restriction site, in
DNA w/ particular pattern
& break it. This process
is called digestion
• In nature, restriction
enzymes are used to
break foreign DNA to
avoid infection
Copyright © 2004 by Jinyan Li and Limsoon Wong
• Example
– EcoRI is the 1st restriction
enzyme discovered that
cuts DNA wherever the
sequence GAATTC is
found
– Similar to most of the
other restriction enzymes,
GAATTC is a palindrome
• > 300 known restriction
enzymes have been
discovered
Shotgun Method
• Break DNA molecule into small pieces
randomly like this:
– Solution w/ a large amount of purified DNA
– Apply high vibration to break each molecule
randomly into small fragments
Copyright © 2004 by Jinyan Li and Limsoon Wong
Cloning by Plasmid Vector
• Insert DNA X into plasmid
vector w/ antibiotic-resistant
gene & a recombinant DNA
molecule is formed
• Insert recombinants into host
cells
• Grow host cells in presence of
antibiotic
– Only cells w/ antibioticresistant gene can grow
– When we duplicate host cell, X
is also duplicated
• Select cells w/ antibioticresistance genes
• Kill them & extract X
Copyright © 2004 by Jinyan Li and Limsoon Wong
Polymerase Chain Reaction (PCR)
• Allows rapid replication
of a selected region of a
DNA w/o need for a
living cell
• Inputs for PCR:
– Two oligonucleotides are
synthesized, each
complementary to the two
ends of the region. They
are used as primers.
– Thermostable DNA
polymerase TaqI
Copyright © 2004 by Jinyan Li and Limsoon Wong
• Repeats a cycle w/ 3
phases 25-30 times.
Each cycle takes ~5min
– Phase 1: separate double
stranded DNA by heat
– Phase 2: cool; add
synthesis primers
– Phase 3: add DNA
polymerase TaqI to
catalyze 5’ to 3’ DNA
synthesis
 Selected region has
been amplified
exponentially
Gel Electrophoresis
• Used to separate a mixture of DNA fragments
of different lengths
• Apply an electrical field to mixture of DNA
• Note that DNA is negative charged. Small
molecules travel faster than large molecules
• Mixture is separated into bands, each
containing DNA molecules of same length
Copyright © 2004 by Jinyan Li and Limsoon Wong
Sequencing by Gel Electrophoresis
• An application of gel electrophoresis is to
reconstruct DNA sequence of length 500-800
within a few hours
– Generate all sequences ending with A
– Separate sequences ending with A into diff bands
using gel electrophoresis
 Such info tells us positions of A’s in the DNA
– Similarly for C, G, & T
Copyright © 2004 by Jinyan Li and Limsoon Wong
Sequencing by Gel Electrophoresis:
Reading the Sequence
•
•
•
•
Four groups of fragments: A, C, G, & T
Fragments are placed in negative end
Fragments move to positive end
From relative distances of fragments,
reconstruct the sequence
……TCAACATGT
Copyright © 2004 by Jinyan Li and Limsoon Wong
Hybridization
• Among thousands of
DNA fragments,
biologists routinely need
to find a DNA fragment
that contains a particular
DNA subsequence
Copyright © 2004 by Jinyan Li and Limsoon Wong
• This can be done based
on hybridization
– Suppose we need to find
DNA fragments that
contain ACCGAT
– Make probes inversely
complement to ACCGAT
– Mix probes w/ DNA
fragments
– Due to hybridization rule
(A=T, CG), DNA
fragments containing
ACCGAT will hybridize
with probes
Brief Introduction to
Bioinformatics
Copyright © 2004 by Jinyan Li and Limsoon Wong
Some Bioinformatics Problems
•
•
•
•
•
•
•
•
•
•
Biological Data Searching
Gene/Promoter finding
Cis-regulatory DNA
Gene/Protein Network
Protein/RNA Structure Prediction
Evolutionary Tree reconstruction
Infer Protein Function
Disease Diagnosis
Disease Prognosis
Disease Treatment Optimization, ...
Copyright © 2004 by Jinyan Li and Limsoon Wong
Biological Data Searching
• Biological Data is
increasing rapidly
• Biologists need to locate
required info
• Difficulties:
–
–
–
–
–
Too much
Too heterogeneous
Too distributed
Too many errors
Due to mutation, need
approximate search
Copyright © 2004 by Jinyan Li and Limsoon Wong
Cis-Regulatory DNAs
• Cis-regulatory DNAs
control whether genes
should express or not
• Cis-regulatory may
locate in promoter region,
intron, or exon
• Finding and
understanding cisregulatory DNAs is one
of the key problem in
coming years
Copyright © 2004 by Jinyan Li and Limsoon Wong
Image credit: US DOE
Gene Networks
• Inside a cell is a complex
system
• Expression of one gene
depends on expression
of another gene
• Such interactions can be
represented using gene
network
• Understanding such
networks helps identify
association betw genes
& diseases
Copyright © 2004 by Jinyan Li and Limsoon Wong
Protein/RNA structure prediction
• Structure of Protein/RNA
is essential to its
functionality
• Important to have some
ways to predict the
structure of a
protein/RNA given its
sequence
• This problem is
important & it is always
considered as a “grand
challenge” problem in
bioinformatics
Copyright © 2004 by Jinyan Li and Limsoon Wong
Image credit: Kolatkar
Evolutionary Tree Reconstruction
• Protein/RNA/DNA mutates
• Evolutionary Tree studies
evolutionary relationship
among set of
protein/RNA/DNAs
• Figures out origin of
species
189, 217
Root
189, 217, 261
150000
years ago
African
100000
years ago
Asian
189, 217, 247, 261
Image credit: Sykes
Copyright © 2004 by Jinyan Li and Limsoon Wong
50000
years ago
Papuan
present
European
Gene Expression
Profiling
Copyright © 2004 by Jinyan Li and Limsoon Wong
What’s a Microarray?
• Idea of hybridization leads to DNA array tech
• In the past, “one gene in one experiment”
 Hard to get whole picture
• DNA array is a technology that contains large
number of “DNA molecules” spotted on glass
slides, nylon membranes, or silicon wafers
 Measure expression of thousands of genes
simultaneously
Copyright © 2004 by Jinyan Li and Limsoon Wong
Affymetrix GeneChip™ Array
Image credit: Affymetrix
Copyright © 2004 by Jinyan Li and Limsoon Wong
Making GeneChip™ Array
quartz is washed to ensure uniform
hydroxylation across its surface and
to attach linker molecules
exposed linkers become deprotected
and are available for nucleotide
coupling
Image credit: Affymetrix
Copyright © 2004 by Jinyan Li and Limsoon Wong
Gene Expression Measurement by
GeneChip™ Array
Image credit: Affymetrix
Copyright © 2004 by Jinyan Li and Limsoon Wong
A Sample GeneChip™ File
Copyright © 2004 by Jinyan Li and Limsoon Wong
Applications of DNA arrays
• Sequencing by
hybridization
– Promising alternative to
sequencing by gel
electrophoresis
– May be able to reconstruct
longer DNA sequences in
shorter time
 SNP discovery
• Profiling of gene
expression
– DNA arrays allow us to
monitor activities within a
cell
– Each spot contains
complement of a gene
– Due to hybridization, we
can measure
concentration of diff
mRNAs within cell
 disease diagnosis
 disease prognosis
 target discovery
Copyright © 2004 by Jinyan Li and Limsoon Wong
Proteomic Profiling
Copyright © 2004 by Jinyan Li and Limsoon Wong
What is Proteomics?
• It is proteins that are
directly involved in both
normal and diseaseassociated biochemical
processes
• A more complete
understanding of
disease may be gained
by looking directly at the
proteins present within a
diseased cell or tissue
• Achieved thru the
proteome & proteomics
Copyright © 2004 by Jinyan Li and Limsoon Wong
• Proteomics is scientific
discipline that detects
proteins associated with
a disease by means of
their altered levels of
expression betw control
& disease states.
• Proteome research
permits discovery of new
protein markers for
diagnostic purposes & of
novel molecular targets
for drug discovery
Expt Procedures in Proteomics
• Separation
– electrophoresis (1-D, 2-D)
– chromatography (SEC, ion
exchange, reversed phase)
• Digestion
– chemical (BrCN)
– enzymatic (trypsin, Lys-C, Asp-C)
– reduction (Di-Thio-Threitol, bMercapto-Ethanol)
– alkylation
(IodoAcAcid,
IodoAcAmide, Vynil Pyridine)
• Sample clean-up
– chromatography (rev phase)
– solid phase extraction (Zip Tip)
Copyright © 2004 by Jinyan Li and Limsoon Wong
• MS ANALYSIS
– protein identification (peptide
mass fingerprinting)
– peptide structural information
(post source decay)
Ciphergen Protein Chip® System
Image credit: Ciphergen
Copyright © 2004 by Jinyan Li and Limsoon Wong
Protein Chip® Processes
SELDITOF-MS
Image credit: Ciphergen
Copyright © 2004 by Jinyan Li and Limsoon Wong
Schematic of Protein Chip® Reader
Image credit: Ciphergen
Copyright © 2004 by Jinyan Li and Limsoon Wong
Protein Chip® Array Surfaces
Image credit: Ciphergen
Copyright © 2004 by Jinyan Li and Limsoon Wong
A sample proteomic profile
Image credit: Petricoin
Copyright © 2004 by Jinyan Li and Limsoon Wong
Applications of Proteomics
 Disease diagnosis
 Disease prognosis
 Target discovery
Copyright © 2004 by Jinyan Li and Limsoon Wong
A Motivating Application:
Optimizing Treatment
of Childhood ALL
Image credit: FEER
Copyright © 2004 by Jinyan Li and Limsoon Wong
Childhood ALL
• Major subtypes are: TALL, E2A-PBX, TEL-AML,
MLL genome
rearrangements,
Hyperdiploid>50, BCR-ABL
• Diff subtypes respond
differently to same Tx
• Over-intensive Tx
– Development of
secondary cancers
– Reduction of IQ
• Under-intensiveTx
– Relapse
Copyright © 2004 by Jinyan Li and Limsoon Wong
• The subtypes look
similar
• Conventional diagnosis
– Immunophenotyping
– Cytogenetics
– Molecular diagnostics
• Unavailable in most
ASEAN countries
Single-Test Platform of
Microarray & Machine Learning
Image credit: Affymetrix
Copyright © 2004 by Jinyan Li and Limsoon Wong
Impact
Conventional Tx:
• intermediate intensity to
everyone
 10% suffers relapse
 50% suffers side effects
 costs US$150m/yr
Our optimized Tx:
• high intensity to 10%
• intermediate intensity to 40%
• low intensity to 50%
• costs US$100m/yr
Copyright © 2004 by Jinyan Li and Limsoon Wong
•High cure rate of 80%
• Less relapse
• Less side effects
• Save US$51.6m/yr
Some Sample Data at
http://research.i2r.astar.edu.sg/rp
Copyright © 2004 by Jinyan Li and Limsoon Wong
Kent Ridge Biomedical Data Set
Repository
• http://research.i2r.a-star.edu.sg/rp
• Store high-dimensional biomedical data sets:
– gene expression data
– protein profiling data
– genomic sequence data
• All data are for classification purposes:
–
–
–
–
–
disease diagnosis
subtype classification
relapse study
genomic feature prediction
....
Copyright © 2004 by Jinyan Li and Limsoon Wong
Convenient File Formats
• Original raw data are of
formats diff from c’mon
ones used in machine
learning softwares
• All original raw data files
have been converted
into plain text data files
under the same schema,
where every row in the
new file is a commapunctuated string, like a
vector, representing a
labelled data sample
Copyright © 2004 by Jinyan Li and Limsoon Wong
• Re-formatted data files
have extension .data; &
feature names are saved
in a separate file w/
extension .names. Equiv
.arff format also provided
• Such transformed data
files can be directly fed
to c’mon machine
learning s/w packages
such as C5.0, MLC++, &
WEKA
Breast Cancer Outcome Prediction
Gene Expression Data
• Van't Veer et al., Nature
415:530-536, 2002
• Training set contains 78
patient samples
– 34 patients develop distance metastases in 5 yrs
– 44 patients remain healthy
from the disease after
initial diagnosis for >5 yrs
• Testing set contains 12
relapse & 7 non-relapse
samples
• No of genes is 24481
Copyright © 2004 by Jinyan Li and Limsoon Wong
Image credit: Veer
CNS Embryonal Tumour Outcome
Prediction Gene Expression Data
• Pomeroy et al., Nature
415:436-442, 2002
• Survivors are patients
who are alive after
treatment, while the
failures are those who
succumb to their disease
• 60 patient samples
– 21 are survivors
– 39 are failures
• There are 7129 genes in
the dataset
Copyright © 2004 by Jinyan Li and Limsoon Wong
Image credit: Pomeroy
Ovarian Cancer Diagnosis
Proteomic Profiling Data
• Petricoin et al., Lancet
359:572-577, 2002
• 253 serum samples of
proteomic spectra
generated by mass spec
– 91 controls (Normal)
– 162 ovarian cancers
• Each sample contains
15154 M/Z identities
Image credit: Petricoin
Copyright © 2004 by Jinyan Li and Limsoon Wong
Lung Cancer Diagnosis Gene
Expression Data
• Gordon et al., Cancer
Res 62:4963-4967, 2002
• Lung Malignant pleural
mesothelioma (MPM) vs
adenocarcinoma (ADCA)
• 149 testing samples
– 15 MPM
– 134 ADCA
• 32 training samples
– 16 MPM
– 16 ADCA
• Each sample described
by 12533 genes
Copyright © 2004 by Jinyan Li and Limsoon Wong
Image credit: Gordon
Translation Initiation Site Prediction
Data
• Liu & Wong, JBCB
1:139-168, 2003
• Used to find Translation
Initiation Site (TIS),
where translation from
mRNA to protein
initiates.
• 3312 seq in raw data
– 3312 true ATG
– 10063 false ATG
• 927 features
Copyright © 2004 by Jinyan Li and Limsoon Wong
Image credit: Liu
Childhood Acute Lymphoblastic
Leukemia Gene Expression Data
• Yeoh et al., Cancer Cell
1:133-143, 2002
• Classifying 6 subtypes of
childhood acute
lymphoblastic leukemia
• 215 training samples
• 112 testing samples
• Each sample has
expression value of
12558 genes
Copyright © 2004 by Jinyan Li and Limsoon Wong
Image credit: Yeoh
Any Question?
Copyright © 2004 by Jinyan Li and Limsoon Wong
Acknowledgements
• Many slides presented in this part of the tutorial
are derived from a course jointly taught at NUS
SOC with Ken Sung
Copyright © 2004 by Jinyan Li and Limsoon Wong