Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Array Based Cancer Diagnostics:
Gene Expression Profiling of DNA
Microarray Data
Abdoulaye Samb
DPS 2005
Proceedings Student Research
May 06, 2005
Outline
• Brief Overview of Bioinformatics
• Microarray Technology
• Motivation and Potential Impacts
• “Peano”
• Conclusion
Array Based Cancer Diagnostics: Gene
Expression Profiling of DNA Microarray Data
•
•
•
•
•
Brief Overview of Bioinformatics
The term was first coined in 1988 by Dr. Hwa Lim
The original definition was :
“a collective term for data compilation, organisation,
analysis and dissemination”
Using information technology to help solve biological
problems by designing novel and incisive algorithms and
methods of analyses
It also serves to establish innovative software and create
new/maintain existing databases of information, allowing
open access to the records held within them.
Array Based Cancer Diagnostics: Gene
Expression Profiling of DNA Microarray Data
Brief Overview of Bioinformatics
• Bioinformatics’ - the new ‘buzz word’ in the scientific
community
• It is an umbrella term for genomics, proteomics and
evolution, and computer science
• It is now necessary for scientists to be inter-disciplinary
• The data is collected from a variety of sources
• The terminology is specific to its particular branch of
science
Array Based Cancer Diagnostics: Gene
Expression Profiling of DNA Microarray Data
Overview of Bioinformatics
• To allow the effortless transfer of information gathered and
the interrogation of databases across the global interface
• i.e. to make the data easily and universally interpretable by
scientists.
• It is a discipline vital in the era of post-genomics.
• Biologists have been classifying data on species of plants
and animals since the 17th century
• The knowledge acquired has escalated in harmony with the
evolution of technology
In a nutshell….
• Bioinformatics will also serve to advance medical research
regarding the drug discovery process and therapeutic intervention.
• Implementing the information disclosed permits us to discern
biological systems well enough and at such a level to build
models reflecting how natural pathways/processes work, to
predict their response and behavior, to manipulate them, as well
as to identify defects in order to better understand and fight
disorders and disease.
Array Based Cancer Diagnostics: Gene
Expression Profiling of DNA Microarray Data
What is Microarray?
• A way of studying how large numbers of genes interact
with each other and how a cell's regulatory networks
control vast batteries of genes simultaneously.
• The method uses a robot to precisely apply tiny droplets
containing functional DNA to glass slides.
• Fluorescent labels are attached to DNA of cell to study.
Array Based Cancer Diagnostics: Gene
Expression Profiling of DNA Microarray Data
What is Microarray?
• Genetics began when Mendel proved his laws of hereditary with
varieties of peas and flowers in 1865
• The invention of the compound microscope in the 19th century
• The first protein to be sequenced – insulin
• The first complete sequencing of an enzyme, ribonuclease in
1960
• To the sequencing of the first complete genome (Haemophilus
influenzae) published in 1995
• Since, we have moved on to technologies permitting the
sequencing, recombination and cloning of DNA
Microarray Technology
• Refer to by other names:
– microchip, biochip, DNA chip, DNA microarray, gene
array, GeneChip®, and genome chip
• Microarray analysis encompasses:
– Data Capture
– Data Mining
• Making sense of gene expression data
– Visualization and Interfaces
• How to make all of this complicated data and
analysis software accessible
Microarray
• Two general types that are popular
– Spotted Arrays (Pat Brown, Stanford)
– Oligonucleotide Arrays (Affymetrix)
• Both based on the same basic principles
– Anchoring pieces of DNA to glass/nylon slides
– Complementary hybridization
Motivation
• Microarrays provide a tool for answering a wide variety of
questions about the dynamics of cells:
– In which cells is each gene active?
– Under what environmental conditions is each gene
active?
– How does the activity level of a gene change under
different conditions?
• Stage of a cell cycle?
• Environmental conditions?
• Diseases?
– What genes seem to be regulated together?
Motivation (2)
• Microarrays may be used to assay gene expression within a
single sample or to compare gene expression in two
different cell types or tissues samples, such as in healthy
and diseased tissue.
– Follow population of (synchronized) cells over time, to
see how expression changes (vs. baseline)
– Expose cells to different external stimuli and measure
their response (vs. baseline)
– Take cancer cells (or other pathology) and compare to
normal cells.
Array Based Cancer Diagnostics: Gene
Expression Profiling of DNA Microarray Data
Potential Impacts
• Preventative medicine
• Ability to subtype disease and design drugs that treat
disease causes, rather than symptoms
• Specific genotype (population) targeted drugs
• Targeted drug treatments
Spotted Arrays
• 1.Control Cells (left)
and Target Cells (right)
• 2. Harvesting mRNA
from both cell group
• 3. Tagging the mRNA
with green and red dye
• 4. Applying the mRNA
to the cDNA microarray
• 5. Reading the result
using a laser
• 6. A false-color
composite representing
the results
Microarray Animation
Oligonucleotide Arrays
• Gene Chips
– Instead of putting entire genes on array, put sets of
DNA 25-mers (synthesized oligonucleotides)
– Produced using a photolithography process similar to
the ones used to make semiconductor chips
– mRNA samples are processed separately instead of in
pairs
• Condition
– 1. Internal cellular physiology from
different cell lines
– 2. Diverse physiological conditions in an
intact organism
– 3. Pathological tissues specimens from
patients
– 4. Serial time points following a
stimulus to a cell or organism
•
Profile is the list of measurements
along each row or column.
• Features are the individual
expression measurements with in
each profile.
Array Based Cancer Diagnostics: Gene Expression
Profiling of DNA Microarray Data
• Where does computer science come into it?
• Bioinformaticians act as bridge between both science.
• The HGP has brought to light the limitations of traditional
lab work – although mostly automated they are expensive
and time consuming
• We need to incorporate original techniques to allow greater
understanding of protein function, protein-protein
interactions and protein-DNA interactions and put it all in a
cellular context
• The gap between the data stored and its biological
significance
Array Based Cancer Diagnostics: Gene
Expression Profiling of DNA Microarray Data
“Peano” Method for Association Rule Mining
• “Peano” is a technology that employs Association Rule Mining
as means to do data mining of the microarray data.
• Association Rule Mining is an advanced data mining technique
that is useful in deriving meaningful rules from a given data.
• Our approach proposes a new microarray data mining
technology, which involves a "Data Mining Ready" data
structure, called Peano count tree (P-tree), to measure gene
expression levels.
• The method involves treating the microarray data as spatial data
Array Based Cancer Diagnostics: Gene
Expression Profiling of DNA Microarray Data
“Peano” Method for Association Rule Mining
• Each spot on the microarray is presented as a pixel with
corresponding red and green ratios.
• The microarray data is reorganized into an 8-bit bSQ file
(where each attribute or band is stored as a separate file)
• Each bit is then converted in a quadrant base tree structure
P-tree from which a data cube is constructed and
meaningful rules readily obtained.
Array Based Cancer Diagnostics: Gene
Expression Profiling of DNA Microarray Data
• An association rule is a relationship of the form
X  Y where X is the antecedent item set and Y is the
consequent itemset
• An example of the rule can be, "customers who purchase
an item X are very likely to purchase another item Y at the
same time.
• The rule X  Y has support s% in the transaction set T if
s% of transactions in T contains X  Y.
Array Based Cancer Diagnostics: Gene
Expression Profiling of DNA Microarray Data
• The rule has confidence c% if c% of transactions is T that
contain X also contain Y.
• The goal of association rule mining is to find all the rules
with support and confidence exceeding some user
specified thresholds.
• The data mining model for Market Research dataset can be
treated as a relation R(Tid, i1,........in)
• Where Tid is the transaction identifier and i1..... in denote
the feature attributes - all the items available for purchase
from the store.
Array Based Cancer Diagnostics: Gene
Expression Profiling of DNA Microarray Data
• Transactions constitute the rows in the data-table whereas
itemsets form the columns. The values for the itemsets for
different transactions are in binary representation
• The microarray data is currently represented as a relation
R(Gid, T1,.....Tn)
• Where Gid is the gene identification for each gene and
T1....Tn are the various kinds of treatments to which the
genes were exposed.
Array Based Cancer Diagnostics: Gene
Expression Profiling of DNA Microarray Data
• The genes constitute the rows in the data table where as
treatments are the columns.
• The values are in the form of normalized Red/Green color
ratios representing the abundance of transcript for each
spot on the microarray
• This table can be called as a "Gene table". Currently the
data mining techniques - clustering and classification is
being applied to the Gene table
Array Based Cancer Diagnostics: Gene
Expression Profiling of DNA Microarray Data
Clustering Data format
• Divide dataset into clusters/classes by grouping on the
rows (genes).
• Design a data format called "Treatment table" formed by
flipping the gene table.
• The relation R of a Treatment table can be represented as
R(Tid, G1,...........Gn)
• Where Tid represents the treatment ids and G1…….Gn
are the gene identifiers
Array Based Cancer Diagnostics: Gene
Expression Profiling of DNA Microarray Data
• The goal here is to mine for rules among the genes by
associating the columns (genes) in the Treatment table.
Treatment table can be viewed as a 2-dimensional array of
gene expression values
• Treatment table can be organized into a new spatial format
called bit Sequential (bSQ) proposed by Qin Ding, Qiang
Ding and William Perrizo [1].
• Bit Sequential (bSQ) is a new data format for representing
the spatial data.
Array Based Cancer Diagnostics: Gene
Expression Profiling of DNA Microarray Data
• There are several reasons to use the bSQ format. First,
different bits have different degrees of contribution to the
value. In some applications, the high-order bits alone
provide the necessary information.
• Second, the bSQ format facilitates the representation of a
precision hierarchy (from one bit precision up to eight bit
precision).
• Third, bSQ format facilitates better compression through
creation of an efficient, rich data structure called “Peano”
Count Tree and accommodates algorithm pruning based on
one-bit-at-a-time approach.
Array Based Cancer Diagnostics: Gene
Expression Profiling of DNA Microarray Data
Bit Sequence File
11
11
11
11
11
11
11
11
11
11
11
11
11
11
01 11
11
10
11
11
11
11
11
00
00
00
10
11
11
11
11 11
P-tree
55
__________/ / \ \__________
/
___ / \___
\
/
/
\
\
16
____8__
_15__ 16
/ / | \
/ | \ \
3 0 4 1
4 4 3 4
//|\
//|\
//|\
1110
0010
1101
PM-tree
m
____________/ / \ \___________
/
___ / \___
\
/
/
\
\
1
____m__
_m__
1
/ / | \
/ | \ \
m 0 1 m
1 1 m 1
//|\
//|\
//|\
Figure 1: P-tree and PM tree for 8x8 image
1110
0010
1101
Array Based Cancer Diagnostics: Gene
Expression Profiling of DNA Microarray Data
• Figure 1 can be considered as a set of 8-row-8-column
microarray data, representing the expression levels for 64
different treatments for a single gene.
• 55 is the number of 1's in the entire microarray data set.
This root level is labeled as level 0. The numbers at the
next level (level 1), 16, 8, 15 and 16, are the 1-bit counts
for the four major quadrants
• The first and last quadrants are composed entirely of 1-bits
(called a "pure 1 quadrant") we terminate them.
Array Based Cancer Diagnostics: Gene
Expression Profiling of DNA Microarray Data
• We do not need sub-trees for these two quadrants, so these
branches terminate.
• Similarly, quadrants composed entirely of 0-bits are called
"pure 0 quadrants" which also terminate the tree branches.
• This pattern is continued recursively using the Peano or Zordering of the four sub quadrants at each new level.
• Every branch terminates eventually (at the "leaf" level,
each quadrant is a pure quadrant).
• If we were to expand all sub-trees, including those for pure
quadrants, then the leaf sequence is just the Peano-ordering
of the original raster data..
Binary Representation
Gene
Array
A
B
C
D
E
1
0
1
0
1
0
2
1
0
1
0
0
3
1
0
1
0
1
4
0
1
1
0
0
5
1
1
0
0
0
6
1
0
1
0
1
7
0
1
1
0
0
8
0
0
0
0
0
Determine a cutoff and convert any value>=cutoff (2.0) to 1, others to 0.
Genes up-regulated have a value of 1.
Could convert to -1, 0, and 1 to look at up-regulation and down-regulation
Array Based Cancer Diagnostics: Gene
Expression Profiling of DNA Microarray Data
• Setting rules can provide valuable information to the
biologist as to the gene regulatory pathways and identify
important relationships between the different gene
expression patterns.
• The biologist may be interested in some specific kinds of
rules which can be called as "rules of significance“
• In gene regulatory pathways, a biologist may be interested
in identifying genes that govern the expression of other
sets of genes.
Array Based Cancer Diagnostics: Gene
Expression Profiling of DNA Microarray Data
• These relationships can be represented as following,
{G1,..............Gn}  Gm where G1 ...….Gn represents the
antecedent and Gm represents the consequent of the rule.
•
The intuitive meaning of this rule is that for a given
confidence level the expression of G1 .....Gn genes will
result in the expression of Gm gene.
• The algorithm used here is described in Figure 2 as the PARM algorithm; it assumes a fixed precision, for example,
3-bit precision in all bands which is being used for our
experiment.
Array Based Cancer Diagnostics: Gene
Expression Profiling of DNA Microarray Data
Peano Algorithm
Procedure P-ARM
{
Data Discretization;
F1 = {frequent 1-Asets};
For (k=2; F k-1 ) do begin
Ck = p-gen(F k-1);
Forall candidate Asets c  Ck
do
c.count = rootcount(c);
Fk = {cCk | c.count >=
minsup}
end
Answer = k Fk
}
Figure 2: P-ARM algorithm
insert into Ck
select p.item1, p.item2, …, p.itemk-1,
q.itemk-1
from Fk-1 p, Fk-1 q
where p.item1 = q.item1, …, p.itemk-2 =
q.itemk-2, p.itemk-1 < q.itemk-1,
p.itemk-1.group <> q.itemk-1.group
Figure 3: Join step in p-gen function
Conclusions
• Controls yielded no rules or itemsets. Shows associations
are not happening by chance.
• We are detecting order.
• Itemset and rules count peek at n genes per set.
• Verifying biological associations are actually meaningful.
• Few examples in the literature, yet technique shows
promise.
• Larger dataset is more prone to yield better results