Download Exploring large sets of microarray data to identify genes with lowest

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Ridge (biology) wikipedia , lookup

Genome evolution wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Gene regulatory network wikipedia , lookup

Gene expression wikipedia , lookup

Real-time polymerase chain reaction wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Community fingerprinting wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
University of Tartu
Faculty of Mathematics and Computer Science
Institute of Computer Science
Exploring large sets of microarray data to identify
genes with lowest variation in expression
MTAT.03.277 Research Seminar in Data Mining
Author: Janno Jõgeva
Supervisors: Sten Ilmjärv & Hendrik Luuk
Tartu 2012
TableofContents
Introduction ............................................................................................................................. 3 Data ........................................................................................................................................ 4 Generating groups .................................................................................................................. 6 First approach ..................................................................................................................... 6 Second approach ................................................................................................................ 6 Third approach .................................................................................................................... 6 Forth approach .................................................................................................................... 7 Conclusions ............................................................................................................................ 8 Page 2 of 8
Introduction
This topic tries to solve a problem deriving from wet-lab gene expression
measurements. Its aim is to determine gene expression using DNA microarrays.
Microarray is a small solid surface with approximately 1 million to 6 million distinct
covalently bound oligonucleotide sequences (called probes) that are used to detect
the abundance of labeled complementary DNA/RNA molecules based on nucleic
acid hybridization process. The working principle is that complementary sequences
form a fluorescent duplex molecule that can be detected with a scanner. This allows
to measure the expression level of individual DNA/RNA molecules in the sample of
interest. Differential gene expression analysis package DEMI (Differential Expression
from Multiple Indicators) converts the obtained hybridization signal intensities to a an
ordinal scale by relative ranking. A relative rank of 99 means that 99% of probes on
this microarray have expression signals that are lower or equal to current probe’s
signal. Using a three-step procedure, DEMI will estimate differential expression of all
genes that are represented by the probes on the microarray.
The aim of this project is to identify genes with lowest variation in expression
across all experimental conditions. These genes are commonly known as
housekeeping genes and they can be used as internal references when measuring
gene expression by Real-Time PCR (qPCR). qPCR is known as the gold-standard
technique for measuring gene expression, but it’s performance depends on the
stability of expression of the reference gene and, typically, measurements need to
include several reference genes which is costly and time consuming.
Page 3 of 8
Data
We used data about experiments done with Human Gene ST 1.0 (Affymetrix
microarray platform). Other platforms where not covered due to the computational
complexity of adding this much data. Files containing experiment data were already
downloaded.
First scripts were written to download the metadata from Arrayexpress site.
The metadata is divided between two files per experiment. a .sdrf.txt file that
contained data about the analysis - description for every .CEL file and a .idf.txt file
that contains general description of the experiment. All in all there are 250
experiments containing 4875 cel-files. This number increased to 5066 when
downloading the missing data from arrayexpress. Figure 1 shows the division of
.CEL files between the experiments - 5 experiments containing more than 100 celfiles are not presented on this graph.
Number of existing CEL files per experiment has the following characteristics.
Minimum : 4.00
1st Quarter : 8.00
Median : 13.00
Mean : 20.76
3rd Quarter : 21.75
Maximum : 299.00
Figure 1 Here we can see count for experiments with similar number of cel-files
Page 4 of 8
The original dataset contained two experiments that were no longer available in
arrayexpress - E-GEOD-31428 and E-GEOD-30074 so they were excluded from the
analysis.
Sdrf files contained values for every experiment in the series. The number of values
varies greatly between the experiments. The main characteristics of number of fields
are as follows.
Minimum : 6.00 1st Quarter : 9.00 Median : 14.00 Mean : 23.20 3rd Quarter : 23.25 Maximum : 351.00 Page 5 of 8
Generating groups
For dividing the experiment results into groups metadata was used. This
became the main topic for this project. The aim was to separate experiments to
groups by factors for that particular experiment. Python 2.6 was used as the main
toolset for this task. This decision was based on wide variety of libraries and
documentation available for the language. The results from the groupings were
generated as a directory structure containing links to original .CEL files to reduce the
data duplication. An experiment that didn’t contain any factor values got separated
so that every cel-file was in a separate group. This helps to ensure that we do not
group data we do not have any information on.
First approach
For the initial grouping a set of fields were selected using empirical methods.
These final fields were "Comment [Sample_characteristics]", "FactorValue
[ORGANISM]", "Comment [Sample_description]". Only one column per experiment
was initially used to get the groups. Comment fields have a very inconsistent set of
information and are in many cases the same for all elements in an experiment.
Second approach
For this grouping the same fields were eventually used as in first approach.
This time all of those fields that were available for an experiment were used to
generate the groups. This resulted in smaller groups. Additional restriction was that
the field can not contain unique values for every element as this would result in the
default grouping (Every file in separate group).
Third approach
All in all there were 186 different column names containing the string
“FactorValue”. After eliminating those that had differences in whitespace 161 strings
remained.
a When using all of the factors as group separators 3707 groups were
generated.
b Fields that didn’t contain anything but whitespace characters - these
factor columns were not used in the final classification. This reduced
the number of groups to 3692.
c Then removed all classifiers that had only “none” specified for all fields
this lead to 3691 groups.
d Now did the same for “not specified”, this lead to 3714 groups.
Page 6 of 8
This approach gave a good overview of the metadata available for grouping the CEL
files. Out of all of the CEL files 2254 (~41%) didn’t have any factor value specified
and were separated into groups containing only one CEL file.
Number of factor values per experiment.
Minimum : 0.0000 1st Quarter : 0.0000 Median : 0.0000 Mean : 0.9839 3rd Quarter : 1.0000 Maximum : 12.0000
Groups generated by this approach had the following size characteristics.
Minimum : 1.000
1st Quarter : 1.000
Median : 1.000
Mean : 1.482
3rd Quarter : 1.000
Maximum : 160.000
Forth approach
The grouping could also be achieved using clustering over the CEL files in
experiments. One of the previous approaches could be used to estimate the number
of groups. This method is not covered by this report.
Page 7 of 8
Conclusions
The time for doing this project was greatly underestimated by the author. The
main data mining goals achieved by this project where acquiring, characterising and
cleansing the data.
The methods used for finding groups within experiments are not as good as
hoped as the quantity and quality of the provided metadata varies greatly between
the experiments. At the same time there were many groups that got a good
classification as far as we know - this quality can be quantified after the full process
has been completed.
The next steps needed to complete the overall goal is to experiment with
group size and run the initially planned tests on the experiment results to identify the
most stable genes over all of the experiment results. This should be followed by
including experiments with other microarray platforms to get a more significant result.
Page 8 of 8