Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Ridge (biology) wikipedia , lookup
Genome evolution wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Gene regulatory network wikipedia , lookup
Gene expression wikipedia , lookup
Real-time polymerase chain reaction wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Community fingerprinting wikipedia , lookup
Silencer (genetics) wikipedia , lookup
University of Tartu Faculty of Mathematics and Computer Science Institute of Computer Science Exploring large sets of microarray data to identify genes with lowest variation in expression MTAT.03.277 Research Seminar in Data Mining Author: Janno Jõgeva Supervisors: Sten Ilmjärv & Hendrik Luuk Tartu 2012 TableofContents Introduction ............................................................................................................................. 3 Data ........................................................................................................................................ 4 Generating groups .................................................................................................................. 6 First approach ..................................................................................................................... 6 Second approach ................................................................................................................ 6 Third approach .................................................................................................................... 6 Forth approach .................................................................................................................... 7 Conclusions ............................................................................................................................ 8 Page 2 of 8 Introduction This topic tries to solve a problem deriving from wet-lab gene expression measurements. Its aim is to determine gene expression using DNA microarrays. Microarray is a small solid surface with approximately 1 million to 6 million distinct covalently bound oligonucleotide sequences (called probes) that are used to detect the abundance of labeled complementary DNA/RNA molecules based on nucleic acid hybridization process. The working principle is that complementary sequences form a fluorescent duplex molecule that can be detected with a scanner. This allows to measure the expression level of individual DNA/RNA molecules in the sample of interest. Differential gene expression analysis package DEMI (Differential Expression from Multiple Indicators) converts the obtained hybridization signal intensities to a an ordinal scale by relative ranking. A relative rank of 99 means that 99% of probes on this microarray have expression signals that are lower or equal to current probe’s signal. Using a three-step procedure, DEMI will estimate differential expression of all genes that are represented by the probes on the microarray. The aim of this project is to identify genes with lowest variation in expression across all experimental conditions. These genes are commonly known as housekeeping genes and they can be used as internal references when measuring gene expression by Real-Time PCR (qPCR). qPCR is known as the gold-standard technique for measuring gene expression, but it’s performance depends on the stability of expression of the reference gene and, typically, measurements need to include several reference genes which is costly and time consuming. Page 3 of 8 Data We used data about experiments done with Human Gene ST 1.0 (Affymetrix microarray platform). Other platforms where not covered due to the computational complexity of adding this much data. Files containing experiment data were already downloaded. First scripts were written to download the metadata from Arrayexpress site. The metadata is divided between two files per experiment. a .sdrf.txt file that contained data about the analysis - description for every .CEL file and a .idf.txt file that contains general description of the experiment. All in all there are 250 experiments containing 4875 cel-files. This number increased to 5066 when downloading the missing data from arrayexpress. Figure 1 shows the division of .CEL files between the experiments - 5 experiments containing more than 100 celfiles are not presented on this graph. Number of existing CEL files per experiment has the following characteristics. Minimum : 4.00 1st Quarter : 8.00 Median : 13.00 Mean : 20.76 3rd Quarter : 21.75 Maximum : 299.00 Figure 1 Here we can see count for experiments with similar number of cel-files Page 4 of 8 The original dataset contained two experiments that were no longer available in arrayexpress - E-GEOD-31428 and E-GEOD-30074 so they were excluded from the analysis. Sdrf files contained values for every experiment in the series. The number of values varies greatly between the experiments. The main characteristics of number of fields are as follows. Minimum : 6.00 1st Quarter : 9.00 Median : 14.00 Mean : 23.20 3rd Quarter : 23.25 Maximum : 351.00 Page 5 of 8 Generating groups For dividing the experiment results into groups metadata was used. This became the main topic for this project. The aim was to separate experiments to groups by factors for that particular experiment. Python 2.6 was used as the main toolset for this task. This decision was based on wide variety of libraries and documentation available for the language. The results from the groupings were generated as a directory structure containing links to original .CEL files to reduce the data duplication. An experiment that didn’t contain any factor values got separated so that every cel-file was in a separate group. This helps to ensure that we do not group data we do not have any information on. First approach For the initial grouping a set of fields were selected using empirical methods. These final fields were "Comment [Sample_characteristics]", "FactorValue [ORGANISM]", "Comment [Sample_description]". Only one column per experiment was initially used to get the groups. Comment fields have a very inconsistent set of information and are in many cases the same for all elements in an experiment. Second approach For this grouping the same fields were eventually used as in first approach. This time all of those fields that were available for an experiment were used to generate the groups. This resulted in smaller groups. Additional restriction was that the field can not contain unique values for every element as this would result in the default grouping (Every file in separate group). Third approach All in all there were 186 different column names containing the string “FactorValue”. After eliminating those that had differences in whitespace 161 strings remained. a When using all of the factors as group separators 3707 groups were generated. b Fields that didn’t contain anything but whitespace characters - these factor columns were not used in the final classification. This reduced the number of groups to 3692. c Then removed all classifiers that had only “none” specified for all fields this lead to 3691 groups. d Now did the same for “not specified”, this lead to 3714 groups. Page 6 of 8 This approach gave a good overview of the metadata available for grouping the CEL files. Out of all of the CEL files 2254 (~41%) didn’t have any factor value specified and were separated into groups containing only one CEL file. Number of factor values per experiment. Minimum : 0.0000 1st Quarter : 0.0000 Median : 0.0000 Mean : 0.9839 3rd Quarter : 1.0000 Maximum : 12.0000 Groups generated by this approach had the following size characteristics. Minimum : 1.000 1st Quarter : 1.000 Median : 1.000 Mean : 1.482 3rd Quarter : 1.000 Maximum : 160.000 Forth approach The grouping could also be achieved using clustering over the CEL files in experiments. One of the previous approaches could be used to estimate the number of groups. This method is not covered by this report. Page 7 of 8 Conclusions The time for doing this project was greatly underestimated by the author. The main data mining goals achieved by this project where acquiring, characterising and cleansing the data. The methods used for finding groups within experiments are not as good as hoped as the quantity and quality of the provided metadata varies greatly between the experiments. At the same time there were many groups that got a good classification as far as we know - this quality can be quantified after the full process has been completed. The next steps needed to complete the overall goal is to experiment with group size and run the initially planned tests on the experiment results to identify the most stable genes over all of the experiment results. This should be followed by including experiments with other microarray platforms to get a more significant result. Page 8 of 8