Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Promoter (genetics) wikipedia , lookup
Genome evolution wikipedia , lookup
Gene desert wikipedia , lookup
Gene expression wikipedia , lookup
Expression vector wikipedia , lookup
Community fingerprinting wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Gene regulatory network wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
6 Gene Expression Data Sets Chapter 2 Gene Expression Data Sets BIOLOGICAL DATA AND THEIR CHARACTERISTICS Before embarking on a long tour across different machine learning methods, it is useful to look at some popular and their characteristics. First of all, a typical gene expression data set contains a matrix X of real numbers. Let D and N be the number of its rows and columns, respectively. Then X is represented as x11 x1N X = , xD1 xDN where xij represents the value in the i th row and j th column. DOI: 10.4018/978-1-60960-557-5.ch002 Copyright ©2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. Gene Expression Data Sets 7 As there are thousands of gene expressions and only a few dozens of samples, D (the number of genes) is of order 1,000-10,000 while N (the number of biological samples) is somewhere between 10 and 100. Such a condition makes the application of many traditional statistical methods impossible as those techniques were developed under the assumption that N D . You may ask: what’s a problem? The problem is in an underdetermined system where there are only a few equations versus many more unknown variables (Kohane, Kho, & Butte, 2003). Hence, the solution of such a system is not unique. In other words, multiple solutions exist. By translating this into the biological language of the applied problem treated in this book, this means that multiple subsets of genes may be equally relevant to cancer classification (Ein-Dor, Kela, Getz, Givol, & Domany, 2005), (Díaz-Uriarte & Alvarez de Andrés, 2006). However, in order to reduce a chance for noisy and/ or irrelevant genes to be included into one of such subsets, one needs to eliminate irrelevant genes before the actual classification. You may also wonder why it is impossible to increase N . The answer is that this is difficult as the measurement of gene expression requires a functionally relevant tissue taken under the right conditions, which is sadly rare due to impossibility to meet all requirements at once in practice (read more about these problems in (Kohane, Kho, & Butte, 2003)). So, we are left with the necessity to live and to deal with high dimensional data. Below five popular gene expression data sets are briefly described in order to give a realistic picture of what gene expression data are. Examples of Gene Expression Data Sets For each data set below, all the data are stored in a text file, though the file extension may not be necessarily .txt. Often gene names are stored in a separate text file; hence, it is useful and recommended to study the content of all text files associated with a given data set. Downloading such files is straightforward but extracting numerical information is not so as different files stores gene expression data mixed with textual headers and other text information. In order to access gene expression data, one needs to write a separate script or program for each data set after studying the data structure in a file storing them. Any programming language or environment has commands/functions for input/output file operations. Sometimes, the entire file can be read into RAM memory during a single reading operation. However, as textual and numerical data are mixed together, further efforts are usually necessary in order to separate text from non-text. If you care of the standard way of storing and exchange gene expression data, then the MicroArray and Gene Expression (MAGE) group (http://www.mged.org/Workgroups/MAGE/mage.html) provides Copyright © 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. 2 more pages are available in the full version of this document, which may be purchased using the "Add to Cart" button on the publisher's webpage: www.igi-global.com/chapter/gene-expression-datasets/53893 Related Content Evidence-Based Combination of Weighted Classifiers Approach for Epileptic Seizure Detection using EEG Signals Abduljalil Mohamed, Khaled Bashir Shaban and Amr Mohamed (2012). International Journal of Knowledge Discovery in Bioinformatics (pp. 27-44). www.irma-international.org/article/evidence-based-combination-weightedclassifiers/77929/ Ensemble Gene Selection (2011). Feature Selection and Ensemble Methods for Bioinformatics: Algorithmic Classification and Implementations (pp. 329-333). www.irma-international.org/chapter/ensemble-gene-selection/53911/ Biomedical Instrumentation: Diagnosis and Therapy John G. Webster (2015). International Journal of Systems Biology and Biomedical Technologies (pp. 20-38). www.irma-international.org/article/biomedical-instrumentation/148682/ Prioritizing Disease Genes and Understanding Disease Pathways Xiaoyue Zhao, Lilia M. Iakoucheva and Michael Q. Zhang (2009). Biological Data Mining in Protein Interaction Networks (pp. 239-256). www.irma-international.org/chapter/prioritizing-disease-genes-understanding -disease/5568/ Structural and Dynamical Heterogeneity in Ecological Networks Ferenc Jordán, Carmen Maria Livi and Paola Lecca (2012). Systemic Approaches in Bioinformatics and Computational Systems Biology: Recent Advances (pp. 141-162). www.irma-international.org/chapter/structural-dynamical-heterogeneityecological-networks/60832/