Download Gene Expression Data Sets

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Promoter (genetics) wikipedia , lookup

Genome evolution wikipedia , lookup

Gene desert wikipedia , lookup

Gene expression wikipedia , lookup

Expression vector wikipedia , lookup

Community fingerprinting wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Gene regulatory network wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene expression profiling wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
6 Gene Expression Data Sets
Chapter 2
Gene Expression
Data Sets
BIOLOGICAL DATA AND THEIR CHARACTERISTICS
Before embarking on a long tour across different machine learning methods, it is
useful to look at some popular and their characteristics.
First of all, a typical gene expression data set contains a matrix X of real numbers. Let D and N be the number of its rows and columns, respectively. Then X
is represented as


 x11 x1N 
X =   , 

 xD1 xDN 
where xij represents the value in the i th row and j th column.
DOI: 10.4018/978-1-60960-557-5.ch002
Copyright ©2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Gene Expression Data Sets 7
As there are thousands of gene expressions and only a few dozens of samples,
D (the number of genes) is of order 1,000-10,000 while N (the number of biological samples) is somewhere between 10 and 100. Such a condition makes the
application of many traditional statistical methods impossible as those techniques
were developed under the assumption that N  D . You may ask: what’s a problem?
The problem is in an underdetermined system where there are only a few equations versus many more unknown variables (Kohane, Kho, & Butte, 2003). Hence,
the solution of such a system is not unique. In other words, multiple solutions exist. By translating this into the biological language of the applied problem treated
in this book, this means that multiple subsets of genes may be equally relevant to
cancer classification (Ein-Dor, Kela, Getz, Givol, & Domany, 2005), (Díaz-Uriarte
& Alvarez de Andrés, 2006). However, in order to reduce a chance for noisy and/
or irrelevant genes to be included into one of such subsets, one needs to eliminate
irrelevant genes before the actual classification.
You may also wonder why it is impossible to increase N . The answer is that
this is difficult as the measurement of gene expression requires a functionally relevant tissue taken under the right conditions, which is sadly rare due to impossibility to meet all requirements at once in practice (read more about these problems in
(Kohane, Kho, & Butte, 2003)). So, we are left with the necessity to live and to deal
with high dimensional data.
Below five popular gene expression data sets are briefly described in order to
give a realistic picture of what gene expression data are.
Examples of Gene Expression Data Sets
For each data set below, all the data are stored in a text file, though the file extension
may not be necessarily .txt. Often gene names are stored in a separate text file; hence,
it is useful and recommended to study the content of all text files associated with
a given data set. Downloading such files is straightforward but extracting numerical information is not so as different files stores gene expression data mixed with
textual headers and other text information. In order to access gene expression data,
one needs to write a separate script or program for each data set after studying the
data structure in a file storing them. Any programming language or environment
has commands/functions for input/output file operations. Sometimes, the entire
file can be read into RAM memory during a single reading operation. However, as
textual and numerical data are mixed together, further efforts are usually necessary
in order to separate text from non-text. If you care of the standard way of storing
and exchange gene expression data, then the MicroArray and Gene Expression
(MAGE) group (http://www.mged.org/Workgroups/MAGE/mage.html) provides
Copyright © 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
2 more pages are available in the full version of this
document, which may be purchased using the "Add to Cart"
button on the publisher's webpage:
www.igi-global.com/chapter/gene-expression-datasets/53893
Related Content
Evidence-Based Combination of Weighted Classifiers Approach for Epileptic
Seizure Detection using EEG Signals
Abduljalil Mohamed, Khaled Bashir Shaban and Amr Mohamed (2012). International
Journal of Knowledge Discovery in Bioinformatics (pp. 27-44).
www.irma-international.org/article/evidence-based-combination-weightedclassifiers/77929/
Ensemble Gene Selection
(2011). Feature Selection and Ensemble Methods for Bioinformatics: Algorithmic
Classification and Implementations (pp. 329-333).
www.irma-international.org/chapter/ensemble-gene-selection/53911/
Biomedical Instrumentation: Diagnosis and Therapy
John G. Webster (2015). International Journal of Systems Biology and Biomedical
Technologies (pp. 20-38).
www.irma-international.org/article/biomedical-instrumentation/148682/
Prioritizing Disease Genes and Understanding Disease Pathways
Xiaoyue Zhao, Lilia M. Iakoucheva and Michael Q. Zhang (2009). Biological Data
Mining in Protein Interaction Networks (pp. 239-256).
www.irma-international.org/chapter/prioritizing-disease-genes-understanding
-disease/5568/
Structural and Dynamical Heterogeneity in Ecological Networks
Ferenc Jordán, Carmen Maria Livi and Paola Lecca (2012). Systemic Approaches in
Bioinformatics and Computational Systems Biology: Recent Advances (pp. 141-162).
www.irma-international.org/chapter/structural-dynamical-heterogeneityecological-networks/60832/