Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Integrative Analysis of multiple largescale molecular biological data Sri Priya Ponnapalli Genomic Signal Processing Laboratory The University of Texas at Austin Project Objectives Specimen Under Analysis : The National Cancer Institute’s 60 cell lines (NCI60). Dataset #1 RNA Expression profiles, [ Ross et al, 2000 ] Dataset #2 Proteomic profiles, [ Nishizuka et al , 2003 ] Dataset #3 Drug Activity Levels, [ Scherf et al, 2000 ] Perceive relationships between three datasets, each containing a different attribute of the NCI60 – genome-scale expression, sensitivities to more than 70,000 chemical compounds and chemotherapeutics, and proteomic profiles. CHIEF OBJECTIVE : DEVELOP A METHOD TO ANALYSE THE RELATIONSHIPS BETWEEN MULTIPLE DATASETS. Initial Analysis : SVD • All three datasets were processed using Singular value decomposition [ Alter et al, 2000]. The results look interesting but as you can see, it is difficult to interpret them very well, let alone integrate the SVD results of all three datasets. Plot of the First 5 sorted Eigengenes Eigengenes Tumor Samples Analysis using GSVD Every pair of datasets was then processed using Generalized Singular value decomposition. Dataset1=U1E1X Dataset2=U2E2X If a dataset is thought to represent a line, the GSVD of two datasets represent the point of intersection of these lines. i.e. It highlights the similarities and dissimilarities between the two datasets. This simple fact suggests a method to study the similarities and differences between multiple datasets. Consider the case of finding similarities and dissimilarities between 3 pairs of datasets ( this can be extended to multiple datasets). • These 3 datasets maybe thought of representing 3 lines. • Any two non-parallel lines intersect at a point. • Three non-parallel lines form a triangle ( unless they all have a common point in which case all three vertices of the triangle converge to that point). • To goal is to express the three datasets in the form Dataset1=U1E1X Dataset2=U2E2X Dataset3=U3E3X If we compute the GSVD of every two datasets ( find the points of intersection of every two lines), we get three matrices that each correspond to a vertex of a triangle. We want a matrix that best approximates these three matrices i.e. a point that is closest to all three vertices simultaneously. This point would be the centroid of the triangle. Given the co-ordinates of the vertices, the centroid may be easily computed. All these results have to be interpreted in terms of matrices. This may be easily done by considering the distances between matrices as defined by the Frobenius distance. •This method is an approximation, but the best possible approximation. •It minimizes the error between the original dataset and the dataset obtained by the product of the three matrices. •It has been tried on the three datasets under study and the results look promising. • Please read the paper for further details.