Download Integrative Analysis of multiple large

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

JADE1 wikipedia , lookup

Pharmacometabolomics wikipedia , lookup

Transcript
Integrative Analysis of multiple largescale molecular biological data
Sri Priya Ponnapalli
Genomic Signal Processing Laboratory
The University of Texas at Austin
Project Objectives
Specimen Under Analysis : The National Cancer Institute’s 60 cell
lines (NCI60).
Dataset #1 RNA Expression profiles, [ Ross et al, 2000 ]
Dataset #2
Proteomic profiles, [ Nishizuka et al , 2003 ]
Dataset #3
Drug Activity Levels, [ Scherf et al, 2000 ]
Perceive relationships between three datasets, each containing a
different attribute of the NCI60 – genome-scale expression, sensitivities
to more than 70,000 chemical compounds and chemotherapeutics, and
proteomic profiles.
CHIEF OBJECTIVE : DEVELOP A METHOD TO ANALYSE THE
RELATIONSHIPS BETWEEN MULTIPLE DATASETS.
Initial Analysis : SVD
• All three datasets were processed using Singular value
decomposition [ Alter et al, 2000].
The results look interesting but as you can see, it is
difficult to interpret them very well, let alone integrate
the SVD results of all three datasets.
Plot of the First 5 sorted Eigengenes
Eigengenes
Tumor Samples
Analysis using GSVD
Every pair of datasets was then processed using Generalized
Singular value decomposition.
Dataset1=U1E1X
Dataset2=U2E2X
If a dataset is thought to represent a line, the GSVD of two
datasets represent the point of intersection of these lines.
i.e. It highlights the similarities and dissimilarities between
the two datasets.
This simple fact suggests a method to study the similarities
and differences between multiple datasets.
Consider the case of finding similarities and dissimilarities
between 3 pairs of datasets ( this can be extended to
multiple datasets).
• These 3 datasets maybe thought of representing 3 lines.
• Any two non-parallel lines intersect at a point.
• Three non-parallel lines form a triangle ( unless they all
have a common point in which case all three vertices of
the triangle converge to that point).
• To goal is to express the three datasets in the form
Dataset1=U1E1X
Dataset2=U2E2X
Dataset3=U3E3X
If we compute the GSVD of every two datasets ( find the
points of intersection of every two lines), we get three
matrices that each correspond to a vertex of a triangle.
We want a matrix that best approximates these three
matrices i.e. a point that is closest to all three vertices
simultaneously.
This point would be the centroid of the triangle.
Given the co-ordinates of the vertices, the centroid may be
easily computed.
All these results have to be interpreted in terms of matrices.
This may be easily done by considering the distances
between matrices as defined by the Frobenius distance.
•This method is an approximation, but the best possible
approximation.
•It minimizes the error between the original dataset and the
dataset obtained by the product of the three matrices.
•It has been tried on the three datasets under study and the
results look promising.
• Please read the paper for further details.