Download Chem452 : Lecture 15

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Epigenetics of diabetes Type 2 wikipedia , lookup

Long non-coding RNA wikipedia , lookup

History of genetic engineering wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Pathogenomics wikipedia , lookup

Public health genomics wikipedia , lookup

Genome evolution wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Metagenomics wikipedia , lookup

Microevolution wikipedia , lookup

Gene wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Oncogenomics wikipedia , lookup

Genomic imprinting wikipedia , lookup

Minimal genome wikipedia , lookup

Designer baby wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Genome (book) wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Gene expression programming wikipedia , lookup

Ridge (biology) wikipedia , lookup

NEDD9 wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

Gene expression profiling wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
Chem452 :
Lecture 15
Week 8: Complex Data Analysis in Biology
Halil Bayraktar
What is common in all?
Singular Value Decomposition (SVD analysis)
It is a mathematical matrix decomposition method or tool to
analyze complex data and answer important questions.
It is used extensively for
a) 
b) 
c) 
d) 
e) 
Gene expression analysis
Image Analysis, compression and matching
Sound analysis and matching
Chemical sample analysis etc.
Time series analysis, signal processing and automation
control
f)  Google web site ranking
It is very easy to use it and rich information can be obtained
from data.
Gene expression,
You would like to answer the following questions.
What is the best way of analyzing your data?
Which of the genes are expressed at the same time in
cancer cells?
What are the genes that are expressed in cancer cells but
not normal cells?
How dramatically the expression changes over time?
Type of gene
Colors define the intensity
values
Intensity
Time
Time
Gene Type
N genes are expressed M different time points
N by M matrix
Yeast cell cycle
PCA for DNA Microarrays
Genes with same pattern
might have similar function
a
Fig. 2 Two pattern-discovery techniques. Data for both figures measure
expression for 11 genes characterizing sensitivity to compound cytochalasin D in 60 cancer cell lines97. a, The
first three principal components, plotted using Matlab software (Mathworks). Apparent features include a
tight cluster of leukemia samples (red
dots, nearly superimposed) and the
more scattered outlying cluster of
CNS tumors (black dots). A single lung
cancer sample (NSCLC-NCIH226) also
appears as an outlier — the solitary
orange dot at the top. b, Hierarchical
clustering of the same data, using
Cluster/TreeView (http://rana.lbl.gov/
EisenSoftware.htm). Names of samples extremely sensitive or resistant to
cytochalasin D (see Supplementary
information) are prefixed ‘S’ and ‘R’
respectively. The samples fall into two
main clusters, roughly, but not perfectly, separating the sensitive and
resistant samples. As in a, fine structure shows a tight leukemia cluster
(underlined in green) and a tight CNS
cluster (underlined in red), but does
not suggest that the CNS cluster or
NSCLC-NCIH226 (underlined in blue)
are outliers. Apparent in both a and b
is the relative heterogeneity of the
breast cancer cell lines.
If genes are expressed
at the same they have
the same scores and
localized close in 3D
map.
b
merging the two closest clusters is repeated until a single
cluster remains. This arranges
the data into a tree structure
that can be broken into the
desired number of clusters by
cutting across the tree at a
particular height. Tree structures are easily viewed and
understood (Fig. 2b), and the
hierarchical structure provides
potentially useful information about the relationships
between clusters. Trees are
known to reveal close relationships very well. However, as
What important information can be obtained?
1. Prediction of the class of these gene : The key genes can be
identified. The expression of these genes can be used to predict
the type (cancer or normal) of cell sample.
PC3
PC2
PC1
2. Building large network of genes: Complex data analysis is
performed to learn the gene expression and construct a graph
that show the dependency of expressed genes
Gene expression can also be probed for effect of
toxic compounds, ions, different peptides etc.
Intensity
Time
Toxic compounds
Different Ions
Small molecules
Different Peptides
Cell types
Etc.
SVD in biology
Gene Type
Which genes are most important? So
the largest eigenvalues?
What is relative significance of these
genes?
Can we identify unknown genes?
Computer
Science
Identifying faces:
We need a training matrix – some initial data to start our
analysis.
Is there anything special in these pictures?
Face database can be analyzed
the same way. The following
question would be asked such
as.
Eigenfaces for Recognition
MIT media lab
Matthew turk and Alex Pentland
Can we find the important and
good features when we analyze
faces?
PCA analysis on Faces - Eigenfaces
First Principal
Component
Original Image
1906x1372
Total 400 Principal
Components
GOOGLE WEB PAGE RANKING
Larry Page
Computer
Science
Sergey Brin
They used singular value decomposition to find the solution of their complex data. The
results, eigenvalues, is used to predict the ranking of pages. Google rigorously
calculates the eigenvectors of a large matrix. That matrix represents the internet's links
in order to rank which pages users will most likely (and probably should) end up on.
PCA in social science - Sociology
Reading score
Cate 1
Cate 2
Cate 3
Job
quality
Years of
education
Eigenvalue Decomposition
!i = Eigenvalue
bi , xi = Eigenvector
Abi = !i xi
Remember from linear algebra that that does not
always exist.
The columns bi and xi of B and X are called the left and right
singular vectors respectively, and the diagonal elements λi of λ
are called the singular values.
Abi is in the direction of xi
In Matrix form,Abi
= !i xi
becomes
AV = US
Singular Value Decomposition (SVD) of a rectangular matrix A
is a decomposition of the form
A = U S VT
where U and V are orthogonal matrices, and S is a diagonal
matrix.
U is m×n and orthonormal
S is n×n and diagonal
V is n×n and orthonormal
AV = US
T
T
AVV = USV
T
A = USV
Singular value decomposition of A. SVD
can be written always for A .
T
VV =1
S = DIAG(! 1, ! 2,...., ! m)
! 1 = "i
Eigenvalues of AAT or ATA
Principal Component Analysis (PCA)
PCA uses the SVD in its calculation
In PCA, we basically find eigenvalues and
eigenvectors of covariance matrix.
C= AAT/N
σaσb=0
σaσb=σa2
highly uncorr.
correlated
!
#
#
#
#
"
A
$ !
& #
&=#
& #
& #
% "
U
$!
$!
s
0
0
&# 1
&#
&# 0 ! 0 &
#
&#
&&#
0
0
s
#
n
&"
"
%
%
V
$
&
&
&
%
Remember that si on the diagonal are called the singular
values of A.
The Rank R is determined by the smallest dimension, in
other words, number of nonzero singular values determines
the rank R of A.
T
1. Eigenvectors and eigenvalues always come in pairs.
2. Eigenvalues is the scaling factor of the vector.
3. Every matrix has SVD.
4. The eigenvalues can be determined and those values
can be S1 ≥ S2 ≥ S3 ≥ ……Sn > 0
EIGENVALUES
Why eigenvalues are important?
It can be considered as characteristic tool of the matrix.
For example you tell if a large sets of genes are expressed at
certain time but not the other. Or you can say which of the
web site are more important than the other by just looking its
eigenvalue.
Briefly, the eigenvalue for a given factor measures the
variance in all the variables which is accounted for by that
factor. Largest eigenvalues gives the principal axis where the
variance is largest along the axis.
The ratio of eigenvalues :
It is extremely important. If a factor has a low eigenvalue, the
variance in the variables can be explained less significantly by
the eigenvalues.
The main idea in PCA is to reduce the dimensionality of our
data A by approximating A as a sum of rank matrices.
n
An = ! ui! i v i
T
i=1
Rank
matrix
A
PCA by SVD
We can use SVD to perform PCA. We decompose A using SVD.
PCA seeks a linear combination of variables such that the
maximum variance is extracted from the variables.
A = USV
It approximates a high-dimensional data set with a lower-dimensional linear small set. It still
contains most of the information in the large set.
Principal
Component 1
Principal
Component 2
Original
axis
Genes with same pattern might have similar function
calculat
thing li
one var
between
Cons
with ze
A
The var
follows.
Figure 3: A spectrum of possible redundancies in
data from the two separate recordings r1 and r2 (e.g. where t
xA , yB ). The best-fit line r2 = kr1 is indicated by ables. T
the dashed line.
forward
Take home message.
SVD is a great mathematical method to analyze
complex data
Covariance can be considered to be a measure of how
well correlated two variables are.
The correlations among the observations become clear
after diagonalization of the covariance matrix.