Download A simple statistical model for deciphering the cdc15

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
A simple statistical model
for deciphering the cdc15synchronized yeast
cell cycle-regulated genes expression
data
Ker-Chau Li , Robert Yuan
Statistics, UCLA
Ming Yan
Biochemistry , UCLA
The goal of this study is to
demonstrate how simple statistical
models can be employed for
helping the organization and
explanation of
complex gene expression patterns
Outlines
•
•
•
•
•
•
•
Introd : Micro-array and cell-cycle
Data : cdc15 experiment
A statistical model
Phase determination
Comparison with Spellman et al(1998)
Regularly oscillated genes
Further discussion
MicroArray
• Allows measuring the mRNA level of thousands
of genes in one experiment -- system level
response
• The data generation can be fully automated by
robots
• Common experimental themes:
– Time Course
– Mutation/Knockout Response
Time Course:
Expression level
1
0
Time
Change of Condition
Or:
A
B
C
D
E
A
--
2.1
0.8
1.3
0.5
B
0.2
--
-0.5
2.3
0.22
…
-1.2
--
0.3
-1.1
…..
Mic roArra y T
ec hniq ue:
Synthesize Gene
Sp ec ific DNA Oligos
Tissue or Cell
Atta c h oligo to
Solid Sup p ort
extra c t m RNA
Am p lific a tion
a nd La b eling
Hyb rid ize
Sc a n a nd Qua ntita te
Yeast Cell Cycle
(adapted from Molecular Cell Biology, Darnell et al)
Getting a homogeneous
population of cells:
cell cycle
Cells at various
stages of cell cycle
Synchronization conditions:
-Temperature shift to 37 C for
CDC15 yeast ts-strain
-add pheromone
-Elutriation
Release back into cell cycle
Take sample
as cells progress
through cycle
simultaneously
The data set available at
http:cellcycle-www.standford.edu
We focus on one
experiment in
which a strain of yeast(cdc15-2)
was incubated at a high
temperature(35 degrees C) for a long
time,
causing cdc15 arrest. Cells were
then shifted back to a
low temperature( 23 degrees C) and
the monitoring of gene expression
is taken every 10 min for 300 min.
Data from some chips are not available
We concentrate on those from the 19
Consecutive time points from 70 mins
To 250 mins
24 Time points: (mins)
10 30 50 70 80 ..... 240 250 270 290
----------> 10 mins apart
Use of full data will be discussed later.
Genes with missing values are also
Deleted
There are 4530 genes remaining
The data can be represented by a
4530 by 19 matrix
Example of the time curve:
Histone Genes: (HTT2)
ORF: YNL031C
Time course:
YKL164C
YNL082W
Preliminary study with two-way anova
This is to investigate the constancy of average expression
Level over the time for each gene and the constancy of
The average expression level over all genes at each time
Point.
> cdc15
Factor
gene
time
residual
total
df
| 4529
|
18
|81522
|86069
|
|
|
|
SS
5.2408E+2
2.9745E+2
1.4701E+4
1.5522E+4
|
|
|
MS
1.1572E-1
1.6525E+1
1.8033E-1
|
|
F
6.4169E-1
9.1638E+1
Gene insignificant
Time appears statistically significant; But …………(next slide)
Column mean (Time) from Anova result
The values are small
The expression level is log_2 of ratio of red/green
Red = light intensity for red channel - “noise”
Green = light intensity of green channel - “noise”
Red channel = mRNA from cells at one time point
Green channel =mRNA from unsynchronized cells
.5 fold increase = log_2 1.5=.585 ; 2^.15 =1.11=.11
fold increase
A statistical model
• Motivation :
modeling each curve
with simple functions such as linear,
quadratic, sine, cosine appears reasonable
but inflexible;
• Parsimony and accuracy can be gained if
basis curves are chosen by data themselves
• The model : each gene expression curve =
c0  c1V1  c2 V2  c3 V3  
V1 ,1st basis curve V2 , 2nd basis curve
V3 ,3rd basis curve
The model -continued
The errors have mean zero, uncorrelated
,same variance cross the time;
But the variance may depend on genes
(This is important)
It turns out that we can find the basis
functions from an application of PCA.
(see pdf file for pca)
Enhanced PCA for curve fitting
Choose the number of basis curves by
eigenvalues
 Assess the goodness of each curve fitting by
R-squared and by residual sum of squares
 Identify genes that comply well to the model
 Interactive plotting helps resetting userspecified parameters

PCA:
For a list of vectors, PCA could be used for finding the common basis based on
the scaling matrix.
Covariance Matrix:
  (X   )'(X   )
The directions found will have highest variance along those directions.
Find the directions by eigenvalue decomposition:
 i   i
Model the curves by the PCA directions:
Xi a1i1 a2i2  akik  
Here, we chose first three PCA directions as our basis.
1st PCA direction
2nd PCA direction
3rd PCA direction
Eigenvalues
1. Compliance Check:
H0 : three- bases model holds
Reject if Ri 2  0.56 & RSSi  7.25
(Corr. Coff between fit and observed < .75
And error s.d. Bigger than .70 , which is equivalent to .5
fold increase.)
2. Cycle Component Check: H 0 : a 2 i  a3 i  0
Reject if
(a2i 2  a3i2 ) / 2
 F2,15(0.95)  3.68
RSSi /15
3. Smoothness Check:
Reject if
H0 :a1 i  0
a1i
 t15 (0.975)  2.131
RSSi /15
6178
missing values
1648
complete
4530
non-compliance
compliance
4489
41
insignificant
cycle comonents
Significant cyclle components
2824
1665
Smooth
714
Non-smooth
951
For the non-compliance group, visual examination of each curve pattern is done .
*** of these 41 have visible cycle patterns. l
Noncompliance genes (41)
. High overall expression levels
. May or may not show cycle patterns
… Recommendation :
inspect each gene separately
Phase determination
• The second and the third basis curves show
clear cycle patterns. The third basis appears
to be a 40 min-delayed version of the
second basis, with an R-squared value of
.78
• Linear combinations of these two basis
curves show a variety of expression
patterns.
Construction of A Compass plot
•
•
•
•
Use of known cycle-regulated genes
Compliance checking with RSS/R^2 plot
Cycle- exhibition checking with projection angles
Coherent pattern checking by ANOVA
• ( A list of 104 known genes with 6 groups)
Phases of genes:
Identify the phases of genes:
Prior Knowledge:
There were 104 know genes whose phases were
determined by traditional experiment methods.
Known genes:
There are 6 groups of genes.
SCB (G1 phase)
MCB (G1 phase)
Histone (S phase) S/G2 phase
G2/M phase
M/G1 phase
The noncompliance genes and without significant cycle
components are excluded
The group of genes, SCB, are also excluded due to the
inconsistent patterns within their expression vectors.
82 non-missing known phase genes
Remove genes with
insignificant cycle component
Points obtained by
normalizing the loading
coeff. for 2nd and 3rd
bases to unit length
Late G1, SCB regulated genes:
Compass plot for phase assignment
Histone genes
S
G1
S/G2
M/G1
G2/M
Phase Assignment
Smooth
Non-smooth
G1
108
S
31
S/G2
352
G1
103
S
S/G2
27
255
90
295
M/G1
165
G2/M
239
M/G1
90
G2/M
Comparison
• For the 800 cell-regulated genes classified
by Spellman et al, we re-classified them
with our method. If a gene does not comply
with our model or does not have significant
second or third regression coefficients, we
would not assign the phase.
• Contingency tables of mismatched and
unclassified cases.
800
missing values
complete
654
non-compliance
compliance
645
9
insignificant
cycle comonents
Significant cyclle components
130
515
Smooth
293
Non-smooth
222
The group of 130 insiginicant cycle components appear quite bumpy.
A non-compliance gene
YJL159W :
Spellman et.al’s Score : 10.86
R2:
0.36273
(M/G1)
RSS:
14.15322
Angle: -2.43803
Least Squares Estimates:
Constant
Variable 0
Variable 1
Variable 2
-4.794002E-16 (0.222846)
1.28464 (0.971364)
-2.04016 (0.971364)
-1.49779 (0.971364)
Black: data curve
Red : fitted curve (full model)
Blue : fitted curve (cyclic model)
Locus_info: Other_name PIR2
YJL159W
CCW7
ORE1
Gene_class HSP
Gene_Info HSP150
Gene_product Heat shock protein, secretory
glycoprotein
Function cell wall structural protein
Cellular_Component cell wall
Process cell wall organization and biogenesis
Phenotype Null mutant is viable
Locus_notes 14 HSP150 has also been called
gp400
Position_info: Chromosome X
ORF_name YJL159W
An example of our non-compliance geneLocus_info: Other_name YDR055W
YDR055W :
Spellman et.al’s Score : 7.266
R2:
0.30136
(M/G1)
RSS:
7.94018
Angle: -2.81396 (Insig. Coef.)
Least Squares Estimates:
Constant
Variable 0
Variable 1
Variable 2
-5.428720E-16 (0.166914)
1.47329 (0.727561)
-1.07451 (0.727561)
-0.316032 (0.727561)
Black: data curve
Red : fitted curve (full model)
Blue : fitted curve (cyclic model)
Gene_class PST
Gene_Info PST1
Description Protoplasts-secreted
Gene_product The gene product has
been detected among the proteins
secreted by regenerating
protoplasts
Phenotype Viable
Position_info: Chromosome IV
ORF_name YDR055W
An example of
gene
non-compliance
YNL082W :
Spellman et.al’s Score : 4.843
R2:
0.229191
(G1)
RSS:
18.247480537500003
Least Squares Estimates:
Constant
(0.253035)
Variable 0
Variable 1
Variable 2
-6.087129E-16
1.51725
-1.74757
0.263945
(1.10295)
(1.10295)
(1.10295)
Black: data curve
Red : fitted curve (full model)
Blue : fitted curve (cyclic model)
Top 10 scores and gene names from insignificant
Cycle component group
3.69 3.85 3.874 4.022 4.048 4.13 4.41 5.047 6.28 6.716
"YOR263C" "YOR320C" "YGR035C" "YCR042C" "YPR019W”
"YJL194W" "YJR010W" "YEL068C" "YGR124W" "YKL172W"
78 genes score higher than 6.716;
188 genes score higher than 4.022
213 genes score higher than 3.69
Yet these genes appear very bumpy; see next slide
An example of insignificant cycle
component gene
YGR124W :
Spellman et.al’s Score: 6.28
R2:
0.364945 (small)
RSS:
0.812496 (small)
Angle: 3.13118
Locus_info: Other_name YGR124W
Gene_class ASN
Gene_Info ASN2
Description Asn1p and Asn2p are
isozymes
Gene_product asparagine synthetase
Phenotype Null mutant is viable; L(S/G2)
asparagine auxotrophy occurs upon
mutation of both ASN1 and ASN2
Position_info: Chromosome VII
ORF_name YGR124W
250 mins
CDC15
70 mins
EBP2: YKL172W
TSM1: YCR042C
YOR263C
Non-smooth group from 800 genes
Our\their
G1
S
S/G2
G2/M
M/G1
Total
G1
59
4
1
0
18
82
S
6
3
7
0
0
16
S/G2
0
0
31
3
0
34
G2/M
0
0
17
47
4
68
M/G1
0
0
0
1
21
22
Total
| 65
|
7
| 56
| 51
| 43
| 222
Smooth group from 800 genes
Low overall expression level
Our\their
G1
S
S/G2
G2/M
M/G1
Total
G1
74
7
5
0
43
129
S
8
10
11
0
0
29
S/G2
0
1
43
1
0
45
G2/M
0
0
17
39
3
59
M/G1
1
0
1
1
28
31
Total
| 83
| 18
| 77
| 41
| 74
| 293
CLN2: YPL256C
HTA1: YDR225W
(S)
(G1)
YJL091C
(Phase ??)
CLB4: YLR210W
(S/G2)
CLN2: YPL256C
HTA1: YDR225W
(S)
(G1)
FKS1: YLR342W
(Phase ??)
From 5 cell
CLB4: YLR210W
(S/G2)
From 1 , total SS small
YOR264W
Least Squares Estimates:
Constant
Variable 0
Variable 1
Variable 2
-5.706461E-16 (4.704328E-2)
-0.170979 (0.205057)
0.479678 (0.205057)
0.762583 (0.205057)
R Squared:
0.571396
Sigma hat:
0.205057
Number of cases:
19
Degrees of freedom:
15
Oscillated genes
• First curve basis is oscillating in a
extremely regular way
• There are over 200 genes with such regular
oscillating patterns
• Role unknown : Systematic error ?
Common upstream promoter region ?
DIM1 (YPL266W)
Locus_info: Other_name YPL266W
Gene_class DIM
Gene_Info DIM1
Description Dimethyladenosine transferase,
(rRNA(adenine-N6,N6-)-dimethyltransferase),reponsible for
m6[2]Am6[2]A dimethylation in 3'-terminal loop of 18S rRNA
Gene_product dimethyladenosine transferase
Function rRNA (adenine-N6,N6-)-dimethyltransferase
Cellular_Component nucleolus
Process 35S primary transcript processing
rRNA modification
Phenotype Null mutant is inviable
Position_info: Chromosome XVI
ORF_name YPL266W
PRS1A (YLR441C)
Locus_info: Other_name YLR441C
RP10A
Gene_class RPS
Gene_Info RPS1A
Description Homologous to rat S3A
Gene_product Ribosomal protein S1A (rp10A)
Function structural protein of ribosome
Cellular_Component cytosolic small ribosomal (40S)-subunit
Process 0006416
protein biosynthesis
Locus_notes 13 RP10A (RPS1A) and RP10B (RPS1B) are nearly identical; this
gene has also been called PLC1, but should not be confused
with PLC1 on chromosome XVI encoding a
phosphoinositide-specific phospholipase
Position_info: Chromosome XII
ORF_name YLR441C
GLN1: YPR035W
One gene from non-smooth group
Not in Spellman et. al.’s list.
Least Squares Estimates:
Constant
Variable 0
Variable 1
Variable 2
R Squared:
Sigma hat:
-6.276471E-16 (4.762055E-2)
-2.47649 (0.207573)
3.958405E-2 (0.207573)
1.01860 (0.207573)
0.917337
0.207573
Further discussion
•
•
•
•
•
Others who use PCA
Clustering
Other data set
Use of SIR/PHD
Without a time scale ? B-cell lymphoma
data
• Pathway study
. Genes with overall small expression levels could have been
Removed from the beginning???
YGR231C
One gene from smooth group
Not in Spellman et. al.’s list.
Least Squares Estimates:
Constant
Variable 0
Variable 1
Variable 2
R Squared:
Sigma hat:
-5.803153E-16 (4.131369E-2)
-0.156478 (0.180082)
-1.59995 (0.180082)
-0.623201 (0.180082)
0.859375
0.180082
Total sum of squares equals to 3.4591 which is about
71.6 percentile among all genes.
The median of the total sum of squares is 2.27735.
THE END
YBL002W
YER124C
YDR224C
YJL159W
YKL163W
YKL164C
YKL185W
YMR003W
YMR011W
YNL160W
YDR055W
Related documents