Download DW-lecture6

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Explicit Definition of Concept Hierarchies
Disease
Gene Ontology
n
n
Patient
Anatomy Ontology
1
1
n
n
Gene Cluster
n
n
n
n
Gene Sequence
1
1
1
n
Array Probe
Clinical Sample
n
n
mRNA
Expression
n
1
n
1
n
1
Project
Platform
1
Normalization
1
Measurement Unit
Experiment
n
n
Sample Classification Hierarchy
All_diseases
Tumor
Normal
AdenoCNS_tumor Leukemia carcinoma
...
Brain Blood Colon Breast
Glio.
blastoma
...
..
ALL AML Colon Breast .
tumor tumor
... ... ...
...
..
...
... ... ...
...
...
...
...
...
... ... ...
...
...
...
...
(Patients)
...
... ... ...
(Clinical Samples)
Aggregate Functions
• Simple: sum, average, max, min, etc.
• Statistical: variance, standard deviation, tstatistic, F-statistic, etc.
• User-defined: e.g., for aggregation of Affymetrix
gene expression data on the Measurement Unit
dimension, we may define the following function:
Exp =
Val
if PA = ‘P’ or ‘M’,
0
if PA = ‘A’.
Here, Exp is summarized gene expression; Val
and PA are the numeric value and PA call given
by the Affymetrix platform, respectively.
Conventional OLAP Operations
• Roll-up: aggregation on a data cube, either by
climbing up a concept hierarchy for a dimension
or by dimension reduction.
• Drill-down: the reverse of roll-up, navigation
from less detailed data to more detailed data.
• Slice: selection on one dimension of the given
data cube, resulting in a subcube.
• Dice: defining a subcube by performing a
selection on two or more dimensions.
• Pivot: a visualization operation that rotates the
data axes to provide an alternative presentation.
t Test
• The t-Test assesses whether the means of two
groups are statistically different from each other.
_
• Given two groups of samples X 1 : {n1 , x1 , s12 } and
_
X 2 : {n2 , x 2 , s22 }:
N : number of samples
X : the mean of the samples
N
S 2 : the var iance of the samples
 (x  x)
i 1
2
i
N 1
Degrees of freedom.
Due to bias of the sample
• Assumption: the differences in the groups follow a normal distribution.
Degrees of Freedom (df)
Idea: Number of observations that are free to vary
after sample mean has been calculated
Example: Suppose the mean of 3 numbers is 8.0
Let X1 = 7
Let X2 = 8
What is X3?
If the mean of these three values is
8.0,
then X3 must be 9
(i.e., X3 is not free to vary)
Here, n = 3, so degrees of freedom = n – 1 = 3 – 1 = 2
(2 values can be any numbers, but the third is not free to
vary for a given mean)
Basic Business Statistics, 11e ©
2009 Prentice-Hall, Inc..
Chap 8-6
Student t-distribution
• It is family of continuous probability
distributions that arises when estimating
the mean of a normally distributed
population in situations where the sample
size is small and population standard
deviation is unknown.
t Test
• Hypothesis: H0(null hypothesis): µ1=µ2
Hα: µ1 µ2
• Choose the level of confidence (significance): α = 0.05 (the amount of
uncertainty we are prepared to accept in the study.
_
• Test Statistics
t
_
x1  x2
s / n1  s / n2
2
1
2
2
• The t-value can be positive or negative (positive if the first mean is larger
than the second and negative if it is smaller).
• Calculate the p-value corresponding to t-value: look up a table.
• The t is a family of distributions
Student’s t Distribution
Note: t
Z as n increases
Standard
Normal(t
with df = ∞)
t (df = 13)
t-distributions are bellshaped and symmetric,
but have ‘fatter’ tails than
the normal
Basic Business Statistics, 11e ©
2009 Prentice-Hall, Inc..
t (df = 5)
0
Chap 8-9
t
Selected t distribution values
With comparison to the Z value
Confidence
t
Level
(10 d.f.)
t
(20 d.f.)
t
(30 d.f.)
Z
(∞ d.f.)
0.80
1.372
1.325
1.310
1.28
0.90
1.812
1.725
1.697
1.645
0.95
2.228
2.086
2.042
1.96
0.99
3.169
2.845
2.750
2.58
Note: t
Basic Business Statistics, 11e ©
2009 Prentice-Hall, Inc..
Z as n increases
Chap 8-10
Example of t distribution confidence
interval
A random sample of n = 25 has X = 50 and
S = 8. Form a 95% confidence interval for μ
– d.f. = n – 1 = 24, so
t α/2  t 0.025  2.0639
The confidence interval is
S
8
X  tα/2
 50  (2.0639)
n
25
46.698 ≤ μ ≤ 53.302
Basic Business Statistics, 11e ©
2009 Prentice-Hall, Inc..
Chap 8-11
P - Value
• The p-value is the upper-tail (or lower tail)
area of the t curve.
• Steps to accept/reject the null hypothesis H0
– Calculate the t-statistics
– Look up the table to find the p-value
– Given confidence level  ,
if p-value is smaller than ,
then reject H0; otherwise,
accept H0
The t-curve of 25
degrees of freedom
This area is
the p-value!
The t-statistics value
New OLAP Operation: Compare
• Compare two random variables by computing
ratios, differences or t-statistics.
• Example:
Question: Is gene X
expressed differently
between two groups?
Solution:
(1) Compute the mean
and variance.
(2) Compute t and p:
t = 3.120
p = 0.013/0.007
Answer: Yes (at 5%
significance level)
Different measurements of gene X
Disease 1
Disease 2
100
90
105
83
78
70
72
81
74
75
Mean
91.2
Variance 127.7
N
5
74.4
17.3
5
X 1  91.2
n1  5
2
(
X

X
)
i 1 1i 1
n1
VarX 1 
n1  1
X 2  74.4
VarX 2
t


(100  91.2) 2      (78  91.2) 2

 127.7
5 1
n2  5
n2
2
(
X

X
)
2
i
2
i 1
n2  1
(70  74.4) 2      (75  74.4) 2

 17.3
5 1
X1  X 2
91.2  74.4

 3.120
VarX 1 VarX 2
127.7 17.3


n1
n2
5
5
Assuming unequal variances , the degrees of freedom are :
2
2
 VarX 1 VarX 2 
127
.
7
17
.
3








n
n
5
5
2


df   1 2
 
 5.06  5
2
2
2
 VarX 1   VarX 2 
 127.7   17.3 

 


 

5
5

 

 n1    n2 
4
4
n1  1
n2  1
p  0.013 (Calculate d using Excel' s TDIST function, one - tail)
Output from Excel
X 1  91.2
n1  5
2
(
X

X
)
i1 1i 1
n1
VarX 1 
(100  91.2) 2      (78  91.2) 2

 127.7
5 1
n1  1
X 2  74.4
n2  5
2
(
X

X
)
i 1 2i 2
n2
VarX 2 
n2  1
(70  74.4) 2      (75  74.4) 2

 17.3
5 1
Pooled sample variance (assuming equal variance) :
(n1  1)(VarX 1 )  (n2  1)(VarX 2 )
VarX 12 
 72.5
n1  n2  2
t
X1  X 2

91.2  74.4
 3.120
1 1
1 1
72.5    
(VarX 12 )  
5 5
 n1 n2 
Degree of freedom, df  n1  n2  2  5  5  2  8
p  0.007 (Calculate d using Excel' s TDIST function, one - tail)
Output from Excel
New OLAP Operation: ANOVA
• Analysis of variance (ANOVA) tests if there are
differences between any pair of variables.
• Example:
Is there a
significant
difference
between the
expression
of gene X in
the various
disease
conditions?
Different measurements of gene X
Disease 1
Disease 2
Disease 3
100
90
105
83
78
70
72
81
74
75
95
93
79
85
90
74.4
4.2
88.4
6.5
mean
st dev
91.2
11.3
ANOVA
• ANalysis Of VAriance (ANOVA) is used to find
significant genes in more than two conditions:
Disease A
Disease B
Disease C
Gene
A1
A2
A3
B1
B2
B3
C1
C2
C3
g1
0.9
1.1
1.4
1.9
2.1
2.5
3.1
2.9
2.6
g2
4.2
3.9
3.5
5.1
4.6
4.3
1.8
2.4
1.5
g3
0.7
1.2
0.9
1.1
0.9
0.6
1.2
0.8
1.4
g4
2.0
1.2
1.7
4.0
3.2
2.8
6.3
5.7
5.1
∙∙∙
∙∙∙
∙∙∙
∙∙∙
∙∙∙
∙∙∙
∙∙∙
∙∙∙
∙∙∙
∙∙∙
• For each gene, compute the F statistic.
• Calculate the p value for the F statistic.
One-way Analysis of Variance (ANOVA)
• Decide whether there are any differences
between the values from k conditions (groups).
– H0: µ1 = µ2 = …. = µk
– Hα: There is at least one pair of means that are
different from each other.
• Assumptions:
– All k populations have the same variance
– All k populations are normal.
• ANOVA can be applied to any number of
samples. If there are only two groups, the
ANOVA will provide the same results as a t-test.
• Problem with multiple t-tests: accumulated
error may be large.
Idea of ANOVA
• The measurement of each group vary
around their mean – within group variance.
• The means of each condition will vary
around an overall mean – inter-group
variability.
• ANOVA studies the relationship between
the inter-group and the within-group
variance.
# of groups : k , # of measuremen ts for group i : ni
Total # of measuremen ts : N  i 1 ni
k
k
The overall mean : X  
ni
 X
i 1 j 1
The mean for group i : X i
ij
N


ni
j 1
X ij
ni
k
ni

Sum of squares between : SSbetween group SSCond    X i  X 
i 1 j 1
k
ni


Sum of squares within : SSwithingroup ( SS Error )   X ij  X i ]
i 1 j 1
2

2
Degrees of freedom for the conditions : k  1
SSCond
Condition mean squares : MS Cond 
k 1
Degrees of freedom for the error : N  k
SS Error
Error mean squares : MS Error 
N k
F statistic : F 
Calculate : p
MS cond
( F distribution with v1  k  1, v2  N  k )
MS Error
# of diseases : k  3
# of measuremen ts for disease i : ni  5
Total # of measuremen ts : N  i 1 ni  15
k
k
The overall mean : X  
ni
 X
i 1 j 1
The mean for disease i : X i
ij
 84.67
N


ni
j 1
X ij
ni
ni
k


Disease sum of squares : SS Disease   X i  X   810.13
2
i 1 j 1
k
ni


Error sum of squares : SS Error   X ij  X i  747.20
i 1 j 1
2
Degrees of freedom for the diseases : k  1  3  1  2
SS Disease 810.13
Disease mean squares : MS Disease 

 405.06
k 1
2
Degrees of freedom for the error : N  k  15  3  12
SS Error 747.20
Error mean squares : MS Error 

 62.27
N k
12
MS Disease 405.06
F statistic : F 

 6.50
MS Error
62.27
p  0.012
Output from Excel (ANOVA, single factor):
At the 5% significance level, gene X is expressed differently
between some of the disease conditions (p = 0.012).
New OLAP Operation: Correlate
• Computing the Pearson correlation coefficient
between two variables (e.g., between a clinical
variable and a gene expression variable).
• Example:
Is the gene expression
correlated with the
drug use?
ρxy =
Cov(X, Y)
√ (Var X)(Var Y)
Expression
of gene X
Dosage of
Drug Y
50
205
45
83
155
78
15
50
0
20
40
20
The Covariance
• The covariance measures the strength of the linear
relationship between two numerical variables (X & Y)
• The sample covariance:
n
cov ( X , Y ) 
 ( X  X)( Y  Y)
i1
i
i
n 1
• Only concerned with the strength of the relationship
• No causal effect is implied
Basic Business Statistics, 11e ©
2009 Prentice-Hall, Inc..
Chap 3-28
Coefficient of Correlation
• Measures the relative strength of the
linear relationship between two numerical
variables
• Sample coefficient of correlation:
cov (X , Y)
r
SX SY
where
n
cov (X , Y) 
 (X  X)(Y  Y)
i1
i
Basic Business Statistics, 11e ©
2009 Prentice-Hall, Inc..
n
i
n 1
SX 
Chap 3-29
 (X  X)
i1
i
n 1
n
2
SY 
 (Y  Y )
i1
i
n 1
2
Person’s Correlation Coefficient
• Given two groups of samples X = {x1, …, xn }
and Y = { y1, …, yn } .
• Pearson’ correlation coefficient r is given by
n
r
_
 ( x  x)( y
i 1
n
i
_
 ( xi  x)
i 1
_
i
n
2
 y)
_
2
(
y

y
)
 i
i 1

Features of the
Coefficient of Correlation
• The population coefficient of correlation is referred as ρ.
• The sample coefficient of correlation is referred to as r.
• Either ρ or r have the following features:
– Unit free
– Ranges between –1 and 1
– The closer to –1, the stronger the negative linear relationship
– The closer to 1, the stronger the positive linear relationship
– The closer to 0, the weaker the linear relationship
Basic Business Statistics, 11e ©
2009 Prentice-Hall, Inc..
Chap 3-31
Scatter Plots of Sample Data with
Various Coefficients of Correlation
Y
Y
X
r = -1
Y
X
r = -.6
Y
Y
r = +1
Basic Business Statistics, 11e ©
2009 Prentice-Hall, Inc..
X
X
rChap
= 3-32
+.3
X
r=0
Calculation of the Correlation Coefficient
X  102.67
Y  24.17
n6

VarX 
n
2
(
X

X
)
i
i 1
n 1
2
(
Y

Y
)
i 1 i
n
VarY 
n 1

Cov( X , Y ) 
n
i 1
 XY 
(50  102.67) 2      (78  102.67) 2

 4061.07
6 1
(15  24.17) 2      (20  24.17) 2

 324.17
6 1
( X i  X )(Yi  Y )
n 1
(50  102.67)(15  24.17)    

 922.22
6 1
Cov( X , Y )
922.22
922.22


 0.80
(VarX )(VarY )
4061.07  324.17 1147.38
New OLAP Operation: Select
• Given a threshold, select the entries that meet
the minimum requirement.
• Example:
For a threshold of
p < 0.05, gene 2
and gene 6 are
selected.
Gene
p value
1
2
3
4
5
6
7
8
0.561
0.004
0.160
0.335
0.083
0.025
0.532
0.476
Discovery of Differentially Expressed Genes (1)
Roll-up the microarray data over the Measurement Unit
dimension using the user-defined aggregate function.
PA
Val
D13626 10 14 18 5 24 32 16
D13627
roll-up
J04605
0 24 32 16
D13628
Gene
Gene
D13628
D13626 10 14 0
D13627
J04605
L37042
L37042
S78653
S78653
X60003
X60003
Z11518
Z11518
1
1
2
3
4
5
6
Sample (patient)
7
2
3
4
5
6
Sample (patient)
7
Discovery of Differentially Expressed Genes (2)
Roll-up the data over the Clinical Sample dimension
from the patient level to disease level (or normal tissue
level). After the roll-up, each cell contains mean,
variance and the number of values aggregated.
D13628
Gene
D13626 12 0 28 19
D13627
0 24 32 16
roll-up to
disease level
J04605
L37042
D13628
Gene
D13626 10 14 0
D13627
J04605
L37042
S78653
S78653
X60003
X60003
Z11518
Z11518
1
2
3
4
5
6
Sample (patient)
7
a
b c
d
Sample (disease)
Discovery of Differentially Expressed Genes (3)
Compare a particular disease type with its corresponding
normal tissue type. Compute the t statistic and p value
for each gene. Select the genes that have a p value less
than a given threshold (e.g., p < 0.05).
D13626 12 0 28 19
D13627
D13626
D13628
D13628
D13627
Compare a with c
J04605
L37042
S78653
Gene
Gene
p value
J04605
L37042
S78653
X60003
X60003
Z11518
Z11518
a
b c
d
Sample (disease)
0.561
0.004
0.160
0.335
0.083
0.025
0.532
0.476
Discovery of Informative Genes
Roll-up the microarray data over the Measurement Unit dimension
Roll-up the data over the Clinical Sample dimension from the
patient level to disease type or normal tissue level
Slice the data for a particular disease type and its
corresponding normal tissue type
t-test on each pair of the selected cells for each gene
(p-values are computed and adjusted)
p-select the genes that have p-values less than a given threshold