Download Statistical Analysis of Gene Expression Data (A Large Number of

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Complement component 4 wikipedia , lookup

Transcript
Getting the story – biological model based on
microarray data
• Once the differentially expressed genes are identified
(sometimes hundreds of them), we need to figure out what it
all means
• Since we don't know much about function of most of the
genes this is not easy
• Complicated further by the fact that the gene function is
context-specific. Depends on the tissue, developmental stage
of the organism and multiple other factors
• "Functional clustering" grouping genes with respect to their
known function (ontology)
• Establishing statistical significance between groups of genes
identified in the analysis and "Functional clusters"
Analyzing Microarray Data
Data Normalization – reducing
technical variability
Experimental Design
ex pn=70
C4 - 3
Vs
AB4
l 2r
2
1
Not
Treated
C1
Treated
0
- 1
Not
Treated
C4
Treated
- 2
Universal
Control
- 3
4
5
6
7
8
9
10
11
12
13
14
15
16
l 2a
•Identifying differentially expressed genes
•Factoring out variability sources
Data Mining
CLUSTER 1 ; 39 N 38
S
G2
M
G1
S
G2
M
0
-2
-4
0
50
100
150
Time
CLUSTER 1 ; 39 N 17
S
G2
M
G1
S
G2
M
0
Expression Level
2
4
G1
-2
Expression Level
2
4
G1
-4
C2
Treated
Not
Treated
Statistical Analysis (ANOVA):
0
50
100
Time
150
Not
Treated
C3
Treated
Data Integration and Interpretation
Gene Ontology (GO)
http://www.geneontology.org/
The Gene Ontology (GO) project is a collaborative effort to address the need
for consistent descriptions of gene products in different databases. The GO
collaborators are developing three structured, controlled vocabularies
(ontologies) that describe gene products in terms of their associated
biological processes, cellular components and molecular functions in a
species-independent manner.
Molecular Function
Biochemical activity or action of the gene product.
MF describes a capability that the gene product has and there is no
reference to where or when this activity or usage actually occurs.
Examples: enzyme transporter ligand
cytochrome c: electron transporter activity
Biological process
A biological objective to which the gene product
contributes.
A biological process is accomplished via one or more
ordered assemblies of molecular functions.
There is generally some temporal aspect to the process
and it will often involve the transformation of some
physical thing.
Examples: cell growth and maintenance
cytochrome c oxidative phosphorylation, induction of cell
death
Cellular Component
A component of a cell that is part of some larger object
or structure.
Examples: chromosome nucleus ribosome
cytochrome c: mitochondrial matrix, mitochondrial
inner membrane
First step of making a story: Statistical significance
of a particular "Functional cluster"
•Suppose we have analyzed total of N genes, n of which turned out to be
differentially expressed/co-expressed (experimentally identified - call them
significant)
•Suppose that x out of n significant genes and y out of N total genes were
classified into a specific "Functional group"
•Q1: Is this "Functional group" significantly correlated with our group of
significant genes?
•Q2: Are significant genes overrepresented in this functional group when
compared to their overall frequency among all analyzed genes?
•Q3: What is the chance of getting x or more significant genes if we
randomly draw y out of N genes "out of a hat" with assumption that each
gene remaining in the hat has an equal chance of being drawn? (
•H0: p(significant gene belonging to this category) = y/N
•Q3A: What is the p-value for rejecting this null hypothesis
Fisher's tests
(http://eh3.uc.edu/teaching/cfg/2006/R/NickelFunctionalClusteringClean.R)
Strategy for finding "Statistically Significant" GO categories:
•Identify all categories that contain at least 5 genes from the microarray
(about 1800 in our case)
•Perform a Fisher's exact test for each category to test for statistically
significant over-representation of differentially expressed genes
•Adjust individual Fisher's p-values for the fact that we are testing 1800
hypotheses by calculating FDR's
•Repeat this for different levels of the statistical significance used to select
differentially expressed genes (FDR<0.01, 0.05, 0.1, 0.2) and observe the
statistical significance of two most significant GO categories
Statistically Significant GO Categories
Top 2 GO Categories for genes with FDR< 0.01
GO Term
1
FDR for the category= 0.0442416
GOID = GO:0006936
Term = muscle contraction
Definition = A process leading to shortening and/or development of
tension in muscle tissue. Muscle contraction occurs by a sliding
filament mechanism whereby actin filaments slide inward among the
myosin filaments.
Ontology = BP
Two By Two matrix of gene memberships in this category
[,1] [,2]
[1,]
3
12
[2,]
33 9268
GO Term
2
FDR for the category= 0.1769315
GOID = GO:0006937
Term = regulation of muscle contraction
Definition = Any process that modulates the frequency, rate or extent
of muscle contraction.
Ontology = BP
Two By Two matrix of gene memberships in this category
[,1] [,2]
[1,]
2
13
[2,]
11 9290
Statistically Significant GO Categories
Top 2 GO Categories for genes with FDR< 0.05
GO Term
1
FDR for the category= 0.006130206
GOID = GO:0005576
Term = extracellular region
Synonym = extracellular
Definition = The space external to the outermost structure of a cell.
For cells without external protective or external encapsulating
structures this refers to space outside of the plasma membrane.
This term covers the host cell environment outside an
intracellular parasite.
Ontology = CC
Two By Two matrix of gene memberships in this category
[,1] [,2]
[1,] 160 544
[2,] 1381 7231
GO Term
2
FDR for the category= 0.006130206
GOID = GO:0005615
Term = extracellular space
Synonym = intercellular space
Definition = That part of a multicellular organism outside the cells
proper, usually taken to be outside the plasma membranes, and
occupied by fluid.
Ontology = CC
Two By Two matrix of gene memberships in this category
[,1] [,2]
[1,] 149 555
[2,] 1266 7346
Statistically Significant GO Categories
Top 2 GO Categories for genes with FDR< 0.1
GO Term
1
FDR for the category= 0.1196382
GOID = GO:0001568
Term = blood vessel development
Definition = Processes aimed at the progression of the blood vessel
over time, from its formation to the mature structure. The blood
vessel is the vasculature carrying blood.
Ontology = BP
Two By Two matrix of gene memberships in this category
[,1] [,2]
[1,]
25 1731
[2,]
40 7520
GO Term
2
FDR for the category= 0.1196382
GOID = GO:0048514
Term = blood vessel morphogenesis
Definition = Processes by which the anatomical structures of blood
vessels are generated and organized. Morphogenesis pertains to
the creation of form. The blood vessel is the vasculature
carrying blood.
Ontology = BP
Two By Two matrix of gene memberships in this category
[,1] [,2]
[1,]
23 1733
[2,]
34 7526
Statistically Significant GO Categories
Top 2 GO Categories for genes with FDR< 0.2
GO Term
1
FDR for the category= 0.1717101
GOID = GO:0001568
Term = blood vessel development
Definition = Processes aimed at the progression of the blood vessel
over time, from its formation to the mature structure. The blood
vessel is the vasculature carrying blood.
Ontology = BP
Two By Two matrix of gene memberships in this category
[,1] [,2]
[1,]
37 3193
[2,]
28 6058
GO Term
2
FDR for the category= 0.1717101
GOID = GO:0048514
Term = blood vessel morphogenesis
Definition = Processes by which the anatomical structures of blood
vessels are generated and organized. Morphogenesis pertains to
the creation of form. The blood vessel is the vasculature
carrying blood.
Ontology = BP
Two By Two matrix of gene memberships in this category
[,1] [,2]
[1,]
33 3197
[2,]
24 6062
>
Statistical significance of a particular "Functional
cluster" - cont
Observed
g1
...
gx
gx+1
...
gy
gy+1
...
gn+y-x
gn+y-x+1
...
gN
Removing Functional Classification
g1
...
gn
gn+1
...
gN
Q: By randomly drawing y boxes to color their border blue, what is the chance to
draw x or more red ones
Outcome (o1,...,oT): A set of y genes with selected from the list of N genes
Event of interest (E): Set of all outcomes for which the number of red boxes
among the y boxes drawn is equal to x
Since drawing is random all outcomes are equally probable
P(o1 )  P(o 2 )  ...  P(oT )
T

t 1
P(ot )  1

P(ot ) 
1
, for all t  1,..., T
T
Statistical significance of a particular "Functional
cluster" - cont
Outcome (o1,...,oT): A set of y genes with selected from the list of N genes
Event of interest (E): Set of all outcomes for which the number of red boxes
among the y boxes drawn is equal to x
M
E  {o ,...,o )
E
1
E
M

P(E) 

m 1
M
o 
E
m

1 M

T
T
m 1
All we have to do is calculating M and N where:
T=number of different sets we can draw a set of y genes out of total of N genes
 N  N ( N  1)( N  2)...( N  y  1)
T    
1* 2 * ... * y
 y
Comes from the fact that order in which
we pick genes does not matter
M=number of different ways to obtain x red boxes (significant genes) when
drawing y boxes (genes) out of total of N boxes (genes), x of which are red
(significant)
 n  N  n 

M   
 x  y  x 
First pick x red boxes. For each such set
of x red boxes pick a set of y-x non-red
boxes
Statistical significance of a particular "Functional
cluster" - p-value
P-value: Probability of observing x or more significant genes under the
null hypothesis
 n  N  n 
 

min( y , n )  
r
y

r
 

p  value  p ( x)  p ( x  1)  ...  p (min( y, n)) 
N
rx
 
 y

Fisher's exact test or the "hypergeometric" test
Irr_Day3
Irr_Day2
Irr_Day1
E2_Day10
E2_Day7
E2_Day4
Dex_Day3
Dex_Day2
Dex_Day1
381 genes that were differentially expressed after the treating a
cell line with three different carcinogens:
Dex and E2 and Irradiation
Up
Finding important functional groups for up-regulated
genes
Using the "Ease" annotation tool http://david.niaid.nih.gov/david/
We obtained following significant gene ontologies
Up_DexANDNE2ANDirr_381_GO.htm
Homework:
1) Download and install Ease
2) Select top 20 most-signficianly up-regulated genes in our W-C dataset and
identify significantly over-represented categories (using the three-way ANOVA
analysis)
3) Repeat the analysis with 30, 40, 50 and 100 up-regulated and downregulated gene
4) Prepare questions for the next class regarding problems you run into
Regulating Transcription -transcription factor itself does
not need to be transcriptionally regulated
Modeling Microarray Data
G2
CLUSTER
M
G1
100
S
G2
150
0
50
100
CLUSTER
G1
Expression Level
M
150
S
0
S
G2
G2
7 ;
M
150
N
19
S
G2
G1
50
9
0
100
CLUSTER
M
G1
150
S
G2
0
50
G1
100
Time
37
S
G2
M
100
150
G1
S
G2
8 ;
M
N
22
S
G2
G1
M
150
0
50
100
150
Time
11 ;
M
N
G1
50
CLUSTER
M
N
17
S
G2
CLUSTER
M
G1
Expression Level
G2
4 ;
M
Time
-2 -1 0 1 2
N
S
Expression Level
G1
100
Time
G1
Time
10 ;
0
CLUSTER
M
-2 -1 0 1 2
G2
100
M
G2
Expression Level
23
S
Time
TER
86
S
-2 -1 0 1 2
N
G1
0
N
G1
Time
6 ;
M
3 ;
M
-2 -1 0 1 2
3
S
Time
TER
G2
N
G1
Mathematical./
Statistical Models
Computer Algorithms/
Software
150
S
G2
12 ;
M
G1
N
29
S
G2
M
-2 -1 0 1 2
0
G2
2 ;
Expression Level
M
Expression Level
G2
-2 -1 0 1 2
STER
0
50
100
Time
150