Download 4_Diff_Analysis_and_Samp_Features_Mar2011

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Epigenetics in learning and memory wikipedia , lookup

Metagenomics wikipedia , lookup

Pathogenomics wikipedia , lookup

Epistasis wikipedia , lookup

X-inactivation wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Oncogenomics wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Genetic engineering wikipedia , lookup

Ridge (biology) wikipedia , lookup

NEDD9 wikipedia , lookup

Copy-number variation wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Public health genomics wikipedia , lookup

Genomic imprinting wikipedia , lookup

History of genetic engineering wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Genome evolution wikipedia , lookup

Gene wikipedia , lookup

Gene therapy wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Helitron (biology) wikipedia , lookup

Genome (book) wikipedia , lookup

Gene desert wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Gene nomenclature wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

The Selfish Gene wikipedia , lookup

Gene expression programming wikipedia , lookup

RNA-Seq wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene expression profiling wikipedia , lookup

Microevolution wikipedia , lookup

Designer baby wikipedia , lookup

Transcript
Differential Analysis
Differential Analysis
Marker selection
Given phenotypically distinct classes, find “markers”
that distinguish these classes from one another
Tumor
Normal
Tumor
Normal
Gene Marker Selection
Hierarchy of difficulty
Problem
Gene Markers Error
Example
I. Tissue or Cell Type
Normal vs. Abnormal
~1000-2000
~0%
Normal vs. Renal carcinoma
II. Morphological
Type
~200-500
~0-5%
Leukemia ALL vs. AML
III. Morphological Subtype ~50-100
Multiclass Classification
~0-15% ALL B- vs. T-Cell
IV. Treatment Outcome
Drug Sensitivity
~5-50% AML Treatment Outcome
~1-20
Degree of
Difficulty
adapted from P. Tamayo
Gene Marker Selection
Compute score for each gene
Phenotype/
class labels
T-test:
t-test,
SNR, etc.
Ranked gene list
Score
Dataset
Compute
score:
Signal-to-Noise Ratio (SNR):
Gene Marker Selection
Challenges
• Small sample size.
• Each gene tested is a separate hypothesis 
likelihood of false positives.
• Gene interaction not taken into account.
Gene Markers Selection
Small Sample Size
 Generate a 10,000x100 matrix from a Gaussian (mean=0, SD=0.5)
 Pick n columns (6,14,30,100)
 Assign sample labels yellow and green
 Select top 25 markers for yellow, top 25 markers for green
Yellow Green
6 samples
Yellow
Green
14 samples
Yellow
Green
30 samples
Yellow
Green
100 samples
With small sample size it is easy to find genes correlated with phenotype
P-value calculation
• If a gene is normally distributed the t-score follows the t-distribution
– What if they aren’t normally distributed?
• Permutation Test:
– shuffle labels (class membership)
– compute score for each gene (t-score, SNR, .. )
– repeat many times
 Empirical null distribution of scores for each gene
• Compare observed score to empirical distribution.
Distribution of
permuted scores for
given gene
Observed score of gene
scores
No distributional assumptions are made - compute gene-specific p-values
Permutation test and P-value
To determine how significant a gene’s statistical score is
“Called” Class A
“Called” Class B
“True” classes
Permutation 1
Permutation 2
Permutation n
Known class A samples
Known class B samples
7
4
1
9
9
4
6
7
1
9
4
5
6
10
3
8
4
1
2
1
7
3
5
1
4
3
9
4
5
5
7
6
9
8
8
3
10
6
7
3
8
10
9
7
8
5
10
10
2
4
2
8
10
2
4
1
10
9
6
6
5
10
10
10
3
8
10
8
4
9
7
9
8
10
4
5
6
5
2
7
7
2
4
9
6
2
4
1
2
9
10
9
1
3
7
1
1
1
5
5
7
5
4
7
1
2
6
5
8
1
10
9
4
8
7
2
9
1
10
3
8
4
2
6
6
9
2
10
5
2
5
3
7
10
7
6
2
9
3
10
5
9
9
7
10
2
5
2
4
8
4
2
9
2
5
8
2
10
7
5
5
3
2
5
8
9
3
4
5
6
1
1
9
2
6
2
5
1
6
5
6
1
5
2
7
9
9
3
4
2
2
9
1
4
8
3
8
6
6
6
3
1
7
2
8
2
4
2
4
1
2
9
10
8
3
7
3
9
8
6
8
10
7
4
3
10
3
1
5
6
1
8
3
1
9
3
4
1
2
6
9
2
8
8
4
7
9
8
9
10
8
9
6
5
5
7
3
6
5
2
4
2
10
8
9
3
8
3
9
10
5
2
9
6
5
2
10
5
3
9
1
9
7
1
8
10
10
2
7
10
2
9
1
4
3
2
8
8
9
2
1
6
6
1
8
8
6
4
9
8
8
5
5
5
8
7
4
10
4
9
5
1
1
5
5
2
1
7
2
4
9
10
1
4
10
9
7
7
7
5
Generates a “null distribution” of values for this gene
Compare with “real” score for this gene
Score
Marker Selection Process
Phenotype/
class labels
t-test,
SNR, etc.
Ranked gene list
Score
Dataset
Compute
score:
Measure
significance:
Measure of
significance
permutation
test
Correct for
multiple
hypotheses:
FDR, FWER,
etc.
Markers
Multiple Hypotheses
What to control
• Bonferroni Correction:
– Most conservative metric
– Divides the p-value by the number of hypotheses
• FWER (Family-Wise Error Rate): probability of calling one or more
hypotheses significant given that they are all null
• FDR (False Discovery Rate): probability that the null hypothesis is
true given that the result is significant
• Try to reduce the number of hypotheses tested in the first place
(i.e. filtering)
Exercise
ComparativeMarkerSelection Module
1. Choose module:
• Gene List Selection  ComparativeMarkerSelection
2. Choose input file:
Next to “input file”, choose “Specify URL”
View datasets window in Web browser
Click and drag all_aml_train.preprocessed.gct
3. Choose class file:
Next to “cls file”, choose “Specify URL”
View datasets window in Web browser
Click and drag all_aml_train.cls
4. Click Run
Viewing Analysis Results
Differential Analysis Cookbook
• Reduce number of hypotheses/genes by variation filtering (attempt
at reducing false negatives)
• Choose test statistic (e.g., SNR, t-score, ...)
• If enough samples, compute p-values by permutation test
(otherwise, compute asymptotic test using the standard tdistribution).
• Control for Multiple Hypothesis Testing by using the FDR correction
– Remember: if you choose FDR ≤ 0.05, you’re willing to accept 5% of
false positives.
– If number of significant hypotheses/genes “too large” even for very small
threshold values, either:
• use the maxT correction (possible w/ empirical p-values only).
• use additional criteria (e.g., min fold-change, min expression value, etc.)
Differential Analysis
GenePattern modules
• Create expression data set – ExpressionFileCreator
• Reduce number of hypotheses/genes by variation filtering –
PreprocessDataset
• Make class file
• Run Differential Analysis – ComparativeMarkerSelection
– Choose test statistic (say, t-score)
• View results with ComparativeMarkerSelectionViewer
– If enough samples, compute p-values by permutation test (otherwise, use
asymptotic test).
– Control for MHT by using the FDR correction
– Use HeatMapViewer to view results for top genes
• Use GSEA to find gene sets (or pathways) that are enriched in your
dataset.
Working with Samples
and Features
Overview
• Extracting a set of samples
• Computing co-expressed genes
• Converting probe set ids to gene names
• Computing overlap between gene sets
Working with Samples and Features
1.
2.
3.
From a combined dataset of cancer
and normal samples, select the
normal samples.
Within the normal samples, find the
genes coexpressed with LRPPRC
(Affymetrix probe M92439_at), a
gene with mitochondrial function.
Compare these genes and those
coexpressed with LRPPRC in another
expression dataset to determine the
coexpressed genes common to both
datasets.
GCM_Total.r
es
SelectFeaturesColumns
GCM_Normals.res
GeneNeighbors
GCM_Normals.markerdata.g
ct
GCM_Normals.markerlist.o
df
GeneListSignificanceViewer
CollapseDataset
GCM_Total_Normals.markerdata.collapsed.
gct
ExtractRowNames
GCM_Total_Normals.markerdata.collapsed.row.nam
es.txt
VennDiagram
Exercise