Download Some statistical musings

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Protein moonlighting wikipedia , lookup

Gene regulatory network wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene expression profiling wikipedia , lookup

Western blot wikipedia , lookup

Multi-state modeling of biomolecules wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Two-hybrid screening wikipedia , lookup

EXPOSE wikipedia , lookup

Community fingerprinting wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
Some statistical musings
Naomi Altman
Penn State
2015 Dagstuhl Workshop
Some topics that might be interesting
•
•
•
•
•
•
•
Feature matching across samples and platforms
Preprocessing
number of features >> number of samples
feature screening
replication and possibly other design issues
PCA and relatives
mixture modeling
Feature Matching
• e.g. (simple) should we match RNA-seq with a
gene expression microarray by “gene” or by
“oligo”?
• protein MS with RNA-seq or ribo-Seq
• how should we match features such as
methylation sites, protein binding regions, SNPs,
transcripts and proteins?
Preprocessing
These plots show the
concordance of 3
normalizations of the
same Affymetrix
microarray.
Dozens of methods
are available for each
platform.
Matching features
across platforms is
going to be very
dependent on which
set of normalizations
are selected.
p>>n
When the number of features > number of samples:
 correlations of magnitude very close to 1 are
common
 we can always obtain a multiple “perfect”predictors
so selecting “interesting” features is difficult
 “extreme” p-values, Bayes factors, etc become
common
 singular matrices occur in optimization algorithms
p>>n
New statistical methods for feature selection such as
“sparse” and “sure screening” selectors may be
useful.
The idea of “sure screening” selectors is that
prescreening brings us to p<n-1.
But … we have some high probability that all the
“important” features are selected (along with
others which we will screen out later).
Experimental Design
• Randomization, replication and matching enhance
our ability to reproduce research
• In particular, replication ensures the results are not
sample specific while blocking allows variability in
the samples without swamping the effects
• Multi-omics is best done on single samples
measured on multiple platforms
• Technical replication is seldom worth the cost
compared to taking more biological replicates
Dimension Reduction
PCA (or SVD) have many relatives that can be used to
reduce the number of features using projections onto
a lower dimensional space
 The components are often not interpretable.
 Many variations are available from both the
machine learning and statistics communities.
 Machine learning stresses fitting the data.
 Statistics stresses fitting the data generating process.
Mixture Modeling
• In many cases we can think of a sample as a
mixture of subpopulations
• We can use the EM algorithm or Bayesian
methods to deconvolve into the components.
Some other statistical topics already
mentioned
• missing features (present but not detected) which differ
between samples
• mis-identified features
• do p-values (or FDR estimates) matter?
• multiple times; multiple cells; multiple individuals
• biological variation vs measurement noise & error
propagation
• how can be enhance reproducibility (statistical issues)
• can we fit complex models? should we?
• the data are too big for most statistically trained folks
• how are we going to train the current and next generation?