* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Some statistical musings
Protein moonlighting wikipedia , lookup
Gene regulatory network wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Gene expression profiling wikipedia , lookup
Western blot wikipedia , lookup
Multi-state modeling of biomolecules wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Some statistical musings Naomi Altman Penn State 2015 Dagstuhl Workshop Some topics that might be interesting • • • • • • • Feature matching across samples and platforms Preprocessing number of features >> number of samples feature screening replication and possibly other design issues PCA and relatives mixture modeling Feature Matching • e.g. (simple) should we match RNA-seq with a gene expression microarray by “gene” or by “oligo”? • protein MS with RNA-seq or ribo-Seq • how should we match features such as methylation sites, protein binding regions, SNPs, transcripts and proteins? Preprocessing These plots show the concordance of 3 normalizations of the same Affymetrix microarray. Dozens of methods are available for each platform. Matching features across platforms is going to be very dependent on which set of normalizations are selected. p>>n When the number of features > number of samples: correlations of magnitude very close to 1 are common we can always obtain a multiple “perfect”predictors so selecting “interesting” features is difficult “extreme” p-values, Bayes factors, etc become common singular matrices occur in optimization algorithms p>>n New statistical methods for feature selection such as “sparse” and “sure screening” selectors may be useful. The idea of “sure screening” selectors is that prescreening brings us to p<n-1. But … we have some high probability that all the “important” features are selected (along with others which we will screen out later). Experimental Design • Randomization, replication and matching enhance our ability to reproduce research • In particular, replication ensures the results are not sample specific while blocking allows variability in the samples without swamping the effects • Multi-omics is best done on single samples measured on multiple platforms • Technical replication is seldom worth the cost compared to taking more biological replicates Dimension Reduction PCA (or SVD) have many relatives that can be used to reduce the number of features using projections onto a lower dimensional space The components are often not interpretable. Many variations are available from both the machine learning and statistics communities. Machine learning stresses fitting the data. Statistics stresses fitting the data generating process. Mixture Modeling • In many cases we can think of a sample as a mixture of subpopulations • We can use the EM algorithm or Bayesian methods to deconvolve into the components. Some other statistical topics already mentioned • missing features (present but not detected) which differ between samples • mis-identified features • do p-values (or FDR estimates) matter? • multiple times; multiple cells; multiple individuals • biological variation vs measurement noise & error propagation • how can be enhance reproducibility (statistical issues) • can we fit complex models? should we? • the data are too big for most statistically trained folks • how are we going to train the current and next generation?