Download Transformation of all records with a function

Document related concepts

Species distribution wikipedia , lookup

Transcript
Statistical Analyses of Life Science and
Pathology from Genomic Perspective
Unit of Statistical Genetics, Kyoto University
Ryo Yamada
[email protected]
Childhood
Birth
Death
Phenome
Adolescent
Aging
Development
Adult
Gametes
Individual
Fertilized
Egg
Genes
Cell
Organ Metabolites
Proteins
RNA
Genome
Proteome
Transcriptome
Omics
Heredity
Stochastics
Genetic
diversity
Metabolome
Tissue
Molecules
DNA
Molecular
Genetics
at
a glance
Phenotypes
Phenotypic
diversity
Diversity
Big Data
Statistics
DNA,Genome, Consistent
DNA/Chromosome modification
Genotype
Phenotype
Transcriptome
Intermediate
phenotype
Protein/Proteome
Phenotype/Phenome
Terminal
phenotype
Time x Space of
Individual
Cancer
Functional
Somatic
Mutations
Somatic Mosaicism
Parents
Fertilized Egg
Mutations
Germline
Diversity in Phenotypes
• Easy to measure vs. Not-easy to measure
• Representatives vs. Distribution itself
• Many but Mutually Independent vs. Mutually Correlated
Representatives vs. Distribution itself
• Temperature
• A representative of molecular population
• Independent and Identically Distributed Variable Observation
• Good-shaped distribution→Representatives→Parametric approach
• Bad-shaped distribution→Distribution itself →Non-parametric approach
• One sample is a set of observations
• One sample gives a distribution→Representatives enough?
ThermoFisher Scientific社
Many but Mutually Independent vs. Mutually
Correlated
• Multiple items mutually correlated.
横河電機
• Chronological data(time-line)
• Shape data(space continuity)
• Movement data(Time x Space)
• Patten data(Informational axies)
Nature 465, 918–921 (17 June 2010)
https://ja.wikipedia.org/wiki/胚
Summary for Genotypes and Phenotypes
• To start your analysis
•
•
•
•
Record “Values”
“Values” take various shapes
“Simple value” : a Number
“Number”
• Natural Number, Integer, Rationals, Reals, Complex,Vector,Matrix …
• “Values” for analysis but not “Numbers”
• Mathematical models
• Biological phenomena have random errors
• Stochastic models, Statistical models
• Models have parameters, then
• “Values” for parameters are “Numbers” again
• “Simple values” are values of parameters in simple models.
• Complex models and their parameters can be also values for your analysis.
Roles of statistics/data science
for genome/omics
• Quality Control of Noisy High-Throughput Data
• Tests, Estimation/Inference, Classification/Clustering
• Multi-dimensional/High-dimensional Data
• Random value-based approaches
• Others : Experimental Designs
Quality Control of Noisy High-Throughput
Data
• Systematic errors/ biases; samples, reagents,
date/machine/personnel effects
• How to Correct or control the noises
•
•
•
•
Outsider detection
Transformation of all records with a function
Normalization for “locational effects”
“Control samples”
Transformation of all records with a function
Genomic control
for
GWAS
Preprocessing Micorarray Data
Median-based correction
Log-transformation
Normalization for “locational effects”
• Tendency should be considered.
• Batch effects should be considered.
• Non-data-driven
• Data-driven
Tests, Estimation/Inference,
Classification/Clustering
• Tests
• Significance, Error Controlling, Multiple-testing issue
• Estimation/Inference
• Interval, Models, Bayes
• Classification/Clustering
• Unsupervised Learning vs. Supervised Learning
Multiple Comparison
P-value vs. Q-value
Multiple Comparison
• Almost all hypotheses are NULL
Uniform distribution
Minimum p-value distribution
• 2^10
Mean
Min-p may take quite larger
value than the mean.
In many cases, min-p value
is smaller than the mean.
Such small value are not
rare.
Minimum p-value distribution
1,2,4,8,…
10^6
1,2,4,8,…
10^6
NON-NULL, FDR (False Discovery Rate)
• Many hypotheses are NON-NULL, or Almost all hypotheses are NONNULL
Combination of two
distributions
• Uniform p-values
• Small p-values
Pick smaller p-values.
Threshold value should be changed for the
ranks of p-values.
The fraction of “true positives” is
controlled.
Large-scale inference
• When you observed many at once, their distribution is informative.
• The estimates of each observation using the information are different
the estimates not using the information.
• q-value of FDR is one type of such estimates.
• Use information of distribution when observed many together
• Empirical Bayes
Estimation/Inference
• Models, Parameters, Interval, Bayes
• Uniform p-values
• Small p-values
Assuming the mixture of two
distributions;
This is a model.
Estimation/Inference
• Samples → Point estimates, Interval estimates
• Sample distribution, Theoretical estimates, unbiased estimates,…
• Frequentist
The statement “The star’s
weight is between a and b” will
be right 9 times out of 10 times.
Estimation/Inference
• Frequentists vs. Bayesians
•
•
Frequentists approaches are difficult for
students not good at mathematics and
their thinking processes are not easy to
follow.
Instead Bayesian thinking processes
tend to be easy to follow for many.
Estimation/Inference
• Bayesian
• Model has parameter(s)
• Dara + Model → Estimation of parameter value
• Likelihood-based; Maximum-likelihood estimates; Interval estimates
based on likelihood
Summary for Genotypes and Phenotypes
• To start your analysis
•
•
•
•
Record “Values”
“Values” take various shapes
“Simple value” : a Number
“Number”
• Natural Number, Integer, Rationals, Reals, Complex,Vector,Matrix …
• “Values” for analysis but not “Numbers”
• Mathematical models
• Biological phenomena have random errors
• Stochastic models, Statistical models
• Models have parameters, then
• “Values” for parameters are “Numbers” again
• “Simple values” are values of parameters in simple models.
• Complex models and their parameters can be also values for your analysis.
Estimation/Inference
• Frequentists vs. Bayesians
• Quality Control of Noisy High-Throughput
• Tests, Estimation/Inference, Classification
• Multi-dimensional/High-dimensional Dat
• Random value-based approaches
• Others : Experimental Designs
• Use both, not select one of them, it is the way in 21-st century
• Bayesian approaches seem to be used more and more, because
• Models became more complicated.
• Computers’ assists・・・Complicated distributions can be handled simulationally
• Large-scale data ・・・Empirical Bayes approaches can be applied
Estimation/Inference
• Frequentists vs. Bayesians
• “Prior” distribution is necessary
• What is the “appropriate prior”?
Success rate:No information at all
• Somebody you don’t know at all will take an exam on which you have
no information at all. How likely do you think (s)he will pass it?
Jeffreys prior
One of non-subjective priors
Estimation/Inference
• Frequentists vs. Bayesians
• Use both, not select one of them, it is the way in 21-st century
• Large scale inference
• Prior can be set based on the data set ~ empirical Bayesian
Multi-dimensional/High-dimensional Data
• No way to visualize high-dimensional data
• Almost impossible for US to understand in high-dimensional data
themselves
Multi-dimensional/High-dimensional Data
• How many dimensions can we handle?
• 2D space or 3D space
• Extra dimensions
• Gray/Color scale
• Arrows
• Time
Multi-dimensional/High-dimensional Data
• Dimension reduction
• Pick up 2 or 3 dims seemingly important, then visualization is easy and we feel
we can understand them.
• Only few dims are truly meaningful and all the others are noize. Pick the true
dims.
• LASSO, Compression sensing
Multi-dimensional/High-dimensional Data
• Space is high dimensional but data is low
• Manifold learning
• Put data into higher dimensional space
and pull them back to low dim space.
High-dimensionality
• Many genes, many
biomarkers, many
features
Multi-dimensional/High-dimensional Data
• Life-science data are highdimensional
• Number of observed items
are huge.
FACS
• But the items are mutually
strongly correlated , and
their dimension is much
smaller in reality.
Ethnic diversity
Multi-dimensional/High-dimensional Data
• Objects with low dimensions in higher dimensional space
• Topology
• Graph, network and topology
Multi-dimensional/High-dimensional Data
• Graph: Itemize and connect items
with relation
• Pairwise relations are cared.
• No care for trio-wise or higher
relations.
Multi-dimensional/High-dimensional Data
• Graph and its matrix representation
and linear algebra
• Graph tends to be sparse
• … Sparse analysis
Multi-dimensional/High-dimensional Data
• Two important features
• No “common” individuals
• Sparse
High-dimensionality
• No commons
• Central area : a sphere in a cubic
3.14 / 4 = 0.785
High-dimensionality
• Sparse
• To estimate density, you need reasonable number of samples per
small cubic volume, but…
•
•
•
•
•
Dim = 1 : 0.1
Dim = 2 : 0.01
Dim = 3 : 0.001
….
Dime = 6 : 0.000001
High-dimensionality
• Quite spacious, but reasonably dense distribution.
• Distribution should be low dimensional.
Multi-dimensional/High-dimensional Data
• Life-science data are highdimensional
• Number of observed items
are huge.
FACS
• But the items are mutually
strongly correlated , and
their dimension is much
smaller in reality.
Ethnic diversity
Low dimensional distribution in higher
dimensional space and its local density
• Regular density estimation method does not work.
• Small cubic are still spacious in high dimensional space
• How to estimate local density
• k-nearest neighbor method
• In graph theory, similar idea is applicable.
• Minimum-spanning tree
Sparse in high dimensional space
• How sparse?
• One-dimensional manifolds
• But significant variance
Sparse in high dimensional space
• How sparse?
• One-dimensional manifolds
• But significant variance
Clustering
クラスタリングの方法、2タイプ
• Non-hierarchical
• Hierarchical
Hierarchical
• Tree structure --- Graph, again
• Its structure has information
• Its structure is related with dimension
• On the tree, distance is defined.
• Some phenomena have reasons to be analyzed hierarchically.
Classification
• Separate something difficult to
segregate.
J. Med. Imag. 1(3), 034501 (Oct 09,
2014). doi:10.1117/1.JMI.1.3.034501
Classification/Clustering
• Unsupervised Learning
• Supervised Learning
• No teacher, but want to check whether the classification criteria is
reliable or not.
• Cross-validataion: One of resampling methods
Small n Large p
• Sample size 100
• Test association between a trait and expression of A gene.
• N = 100, p = 1
• Large n Small p
• Sample size 100
• Test association between a trait and expression of MANY genes.
• N = 100, p = 25000
• Small n Large p
n << p
• One set of variables gives the perfect answer.
• Another set of variables gives the perfect but different answer.
• Which answer is the truth?
• Closer fitting is not always the best.
• AIC ~ Simpler model is better
• LASSO, Sparse
• The assumption k << n variables should be the answer, that is “prior” beliefBayesian
Resampling
• Estimation based on samples
• Jack-knife(Subsets)、Bootstrap(Replacement)
• Statistical significance
• Permutation ~ Exact probability
• Cross-validation
• Pseudo-random generators from computers
Psuedo-random number sequences
• From uniform distribution
• From other known distributions
• From arbitrary distributions … Gibbs sampling
• Using Gibbs sampler,
• Based on a stochastic model, estimate parameters of distributions and
generate random values form the estimated distributions…
• BUGS (Bayesian inference using Gibbs Sampling)
Example
• Fraction of red vs. green is
repeatedly estimated
• Based on the assumption that
the red is non-central chisquare distribution and its
non-central parameter value is
repeatedly estimated.
• Eventually estimate both
unknown parameter values
Psuedo-random number sequences
• From uniform distribution
• From other known distributions
• From arbitrary distributions … Gibbs sampling
• Using Gibbs sampler,
• Based on a stochastic model, estimate parameters of distributions and generate
random values form the estimated distributions…
• BUGS (Bayesian inference using Gibbs Sampling)
• MCMC(Markov-Chain Monte-Carlo)
• With Stan (a Bayesian estimation application)
Pseudo-random numbers, Monte-Carlo
• Computer-driven methods
Experimental designs
• Various data sets
• Using all kinds of them, what can we state?
Individual analysis/interpretation is tough enough
Integration of them is tougher
• Construct a model/assumption to integrate multiple sets
• There are variations how to combine; order of combination, structure of
combination ….
• Use raw data sets from multiple resources.
• Integrate primary outputs from each data set; so called meta-analyses
• Narrow-sense meta-analyses only combine outputs from similar data sets analyses.
• Difficulties are rooted to the heterogeneity among various dataresources.
• Each data-resource has its own way of analysis. The variations among
analyses make the integration difficult. Then, make each analysis method
unified???
Some resources
• http://statgenet-kyotouniv.wikidot.com/master-class-lecture2016statistical-methods-for-life-scienc
• Its linked sites would be helpful to broaden and deepen your
understandings of todays lecture.