Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistical Analyses of Life Science and Pathology from Genomic Perspective Unit of Statistical Genetics, Kyoto University Ryo Yamada [email protected] Childhood Birth Death Phenome Adolescent Aging Development Adult Gametes Individual Fertilized Egg Genes Cell Organ Metabolites Proteins RNA Genome Proteome Transcriptome Omics Heredity Stochastics Genetic diversity Metabolome Tissue Molecules DNA Molecular Genetics at a glance Phenotypes Phenotypic diversity Diversity Big Data Statistics DNA,Genome, Consistent DNA/Chromosome modification Genotype Phenotype Transcriptome Intermediate phenotype Protein/Proteome Phenotype/Phenome Terminal phenotype Time x Space of Individual Cancer Functional Somatic Mutations Somatic Mosaicism Parents Fertilized Egg Mutations Germline Diversity in Phenotypes • Easy to measure vs. Not-easy to measure • Representatives vs. Distribution itself • Many but Mutually Independent vs. Mutually Correlated Representatives vs. Distribution itself • Temperature • A representative of molecular population • Independent and Identically Distributed Variable Observation • Good-shaped distribution→Representatives→Parametric approach • Bad-shaped distribution→Distribution itself →Non-parametric approach • One sample is a set of observations • One sample gives a distribution→Representatives enough? ThermoFisher Scientific社 Many but Mutually Independent vs. Mutually Correlated • Multiple items mutually correlated. 横河電機 • Chronological data(time-line) • Shape data(space continuity) • Movement data(Time x Space) • Patten data(Informational axies) Nature 465, 918–921 (17 June 2010) https://ja.wikipedia.org/wiki/胚 Summary for Genotypes and Phenotypes • To start your analysis • • • • Record “Values” “Values” take various shapes “Simple value” : a Number “Number” • Natural Number, Integer, Rationals, Reals, Complex,Vector,Matrix … • “Values” for analysis but not “Numbers” • Mathematical models • Biological phenomena have random errors • Stochastic models, Statistical models • Models have parameters, then • “Values” for parameters are “Numbers” again • “Simple values” are values of parameters in simple models. • Complex models and their parameters can be also values for your analysis. Roles of statistics/data science for genome/omics • Quality Control of Noisy High-Throughput Data • Tests, Estimation/Inference, Classification/Clustering • Multi-dimensional/High-dimensional Data • Random value-based approaches • Others : Experimental Designs Quality Control of Noisy High-Throughput Data • Systematic errors/ biases; samples, reagents, date/machine/personnel effects • How to Correct or control the noises • • • • Outsider detection Transformation of all records with a function Normalization for “locational effects” “Control samples” Transformation of all records with a function Genomic control for GWAS Preprocessing Micorarray Data Median-based correction Log-transformation Normalization for “locational effects” • Tendency should be considered. • Batch effects should be considered. • Non-data-driven • Data-driven Tests, Estimation/Inference, Classification/Clustering • Tests • Significance, Error Controlling, Multiple-testing issue • Estimation/Inference • Interval, Models, Bayes • Classification/Clustering • Unsupervised Learning vs. Supervised Learning Multiple Comparison P-value vs. Q-value Multiple Comparison • Almost all hypotheses are NULL Uniform distribution Minimum p-value distribution • 2^10 Mean Min-p may take quite larger value than the mean. In many cases, min-p value is smaller than the mean. Such small value are not rare. Minimum p-value distribution 1,2,4,8,… 10^6 1,2,4,8,… 10^6 NON-NULL, FDR (False Discovery Rate) • Many hypotheses are NON-NULL, or Almost all hypotheses are NONNULL Combination of two distributions • Uniform p-values • Small p-values Pick smaller p-values. Threshold value should be changed for the ranks of p-values. The fraction of “true positives” is controlled. Large-scale inference • When you observed many at once, their distribution is informative. • The estimates of each observation using the information are different the estimates not using the information. • q-value of FDR is one type of such estimates. • Use information of distribution when observed many together • Empirical Bayes Estimation/Inference • Models, Parameters, Interval, Bayes • Uniform p-values • Small p-values Assuming the mixture of two distributions; This is a model. Estimation/Inference • Samples → Point estimates, Interval estimates • Sample distribution, Theoretical estimates, unbiased estimates,… • Frequentist The statement “The star’s weight is between a and b” will be right 9 times out of 10 times. Estimation/Inference • Frequentists vs. Bayesians • • Frequentists approaches are difficult for students not good at mathematics and their thinking processes are not easy to follow. Instead Bayesian thinking processes tend to be easy to follow for many. Estimation/Inference • Bayesian • Model has parameter(s) • Dara + Model → Estimation of parameter value • Likelihood-based; Maximum-likelihood estimates; Interval estimates based on likelihood Summary for Genotypes and Phenotypes • To start your analysis • • • • Record “Values” “Values” take various shapes “Simple value” : a Number “Number” • Natural Number, Integer, Rationals, Reals, Complex,Vector,Matrix … • “Values” for analysis but not “Numbers” • Mathematical models • Biological phenomena have random errors • Stochastic models, Statistical models • Models have parameters, then • “Values” for parameters are “Numbers” again • “Simple values” are values of parameters in simple models. • Complex models and their parameters can be also values for your analysis. Estimation/Inference • Frequentists vs. Bayesians • Quality Control of Noisy High-Throughput • Tests, Estimation/Inference, Classification • Multi-dimensional/High-dimensional Dat • Random value-based approaches • Others : Experimental Designs • Use both, not select one of them, it is the way in 21-st century • Bayesian approaches seem to be used more and more, because • Models became more complicated. • Computers’ assists・・・Complicated distributions can be handled simulationally • Large-scale data ・・・Empirical Bayes approaches can be applied Estimation/Inference • Frequentists vs. Bayesians • “Prior” distribution is necessary • What is the “appropriate prior”? Success rate:No information at all • Somebody you don’t know at all will take an exam on which you have no information at all. How likely do you think (s)he will pass it? Jeffreys prior One of non-subjective priors Estimation/Inference • Frequentists vs. Bayesians • Use both, not select one of them, it is the way in 21-st century • Large scale inference • Prior can be set based on the data set ~ empirical Bayesian Multi-dimensional/High-dimensional Data • No way to visualize high-dimensional data • Almost impossible for US to understand in high-dimensional data themselves Multi-dimensional/High-dimensional Data • How many dimensions can we handle? • 2D space or 3D space • Extra dimensions • Gray/Color scale • Arrows • Time Multi-dimensional/High-dimensional Data • Dimension reduction • Pick up 2 or 3 dims seemingly important, then visualization is easy and we feel we can understand them. • Only few dims are truly meaningful and all the others are noize. Pick the true dims. • LASSO, Compression sensing Multi-dimensional/High-dimensional Data • Space is high dimensional but data is low • Manifold learning • Put data into higher dimensional space and pull them back to low dim space. High-dimensionality • Many genes, many biomarkers, many features Multi-dimensional/High-dimensional Data • Life-science data are highdimensional • Number of observed items are huge. FACS • But the items are mutually strongly correlated , and their dimension is much smaller in reality. Ethnic diversity Multi-dimensional/High-dimensional Data • Objects with low dimensions in higher dimensional space • Topology • Graph, network and topology Multi-dimensional/High-dimensional Data • Graph: Itemize and connect items with relation • Pairwise relations are cared. • No care for trio-wise or higher relations. Multi-dimensional/High-dimensional Data • Graph and its matrix representation and linear algebra • Graph tends to be sparse • … Sparse analysis Multi-dimensional/High-dimensional Data • Two important features • No “common” individuals • Sparse High-dimensionality • No commons • Central area : a sphere in a cubic 3.14 / 4 = 0.785 High-dimensionality • Sparse • To estimate density, you need reasonable number of samples per small cubic volume, but… • • • • • Dim = 1 : 0.1 Dim = 2 : 0.01 Dim = 3 : 0.001 …. Dime = 6 : 0.000001 High-dimensionality • Quite spacious, but reasonably dense distribution. • Distribution should be low dimensional. Multi-dimensional/High-dimensional Data • Life-science data are highdimensional • Number of observed items are huge. FACS • But the items are mutually strongly correlated , and their dimension is much smaller in reality. Ethnic diversity Low dimensional distribution in higher dimensional space and its local density • Regular density estimation method does not work. • Small cubic are still spacious in high dimensional space • How to estimate local density • k-nearest neighbor method • In graph theory, similar idea is applicable. • Minimum-spanning tree Sparse in high dimensional space • How sparse? • One-dimensional manifolds • But significant variance Sparse in high dimensional space • How sparse? • One-dimensional manifolds • But significant variance Clustering クラスタリングの方法、2タイプ • Non-hierarchical • Hierarchical Hierarchical • Tree structure --- Graph, again • Its structure has information • Its structure is related with dimension • On the tree, distance is defined. • Some phenomena have reasons to be analyzed hierarchically. Classification • Separate something difficult to segregate. J. Med. Imag. 1(3), 034501 (Oct 09, 2014). doi:10.1117/1.JMI.1.3.034501 Classification/Clustering • Unsupervised Learning • Supervised Learning • No teacher, but want to check whether the classification criteria is reliable or not. • Cross-validataion: One of resampling methods Small n Large p • Sample size 100 • Test association between a trait and expression of A gene. • N = 100, p = 1 • Large n Small p • Sample size 100 • Test association between a trait and expression of MANY genes. • N = 100, p = 25000 • Small n Large p n << p • One set of variables gives the perfect answer. • Another set of variables gives the perfect but different answer. • Which answer is the truth? • Closer fitting is not always the best. • AIC ~ Simpler model is better • LASSO, Sparse • The assumption k << n variables should be the answer, that is “prior” beliefBayesian Resampling • Estimation based on samples • Jack-knife(Subsets)、Bootstrap(Replacement) • Statistical significance • Permutation ~ Exact probability • Cross-validation • Pseudo-random generators from computers Psuedo-random number sequences • From uniform distribution • From other known distributions • From arbitrary distributions … Gibbs sampling • Using Gibbs sampler, • Based on a stochastic model, estimate parameters of distributions and generate random values form the estimated distributions… • BUGS (Bayesian inference using Gibbs Sampling) Example • Fraction of red vs. green is repeatedly estimated • Based on the assumption that the red is non-central chisquare distribution and its non-central parameter value is repeatedly estimated. • Eventually estimate both unknown parameter values Psuedo-random number sequences • From uniform distribution • From other known distributions • From arbitrary distributions … Gibbs sampling • Using Gibbs sampler, • Based on a stochastic model, estimate parameters of distributions and generate random values form the estimated distributions… • BUGS (Bayesian inference using Gibbs Sampling) • MCMC(Markov-Chain Monte-Carlo) • With Stan (a Bayesian estimation application) Pseudo-random numbers, Monte-Carlo • Computer-driven methods Experimental designs • Various data sets • Using all kinds of them, what can we state? Individual analysis/interpretation is tough enough Integration of them is tougher • Construct a model/assumption to integrate multiple sets • There are variations how to combine; order of combination, structure of combination …. • Use raw data sets from multiple resources. • Integrate primary outputs from each data set; so called meta-analyses • Narrow-sense meta-analyses only combine outputs from similar data sets analyses. • Difficulties are rooted to the heterogeneity among various dataresources. • Each data-resource has its own way of analysis. The variations among analyses make the integration difficult. Then, make each analysis method unified??? Some resources • http://statgenet-kyotouniv.wikidot.com/master-class-lecture2016statistical-methods-for-life-scienc • Its linked sites would be helpful to broaden and deepen your understandings of todays lecture.