Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
3. Introductory Statistical Principles Sihua Peng, PhD Shanghai Ocean University 2016.9 Contents 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. Introduction to R Data sets Introductory Statistical Principles Sampling and experimental design with R Graphical data presentation Simple hypothesis testing Introduction to Linear models Correlation and simple linear regression Single factor classification (ANOVA) Nested ANOVA Factorial ANOVA Simple Frequency Analysis 3. Introductory Statistical Principles Statistics is a branch of mathematical sciences that relates to the collection, analysis, presentation and interpretation of data and is therefore central to most scientific fields. Four Fundamental Terms Fundamental to statistics is the concept that samples are collected and statistics are calculated to estimate populations and their parameters. Terminology The population parameters are the characteristics (such as population mean, variability etc) of the population. Since it is usually not possible to observe an entire population, the population parameters must be estimated from corresponding statistics calculated from a subset of the population known as a sample. Population, Target population, and Sample sample observations are drawn randomly from populations. A particular subgroup of a population 3.1 Distributions The set of observations in a sample can be represented by a sampling or frequency distribution. A frequency distribution (or just distribution) represents how often observations in certain ranges occur (see Figure 3.1a). Discrete Probability Distributions For discrete random variables, the probability distribution is fully defined by the probability mass function Bernoulli Distribution The binary random variable X with possible values 0 and 1 has a Bernoulli distribution with parameter θ , where P(X = 1) = θ and P(X = 0) = 1 − θ . We denote this as X ∼ Bernoulli(θ), where 0 ≤ θ ≤ 1. Plot of Bernoulli(0.8) distribution Bernoulli trial In the theory of probability and statistics, a Bernoulli trial (or binomial trial) is a random experiment with exactly two possible outcomes, “success” and “failure”, in which the probability of success is the same every time the experiment is conducted. It is named after Jacob Bernoulli, a Swiss mathematician of the 17th century. Binomial Distribution The random variable representing the number of times the outcome of interest occurs in n Bernoulli trials (i.e., the sum of Bernoulli trials) has a Binomial(n, θ) distribution, where θ is the probability of the outcome of interest (a.k.a. the probability of success). A binomial distribution is defined by the number of Bernoulli trials n and the probability of the outcome of interest θ for the underlying Bernoulli trials. Binomial Distribution Binomial Distribution Plot of Binomial(50, 0.8) distribution Poisson Distribution In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event. A Poisson distribution is specified by a parameter λ, which is interpreted as the rate of occurrence within a time period or space limit.We show this as X ∼ Poisson(λ), where λ is a positive real number (λ>0). The mean and variance of a random variable with Poisson(λ) distribution are the same and equal to λ. That is, μ = λ and σ2 = λ. Poisson Distribution Probability of events for a Poisson distribution An event can occur 0, 1, 2, … times in an interval. The average number of events in an interval is designated λ. Lambda is the event rate, also called the rate parameter. The probability of observing k events in an interval is given by the equation Poisson Distribution Plot of Poisson(2.5) distribution Continuous Probability Distributions For continuous random variables, we use probability density functions to specify the distribution. e.g., >MACNALLY <- read.table("macnally.csv", header=T, sep=",") > plot(density(MACNALLY$EYR)) The normal distribution It has been a long observed mathematical phenomenon that the accumulation of a set of independent random influences tend to converge upon a central value (central limit theorem) and that the distribution of such accumulated values follow a specific ‘bell shaped’ curve called a normal or Gaussian distribution (see Figure 3.1b). The normal distribution is a symmetrical distribution. Many biological measurements are likewise influenced by an almost infinite number of factors and thus many biological variables also follow a normal distribution. How to perform a test to see if a data set follows normal distribution? One of the methods is to perform Shapiro Test using R function shapiro.test(). >mydata<-c(3.4,4.2,1.9,5.2,3.5,4.2,3.7,3.2) >shapiro.test(mydata) Shapiro-Wilk normality test data: mydata W = 0.95509, p-value = 0.7623 If the p-value>0.05,the data follows the normal distribution. How to perform a test to see if a data set follows normal distribution? ## Generate two data sets ## First Normal, second from a t-distribution words1 = rnorm(100); words2 = rt(100, df=3) ## Have a look at the densities plot(density(words1));plot(density(words2)) ## Perform the test shapiro.test(words1); shapiro.test(words2) ## Plot using a qqplot qqnorm(words1);qqline(words1, col = 2) qqnorm(words2);qqline(words2, col = 2) How to perform a test to see if a data set follows normal distribution? Boxplot() function can also test if a data set follows normal distribution. > VAR1<-rlnorm(15,4,.5) > boxplot(VAR1) >VAR2<-rnorm(25,2,.5) >boxplot(VAR2) > VAR3<-log(VAR1) > boxplot(VAR3) Student’s t-distribution Another continuous probability distribution that is used very often in statistics is the Student’s t distribution or simply the t -distribution. As we will see in later chapters, the t -distribution especially plays an important role in testing hypotheses regarding the population mean. A t -distribution is specified by only one parameter called the degrees of freedom df. The t -distribution with df degrees of freedom is usually denoted as t (df ) or tdf , where df is a positive real number (df > 0). The mean of this distribution is μ = 0, and the variance is determined by the degrees of freedom parameter, σ2 = df/(df −2), which is of course defined when df > 2. Student’s t-distribution Comparing of a standard normal distribution to t – distributions with 1 degree of freedom and then with 4 degrees of freedom. How to obtain random data set with various distributions? Normal distribution: rnorm(n, mean = 0, sd = 1) Chisquare distribution: rchisq(n,df,ncp=0 ) T distribution: rt(n,df,ncp=0) F distribution: rf(n,df1,df2,ncp=0) Parameter Estimation Estimation refers to the process of guessing the unknown value of a parameter (e.g., population mean) using the observed data. For this, we will use an estimator, which is a statistic. A statistic is a function of the observed data only. That is, it does not depend on any unknown parameter, and given the observed data, we should be able to find its value. For example, the sample mean is a statistic. Given a sample of data, we can find the sample mean by adding the observed values and dividing the result by the sample size. No unknown parameter is involved in this process. Population Mean For a population of size N, μ is calculated as N X i 1 i , N where xi is the value of the random variable for the ith member of the population. Given n observed values, X1,X2, . . . , Xn, from the population, we can estimate the population mean μ with the sample mean: n X In this case, we say that X i 1 n X i , is an estimator for μ. Point Estimation of Population Mean We usually have only one sample of size n from the population x1, x2, . . . , xn. Therefore, we only have one value for X , which we denote x : n x x i 1 n i , where xi is the ith observed value of X in our sample, and x is the observed value of X . Population Variance The population variance is denoted as σ2 and calculated as 2 ( x ) i 1 i N 2 N . This is the average of squared deviations of each observation xi from the population mean μ. Sample variance Given n randomly sampled values X1,X2, . . . , Xn from the population and their corresponding sample mean X , we can estimate the variance. A natural estimator for variance is 2 ( X X ) i 1 i n S 2 n 1 . Point Estimation of Population Variance Again, we regard the estimator S2 as a random variable since it changes as we change the sample. However, in practice, we usually have one set of observed values, x1, x2, . . . , xn, and therefore, only one value for S2, which we denote as s2: 2 ( x x ) i1 i n s2 n 1 . Four important distributions 1. Let X ~ N ( , 2 ), (X1,X2,…,Xn) are samples from a population, X is mean of the samples. Then the following distribution holds X 2 X ~ N ( , ) X ~ N (0, 1) 2 n /n 2 X ~ N ( , ), (X ,X ,…,X ) are samples from a 2. Let 1 2 n population, X is mean of the samples, and S2 is the variance of the samples. Then the following distribution holds X 2 S /n ~ t (n 1) Four important distributions 3. If X ~ N ( 1 , 12 ), Y ~ N ( 2 , 2 2 ), X and Y be independent of each other, and (X1,X2,…,Xn),(Y1,Y2,…,Yn) are samples from population X and Y, with means of X ,Y , respectively, Then the following distribution holds X Y ~ N ( 1 2 , 12 n1 ( X Y ) ( 1 2 ) 1 2 n1 2 2 n2 22 n2 ) ~ N (0, 1) Four important distributions 4. If X ~ N ( 1 , 12 ) , Y ~ N ( 2 , 2 2 ) , X and Y be independent of each other, and (X1,X2,…,Xn),(Y1,Y2,…,Yn) are samples from population X and Y, with means of X , Y and 2 2 s , s variances of 1 2 . Then the following distribution holds ( X Y ) ( 1 2 ) (n1 1) s (n2 1) s 1 1 ( ) n1 n2 2 n1 n2 2 1 2 2 ~ t (n1 n2 2) Log-normal distribution Many biological variables have a lower limit of zero. Such circumstances can result in asymmetrical distributions that are highly truncated towards the left with a long right tail (see Figure 3.1c). In such cases, the mean and median present different values , see Figure 3.1d. These distributions can often be described by a log-normal distribution. Consequently, when such data are collected on a linear scale, they might be expected to follow a non-normal distribution. Log-normal distribution In probability theory and statistics, the log-normal distribution is a random distribution of the probability distribution of random variables. If X is a random variable subject to a normal distribution, exp (X) follows a lognormal distribution; similarly, if Y follows a lognormal distribution, ln (Y) follows a normal distribution. If a variable can be seen as the product of many small independent factors, then this variable can be seen as a lognormal distribution. 3.2 Scale transformations Data transformation is the process of converting the scale in which the observations were measured into another scale. Scale transformations: an example Fig 3.2 Ficticious illustration of scale transformations. Leaf length measurements collected on a linear a) and logarithmic b) scale yielding log-normal and normal sampling distributions respectively. Leaf length measurements collected on a linear scale can be normalized by applying a logarithmic function (inset) to each measurement. Such a scale transformation only alters the relative spacing of measurements c). A largest leaf has the largest values on both scales. Scale transformations The purpose of scale transformation is to normalize the data so as to satisfy the underlying assumptions of a statistical analysis. As such, it is possible to apply any function to the data. Nevertheless, certain data types respond more favourably to certain transformations due to characteristics of those data types. Common transformations and R syntax are provided in Table 3.2. Scale transformations