Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Pattern recognition wikipedia , lookup
Predictive analytics wikipedia , lookup
Regression analysis wikipedia , lookup
Generalized linear model wikipedia , lookup
Least squares wikipedia , lookup
Data analysis wikipedia , lookup
Probability box wikipedia , lookup
Taylor's law wikipedia , lookup
Introduction to Biostatistical Analysis Practical statistics course for first-year PhD students Session 1: 7/11/2012 Lecture: Basic concepts Session 2: 8/1/2012 Lecture: Introduction to statistical hypothesis testing Session 3: 9/11/2012 Lecture: Analysis of Variance Session 4: 14/11/2012 Lecture: Regression Session 5: 16/11/2012 Lecture: Synthesis and applications Lecturer: Lorenzo Marini DAFNAE, University of Padova E-mail: [email protected], Tel.: +39 0498272807 http://www.biodiversity-lorenzomarini.eu/ 1 Introduction to Biostatistical Analysis Practical statistics course for first-year PhD students 5 Sessions: Lecture: theory + Practical: use of R + Assessment (at least three sessions attended) Analysis of an assigned data set and writing a report Your report should consist of the following 2 items: 1. A 1 page R script fully documenting your analysis. 2. A word document presenting the aims, method of analysis, the results, and an interpretation. 2 Statistics: Definition STATISTICS -Statistical methods can be used to summarize or describe a collection of data; this is called descriptive statistics. -In addition, patterns in the data may be modeled from samples, and then used to draw inferences about the process or population being studied; this is called inferential statistics. Statistics = Techniques of • Collecting • Analysing • Drawing conclusions from DATA “A mode of thought” will change the way you do science 3 Statistics: Population and samples Population: is a set of entities of interest (e.g. PhD students, farms, bees, fishes, dogs…) Samples: is a subset of entities randomly drawn from the population Spatial extent Samples Population Statistics answers RESEARCH questions using samples 4 Statistics: Population and samples Example: Population of PhD students Question: do female students perform better in stats than male students? Spatial Extent? How to draw samples? Population ♂ ♀ Statistics answers RESEARCH questions using samples 5 Inferential statistics: logics Statistical testing in five steps: 1. Construct a null hypothesis (H0) (RESEARCH QUESTION) E.g. Question: do female students perform better in stats than male students? 2. Choose a statistical analysis E.g. T-test to detect difference between male and female 3. Collect the data (sampling) E.g. Sampling male and female students 4. Calculate P-value and test statistic Perform t-test 5. Reject/accept (H0) if p is small/large (ANSWER THE QUESTION) Common error Sampling before (1) constructing the hypothesis and (2) choosing the 6 statistical analysis Strong vs. weak inference Weak inference (Observational) 1. Many hypotheses 2. Correlational SEVERAL alternative hypotheses for explaining the process Strong inference (Experimental) 1. A clear hypothesis 2. A specified test ONE clear testable hypothesis What is a hypothesis? • a statement that is testable • Testable: can be falsified (K. Popper) 7 KEEP IN MIND! 1. Clearer is the question, easier is the sampling, simpler is the statistics to be applied 2. If you have blurry and foggy questions and bad design, you will have to apply complex analyses Where are you? Without statistics you cannot write scientific papers (almost…) 8 1. Construct hypotheses: types of variables Which is/are your response variable/s? Object of your research (e.g. animal fitness, yield, healing time, cow milk production…) Which is/are your explanatory variable/s? Variables for explaining your response variable (e.g. age, risk factors, temperature, hormones, fertilization, diet...) Which are your research questions (hypotheses)? HYPOTHESIS: a statement that can be falsified 9 1. Construct hypotheses Response Variable Continuous Weight, height, length, temperature, concentration Count Number of individuals, days, cells; zero is a common value Proportion Percentage mortality, infection rate, proportion responding to a treatment; percent leaf area eaten. 10 1. Construct hypotheses Box-plot 100 Categorical explanatory variable 0 20 60 Species, Clone, Genotype, Treatment, Diet, Growth Chamber, Breed… Each categorical factor has two or more levels A B C D E F G H 1.5 11 10 Scatter plot 6 4 2 0 response Weight, height, length, temperature 8 Continuous explanatory variable -0.5 0.0 0.5 1.0 log(factor) 1. Construct hypotheses: prepare the data Each column is one variable Explanatory 1: Age Explanatory 2: Gender Response: Score 26 Male 10 24 Male 9 26 Male 5 28 Male 4 32 Male 3 24 Female 4 26 Female 8 28 Female 9 28 Female 6 Do not confound levels of a categorical factor with a true factor E.g. male and female are not factors! 2. Choose a statistic analysis Univariate analysis Y Variables - One Response variable (Y): (e.g. Y= Hormone concentration) - One or more explanatory variables (Xi) (e.g. age, size, breed…) Multivariate analysis Variables Y1 Y2 Y3 Y4 Y5 - More than 1 response variable (Yi) (e.g. Yi= species composition ) - One or more explanatory variables (xi) (e.g. age, training, breed, sex, ) 13 2. Choose a statistic analysis 1 Univariate Continuous Count Proportion Distributions Response variable More Multivariate 14 2. Choose a statistic analysis Response Variable: Continuous (normal) Count (normal or Poisson) Proportion (normal or binomial) Distributions Explanatory Variables: Continuous: E.g. Regression Categorical: E.g. ANOVA Categorical + continuous: E.g. ANCOVA, GLMs Statistical analyses 15 2. Choose a statistic analysis Parametric statistics The population is assumed to fit any parameterized distributions. A probability distribution describes the values and probabilities that a random event can take place -Normal -Poisson -Gamma -Binomial… General Linear Models (ANOVA, regression, ANCOVA) Generalized Linear Models GLMs NB The distribution depends on the nature of your response variable Non-parametric statistics Nonparametric methods are often referred to as ‘distribution free’ methods as they do not rely on assumptions that the data are drawn from a given probability distributions. 16 2. Choose a statistic analysis Statistical analysis Assumptions Each analysis (even nonparametric) requires to make some background assumptions Sampling design Each analysis requires an appropriate sampling design If both these conditions are met 5. We can accept/refuse our hypotheses 17 3. Collect the data (sampling): Population Samples SAMPLING Key step in any research Sampling is that part of statistical practice concerned with the selection of individual observations intended to yield some knowledge about a population of concern Key concepts: -Randomization -Replication (no pseudo-replication!!!) -Independence 18 3. Collect the data: randomization Population (N) Sample (n<N) 1 2 3 … n Without replacement A simple random sample is selected so that every possible sample has an equal chance of being drawn from the population Copy and paste in R ###Draw randomly one student among you students<-seq(1,30) ## assign to each student an ID students ##show the ID numbers sample(students,1) ## select randomly one among the 16 hist(replicate(1000,sample(students,1)), breaks=8) 19 3. Collect the data: sufficient replication! True replication vs. pseudoreplication Replication means having replicate observations at a spatial and temporal scale that matches the application of the experimental treatments True replicates must be independent n replicates Degree of freedom Replicates MUST NOT: - Come from a time series - Be grouped in space - Be repeated measures on the same individuals (but it depends: wait for mixed models!!!) P 20 3. Collect the data: independence Population 4 replicates Random sampling 4 replicates Independence No meaningful relation between the sampling units 3 MAIN PROBLEMS IN BIOSTATISTICAL ANALYSES 1. Spatial dependence (e.g. spatial autocorrelation) 2. Temporal dependence (e.g. repeated measures) 3. Biological dependence (e.g. siblings) 21 3. Collect the data: independence ID Gender Score 1 Male 10 2 Male 9 3 Male 5 4 Male 4 5 Male 3 6 Male 4 7 Female 8 8 Female 9 9 Female 6 10 Female 7 11 Female 8 12 Female 5 Each row is one observation or measurement MAIN QUESTION: Are your observations true replicates? Do not confound observations with replicates! 3. Collect the data: spatial dependence Which is your replicate and which is the right scale? ♂ ♀ 10 birds per sex 15 feathers per bird 300 measurements 18 rats 18 liver 3 samples per liver 2 analyses per tissue sample 108 measurements 23 3. Collect the data: spatial dependence Which is your replicate and which is the right scale? ♀ Response variable Feather length ♂ Explanatory variable Sex (♀ and ♂) 15 measures per bird 24 3. Collect the data: spatial dependence Which is your replicate and which is the right scale? E.g. ANOVA > Bird_level<-aov(feather ~ sex) df SS MS F value P sex 1 3.4279 3.42 5.887 0.025 * Residuals 18 10.4813 0.58 --------------------------------------------------> Feather_level<-aov(feather ~ sex) sex Residuals df 1 298 SS 51.41 242.48 MS 51.41 0.81 F value 63.19 P 3.9e-14 *** Pseudo-replication!!! Do you have a solution? 25 3. Collect the data: temporal dependence Temporal dependence: time series (not covered) When we measure experimental units repeatedly over time, we are collecting longitudinal data, also called repeated measures data. Time 1 measure 1 Time 2 Time 3 Time 4 measure 2 measure 3 measure 4 26 3. Collect the data: biological dependence Biological dependence Unknown genetic relations between sampled units Individuals belonging to same litter or brood Litter A Litter B Drug A Drug B Biased sampling (factor + noise) Proper sampling (2 litter + diet) 27 3. Collect the data (sampling) If temporal or spatial dependence exists SIMPLEST SOLUTION To have an appropriate number of replicates at the proper scale [P.A. Murtaugh 2007. Simplicity and complexity in ecological data analysis. Ecology, 88, 56–62] ALTERNATIVE SOLUTION Wait for mixed models (they can deal with non-independent data) Agricultural research: easy to control possible confounding factors Ecological research: more difficult to apply traditional models. Modern mixed models with REML estimation (extremely dangerous analyses!!!) 28 3. Collect the data (sampling) Examples of ‘Bad’ design no replication (you cannot use statistics) clumped segregation (totally uninformative) isolative segregation (growth chambers etc.) systematic (problem: periodic variations) Example of ‘Good’ Designs randomized block (paired or block design) completely randomized (if enough time, space, money) 29 3. Collect the data (sampling): one example Factors: Irrigation ------- Response: Maize yield i step: identify your sampling unit ii step: identify your replicate and the sample size iii step: decide the spatial distribution iv step: one or repeated measurements? Arable field Ditch 30 3. Collect the data (sampling) ‘Good’ design (multifactorial ANOVA) A C D B D A B C B D C A C B A D Latin square (in case of two gradients) Split-plot (very common!) E.g. Agronomic trials, greenhouses, Petri dishes Nested (especially in medicine) E.g. liver samples from individual rat 31 3. Collect the data (sampling) Before sampling you MUST know which is your REPLICATE and the right SCALE of your study!!! Sampling with appropriate replication!!! 32 3. Collect the data (sampling) Manipulative experiments Natural experiments Observational studies If we know exactly the analysis the choice of the sampling is straightforward Spatial patterns in the samples If you don’t work with experiments you should always consider the spatial patterns 33 in your sampling design Basic concepts: mean and variance Body size MEAN AND VARIANCE Whereas the mean is a way to describe the location of a distribution, the variance is a way to capture its scale or degree of being spread out mean y mean deviance SS ( yi mean) i n var ( yi mean) (n 1) 2 SD (y i mean) (n 1) 2 2 34 Basic concepts: Residuals RESIDUALS residual mean A residual is an observable estimate of the unobservable statistical error. The simplest case involves a random sample of n men whose heights are measured. The difference between the height of each man in the sample and the observable sample average is a residual. Residuals represent what we cannot explain 35 Basic concepts: Uncertainty Population We normally compute means using samples. It is not feasible to measure all the individuals in a population. Distribution of single cow milk production) WHOLE POPULATION Mean=40 15000 Mean=42 5000 How can we reduce the degree of uncertainty? 0 Frequency Mean=39.5 As we work with samples, there is always a degree of UNCERTAINTY in the estimation. 20 30 40 50 60 36 Milk production Basic concepts: Law of the Large Numbers 3 As the sample size (n) grows, the sample mean approaches to the population mean x 1 2 When is a sample large enough? Population with mean = 0 SD = 1 -2 -1 0 Run the simulation to answer 0 20 40 60 n = 100 80 100 library(animation) ani.options(ani.height = 480, ani.width = 600, outdir = getwd(), nmax =100, interval = 0.1, title = "Demonstration of the Law of Large Numbers", description = "The sample mean approaches to the population mean as the sample size n grows.") ani.start() par(mar = c(3, 3, 1, 0.5), mgp = c(1.5, 0.5, 0)) lln.ani(FUN = rnorm, mu = 0, np = 50, pch = 20, col.poly = "grey") ani.stop() 37 Basic concepts: Uncertainty Standard error (SE) SD SE n SE is simply the SD of the probability distribution of a specific statistic. E.g. SE of the mean Confidence intervals (CI) CI is an interval estimate of a population parameter. How likely the interval is to contain the parameter is determined by the confidence level (95%) t distribution (n<30) CI 0.975 mean t0.975,df SE CI 0.025 mean t0.025,df SE Normal distribution (n>30) CI 0.975 mean z0.975 SE CI 0.025 mean z0.025 SE 38 Basic concepts DEGREE OF FREEDOM The number of INDEPENDENT measurements minus the number of parameters estimated from the data Df does not correspond to the sum of the single measures but must be computed on our replicates The number of the df define the scale of our analyses!!! Look at the error df to spot pseudo-replication 39 Basic concepts: Why distributions? We use distribution in statistics mainly in 2 ways: 1. To model our response variables we need to know its distribution (assumptions) 2. Once we know the distribution we can run a statistical test (There are loads of tests to do (F test, t test etc.) No matter what test we do, the principal is always the same: -> Calculate a test statistic (e.g. z, t, F) which follows a DEFINED DISTRIBUTION -> Look up the critical value related to our level of significance -> Compare the calculated value with the critical value ->We can then associate a PROBABILITY to our decision 40 Basic concepts: Normal distribution Normal standardized (mean=0, sd=1) y y mean z sd f ( z) 1 z2 / 2 e 2 E.g. Suppose we have measured the heights of 100 people. The mean height was 170 cm and the sd was 8 cm • shorter than a particular height? • taller than a particular height? • between one specified height and another? 41 Basic concepts: Poisson distribution The Poisson distribution, which describes a very large number of individually unlikely events that happen (count data) Non-negative values Variance=mean Right skewed 1 Parameter: λ (mean=variance) Use: count data Sample from a Poisson distribution (n=1000, mean=variance=0.2) var = mean 42 Basic concepts: Binomial distribution The binomial distribution describes the number of successes in a finite series of independent Yes/No experiments. 2 Parameters: sample size, probability Use: proportion data and power analysis 43 What’s R? R is a system for statistical computation and graphics. It consists of a language plus a run-time environment with graphics, a debugger, access to certain system functions, and the ability to run programs stored in script files. R has a home page at http://www.R-project.org/. It is a free software distributed under a GNU-style copyleft, and an official part of the GNU project (“GNU S”). 44 Why R? The benefits of R are: + R is free. R is open-source and runs on UNIX, Windows and Macintosh + R has an excellent built-in help system + R has excellent graphing capabilities + Students can easily migrate to the commercially supported S-Plus program if commercial software is desired + R's language has a powerful, easy to learn syntax with many built-in statistical functions + The language is easy to extend with user-written functions + R is a computer programming What is R lacking compared to other software solutions? - It has a limited graphical interface (S-Plus has a good one). This means, it can be harder to learn at the outset. - There is no commercial support. (Although one can argue the international mailing list is even better) - The command language is a programming language so students must 45 learn to appreciate syntax issues etc. Appendix 1 Box-plot explanation 46