* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download outline - Ohio University
Taylor's law wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Foundations of statistics wikipedia , lookup
Student's t-test wikipedia , lookup
Statistical inference wikipedia , lookup
Gibbs sampling wikipedia , lookup
Categorical variable wikipedia , lookup
1/14/13 Welcome to PBIO 3150-5150 Statistical Methods in Plant Biology (aka Biostatistics) Spring 2013 OUTLINE Review syllabus & introduction Use & misuse of statistics Statistics and biological data explained Samples, populations, estimation Intro to sampling design Accuracy vs. precision Types of variables Frequency & probability distributions Getting started: an example Standard measures of central tendency Mean, median, mode Other means Weighted, geometric, harmonic Course Goals & Objectives 1. To provide you with an overview of the statistical tools and procedures required to: a. Evaluate the biological literature b. Conduct original research i. Design experiments ii. Collect & interpret data 2. To familiarize you with the state-of-the-art software (R) used to conduct, analyse, and report scientific data. and manipulation b. Graphics & presentation c. Statistics 1 1/14/13 Statistical Pedagogy What literature exists on the subject suggests that applied statistics is learned best when: a) The material is subject specific (so we will use only biological examples). b) Students have the opportunity to work through the material in a step-by-step fashion (so you will do coded examples with R, as well as exams). c) There is a practical element to learning (i.e., practically you will only ever practice statistics on computers, so we will use computers for everything including exams). This course has been designed entirely around these core principles! Personal Goal 1. To provide you with the best possible overview of descriptive and experimental statistics within the confines of a 10-wk academic quarter. 2. To heighten your awareness and critical thinking with respect to statistical designs and biological questions & ultimately do better research. Hmm…is this good statistics? OUTLINE Review syllabus & introduction Use & misuse of statistics Statistics and biological data explained Samples, populations, estimation Intro to sampling design Accuracy vs. precision Types of variables Frequency & probability distributions Getting started: an example Standard measures of central tendency Mean, median, mode Other means Weighted, geometric, harmonic 2 1/14/13 Use of Statistics To what extent has the importance of statistics in the biological sciences changed over the last 100 years? Survey conducted examining 11 decennial volumes of The American Naturalist. This journal has wide coverage and is presumably a good indicator. 96% (Sokal and Rohlf 1995). Why Such an Increase in the Use of Statistics in Biology? Realization that most biological systems are not deterministic but rather probabilistic. Statistical thinking parallels ordinary scientific thinking. We wish to quantify observations. We express phenomena as a statement of probability rather than as a vague general statement. 3 1/14/13 The Future… The use of quantitative data and major mathematical models will only continue to increase (in all sub-disciplines of biology). The R programming language is being increasingly used towards this end (and therefore widely incorporated into this course). We have many of the major biological patterns described, but because of the variability inherent in the natural world we do not yet understand many of the underlying processes. This will require increasingly specialized quantitative skills. Biology is not as straightforward as physics or math—the rules are different! Misuse of Statistics Statistics have frequently been used to hide or obfuscate important information (usually where economic or political gain was at stake). This led to the well known quote by British Prime Minister Disraeli: there are three forms of falsehood in the universe, lies, damned lies, and statistics. Misuse of Statistics - Example - U.S. Economy Post-Depression Two graphs, same data, two diametrically opposed conclusions! Source: Huff (1954) How to Lie With Statistics 4 1/14/13 Don t Underestimate Incompetence Important Take-Home Point: A statistical test is only as good as the data it is supposed to test! Virtually any experiment can yield data, sophisticated statistics can be employed, fanciful computer software applied, and erudite conclusions can be drawn…but, of what biological relevance??? OUTLINE Review syllabus & introduction Use & misuse of statistics Statistics and biological data explained Samples, populations, estimation Intro to sampling design Accuracy vs. precision Types of variables Frequency & probability distributions Getting started: an example Standard measures of central tendency Mean, median, mode Other means Weighted, geometric, harmonic What is Statistics? • Statistics is a technology that describes and measures aspects of nature from samples. • Statistics allows us to quantify the uncertainty of these measurements (i.e., what is their departure from the truth?). • Statistics is about estimation, the process of inferring an unknown quantity of a target population using sample data. 5 1/14/13 Statistics • A population is all the individuals of interest (what we are trying to describe). • A sample is the subset of observations that we select from the population to describe it. • Parameters are quantities describing the population (unknown most of the time). • Estimates (or statistics) are the measures used to approximate the parameters. Observations, Samples, & Populations Sample (X, s) Population (µ, σ) e Obs ions rvat Statistics Inference (Estimation) Good Samples • Obviously then, the sample completely controls our view of the population. • Chance alone influences sampling error (difference between estimates and parameters). • We need to collect a sample that is both accurate and precise. • Bias is another form of error. It is a systematic discrepancy between estimates and parameters. 6 1/14/13 Random Sampling • In order for a sample to be random, two criteria must be met: – Each member of the population has an equal chance of being part of the sample, and – Each observation is independent of every other observation. • Random sampling does two things: – Minimizes bias – Permits measurement of sampling error How to Take a Random Sample 7 1/14/13 Beware! • Be vigilant as to how the sample is collected. • Samples must be random and representative. • Avoid the sample of convenience (individuals that are easily available to the researcher), which is invariably biased. Variables & Characters Variable The characteristics that differ among individuals. The actual property measured on the individuals selected for the sample. Most general term commonly used in biological statistics. Character Synonym for variable. Used most by evolutionary biologists & systematists. Data Structure Univariate One variable is measured per observation Bivariate Two variables are measured per observation Multivariate Three or more variables are measured per observation 8 1/14/13 Types of Variables 1. Measurement (Numerical) Variables a. Continuous variables b. Discontinuous variables 2. Ranked Variables 3. Categorical (Attribute) Variables Measurement Variables Those whose differing states can be expressed in a numerically ordered fashion Continuous variables are those that have a theoretical infinite number of finer gradations between any two points (e.g., length, mass, size). Dis-continuous (a.k.a. meristic) variables are those with certain fixed discrete values with no intermediates possible (e.g., number of leaves or teeth). Ranked & Attribute Variables Ranked variables are those that can not be measured, but can be ordered (e.g., rank order of pupa emergence, or, seed germination: 1,2,3, etc.). Attribute variables (also known as categorical or nominal variables) are those that cannot be measured, but can be scored for certain criteria (e.g., individual dead/alive, or, flower color red/white/pink). 9 1/14/13 OUTLINE Review syllabus & introduction Use & misuse of statistics Statistics and biological data explained Samples, populations, estimation Intro to sampling design Accuracy vs. precision Types of variables Frequency & probability distributions Getting started: an example Standard measures of central tendency Mean, median, mode Other means Weighted, geometric, harmonic Frequency Distributions • Different individuals (observations) in a sample will have different measurements. • This variability is most easily seen with a frequency distribution (histogram). Probability Distributions • The frequency distribution describes the number of times each value occurs in a sample. • The distribution of a variable in the whole population is called the probability distribution. • For a continuous measurement variable, the distribution is usually approximated by the theoretical distribution known as the normal distribution. 10 1/14/13 Normal Distribution The Probability Density Function The probability distribution can also be looked at as a density function. We can use the calculus to describe the areas under each part of the curve. f(y) What proportion of measurements exists between time intervals 4 & 6? 0.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 y (time) Probability Density Function • Our knowledge of the PDF for a standard normal curve permits much of the inference we desire by allowing us to assign levels of uncertainty (probability) to our sample estimates. • We will describe the detailed properties of the SNC and hypothesis testing in a subsequent lecture. 11 1/14/13 OUTLINE Review syllabus & introduction Use & misuse of statistics Statistics and biological data explained Samples, populations, estimation Intro to sampling design Accuracy vs. precision Types of variables Frequency & probability distributions Getting started: an example Standard measures of central tendency Mean, median, mode Other means Weighted, geometric, harmonic Getting Started Circumscribe the population Collect a sample of observations Measure one or more variables Now what? Key to Success: 12 1/14/13 Frequency Histogram - Example - From a population of oak seedlings in a tree nursery Collect a sample of N = 12 observations The variable of interest is the height of seedlings (cm) Record the frequency of occurrences of heights by category & construct a histogram… Population Sample Frequency Histogram - Example Sample: N = 12 seedlings Height Categories (rounded to nearest cm) 2 3 4 5 6 7 8 Frequency Count by Group 1 1 2 4 1 1 2 13 1/14/13 Frequency Histogram Frequency - Example 5 y f 4 2 1 3 1 4 2 2 5 4 1 6 2 0 7 1 8 1 3 2 3 4 5 6 7 8 Bin Frequency Histogram 40 30 - Distributions - 20 10 0 1 2 3 4 5 6 7 40 30 20 Sample distributions may take a variety of forms, some of which include: NORMAL (seedling example) 10 0 1 2 3 4 5 6 7 40 SKEWED BIMODAL 30 20 We ll discuss these at length later 10 0 1 2 3 4 5 6 7 Plot The Data! IMPORTANT: Failure to plot the data represents the single most routine error that biometricians encounter amongst students & professionals analysing data! There is no better way to understand the structure & distribution of your data. 14 1/14/13 OUTLINE Review syllabus & introduction Use & misuse of statistics Statistics and biological data explained Samples, populations, estimation Intro to sampling design Accuracy vs. precision Types of variables Frequency & probability distributions Getting started: an example Standard measures of central tendency Mean, median, mode Other means Weighted, geometric, harmonic Describing the Distribution After visualizing your data (in this case producing a histogram), you can begin the process of estimation. Now, assuming your data are normally distributed (later we will discuss the ramifications of this and what to do when this is not the case) you need a method to describe the center of the distribution: Central tendency (location of peak) Measures of Central Tendency The standard measures of central tendency are: MODE: the most frequent observation (from French, la mode, most fashionable ) MEDIAN: the observation in the middle, i.e., when rank ordered; 50% above & 50% below MEAN: the sum of all members of a sample divided by the sample size, N. 15 1/14/13 Averages CAUTION: The word average is NOT a term rooted in the statistical sciences and should arguably not be part of your scientific vocabulary. The mean, median, and mode are ALL averages! The word average is a synonym for central tendency . The most commonly employed average in the biological sciences is the arithmetic mean. The Symbology of Statistics Y Each observation is referred to as a variate Y (or X depending upon source) !Y The Greek letter sigma is used as shorthand to denote the sum of N !Y i = 1 i i is an iterator and for N = 10, Y1, Y2…Y10 This syntax is read as, the sum of the Yi s from i = 1 to N Mean We can now define the arithmetic mean using our new statistical lexicon: N Y = ∑Y i =1 N i Which is a lot easier than saying the mean (y-bar) is equal to the sum of the variates, Y, from i = 1 to N divided by N. 16 1/14/13 Calculating the Mean - Turning Symbols into Numbers - Returning to the data from our frequency histogram: N !Y Y = i = 1 Y = 5 i 2 + 3+ 4 + 4 + 5+ 5+ 5+ 5+ 6 + 6 + 7 + 8 12 = N Example - Measures of Central Tendency 5 Again, using the data from the histogram example: Frequency 4 Mean = 5 Median = 5 Mode = 5 3 2 1 0 2 3 4 5 6 7 8 Bin This equality of averages is one characteristic of a bell-shaped or normal distribution. Example 3.1 Glide Snakes (Using R) > Hertz<-c(0.9,1.4,1.2,1.2,1.3,2.0,1.4,1.6) > Hertz [1] 0.9 1.4 1.2 1.2 1.3 2.0 1.4 1.6 > hist(Hertz, col="red ) > mean(Hertz) [1] 1.375 > median(Hertz) [1] 1.35 > mode(Hertz) [1] "numeric" 17 1/14/13 OUTLINE Review syllabus & introduction Use & misuse of statistics Statistics and biological data explained Samples, populations, estimation Intro to sampling design Accuracy vs. precision Types of variables Frequency & probability distributions Getting started: an example Standard measures of central tendency Mean, median, mode Other means Weighted, geometric, harmonic Types of Means Arithmetic Mean (what we have done so far) Weighted Mean Geometric Mean Harmonic Mean Weighted Mean N Yw !w Y i =1 = i i N i ! = 1 wi The larger N is, the more reliable the mean becomes as an estimator of central tendency. If you have two or more samples of markedly different N that you want to combine and find a grand mean, you need to adjust for the Ns using a weighting factor (w). 18 1/14/13 Weighted Mean - Example Suppose you are interested in the mean height of dogwood trees in Ohio s forests. You go to 3 stands, set out a 500 m2 plot in each area, and record the heights of all dogwoods present, thus: Stand Mean N 1 3.85 12 2 5.21 25 3 4.70 8 N Yw = i ∑ = wi Yi = No--don t do it! Weighted Mean - Example - 1 N ∑w i = 1 Yw Your initial instinct might be to take the mean of the arithmetic means. i (12)(3.85) + (25)(5.21) + (8)(4.70) 12 + 25 + 8 = 4.76 Thus, the mean height of Ohio dogwoods is 4.76 m. Weighted Mean - Example Notes: 1. The result of 4.76 is the same as had you taken an arithmetic mean approach, but added all of the original variates together, and divided by N = 45 (as if all one sample). 2. Had you taken the arithmetic mean of the three separate means, you would have obtained an incorrect result of 4.59 (confirm this for yourself). 19 1/14/13 Geometric Mean Suppose you transformed your original variates from a linear to a log10 scale prior to calculating the mean (we will discuss why you might wish to do this in a subsequent lecture). If you calculate the mean of these transformed values and then back-transformed the mean to a linear scale, this value would be different from the arithmetic mean of the original variates. The back-transformed mean of a logarithmically transformed variable is called the geometric mean. Geometric Mean GM Y = N N i ! = 1 Yi The capital pi is read, the product of just like the sigma is read, the sum of . The geometric mean is equal to the nth root of the product of the Ys from i = 1 to N. Geometric Mean - Example - Suppose you had a data set: 2, 3, 3, 4, 15 (N = 5) and wanted to know the central tendency. The straight up arithmetic mean of these observations would be 5.4 (and incorrect). These values would be better log10 transformed first, then averaged . Thus, the data becomes: 0.301, 0.477, 0.477, 0.622, 1.176. The arithmetic mean of the logs is 0.607, which when backtransformed (100.607) = 4.043 (not 5.4!). 20 1/14/13 Geometric Mean - Example Alternatively: 5 ( 2 )( 3)( 3)( 4 )(15 ) = 4.043 NB: which is the same result as if we had backtransformed the mean of the logs. In this case, the geometric mean is the preferred mean to report rather than the arithmetic mean. Many people feel uncomfortable that this is not the logical or best mean to report. I will try to allay these fears in a subsequent lecture on data transformations. Harmonic Mean Suppose that the transformation of choice is not log10, but rather the reciprocal (1/Y), the mean of choice would be the harmonic mean: 1 = HY N 1 i = 1 i ∑Y N Harmonic Mean - Example - Using the same data set as for the GM, the sum of the reciprocals divided by N = 0.297 therefore, 1/HY = 3.37, Thus the harmonic mean = 3.37 Recall arithmetic mean = 5.4 21 1/14/13 22