Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Biostatistics I Descriptive statistics and some things related to the normal distribution Outline 2 Descriptive statistics Frequency distribution, relative, cumulative, histograms Measure of central tendency (mean, median, mode) Deviations and measure of variation Some thoughts on the normal distribution Z-score, CV, Confidence interval Standard error of the mean How many samples? Population and sample 1 3 Population: A finite number of separate objects defined in space and time All boats operating in a country’s EEZ in year 2009 Boat of a particular type operating in a country’s water in January 2009 The number of queen conch in a country’s EEZ Sample: A subset of a population Usually a sample is an order of magnitude smaller than the size of a population. Population and sample 2 4 Use information from a sample to make inference about the population Population is unknown Sample is known Inference Can only make inference about the population from the sample if the sample is representative of the population Frequency distributions Frequency distributions 1 6 Objectives of frequency tabulation is to condense the raw data into some more useful form that allows some visual interpretation of the data. How can we make a quick summary of the data on the right? Lets say that the data contain length measurements of 30 fishes (n=30) We can quickly see that the smallest fish is 3.4 cm and that the largest is 15.3 cm Measurement Length of number (i) fish i (cm) 1 13.1 2 11.7 3 9.0 4 7.0 5 9.9 6 5.1 7 11.6 8 6.4 9 8.0 10 8.7 11 13.0 12 11.6 13 8.7 14 12.8 15 7.5 16 12.1 17 10.8 18 11.5 19 10.3 20 3.4 21 8.1 22 9.4 23 5.6 24 12.6 25 12.4 26 3.4 27 4.1 28 15.3 29 7.3 30 10.8 Frequency distributions 2 7 How its done: Decide on the number of classes to include in the frequency distribution. Find the class width: determine the range of the data, divide the range by the number of classes and round up to the next convenient number. Range is: 15.3 cm – 3.4 cm = 11.9 11.9 cm / 7 = 1.7 cm 2 cm Find the class limits: Start with the lowest value (rounded down) as the lower limit of the first class, add the class width to this to obtain the lower limit for the second class, etc. Here 7 length classes Lowest class limit = 2 cm Next one: 2 + 2 = 4 cm, etc. Count the number of fish in each length class, either by using a pencil or a paper or a computer program. Measurement number (i) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Length of fish i (cm) Sorted 13.1 3.4 11.7 3.4 9.0 4.1 7.0 5.1 9.9 5.6 5.1 6.4 11.6 7.0 6.4 7.3 8.0 7.5 8.7 8.0 13.0 8.1 11.6 8.7 8.7 8.7 12.8 9.0 7.5 9.4 12.1 9.9 10.8 10.3 11.5 10.8 10.3 10.8 3.4 11.5 8.1 11.6 9.4 11.6 5.6 11.7 12.6 12.1 12.4 12.4 3.4 12.6 4.1 12.8 15.3 13.0 7.3 13.1 10.8 15.3 Length class 2-4 4-6 6-8 8-10 10-12 12-14 14-16 Sum C l a sNumber of sfish 2 3 5 6 7 6 1 30 8 Relative and cumulative frequency Data Sorted 13.1 3.4 11.7 3.4 9.0 4.1 7.0 5.1 9.9 5.6 5.1 6.4 11.6 7.0 6.4 7.3 8.0 7.5 8.7 8.0 13.0 8.1 11.6 8.7 8.7 8.7 12.8 9.0 7.5 9.4 12.1 9.9 10.8 10.3 11.5 10.8 10.3 10.8 3.4 11.5 8.1 11.6 9.4 11.6 5.6 11.7 12.6 12.1 12.4 12.4 3.4 12.6 4.1 12.8 15.3 13.0 7.3 13.1 10.8 15.3 Class Class Frequency 2-4 2 4-6 3 6-8 5 8-10 6 10-12 7 12-14 6 14-16 1 Sum 30 Relative Cumulative 0.067 0.067 0.100 0.167 0.167 0.333 0.200 0.533 0.233 0.767 0.200 0.967 0.033 1.000 1 Relative frequency is the proportion of the observation within a class. Cumulative frequency is the sum of the relative frequency of all classes below and including the class indicated. Various ways for displaying frequency Histogram 10 8 6 4 2 0 Relative frequency 0.2 0.1 14-16 12-14 10-12 8-10 6-8 4-6 0.0 2-4 14-16 12-14 10-12 8-10 6-8 4-6 2-4 Proportion 0.3 Length Length (cm) Cumulative frequency Relative cumulative frequency 30 25 20 15 10 5 0 0.6 0.4 0.2 16 14 12 8 6 4 0.0 2 16 0.8 10 Length (cm) 14 12 10 8 6 4 Cumulative % 1.0 2 Cumulative n Observations (n) 9 Lengt (cm) How would one verbally describe: 1) the general characteristics of the data? 2) the different forms of presentations of the same data? Some mathematical bookeeping 10 GENERAL n: number of measurements Lowest value: Xmin Highest value: Xmax Range: Xmax – Xmin j: Class numbers Class boundaries: L1, L2, .. Lj Class range: dl = Lj+1 – Lj Class midpoint: (Lj+1 – Lj)/2 nj: number of fish in class j Relative frequency: nj / n Cumulative frequency: Hmm …, lets wait for that one OUR EXMPLE n = 30 fish Xmin = 3.4 cm Xmax = 15.3 cm Range = 15.3 – 3.4 = 11.9 cm j = 1, 2, … 7 Class boundaries: 2, 4,… 16 cm Class range: 4 -2 = 2 cm Class midpoint: (4+2)/2= 3 cm nj = 2, 3, 5, 6, 7, 6, 1 0.067, 0.100, 0.167, …, 0.033 0.067, 0.167, 0.267, …, 1.000 Number of classes? 11 Generally no fewer than 5 and no greater than 15 Depends in part: On the number of observations, the more observations the greater the number of classes. The nature of the data If the sample is a composite of a lot of different elements we need to have high number of classes. But that also means we need a lot of measurements. Some general guidelines Square root of n Sturge´s rule: (Xmax-Xmin) / (1+1.44 ln(n)) Measure of central tendency A value that is supposed to describe the most typical or central point of the measurements Measure of central tendency A value that is supposed to describe the most typical or central point of the measurements Arithmetic mean Median Mode Mode Median Number of observations 13 Mean 500 450 400 350 300 250 200 150 100 50 0 0 5 10 15 20 25 30 35 40 The arithmetic mean 14 In mathematical notation: 1 n 1 x xi xi n i 1 n i 1 x1 x2 x3 ... xn n n: the total number of measurements i: The ith measurement xi: the value of the ith measurement Example Measurement number (i) 1 2 3 4 5 n Sum Mean Data set 1 Data set 2 Measurement Measurement value (xi) value (xi) 40 40 20 20 10 10 30 30 50 100 5 150 30 5 200 40 Note the effect of “outliers” on the mean value How well does the mean describe the most typical value? The median 15 Median position: Sort the measurements in an ordered fashion from lowest to the highest (ranked) Find the median position: (n+1)/2 of the ordered data The median value: The value of the observation in the median position Note if n is an even number the median is the average of the two central values: E.g. 10 ,20 ,30 ,40 ,50 ,60 Median = 35 Example Measurement Ordered number (i) position 3 1 2 2 4 3 1 4 5 5 n Median pos.: (n+1) / 2 Median value Data set 1 Data set 2 Measurement Measurement value (xi) value (xi) 10 10 20 20 30 30 40 40 50 100 5 3 30 5 3 30 Note that the median is not affected by the “outlier” 16 The mode Mode = value that occur most often Not sensitive to outliers Problem: there may be no or many modes E.g. 10 ,20 ,30 30, 30 ,40 ,50 ,60 Shapes of distributions 17 Left skewed Symmetrical Right skewed Mode Median Mean Mode Median Mean 500 450 400 350 300 250 200 150 100 50 0 Mode Median Mean 450 400 tail 500 450 400 350 300 250 200 150 100 50 0 350 300 250 200 150 100 50 0 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 tail 0 5 Left skewed: Mode < Median < Mean Symmetrical: Mode = Median = Mean Right skewed: Mode > Median > Mean 10 15 20 25 30 35 40 Measure of variability A value that is supposed to describe the distribution of the measurements around the central value Fractiles: General definitions 19 Range: Difference between the maximum and minimum value Range = xmax – xmin Sensitive to outliers Quantiles: Q1, Q2 and Q3 divide a data set into four equal parts Q1: 25th percentile Q2: 50th percentile = Median Q3: 75th percentile Interquantile range = Q3-Q1 Less sensitive to outliers Percentiles: P1, P2, … P100 divide a data set into 100 equal parts Note relationship: Q1 = P25, Q2 = P50= Median, Q3 = P75 20 Fractiles: Box and whisker plots Q1 = 25th percentile Minimum Measurements value 4 100 Q2 = 50th percentile Median Q3 = 75th percentile 200 400 Range = 600 – 4 = 596 Interquartile range = 400 – 100 = 300 Note 25% of observation are ≤ Q1, 50% ≤ Q2, 75% ≤ Q3 50% of the observations lies between Q1 and Q3 Maximum 600 Box and whiskers plots and distributions 21 Left skewed Symmetrical Right skewed Mode Median Mean Mode Median Mean 500 450 400 350 300 250 200 150 100 50 0 Mode Median Mean 450 400 tail 500 450 400 350 300 250 200 150 100 50 0 350 300 250 200 150 100 50 0 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 tail 0 5 10 15 20 25 30 Box and whisker plots give an indication of the central value (here mode), the distribution of the data and the shape of the distribution 35 40 22 Example of a quartile plot Plot show the median catch rate (CPUE) as a function of time. Plot shows the median and the interquartile catch rate as a function of time What additional information does the lower graph provide? Example of a percentile plot Proportion less than value 23 1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 P90 P90: 90% of observations with values less than 19 0 5 10 15 20 25 30 Measurement value 35 40 45 Deviations from the mean 1 24 In mathematical notation: Deviation i = (Xi X) i: The ith measurement n: the total number of measurements xi: the value of the ith measurement Example Measurement Measurement number (i) value (xi) 1 40 2 20 3 10 4 30 5 50 n Sum Mean 5 150 30 Deviationi 10 -10 -20 0 20 5 0 0 Deviation from the mean 2 Deviation from the mean Deviation i = (Xi X) Measurement Measurement number (i) value (Xi) 1 40 2 20 3 10 4 30 5 50 n Sum Mean 5 5 150 30 Deviationi 10 -10 -20 0 20 5 0 0 How can we characterize the average deviation?? Plain average gives always zero. Measurement number (i) 25 4 Deviation 3 2 Observation 1 Mean 0 0 10 20 30 40 Value (Xi) 50 60 Variance & standard deviation 1 26 The variance Whole population Sample from population N s2 (Xi m ) n 2 i 1 N S2 2 (X X) i i 1 n-1 Standard deviation: Square root of variance N s (X i 1 i m) N n 2 S (X i 1 i X) 2 n-1 Xi ith measurement of the variable X X sample mean m: population mean s: population std. deviation s: sample std. deviation N: population size n: sample size Variance and standard deviation 2 Deviations Measurement Measurement number (i) value (Xi) 1 40 2 20 3 10 4 30 5 50 mean Sums of squares variance standard deviation relative standard deviation X Xi X 10 -10 -20 0 20 SS X i X i X 2 100 100 400 0 400 2 S SS n 1 2 S S2 CV S X n 30 1000 250 15.8 0.53 5 5 Measurement number (i) 27 4 Deviation 3 2 Observation 1 Mean 0 Do you think that the value of 15.8 is a reasonable measure of the average deviation in the data? 0 10 20 30 40 Value (Xi) 50 60 Variance and standard deviation 3 28 X X X X 2 n SS X X 2 i 1 Measurement number (i) 400 20 5 0 0 4 10 2 100 10 100 1 0 0 10 + 20 30 40 Value (Xi) 50 1000 = 400 20 3 + + n + S2 X X i 1 n 1 60 S S 2 2 250 15.8 Coefficient of variation 29 S S CVP or CV% 100 X X Measures of relative variation Always a percentage (%) or a proportion of 1 CV = “Relative standard deviation” Can be higher than 100% Can be used to compare two or more sets of data Data Data Data set 1 Set 2 Set 3 Xbar 50 50 50 s 5 10 20 CVP 0.10 0.20 0.40 CV% 10% 20% 40% Xbar s CVP CV% Data set 4 Data Set 5 Data Set 6 50 10 0.20 20% 100 20 0.20 20% 1000 200 0.20 20% The normal distribution The normal distributions are a very important class of statistical distributions. All normal distributions are symmetric and have bell-shaped density curves with a single peak. Common distribution of measurements 1 31 450 400 Number of fish 350 300 250 200 150 100 50 0 20 30 40 50 60 70 80 Length (mm) n=7073 Example: 7073 Icelandic cod fish larvae lengths measurements taken in august 2002. Since we have many fish we can use a length bin of 1 mm to generate a frequency distribution. Most fish fall within a certain narrow size range The number of fish of a certain length decrease the further away one goes from the central distribution. Distribution is close to symmetrical Common distribution of measurements 2 32 450 450 400 400 350 350 300 300 Number of fish Number of fish Lets make an rough eyeball drawing through the points 250 200 150 250 200 150 100 100 50 50 0 0 20 30 40 50 60 Length (mm) 70 80 20 25 30 35 40 45 50 55 60 65 70 75 80 Length (mm) Can we describe this red line mathematically? 33 Normal distribution n L n d Li i 500 1 e s 2 X X 2 1 i 2 s nLi: number in length class Li dLi: width of length interval Numbers 400 300 200 100 0 20 25 30 35 40 45 50 55 60 65 70 75 80 Length (mm) The normal distribution 34 1 pdf e s 2 X X 2 1 i 2 s pdf - probability density function i - measured variable (here length of fish) Xbar – the mean s – the standard deviation The model that describes the normal distribution is complex at first sight … What matters? 35 What parameters are in the equation? 1 pdf e s 2 X X 2 1 i 2 s Xbar is the sample mean s is the standard deviation The rests (2, , e) are constants The normal distribution is only “controlled” by the Xbar and s, often written as: pdf f X , s In words we say that the normal distribution is a function of Xbar and s. 36 pdf = f(Xbar,s), keep Mean(Xbar) =50, change “s” n L n d Li i Number of fish 600 1 e s 2 X X 2 1 i 2 s The central position (Xbar) remains the same. The higher the value of s the greater the spread of the curve. s=10 s=5 s=20 500 400 Q: Is mean on its own a useful measure? 300 200 100 0 20 30 40 50 60 70 80 37 pdf = f(Xbar, s), keep s=10, change Xbar n L n d Li i Number of fish 300 1 e s 2 X X 2 1 i 2 s The shape of the curve remains the same. The mean (Xbar) describes the central location on the x-axis. Xbar = 50 250 Xbar = 40 Xbar = 60 200 150 100 50 0 20 30 40 50 60 70 80 38 What line describes the data distribution best? n L n d Li i Number of fish 600 1 e s 2 X X 2 1 i 2 s Assume the distribution is normal: Find value of Xbar and s which best describe the data. Xbar=50, s=10 500 Xbar=50, s=5 Xbar=50, s=20 400 Observation 300 200 100 0 20 30 40 50 60 70 80 Answer: Xbar = 50, s = 10 In 2002 7073 larvae were measured. The mean was 50 mm and the standard deviation 10 mm 500 Number of fish 39 400 300 200 100 s s 0 20 25 30 35 40 45 50 55 60 65 70 75 80 Length (mm) Can we say anything about probabilities? 40 Probabilities = likelihood relative frequency In presentation of data analysis we often have statements like: We expect that 95% of the population are within a certain specified range of the data distribution E.g. given the sample that I have, I expect that 95% of the distribution of the fish population is between 30 and 70 mm. This is sometimes written as: 50 ± 20 mm How can we say this? Why do we say this? 41 Although there are many normal curves, they all share an important property that allows us to treat them in a uniform fashion. The 68-95-99.7% Rule All normal density curves satisfy the following property which is often referred to as the Empirical Rule. 68% of the observations fall within 1 standard deviation of the mean, that is, between and . 95% of the observations fall within 2 standard deviations of the mean, that is, between and . 99.7% of the observations fall within 3 standard deviations of the mean, that is, between and . Thus, for a normal distribution, almost all values lie within 3 standard deviations of the mean. 42 Note that these values are approximations : For example according to the normal curve probability density function, 95% of the data will fall within 1.96 standard Deviation of the mean. Using 2 standard deviations is a convenient approximation. What does 1.96 standard deviation mean? Number of fish 45 In 2002 7073 larvae were measured. The mean was 50 mm and the standard deviation 10 mm 95% of all the measurements (6719 larvae) fall within 1.96 standard deviation (30-70 mm) from the mean, 500 given that the data follow a normal distribution. 400 300 200 100 0 ±2s 20 25 30 35 40 45 50 55 60 65 70 75 80 Length (mm) But what does 1 standard deviation mean? In 2002 7073 larvae were measured. The mean was 50 mm and the standard deviation 10 mm 500 Number of fish 46 400 68% of all the measurements (4810 larvae) fall within 1 standard deviation from the mean (40-60 mm), given that the data follow a normal distribution 300 200 100 1s 0 20 25 30 35 40 45 50 55 60 65 70 75 80 Length (mm) The Z score 47 In statistics the Z score is defined as: Xi X Z s value of measurement i - mean standard deviation Hmm ... , have we seen this formula before?? deviation of measurement i standard deviation Xi X Z s The meaning of the Z-score 48 The Z-score standardizes the deviation from the mean of a measurement relative to the standard deviation. The Z-score value is a multiplier, indicating how many standard deviation a particular measurement is from the mean. i 1 2 3 4 5 6 7 Xbar s Value (Xi) 20 30 40 50 60 70 80 50 10 Z-score -3.00 -2.00 -1.00 0.00 1.00 2.00 3.00 i 1 2 3 4 5 6 7 Xbar s Value (Xi) 20 30 40 50 60 70 80 50 20 Z-score -1.50 -1.00 -0.50 0.00 0.50 1.00 1.50 Xi X Z s The Z scores of our data 49 1s 2s n 500 400 300 200 100 0 20 25 30 35 40 45 50 55 60 65 70 75 80 Length n 500 400 300 200 100 0 -3 -2 -1 0 Z score 1 2 3 Cumulative relative distribution of Z scores prop. of Fish < Length 50 1.0 0.9 84th% 0.8 0.7 0.6 68% 0.5 0.4 0.3 0.2 16th% 0.1 0.0 -3 -2 -1 0 1 2 3 Z score The graph shows that -1s is the 16th percentile, +1 the 84th percentile. Thus 84-16 = 68% of the data lie within ± 1 s of the mean Cumulative relative distribution of Z scores prop < Z-score 51 The shape of this graph and the values of Z and pdf are the same for any normally distributed data irrespective of the number of measurements (n) and the value of the mean and standard deviation 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 -3 -2 -1 0 1 2 3 Z score If we have a mean and a standard deviation from a sample and we assume that the data are normally distributed we can say what the probability is that the next sample we sample we take is less than a certain Z value. E.g. Xbar = 100 mm, s = 20 mm. How likely is it that the next measurement that is sampled is: 60 : Z score = (60-100)/20 = -3, probably very unlikely 120: Z score = (120-100)/10 = -2, 2.5% probability Standard error (standard deviations of the means) 52 Standard error (or standard deviation of the mean) estimates of the standard deviations of the means. S SE S x n We are effectively using the present sample to estimate what the likely distribution of the means would be if we were to have repeated measurements from the population. The standard error is thus a value that can be used to estimate the confidence interval of parametric mean from the sample mean, given the distribution of the data We assume that the means are normally distributed Note: Standard deviation: estimate of the dispersion of the individual observations from the mean n of a sample (Xi X)2 S i 1 n-1 Confidence intervals 53 The 95% confidence limits (CL) of a population mean given the sample mean and standard error can be calculated as follows: S x CL x Z n S x CL x t95%,n1 n n≥30, Z95%=1.96 n<30 tn-1 :fractiles (here 95%) of the Student t-distribution with n-1 degrees of freedom The distribution of t is similar as the normal distribution, but varies with sample size less then 30. When n>30, t = 1.96 (2) for the 95% confidence interval Example 54 From our measurement of 0-group larvae we have: x 49.8 mm s 10.1 mm n 7073 To calculate the 95% confidence interval of the mean we have t=1.96. 10.1 10.1 49.8 1.96 m 49.8 + 1.96 7073 7073 49.56 m 50.04 There is thus 95% probability that the interval contains the true population mean value How many samples should be collected? 55 Suppose that we require that the estimated mean landings from samples should not deviated more than 7% (maximum relative error) from the true landings and that we want to be 95% certain of this. The maximum relative error of the mean can be calculated from: max tn1,0.05 s CV , where CV 100 X n Increasing sample size (n) lowers the maximum relative error Higher CV (ratio of variance relative to the mean) results in higher relative error for a given sample size Graphically we have Sample size and max. relative error at 95% level 20% Maximum relative error 56 CV = 10% CV = 20% 15% 10% 5% 0% 0 10 20 30 40 Sample size (n) 50 60 Question: How many samples are needed in order to be 95% sure that the estimated mean from the samples does not deviate more than 7% from the true mean? % deviation from "true value" Sample size and relative error at 95% level 57 20% CV = 10% CV = 20% 15% 10% 5% 0% 0 10 20 30 40 50 Sample size (n) Answer: It depends on your CV. If CV is 10%, need 10 samples to achieve the required precision If CV is 20%, need 35 samples to achieve the required precision Note: Increasing the number of samples (for any given CV) does not proportionally increase the precision of the value, the cost getting disproportionately higher the closer one gets to the “true value”. 60 Final remark 58 The introduction on statistical analysis given here is only a very brief overview, taking frequency distribution and dispersion measure mostly from normally distributed data. A simple frequency plot is in essence a probability plot. Graphical analysis/display of data/models can increase the understanding of the concepts behind them. Further suggested readings: Haddon 2001, Chapter 3 Larson and Farber, Elementary Statistics