Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Dr.S.Nishan Silva (MBBS) Statistics The collection, evaluation, and interpretation of data Statistics Statistics Descriptive Statistics Inferential Statistics Describe collected data Generalize and evaluate a population based on sample data Graphic Data Representation Histogram Frequency distribution graph Frequency Polygons Frequency distribution graph Bar Chart Categorical data graph Pie Chart Categorical data graph % Levels of Measurement • Qualitative data – Nominal Measurement • Ex – Give a number coding for the data. Number value is not considered – Ordinal Measurement • Ex- Number coding; but the number value matters • Quantitative data – Interval Measurement • No absolute zero. To what range does a value belong to.. – Ratio Measurement • Absolute zero. And continuing Discussion of Examples Example Research • The effect of food from IIHS canteen on weight gain • Population – IIHS (students and staff) • Further divisions – Students – Nursing and Physiotherapy • Data collection – Questionnaire • Food from home/ outside Vs from canteen • Weight change over one month Master Data Sheet Question Gender M F Job Nurse Physio Other Food From Canteen Other Weigh t Gain Lost or Same Sheet 1 Sheet 2 Sheet 3 Sheet 4 Sheet 5 Sheet 6 Sheet 7 Sheet 8 Master Table Gain Canteen Nurses Other No Gain No Gain Male Canteen Physio Other No Gain No Gain Other Canteen Other No Gain No Gain Canteen Nurses Other No Gain No Gain Female Physio Canteen Other No Gain No Gain Other Canteen Other No Gain No Value % Graphs - Draw • Pie charts – Weight gain from canteen in males – Weight gain from home in females • Bar charts / Graphs – Weight gain from Canteen Discussion of YOUR Examples Describing the Data with Numbers Measures of Central Tendency • • • MEAN -- average MEDIAN -- middle value MODE -- most frequently observed value(s) Measures of Central Tendency Mean x Arithmetic average Sum of all data values divided by the number of data values within the array x x n Most frequently used measure of central tendency Strongly influenced by outliers- very large or very small values Measures of Central Tendency Determine the mean value of 48, 63, 62, 49, 58, 2, 63, 5, 60, 59, 55 x x n (48 63 62 49 58 2 63 5 60 59 55) x 11 524 x 11 x 47.64 Mean of a Group of Data Page 78 Measures of Central Tendency Median Data value that divides a data array into two equal groups Data values must be ordered from lowest to highest Useful in situations with skewed data and outliers (e.g., wealth management) Measures of Central Tendency Determine the median value of 48, 63, 62, 49, 58, 2, 63, 5, 60, 59, 55 Organize the data array from lowest to highest value. 2, 5, 48, 49, 55, 58, 59, 60, 62, 63, 63 Select the data value that splits the data set evenly. Median = 58 What if the data array had an even number of values? 5, 48, 49, 55, 58, 59, 60, 62, 63, 63 Measures of central tendency Mode Most frequently occurring response within a data array • Usually the highest point of curve May not be typical May not exist at all Mode, bimodal, and multimodal Measures of Central Determine the mode of Tendency 48, 63, 62, 49, 58, 2, 63, 5, 60, 59, 55 Mode = 63 Determine the mode of 48, 63, 62, 59, 58, 2, 63, 5, 60, 59, 55 Mode = 63 & 59 Bimodal Determine the mode of 48, 63, 62, 59, 48, 2, 63, 5, 60, 59, 55 Mode = 63, 59, & 48 Multimodal Measures of Dispersion • RANGE highest to lowest values STANDARD DEVIATION • how closely do values cluster around the mean value SKEWNESS • refers to symmetry of curve • • • Range Calculate by subtracting the lowest value from the highest value. R hl Calculate the range for the data array. 2, 5, 48, 49, 55, 58, 59, 60, 62, 63, 63 R hl R 63 2 R 61 Standard Deviation x x 1. Calculate the mean x . s 2 (N 1) 2. Subtract the mean from each value. 3. Square each difference. 4. Sum all squared differences. 5. Divide the summation by the number of values in the array minus 1. 6. Calculate the square root of the product. x x Standard Deviation Calculate the standard s (N 1) deviation for the data array. 2, 5, 48, 49, 55, 58, 59, 60, 62, 63, 63 1. x 524 47.64 x 11 2. x x n 2 - 47.64 = -45.64 59 - 47.64 = 11.36 5 - 47.64 = -42.64 60 - 47.64 = 12.36 48 - 47.64 = 0.36 62 - 47.64 = 14.36 49 - 47.64 = 1.36 63 - 47.64 = 15.36 55 - 47.64 = 7.36 63 - 47.64 = 15.36 58 - 47.64 = 10.36 2 x x Standard Deviation Calculate the standard deviation for the data array. s (N 1) 2, 5, 48, 49, 55, 58, 59, 60, 62, 63, 63 3. x x 2 -45.642 = 2083.01 11.362 = 129.05 -42.642 = 1818.17 12.362 = 152.77 0.362 = 0.13 14.362 = 206.21 1.362 = 1.85 15.362 = 235.93 7.362 = 54.17 15.362 = 235.93 10.362 = 107.33 2 x x Standard Deviation Calculate the standard deviation for the data array. s 2 (N 1) 2, 5, 48, 49, 55, 58, 59, 60, 62, 63, 63 4. x x 2 2083.01 + 1818.17 + 0.13 + 1.85 + 54.17 + 107.33 + 129.05 + 152.77 + 206.21 + 235.93 + 235.93 = 5,024.55 5.(N 1) 11-1 = 10 6. x x ( N1 ) 2 5,024.55 502.46 10 7. s x x 2 (N 1) 502.46 S = 22.42 Variance 2 s x x (N 1) Average of the square of the deviations 1.Calculate the mean. 2.Subtract the mean from each value. 3.Square each difference. 4.Sum all squared differences. 5.Divide the summation by the number of values in the array minus 1. 2 Variance 2 s x x Calculate the variance for the data array. (N 1) 2, 5, 48, 49, 55, 58, 59, 60, 62, 63, 63 5024.55 s 502.46 ( 10 ) 2 2 Standard Deviation Curve A Curve B A B Skewness Mean Median Mode Curve A Curve B negative skew A Simple Method for estimating standard error Standard error is the calculated standard deviation divided by the square root of the size, or number of the population Standard error of the means is used to test the reliability of the data Example… If there are 10 corn plants with a standard deviation of 0.2 Sex = 0.2/ sq root of 10 = 0.2/3.03 = 0.006 0.006 represents one std dev in a sample of 10 plants If there were 100 plants the standard error would drop to 0.002 Why? Because when we take larger samples, our sample means get closer to the true mean value of the population. Thus, the distribution of the sample means would be less spread out and would have a lower standard deviation. Coefficient of Variation • Percentage CV is – • Standard Deviation X 100 Mean Discussion of Examples Probability • It is the numerical measure of the likelihood that a specific event would occur. • (Page 92) • Sum of probabilities for one event = 1 • Probability is always between 0 and 1 Probability • Probability of independent events – Chance of one single event happening (against not happening) • Marginal and condition probabilities – (Page 92-94) The Normal Distribution . Mean, Median, Mode • Mean = median = mode • Skew is zero • 68% of values fall between 1 SD • 95% of values fall between 2 SDs 1 2 The Normal Curve and Standard A normal curve: Deviation Each vertical line is a unit of standard deviation 68% of values fall within +1 or -1 of the mean 95% of values fall within +2 & -2 units Nearly all members (>99%) fall within 3 std dev units Example (Theory) My weight day weight 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 140 140.1 139.8 140.6 140 139.8 139.6 140 140.8 139.7 140.2 141.7 141.9 141.4 142.3 142.3 141.9 142.1 142.5 142.3 142.1 142.5 143.5 143 143.2 143 143.4 143.5 142.7 143.7 day 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 weight day 143.9 144 142.5 142.9 142.8 143.9 144 144.8 143.9 144.5 143.9 144 144.2 143.8 143.5 143.8 143.2 143.5 143.6 143.4 143.9 143.6 144 143.8 143.6 143.8 144 144.2 144 143.9 weight 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 Plot as a function of time data was acquired: 144 144.2 144.5 144.2 143.9 144.2 144.5 144.3 144.2 144.9 144 143.8 144 143.8 144 144.5 143.7 143.9 144 144.2 144 144.4 143.8 144.1 day Comments: background is white (less ink); Font size is larger than Excel default (use 14 or 16) 146 145 144 weight (lbs) weight 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 140 140.1 139.8 140.6 140 139.8 139.6 140 140.8 139.7 140.2 141.7 141.9 141.4 142.3 142.3 141.9 142.1 142.5 142.3 142.1 142.5 143.5 143 143.2 143 143.4 143.5 142.7 143.7 day 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 weight day 143.9 144 142.5 142.9 142.8 143.9 144 144.8 143.9 144.5 143.9 144 144.2 143.8 143.5 143.8 143.2 143.5 143.6 143.4 143.9 143.6 144 143.8 143.6 143.8 144 144.2 144 143.9 143 142 Do not use curved lines to connect data points – that assumes you know more about the relationship of the data than you really do 141 140 139 0 10 20 30 Day 40 50 60 weight 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 144 144.2 144.5 144.2 143.9 144.2 144.5 144.3 144.2 144.9 144 143.8 144 143.8 144 144.5 143.7 143.9 144 144.2 144 144.4 143.8 144.1 Assume my weight is a single, random, set of similar data 25 Make a frequency chart (histogram) of the data 146 145 # of Observations 144 weight (lbs) 20 143 142 141 15 140 139 0 10 20 30 40 50 60 Day 10 5 0 Weight (lbs) Create a “model” of my weight and determine average Weight and how consistent my weight is 25 average 143.11 # of Observations 20 15 10 Inflection pt s = 1.4 lbs 5 0 Weight (lbs) s = standard deviation = measure of the consistency, or similarity, of weights 0.45 0.4 0.35 Amplitude Width is measured At inflection point = s 0.3 0.25 0.2 W1/2 0.15 0.1 0.05 0 -5 -4 -3 -2 -1 0 1 2 3 4 s Triangulated peak: Base width is 2s < W < 4s 5 0.45 0.4 Pp = peak to peak – or – largest separation of measurements 0.35 +/- 1s Area = 68.3% Amplitude 0.3 pp ~ 6s 0.25 0.2 0.15 0.1 Area +/- 2s = 95.4% 0.05 0 -5 -4 -3 -2 Area +/- 3s = 99.74 % -1 0 1 2 3 4 5 s Peak to peak is sometimes Easier to “see” on the data vs time plot pp ~ 6s (Calculated s= 1.4) 146 144.9 145 Peak to peak 143 25 142 20 # of Observations weight (lbs) 144 141 15 10 5 140 139.5 0 Weight (lbs) 139 0 10 20 30 Day s~ pp/6 = (144.9-139.5)/6~0.9 40 50 60 Read Co-relation between variables – Page 99 and beyond Inferential Statistics Used to determine the likelihood that a conclusion based on data from a sample is true Terms p value: the probability that an observed difference could have occurred by chance Terms confidence interval: The range of values we can be reasonably certain includes the true value. The Use of the Null Hypothesis • Is the difference in two sample populations due to chance or a real statistical difference? • The null hypothesis assumes that there will be no “difference” or no “change” or no “effect” of the experimental treatment. • If treatment A is no better than treatment B then the null hypothesis is supported. • If there is a significant difference between A and B then the null hypothesis is rejected... T-test or Chi Square? Testing the validity of the null hypothesis • Use the T-test (also called Student’s Ttest) if using continuous variables from a normally distributed sample populations (ex. Height) • Use the Chi Square (X2) if using discrete variables (if you are evaluating the differences between experimental data and expected or hypothetical data)… Example: genetics experiments, expected distribution of organisms. T-test • T-test determines the probability that the null hypothesis concerning the means of two small samples is correct • The probability that two samples are representative of a single population (supporting null hypothesis) OR two different populations (rejecting null hypothesis) Use t-test to determine whether or not sample population A and B came from the same or different population t = x1-x2 / sx1-sx2 x1 (bar x) = mean of A ; x2 (bar x) = mean of B sx1 = std error of A; sx2 = std error of B Example: Sample A mean =8 Sample B mean =12 Std error of difference of populations =1 12-8/1 = 4 std deviation units The “z” test -used if your population samples are greater than 30 -Also used for normally distributed populations with continuous variables -formula: note: “σ” (sigma) is used instead of the letter “s” z= mean of pop #1 – mean of pop #2/ √ of variance of pop #1/n1 + variance of pop#2/n2 Also note that if you only had the standard deviation you can square that value and substitute for variance Example z-test • You are looking at two methods of learning geometry proofs, one teacher uses method 1, the other teacher uses method 2, they use a test to compare success. • Teacher 1; has 75 students; mean =85; stdev=3 • Teacher 2: has 60 students; mean =83; stdev= = (85-83)/√3^2/75 + 2^2/60 = 2/0.4321 = 4.629 Example continued Z= 4.6291 Ho = null hypothesis would be Method 1 is not better than method 2 HA = alternative hypothesis would be that Method 1 is better than method 2 This is a one tailed z test (since the null hypothesis doesn’t predict that there will be no difference) So for the probability of 0.05 (5% significance or 95% confidence) that Method one is not better than method 2 … that chart value = Zα 1.645 So 4.629 is greater than the 1.645 (the null hypothesis states that method 1 would not be better and the value had to be less than 1.645; it is not less therefore reject the null hypothesis and indeed method 1 is better Z table (sample table with 3 probabilities) α Zα (one tail) Zα/2 (two tails) 0.1 1.28 1.64 0.05 1.645 1.96 0.01 2.33 2.576 Chi square • Used with discrete values • Phenotypes, choice chambers, etc. • Not used with continuous variables (like height… use t-test for samples less than 30 and z-test for samples greater than 30) • O= observed values • E= expected values http://course1.winona.edu/sberg/Equation/chi-squ2.gif Interpreting a chi square • • • • Calculate degrees of freedom # of events, trials, phenotypes -1 Example 2 phenotypes-1 =1 Generally use the column labeled 0.05 (which means there is a 95% chance that any difference between what you expected and what you observed is within accepted random chance. • Any value calculated that is larger means you reject your null hypothesis and there is a difference between observed and expect values. How to use a chi square chart http://faculty.southwest.tn.edu/jiwilliams/probab2.gif