Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Practical No : 07 PRACTICAL ON STATISTICS Objectives : At the end of the practical, the student should be able to, 1. List the methods of data representation. 2. Explain the Tabular and Diagrammatic methods of data representation. 3. Define Mean, Median and Mode 4. Explain the following measures of dispersion and their method of calculation. i. Range ii. Inter-quartile range iii. Variance iv. Standard deviation 5. Describe the normal distribution / Gaussian curve 6. Explain about hypotheses and hypothesis testing 7. Explain and calculate the standard error of the mean. The methods of data representation Tabular Method Cross Tabulation Cumulative Frequency table Data Presentation (3 main methods) Histogram Diagrammatic Method Frequency polygon Bar chart / Pie chart Mean Compute the average Median Mode Numerical Method Range Compute the measure of variability Inter-quartile Range Variance Standard Deviation Tabular Method Population Sample Variable Number : : : : A batch of medical students at faculty of medical sciences, USJP. A random sample – practical group A Height of students Male 169 173 171 168 168 165 169 169 174 Frequency Table Female 155 159 156 150 160 161 160 Cross Tabulation Variable Frequency (x) (f) Variable % Male Cumulative Frequency table Female Height (f) % (f) % Variable (f) Cumulative frequency % 150 - 154 1 6.52 150 - 154 0 0 1 14.28 150 - 154 1 1 6.25 155 - 155 160 - 164 3 3 18.75 18.75 155 - 159 160 - 164 0 0 0 0 3 3 42.85 42.85 155 - 159 160 - 164 3 3 4 7 18.75 18.75 165 - 169 170 - 174 6 3 37.5 18.75 165 - 169 170 - 174 6 3 66.67 33.33 0 0 0 0 165 - 169 170 - 174 6 3 13 16 37.5 18.75 16 100 out of, 9 100 7 100 16 100 Diagrammatic Method Frequency polygon 7 6 5 5 Height (cm) 170 - 174 0 165 - 169 0 160 - 164 1 155 - 159 1 Height (cm) 170 - 174 2 165 - 169 2 3 160 - 164 3 4 155 - 159 4 150 - 154 Frequency (f) 7 6 150 - 154 Frequency (f) Histogram Bar Chart Pie Chart Frequency (f) 7 6 5 Female 4 Male 3 44% 2 56% 1 170 - 174 165 - 169 160 - 164 155 - 159 150 - 154 0 Height (cm) Numerical Method Computing the Average : Estimation of the average values are known as measures of central tendency. These include the Mean, Median and the Mode Mean : When calculating the mean, all the observed values are added up and divided by their number. The values may or may not be arranged in a numerical order. The mean of a group of values, however, is governed by the individual values. i.e. Any outstandingly high or low value makes a big impact on the mean. (“X bar”) = X n Median : Represents the middle or central value of a group of observations. When obtaining the median value, all the observations should first be arranged in ascending or descending order. In an odd number of observations, the middle observation can be directly taken as the median of that sample. In an even number of observations, the two middle numbers are taken, added and divided by 2 to obtain the median value. The median is a more accurate measure of central tendency as its value is not affected by any outstandingly high or low observation within the sample. Mode : This is the most frequently occurring value in a set of observations. Eg. In a set of observations as follows, 150, 155, 156, 159, 160, 160, 161, 165, 168, 168, 169, 169, 169, 170 The mode will be 169, as it is the most frequently occurring value. There maybe more than one mode (bimodal) in a population. Computing the measure of variability : Range : Shows the minimum and maximum values in a set of observations It thereby establishes the boundaries for that specific set of observations. Range Interquartile Range : = Xmin - Xmax Is similar to the range, however, the highest and lowest values in an interquartile range correspond to the 25th and 75th percentile respectively, when data is arranged in ascending or descending order. Eg, In the following set of observations, 150, 155, 156, 159, 160, 160, 161, 165, 168, 168, 169, 169, 169, 170, 173, 174 01 4 (25%) 8 (50%) 12 (75%) Interquartile Range Variance : This is the Mean Square Deviation in a set of observations Variance (X - )2 = n-1 Standard Deviation : This is the square root of the variance and it gives a more accurate indication about the spread of observations around the mean. Standard Deviation = (X - )2 n-1 The calculation of standard deviation is as follows, X (X - ) X1 X2 X3 X4 X5 X1 – X2 – X3 – X4 – X5 – (X - )2 (X1 – (X2 – (X3 – (X4 – (X5 – ) )2 )2 )2 )2 Variance (V) Standard Deviation (X – )2 5 (X – )2 5 () 2 16 The normal distribution (Gaussian Curve) ± 1 SD ± 2 SD Frequency ± 3 SD 68% of the population 95 % of the population 99.7 % of the population In a normal distribution, The curve is symmetrically bell shaped Mean = Mode = Median The sum of the positive observations + negative observations = 0 Hypotheses and Hypothesis testing The statistical significance of a difference - When analyzing data that come from two different populations, it is important to know the degree in which they are related as well as the degree by which an observation differs from or relates to the rest of the observations in that population. - Hereby, we can make a statement as to the probability of such an observation occurring within the specified population. - When a set of observations has a normal distribution, multiples of the standard deviation mark certain limits on the scatter of observations. Eg. 1.96 SD (or approx. 2SD) above and below the mean mark the points within which 95% of the population lie. I.e. 5% of the population lie beyond these points. If a certain observation falls in this 5%, we can say that the probability of such an observation occurring is 5% or less. Probability is expressed as ‘P’, an as a fraction of 1 rather than 100. in the above instance P< 0.05 Hypothesis testing A set of observations is plotted on a graph (normal distribution) Choose the observation to be tested Form a hypothesis (known as ‘null hypothesis’) Calculation : Value of Observation – Mean value of all observations Standard Deviation X- SD Answer : How many standard deviations away from the mean does the observation lie ? Compare this value with probability table Number of standard deviations Probability of observation showing at least as large a deviation from the population mean 0.674 1.0 1.645 1.96 2.0 2.576 3.0 3.291 0.50 0.317 0.10 0.05 0.046 0.01 0.0027 0.0001 Find out the Probability of the observation occurring within the population If, P > 0.05 No significant difference 0.05 > P > 0.01 Probably significant 0.01 > P > 0.001 Significant difference P < 0.001 Highly significant difference Accept or reject the null hypothesis Standard Error of Mean - In statistical analysis of a population, several samples maybe drawn instead of one, and analyses performed on each, separately. - Even though all samples maybe drawn at random, the means of these samples will not necessarily be the same but will generally conform to a “normal distribution”. - Thus, there is a variation between samples. - This variation will depend on the variation in the population and the size of the sample. - We do not know the variation in the population. - But an estimate of it can be obtained by the variation within the sample, which is its standard deviation. Therefore, The Standard Error of mean = Standard Deviation n - This series of means, thus, will also have a standard deviation. - In this manner we can compare two samples and identify if they are from the same or different populations. Eg. Consider two samples 1 and 2. As we do not know the population mean they came from, we estimate it from the standard deviation of one of the samples using the above formula. The mean of this sample has a 95% chance of falling within 1.96 standard errors above or below the population mean. If the second sample also comes from the same population there is a 95% chance that its mean will also lie within +/- 1.96 standard errors of the population mean. In order to assess this, we calculate the standard error of difference between the means. SE of Difference between means = SD12 n1 2 + SD2 n2 After obtaining the Standard Error, we need to find the differences between the two sample means, i.e. 1 - 2 We now need to find out how many multiples of the Standard Error, this represents. i.e. 1 - 2 SE More than 3.291 standard deviations (or multiples) away from the mean represents a probability of 0.001 of the two samples being from the same population. Therefore, if (1 - 2) / SE is more than 3.291, we can say that the probability of the two samples belonging to the same population is less than 0.001 or P 0.001.