Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistics for data analysis When measuring quantities, there is always a level of uncertainty in the resulting numbers. The recorded estimates are as likely to be larger than the “real” value as they are to be smaller. If we have a large number of such replicate measurements, the plot of how many times a measurement was obtained versus the value of the measurement would be a “bell curve”, also known as a Gaussian or normal distribution. This is based on a theorem called the central limit theorem which I will not go into here. The mean or average value of the measurements is at the peak of the curve. The curve below is a plot of (x-mean), so the peak is at zero. To plot the curve, I chose a standard deviation of 1. The standard deviation is a measure of the width of the distribution. If the standard deviation is large, the scatter of data is large and the curve will be broad. In such a case, we are less certain of the actual value. The more measurements we make, the more certain we are that the correct answer is close to our average value. Suppose we had three measurements. The average would be our best estimate of the value and the standard deviation would define how wide the curve is. In order to say that the “actual value” is within a certain range of the average, we use the properties of the normal distribution. The fraction of all results which are within 1 standard deviation of the mean is given by the area under the curve between (mean- stdev) and (mean + stdev). For the normal distribution, this is 65%. Turning this around, we would say that there is a 65% probability that the “correct” answer is within 1 standard deviation of the mean. Other useful values are given in the table below: Standard 1 1.65 1.96 2.57 deviations inner bars middle bars outer bars % of results 68 90 95 99 One way to reflect the increasing certainty of the mean value is to use the standard deviation of the mean. This should shrink with increasing numbers of measurements. It turns out that the square of the standard deviation, called the variance adds. So the variance of the mean would be the variance divided by the number of measurements (just like the average!). Taking a square root give the standard deviation of the mean is S(mean) = S/√(n) The confidence interval tells what range about the mean will contain the “true” value to the stated degree of confidence. This uses the results in the above table. The 90% confidence interval would be the mean ± 1.65(standard deviation). If we want to compare two methods of measurement of a value, the two are significantly different if their means are separated by more than the confidence limit. If the means for the two methods differ by less than 1.65 standard deviations, there is a 90% chance they are in agreement, which is to say the difference is not statistically significant. The t test gives a way of comparing means which also takes account of the number of measurements made. The T score calculated is essentially the number of standard deviations difference between the two means; if it is small enough, the two values are statistically indistinguishable. The plot at left and table below are for 5 normal curves with different means and standard deviations, labeled sigma in the table. mean A B C D E 0 0.5 0.1 0.3 0.35 sigma 0.2 0.04 0.1 0.01 0.05 The results of t tests on the distributions above are listed below for different numbers of samples: Compare the results with the graph; A and B N= 5 10 30 are clearly different means, while the mean A-B 2.282177 3.227486 5.59017 values for A and C are not obviously distinct. A-C 0.408248 0.57735 1 Similarly, means for D and E are less distinct, D-E 0.456435 0.645497 1.118034 since the curves overlap significantly. So it appears that larger values of T result when the two means are more significantly different. In effect, the T score is the number of pooled standard deviations equal to the difference between the means. The table below gives scores for comparison to determine if the means agree or not to a chosen level of confidence. A T score larger than the table value indicates a significant difference between the means. The degrees of freedom are the number of independent values in the two samples combined. This will be equal to the total number of samples minus 2, since the two mean values constrain the data and the last value in each set could be determined from the mean and all but one of the data. T statistic for the Student t-test. Confidence>>>> 95% Degrees of freedom t (0.05) 1 6.314 2 2.92 3 2.353 4 2.132 5 2.015 6 1.943 7 1.895 8 1.860 9 1.833 10 1.812 11 1.796 12 1.782 13 1.777 14 1.761 15 1.753 16 1.746 Degrees of freedom n1+n2-2 97.5% t(0.025) 12.706 4.303 3.182 2.776 2.571 2.447 2.365 2.306 2.262 2.228 2.201 2.179 2.160 2.145 2.131 2.12 99% t(0.010) 31.821 6.965 4.541 3.747 3.365 3.143 2.998 2.896 2.821 2.764 2.718 2.681 2.650 2.624 2.602 2.583