* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download STATISTICS
Survey
Document related concepts
Transcript
Adam Zaborski – handouts for Afghans STATISTICS Statistics is the science of the collection, organization and interpretation of data. Statistics also provides tools for prediction and forecasting using data and statistical models. “Statistics” – if refers to discipline – is singular “Statistics” – when refers to quantities calculated from a set of data – is plural There are three kinds of lies: lies, damned lies, and statistics. By choosing a certain sample, results can be manipulated. Such manipulations need not be malicious or devious; they can arise from unintentional biases of the researcher. The graphs used to summarize data can also be misleading. A difference that is highly statistically significant can still be of no practical significance. For instance, we want to start the car engine. We observe that each time the engine cannot start the lights are like turned out. The possible statement is that breakdown of lights causes the engine failure. But both phenomena are really caused by low battery level (and possible malfunction of generator). Chosen subset of the population is called a sample. Adam Zaborski – handouts for Afghans For a sample to be used as a guide to an entire population, it is important that it is truly a representative of that overall population. Most studies will only sample part of a population and then the result is used to interpret the null hypothesis in the context of the whole population. Descriptive statistics – summarize the population data by describing what was observed in the sample numerically or graphically (for ex. mean and standard deviation for continuous data types, while frequency and percentage for categorical data ). Inferential statistics draws inferences about the population: answering yes/no question, estimating numerical characteristics, describing associations (correlations), modeling relationships (regression), extrapolation, interpolation and other modeling techniques. Common mistake is to take the statistical frequency (expressed in % as the unit) as probability (also in % as the unit). The same units for both variables doesn’t mean the same variable. For example, in statistics 95% means, that something will happen 95 times on 100 repetition. This is slightly different that 95% probability of the event. Adam Zaborski – handouts for Afghans Accuracy and precision Do we always need precise measurement of quantities? Of course not. Sometimes we have to have some qualitative information or a rough assessment is sufficient. As an example we can study a case of settlement of structure. Let’s suppose it was caused by some excavation works nearby. We want to know does the settlement continue? The answer can be given by simple experiment, not measurement strictly speaking. For this we can use marble triangles as the most basic device to monitor wall movement: Adam Zaborski – handouts for Afghans Marble triangles as indicator of continuously growing crack Adam Zaborski – handouts for Afghans If we want to have more precise answer we can use another simple device: The device installed on a lighthouse. Adam Zaborski – handouts for Afghans If we need greater precision we may use the dial gauges or another devices. The accuracy of a measurement system is the degree of closeness of measurement of a quantity to its actual (true) value. The precision, also called reproducibility or repeatability, is the degree to which repeated measurements under unchanged conditions show the same results. Although the two words can be synonymous in colloquial use, they are deliberately contrasted in the context of the scientific method. A measurement system can be accurate but not precise, precise but not accurate, neither, or both. Adam Zaborski – handouts for Afghans For example, if an experiment contains a systematic error, the accuracy is small. A measurement system is called valid if it is both accurate and precise. Related terms are bias (if non-random) and error (random variability). In addition to accuracy and precision, measurements may have also a measurement resolution, which is the smallest change in the underlying physical quantity that produces a response in the measurements. Attention: it is not allowed to increase the resolution of the device over that prescribed: usually, if not stated otherwise (by a caption: “can be interpolated”), it is the smallest scale tics. To better explain accuracy and precision we can use the target analogy: High accuracy, low precision and high precision , low accuracy Adam Zaborski – handouts for Afghans In this analogy, repeated measurements are compared to arrows that are shot at a target. Accuracy describes the closeness of arrows to the bullseye at the target center. Arrows that strike closer to the bullseye are considered more accurate. The closer a system's measurements to the accepted value, the more accurate the system is considered to be. A common convention in science and engineering is to express accuracy and/or precision by means of significant digits. For example: 8 x 103 m stands for eight hundred meters (no margin stated) 8.0 x 103 stands for margin of 50 meters 8.00 x 103 stands for margin of 5 meters 8.000 x 103 stands for margin of 50 centimeters. The accuracy can be said to be the “correctness” of a measurement, while precision could be identified as the ability to resolve smaller differences. Adam Zaborski – handouts for Afghans Normal distribution (Gaussian distribution) is an continuous probability distribution “bell”-shaped with peak at the mean: where: – the mean, specifies the position of the curve’s central peak – the variance, specifies the “width” of the curve; some authors use its reciprocal which is called precision; when it is equal to zero, the density function does not exist: it is the Dirac delta function, equal to infinity for the mean and zero elsewhere. The distribution with the mean 0 and variance 1 is called standard normal: total area under the curve is equal to 1 and ½ in the exponent makes the width of the curve (a half distance between the inflection points) also equals to one. Adam Zaborski – handouts for Afghans Normal distribution – probability density function The normal distribution is often used to describe, at least approximately, variables that tends to cluster around the mean. The observational error in an experiment is usually assumed to follow a normal distribution, and the propagation of uncertainty is computed using this assumption. By the central limit theorem, under certain conditions the sum of a number of random variables with finite means and variances approaches a normal distribution as the number of Adam Zaborski – handouts for Afghans variables increases. The observation error in an experiment is usually assumed to follow a normal distribution, and the propagation of uncertainty is computed using this assumption. Of all probability distributions, the normal distribution is the one with the maximum entropy. Two estimators σ and s differ by having (n-1) instead of n. When the outcome is produced by a large number of small effects acting additively and independently, its distribution will be close to normal. Measurement errors in physical experiments are often assumed to be normally distributed. Central limit theorem It states that under certain conditions the sum of a large number of random variables will have an approximately normal distribution. When the result is produced by a large number of small effects acting additively and independently, its distribution will be close to normal. The importance of the central limit theorem cannot be overemphasized. Another practical consequence of the central limit theorem is that certain other distribution can be approximated by the normal distribution: binomial, Poisson, chi-squared, Student’s tdistribution when the sample is large. Adam Zaborski – handouts for Afghans Mean - - arithmetic mean: arithmetic average of values; it is not necessarily the same as the middle value (median), or the most likely (mode); for example mean income is skewed upwards by a small number of people with very large incomes, so the majority have an income lower than the mean; the median income is the level at which half the population is below and half is above; the mode income is the most likely income and favors the larger number of people with lower incomes geometric mean is an average useful for sets of positive numbers that are interpreted according to their product and not their sum, e.g. rates of growth harmonic mean is an average which is useful for sets of numbers which are defined in relation to some unit, for example speed AM, GM, HM satisfy these inequalities: AM >= GM >= HM. The mean of a population and the sample mean: - The sample mean of a population is a random variable, not a constant. Adam Zaborski – handouts for Afghans - The sample mean may differ from the population mean, especially for small samples, but the law of large numbers dictates that the larger the size of the sample, the more likely it is that sample mean will be close to the population mean. Variance It describes how far values lie from the mean. A variance has units that are the square of the units of the variable itself. For this reason the standard deviation is more frequently used. Standard deviation It is the square root of the variance and is widely used measure of the variability or dispersion, if the data points tend to be very close to the mean or are spread out over a large range of values. Standard deviation is commonly used to measure confidence in statistical conclusions. SD is, unlike variance, expressed in the same units as the data. Population SD is valid for the complete population. Sample SD is valid for a random sample drawn from larger population. Adam Zaborski – handouts for Afghans The bell curve. Each colored band has a width of one standard deviation (68%, 95%, 99.7%) Confidence interval Confidence intervals are used to indicate the reliability of an estimate. Increasing the desired confidence level will widen the confidence interval. Adam Zaborski – handouts for Afghans A confidence interval is always qualified by a particular confidence level, usually expressed as a percentage; thus one speaks of a “95% confidence interval”. The endpoints of this interval are called confidence limits. Formally, a 95% confidence interval means that, if the sampling an analysis were repeated under the same conditions, the interval would include the true value 95% of the time. This does not imply that the probability that the true value is in confidence interval is 95%. (This is true for so-called credible interval from Bayesian statistics). The calculation of confidence interval requires assumptions about the nature of the estimation process, for instance, that the distribution of the population from which the sample came is normal. The desirable properties are: - validity (confidence interval should hold) - optimality (the rule for constructing the confidence interval should make as much data set as possible) - invariance (independence of data-presentation coordinates) Adam Zaborski – handouts for Afghans An example: If one were to roll two dice and get double six (which happens 1/36th of the time, or about 3%), few would claim this as proof that the dice were fixed, although statistically speaking one could have 97% confidence that they were. Similarly, the finding of a statistical link at 95% confidence is not proof, nor even very good evidence, that there is any real connection between the links linked. Skewness It is a measure of the asymmetry of the probability distribution of a variable. Adam Zaborski – handouts for Afghans Student’s t-distribution It is a continuous probability distribution that arises in the problem of estimating the mean of a normally distributed population when the sample is small. Student’s distribution arises when (as in nearly all practical statistical work) the population standard deviation is unknown and has to be estimated from the data. The problems are generally of two kinds: 1. those in which the sample size is so large that one may treat a data-based estimate of the variance as if it were certain 2. the standard deviation is ignored The probability density function of the t-distribution resembles the bell shape of a normality distributed variable with mean 0 and variance 1, except that it is a bit lower and wider. As the number of degrees of freedom grows, the t-distribution approaches the normal distribution with mean 0 and variance 1. Adam Zaborski – handouts for Afghans Probability density function Adam Zaborski – handouts for Afghans It is often necessary to fix the degrees of freedom at a fairly low value and estimate the other parameters taking this as given. Some authors report that values between 3 and 9 are often good choices. Practical formulae mean value x m n xi i standard deviation for the sample of population: Sx confidence level t ( n 1) x x i i 2 m n 1 Sx , n where t ( n 1) - coefficient of t-Student repartition The results, with the expected frequency % is in the interval of: x x m . Adam Zaborski – handouts for Afghans The formulae and computation order: Direct measurement mean value x m standard deviation S x confidence level final results x xm Indirect measurement of one variable y f ( x ) mean value ym f ( xm ) standard deviation S y f ' ( x ) S x confidence level y final results x xm Indirect measurement of multiple variables. For function of many variables y f ( x1 ,, xk ) Adam Zaborski – handouts for Afghans mean value ym for i 1,, k standard deviation S xi for i 1,, k confidence level xi for i 1,, k 2 f f x1 xk confidence interval y x1 xk 2 final results y ym y r 1 2 3 4 5 6 7 8 TABLE OF t – STUDENT’S DISTRIBUTION (FOR THE CONFIDENCE LEVEL 95 %) r r r r t r t r t r t r t r 12.706 9 2.262 17 2.110 25 2.060 60 2.000 4.303 10 2.228 18 2.101 26 2.056 3.182 11 2.201 19 2.093 27 2.052 100 1.980 2.776 12 2.179 20 2.086 28 2.048 2.571 13 2.160 21 2.080 29 2.045 1.960 2.447 14 2.145 22 2.074 30 2.042 2.365 15 2.131 23 2.069 2.306 16 2.120 24 2.064 40 2.021 r=n-1 Adam Zaborski – handouts for Afghans Example P P h b a l a M We have 4 measurements for loading, 4 measurements for unloading for both dial gauges, 16 measurements in total. We calculate the differences (the changes of gauges’ indications), we note results: 1.06, 1.08, 1.03, 1.04, 1.06, 1.07, 1.08, 1.09, 1.05, 1.06, 1.05, 1.07, 1.07, 1.07, 1.05, 1.04, 1.06 We have the formulae: , We calculate: - mean mm Adam Zaborski – handouts for Afghans - mean GPa standard deviation for deflection measurement: S x 0.01633 mm standard deviation for Young modulus measurement: S E 3.208 GPa confidence interval: GPa final result: GPa