Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to Statistics Statistic [stuh-tis-tik] noun . A numerical fact or datum, especially one computed from a sample How long does the ball take to fall? Measured values: See Board • How do we decide which of these measured values is correct? • How do we discuss the variation in our measurements? Mean Also known as “Average” Add all results, and divide by the number of measurements. Equation form: n x1 + x 2 +... + xn 1 m=x= = å xi n n i=1 Propagation of Uncertainty Accuracy Sources of Inaccuracy: Broken measurement device Parallax Random error ? Low bias, high variability Precision Sources of Imprecision: Multiple measurement methods Systematic error? High bias, low variability Variance and Standard Deviation Squared deviation: How much variation is there from the mean? x1 x x2 x 2 s 2 2 .... xn x 2 n 1 n 2 xi x n i 1 Variance: measures the absolute distance observations are from the mean s @s= s 2 Error Error is the difference between the measured and expected value Error is how we make sense of differences between two measurements that should be the same Error is NOT mistakes! If you made a mistake, do it again. Types of Error Descriptions For a true mean, µ, and standard deviation, σ, the sample mean has an uncertainty of the mean over the square root of the number of samples. Gives a measure of reliability of the mean. Dx = s N Sample standard error tells you how close your sample mean should be to the true mean. s Dx @ N Using the Standard Error This is the simplest way of using data to confirm or refute a hypothesis. inside x x outside x x This is also what is used to create the error bars. confirmed not confirmed x x x Density Curve Low values indicate a small spread (all values close to the mean) high values indicate a large spread (all values far from the mean) Normal Distribution • Particularly important class of density curve • Symmetric, unimodal, •bell-shaped • Mean, μ, is at the center of the curve • Probabilities are the area under the curve • Total area = 1 The Empirical Rule In a normal distribution with mean μ and standard deviation of σ: •68% of observations fall within 1 σ of the mean •95% of observation fall within 2 σ of the mean •99.7% observations fall within 3 σ of the mean F D C B A Example with data Set of values: 2, 4, 4, 4, 5, 5, 7, 9 Mean: 2 4 4 455 79 5 8 Standard Deviation: mean measurement 2 # samples (2 - 5)2 + (4 - 5)2 + (4 - 5)2 + (4 - 5)2 + (5 - 5)2 + (5- 5)2 + (7 - 5)2 + (9 - 5)2 8 32 12 12 12 02 02 22 42 8 2 Data Distribution 5-6 5-4 5-2 5 5+2 5+4 5+6 Confidence Interval 5-6 5-4 5-2 5 5+2 5+4 5+6 Central Limit Theorem If X follows a normal distribution with mean μ and standard deviation σ, then x̄ is also normally distributed with mean What if X is not normally distributed? When sampling from any population with mean μ and standard deviation σ, when n is large, the sampling distribution of x̄ is approximately normal: As the number of measurements increase, they will approach a normal distribution (Gaussian). x Px e 2 2 2 2 N 2 x N P x e e 2 2 x 2 Visit This webpage to play with the numbers 2x 2 x 2 http://www.intuitor.com/statistic s/CLAppClasses/CentLimApplet. htm Applications Simulated examples: Dice rolling, coin flipping ect… Exit polling Non-normal Distributions Central Limit Theorem Summary For large N of sample, the distribution of those mean values will be: P ( x) µ e -x 2 which is a normal distribution. Normal distribution of CLT is independent of the type of distribution of data. Where else would this become problematic? Where can it still be used, but issues should be considered? Questions? Effective Statistics You might have strong association, but how do you prove causation? (that x causes y?) Good evidence for causation: a well designed experiment where all other variables that cause changes in the response variable are controlled The Scientific/Statistic Process 1. 2. 3. 4. 5. 6. 7. Formulating a scientific question Decide on the population you are interested in Select a sample Observational study or experiment? Collect data Analyze data State your conclusion Ways to collect information from sample Anecdotal evidence Available data Observational study Experiment Sampling and Inference population sample sampling σ s μ x̄ inference Some Cautions Statistics can not account for poor experimental design There is no sharp border between “significant” and “non-significant” correlation, only increasing and decreasing evidence Lack of significance may be due to poorly designed experiment Fit Tests t-test, z-test, and χ2 test z-Test z-test • All normal distributions are the same if we standardize our data: • • • Units of size σ Mean μ as center If x is an observation from a normal distribution, the standardized value of x is called the z-score • Z-scores tell how many standard deviations away from the mean an observation is z- test procedure • To use: find the mean, standard deviation, and standard error • Use these statistics along with the observed value to find Z value • Consult the z-score table to find P(Z) the determined z Equation for x -m z = hypothesis testing: s/ n Example Jacob scores 16 on the ACT. Emily scores 670 on the SAT. Assuming that both tests measure scholastic aptitude, who has the higher score? The SAT scores for 1.4 million students in a recent graduating class were roughly normal with a mean of 1026 and standard deviation of 209. The ACT scores for more than 1 million students in the same class were roughly normal with mean of 20.8 and standard deviation of 4.8. Example Continued Jacob – ACT Emily - SAT Score: 16 Mean: 20.8 Standard Dev.: 4.8 Score: 670 Mean: 1026 Standard Dev.: 209 Interpreting Results “Backwards” z-test What if we are given a probability (P(Z)) and we are interested in finding the observed value corresponding to the probability.? Find the Z-score Set up the probability (could be 2 sided) P(-z0<Z<zo) = Convert the score to x by æs ö x = z ´ç ÷+ m è nø t Tests Necessary assumptions for t-Test 1. Population is normally distributed. 2. Sample is randomly selected from the unknown population. 3. Standard deviation of the unknown population is the same as the known population. So, we can take the sample standard deviation as an estimate of the known population. x t s/ n Probability that the fish are the same in both lakes Probability that fish populations are the same average length in each lake 1 0.8 0.6 T Test Accumulating Data (N) Progressively 0.4 0.2 0 1 11 21 31 41 51 61 71 81 91 101 111 121 # of samples included in analysis from each lake This is typical of the kind of data many of you may generate. Let’s take a quick Look at how this T Test calculated from the data, using Excel. z versus t procedures Use z procedures if you know the population standard deviation Use t procedure if you don’t know the population standard deviation Usually we don’t know the population standard deviation, unless told otherwise Central Limit Theorem 2 χ -test (kai) χ2-test (Goodness-of-fit) Users Guide • • • • • χ2-test tells us whether distributions of categorical variables differ from one another Can use to determine if your data conforms to a functional fit. Compares multiple means to multiple expected values. Can only use when you have multiple data sets that cannot be combined into one mean. Use when comparing means to expected values. χ2-test Xi is each individual mean µi is each expected value ΔXi = uncertainty in Xi d = # of mean values • χ2/d table gives probability that data matches expected values. • In χ2/d , d is count of independent measurements. d 2 i1 X i i X i2 2 χ2- (Goodness-of-fit) Test Procedure Find averages and uncertainty for each average. Calculate χ2 using averages, uncertainties, and expected values. Count number of independent variables. Use table to find probability of fit accuracy based on χ2/d and number of independent variables (d). Example • Launch a bottle rocket with several different volumes of water. • Measure height of flight multiple times for each volume. • You decide you have a fit of: • Plot of fit with data on left. y 0.204 V (m/ml) - 10-4 V 2 (m/ml 2 ) Example 7 degrees of freedom Probability of fit ≈50% 50% of the time, chance alone could produce a •This does not mean that other fits might larger χ2 value. No reason to not match the data better, so try other reject fit. fits and see which one is closest. Interpreting Results Probability is how similar data is to expected value. Large P means data is similar to expected value. Small P means data is different than expected value. Summary Propagation of uncertainty Mean Accuracy vs. Precision Error Standard deviation Central Limit Theorem Fit Tests z-test t-test χ2-test