Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Degrees of freedom (statistics) wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Taylor's law wikipedia , lookup
Confidence interval wikipedia , lookup
History of statistics wikipedia , lookup
Foundations of statistics wikipedia , lookup
Law of large numbers wikipedia , lookup
Gibbs sampling wikipedia , lookup
Categorical variable wikipedia , lookup
Resampling (statistics) wikipedia , lookup
Topic 1 – Basic Statistics KKNR Chapters 1-3 1 Topic Overview Course Syllabus & Schedule Review: Basic Statistics Terminology: Being able to Communicate Distributions: Normal, T, F Hypothesis Testing & Confidence Intervals Significance Level & Power 2 Time Requirements Often the biggest complaint in this class is that the coursework takes too much time. As a general rule, if you are staring at a problem for 10 minutes and not getting very far, you should seek help. EMAIL / Office Hours / Appointment Ask your group-mates how to get started (remember, please don’t share homework solutions) 3 Collaborative Learning Key Premise: If you help each other and work together, you will learn more. Requires: Cooperation and some coordinating on your part. Note: If something isn’t working, or if you have suggestions for how this aspect of the course can improve, please tell me! 4 SAS Software We’ll use SAS software for the course. I’ll generally provide template files with each lecture for the purpose of helping you learn the appropriate code. Hopefully working in groups will reduce the strain/stress people often feel working in SAS. Can email me questions about SAS – please send CODE (*.SAS file) and LOG (*.LOG). 5 Questions? 6 Terminology of Statistics & Experimental Design One of the most important components of a good statistical analysis is the ability to communicate your results to others. 7 What is Statistics? “Scientific process of learning from available data and making decisions in the face of variability” Good statistics involves… Unbiased collection of information. Using appropriate statistics to describe the information. Using models to help us interpret the data, make generalizations, and draw conclusions. 8 Key Concept What is variability? Why is it useful to us? How do we use it to our advantage in trying to assess relationships among variables? 9 Data Collection Several questions should be answered before data collection whenever possible… What is the response variable (the variable of interest)? What magnitude of change in response is important? (Statistical vs. Practical Significance) What are potential predictor variables? 10 Data Collection (2) How do we measure each variable? Are the variables continuous? Categorical? Are the variables experimental? Observational? Mixed? Are there any sources of BIAS? How many observations / replicates do we take? And what are the resources / costs involved? 11 Data Collection (3) Unfortunately, I can’t count how many times students have come to me AFTER data collection and tried to resolve the same questions above You can see how a little groundwork goes along way in real practical experience 12 Tradeoffs – Observational vs. Experimental Study Observational Easy to run Less expensive Difficult to draw definitive conclusions (cause/effect) No control of the predictor variables. Experimental More difficult to run More expensive Can plan the experiment to obtain definitive conclusions from the model Have control of the predictor variables Can show cause and effect. 13 Classification of Variables Type of Analysis usually depends on classification Let me repeat that: Type of Analysis DEPENDS on classification!!! Variables may be classified according to TYPE: Categorical (qualitative) or Continuous (quantitative) DESCRIPTIVE ORIENTATION: Response / Predictor / Nuisance (or control) 14 Categorical (Qualitative) Variables Nominal – variables are classified into categories that have no logical ordering. Ordinal – variables are classified into categories that have some logical ordering. Examples Hair Color Blood Type Sex Examples Letter Grade Agree/Disagree Survey Social Class Age 15 Continous (Quantitative) Variables Variables take on numerical values for which arithmetic operations make sense. Continuous – Height, Water Level, Age Sometimes Counts (e.g. # of defects) are treated as continuous Note: ANY response variable discussed in this class for regression or analysis of variance will be treated as continuous. 16 Descriptive Orientation Response (or dependent) variable Predictor (or independent) variables Variable to be described in terms of other variables The objectives of your research determine which variable will be the response. Variables of experimental interest used to describe the response variable. Nuisance (or control) variables Variables may be associated with the response but are not of interest experimentally Sometimes called covariates, confounders, or lurking variables. 17 Association vs. Cause Association: Some values of the response variable tend to occur more often with certain values of the predictor variable(s). Association does NOT always mean causation. Watch out for lurking variables! Example: Deaths from heat related illness are highest when more ice cream is sold. Lurking variable: __________ 18 Summary: Issues to Consider When Designing An Experiment Statement of the problem What is the response variable of interest? What qualifies as an “important” change in response? What predictors are available? Which do we expect will be important? What are their classifications? What is the experimental unit? Which variables, if any, have associated uncertainty or variation? Are there nuisance variables? Can we block on control variables? How many observations are to be taken? What are the resources / costs? 19 Collaborative Learning Activity CLG #1.1-1.4 Experiments have been described for you. Please try (in groups) to answer all of the relevant questions from the previous summary slide. 20 CLG #1.1 A baseball scout is interested in determining variables that can be used in predicting future power-hitters from their minor league numbers. He has minor league data available on 200 current major leaguers. He wants to relate this data to the number of homeruns hit by the players in their 4th full season of major league ball. 21 CLG #1.2 To study the effects of new methods of germination for strawberry plants, a farmmanager randomly (and equally) allocated 200 seeds to four germination methods (three new methods and a control). After 14 days, the weight of each plant was measured. 22 CLG #1.3 We want to study the effects of two newly developed instructional methods. To do this, each method will be assigned to five different 3rd grade classrooms. There will also be five classrooms that are used as a control. Students will be given a test at the beginning of the semester and an equivalent test at the end of the semester. It will be assumed that the difference in their test scores reflects the amount which they have learned. 23 CLG #1.4 Design your own study, perhaps related to some research you actually are doing or plan to do at some point in the future. You need a response variable, one or more predictor variables, and reason to believe they have an association. 24 Descriptive Statistics & Plots Once we collect data, we need appropriate methods to analyze it. We often begin by considering descriptive statistics and plots concerning individual variables. 25 Describing the Information You should recall most of these graphical and numerical summaries from STAT 501 Descriptive Statistics SAS: PROC UNIVARIATE or PROC MEANS Different plots: Scatter Plots, Histograms, etc. SAS: PROC GPLOT (and other associated statements like SYMBOL, AXIS, etc.) 26 Examining Data Once we have collected data and understand our different variables from a conceptual standpoint, we now need to… Summarize the information in the data Use statistical procedures to draw relevant conclusions. These allow us to examine the distributional aspects of our variables. They also help us to identify potential problems (e.g. outliers, gaps, heavy skewness etc.). 27 Key Concept A descriptive statistic is any single number computed from a data set and designed to describe a feature of the population. The true feature is known as a population parameter. Statistics based on a known sample are used to estimate the population parameters. 28 Statistics as RV’s Statistics are random variables! The observed value of a statistic is based on a particular sample. Different samples may result in different values. By looking at different samples, and the variation in our statistic, we can get an idea of the accuracy of our estimate. 29 Review of Random Variables As we are talking a lot about random variables and distributions of statistics, it seems appropriate to pause for review of some more important and useful distributions. 30 Random Variables Random variables are present in any situation that is probabilistic – where we cannot predict outcomes with certainty (i.e. there is variation). By understanding the underlying distribution of a statistic that is random, we may develop estimates for the population parameter that encompass the variability (i.e. confidence intervals). 31 Discrete Random Variables Have a finite or countable number of outcomes Distribution often represented pictorially using a histogram Examples: Binomial, Poisson 32 Binomial Histogram 33 Continuous Random Variables Have uncountably many possible outcomes. Distribution represented by a density curve (probability represented by area under the curve; probability of any specific value is zero!) Example: Normal Distribution 34 The Normal Distribution Commonly denoted N , where is the mean and is the standard deviation. 0.8 Green = N(-1, var = 2) 0.6 y1 Cyan = N(3,var = 0.5) Magenta = N(0,1) 0.4 0.2 0.0 -6 -4 -2 0 x 2 4 35 Normal Distribution Features Symmetric & Bell-shaped. Centered at the mean. 68-95-99.7 rule – 68% of observations fall within one SD of the mean, 95% within two SD, and 99.7% within 3 SD. A standard normal distribution has mean 0 and standard deviation 1. Compute probabilities by first standardizing and then using Table A-1 36 Standardization If X ~ N , , then you can transform X to become Z ~ N(0,1) by subtracting the mean and dividing by the standard deviation: Z X This process is called standardization. Once standardized, probabilities may be looked up in a table (Table A-1). 37 Example If the height of a person in this class is normally distributed with a mean of 70 inches and a standard deviation of 3 inches, what is the probability that a randomly chosen person is less than 66 inches tall? X ~ N 70,3 66 70 P X 66 P Z 3 P Z 1.33 0.0918 38 Example (2) If the height of this class is normally distributed with a mean of 70 inches and a standard deviation of 3 inches, what is the probability that a randomly chosen person will fall within 1.5 standard deviations of the mean? P 70 1.5 3 X 70 1.5 3 P 1.5 Z 1.5 0.9322 0.0668 0.8654 39 Example (3) Approximately what percentage of people would be taller than 75 inches? 75 70 P X 75 P Z 3 P Z 1.667 1 0.952 0.048 So we would say that about 4.8% are taller than 75 inches. 40 Example (4) What height represents the 80th percentile? Z 0.8 0.84 inverse reading of the table X 0.8 Z 0.8 70 3 0.84 72.52 41 Common Sampling Distributions Normal Distribution T Distribution Chi-Square Distribution F Distribution 42 T Distribution Used to describe a standardized random variable with unknown variance. Similar to Z, but slightly heavier in the tails depending on “degrees of freedom” If X is normal, then T X ~ tn 1 n If certain assumptions are satisified, then the T distribution may also be used for statistics of the form ˆ ˆ T Sˆ Example: Two sample T-test 43 T - Distribution 44 Chi Square Distribution Assymetric; Used to describe some nonnegative RVs (e.g. variances) Also used widely in categorical data analysis Probabilities (based on DF) in Table A-3. Example: If we have a random sample from some normal population, then n 1 2 2 S X i X 2 2 ~ n21 45 Chi Square Distribution 46 F Distribution Used to describe the ratio of independent variances S12 12 ~ Fn1 1,n2 1 2 2 S2 2 Used extensively in Analysis of Variances (ANOVA) Critical values based on numerator DF, denominator DF, and significance level are found in Table A-4. Special Property: t ,1 2 2 F1, ,1 47 Types of Descriptive Statistics Measures of center or location Mean Median Measures of variability Range = Max – Min Interquartile Range = Q3 – Q1 Variance or Standard Deviation 48 Measures of Location The sample mean is the arithmetic average and is n denoted X 1 n X i 1 i where n is the sample size. The sample mean is much more sensitive to extreme X values than the sample median (middle value). By the CLT, when n is large enough, the sample mean is approximately normally distributed (a nice property). If not large sample, but underlying population normal, then distributed as T. 49 Measures of Variability The deviation of the ith observation is given by Xi X The average (or sum) of the deviations in a sample is always zero. The average of the squared deviations is the sample variance. S 2 n 1 n 1 X i 1 i X 2 We commonly divide the sum by n – 1 degrees of freedom so that the estimate remains unbiased. 50 Measures of Variability (2) The sample standard deviation is S S 2 1 n 1 X i X 2 It is more convenient than the variance because it is measured in the same units as the observation. Both the variance and standard deviation are sensitive to extreme observations. 51 Measures of Variability (3) If an underlying distribution is Normal, then the sample variance is distributed as ChiSquared. Additionally, a ratio of variances (e.g. ANOVA) will generally have an F distribution. 52 Inference Point Estimates Confidence Intervals Hypothesis Testing 53 Statistical Inference Two General Categories Estimation (point estimates, confidence intervals) Hypothesis Testing (which can often be accomplished using CI’s) Both estimation and hypothesis testing are based on the premise of repeated experiments and used heavily in regression/ANOVA 54 General Goal of Inference Want to distinguish chance variations from true differences. Example: Standardized test given to 500 Purdue students and 500 Notre Dame students, with average scores of 92 and 88 respectively. Is the difference in scores random chance? Or truly reflecting intellect? 55 Point Estimates Generally speaking, we want to quantify some population parameter (e.g. the pop. mean ) A point estimate is a single numerical summary from a random sample that is used to estimate a population parameter (e.g. X ) We generally like the point estimate to be unbiased, that is its expectation is the parameter of interest. 56 Point Estimates (2) One problem with point estimates is that (by themselves) they give no indication of precision. A quick solution is to associate a standard error (standard deviation) to the point estimate. For example, we might consider X (+/- s). Another nice property for a point estimate to have is that of minimum variance. That is, of all unbiased point estimators (and there can be many), it is the one with the smallest SD. 57 Confidence Intervals When distributional assumptions can be made, we combine the point estimate with its standard error to construct a confidence interval. A fairly general form for a CI is: Margin of Error Point Estimate appropriate crit. value Estimated Std. Error Example for the mean X tn1,1 2 s n 58 Example (KKMN #3.13) A random sample of n = 32 persons attending a diet clinic was found after 3 weeks to have lost an average of X 30 pounds. The sample standard deviation was s = 11. We would like to construct a 95% CI for the true mean. 59 Example (2) The point estimate for the mean is given. Assuming the underlying population is normal, and since n is small/medium, and the SD is estimated, the sample mean will have a T distribution. The critical value 2.04 is the 97.5th percentile of a T distribution with ~30 degrees of freedom (see tables). The standard error of the mean will be s / n 11/ 32 60 Example (3) 30 2.04 11/ 32 X t30,0.975 s / n 26.03,33.97 61 What is a 95% CI? There is often some controversy about what “95% confidence” means. The key to understanding is to remember two things: The population parameter, though unknown, is a FIXED number. It is not random – it does not have “probability” associated to it. The endpoints of the confidence interval are RANDOM. If you take a new sample, you’ll get different endpoints. 62 What is a 95% CI? (2) Confidence level gives the probability that the interval will cover the true value of the parameter. So for a 95% CI, in the long run 95% of the intervals will capture the parameter (see next slide). So we expect 95 / 100 intervals to capture the parameter – but this does not mean exactly 95 of the next 100 trials will result in such success. The actual number that do would be a Binomial random variable. 63 Be Careful... Avoid statements like “the probability that the true mean is in the CI is 0.95”. The true mean either IS, or IS NOT, in the interval – and we have no real way of knowing whether we might have been unlucky in taking our sample. We know that we will be unlucky about 5% of the time. 65 Significance Level Relate confidence to significance level. The significance level represents the chance that we will “get unlucky” in our sampling, and not capture the population parameter with our interval. It reflects the risk we are willing to take. We usually set the significance level at 5% meaning that we will accept the reality that about 1 of every 20 samples will lead us to an interval that doesn’t capture the parameter. More about significance level later... 66 Properties of CI’s CI by default is the shortest interval for which we have 100 1 % confidence that the interval covers the parameter of interest. Higher confidence level Wider Interval The CI gives a region of “likely” values for the parameter. So if the CI does not contain some value, then we may say that there is evidence against the statement that the true parameter has that value (hypothesis test). 67 Hypothesis Testing General procedure to test if the parameter is in a specific region 1. Look at the data & check assumptions. 2. State the null and alternative hypotheses. The alternative is what you want to show. 3. Specify the desired significance level. 4. Specify the test statistic and its sampling distribution under the null. 5. Form a decision rule. 6. Compute the statistic and draw a conclusion. 68 Example: Two sample t-test H A : 1 2 H 0 : 1 2 Collect data – n1 and n2 observations vs x11 , x12 ,..., x1n1 x21 , x22 ,..., x2 n2 xi xi1 xi 2 xini ni Is the observed difference x1 x2 unusual if 1 2 ? 69 Two sample t-test (2) Underlying Assumptions 1. Independent observations 2. Equal variances 3. Normally distributed observations If the assumptions are true, then the statistic T under the null hypothesis has a t-distribution with n1 + n2 – 2 degrees of freedom. 70 Two sample t-test (3) Use the statistic T x1 x2 Sp 1 n1 1 n2 where 2 2 n 1 S n 1 S 1 2 2 S2 1 p n1 n2 2 Reject H0 if |T| is bigger than the corresponding critical value of the t-distribution for the desired significance level. This is how we quantify “unusual”. 71 P-values We quantify “unusual” by considering a p-value. The p-value reflects the probability that a randomly drawn t is more extreme than T. We reject the null hypothesis if the p-value is small (less than our choice of significance level ). Can be visualized by considering the redshaded area on the next slide. 72 P-values 73 Example In a study of strawberry growth, 20 seedlings were randomly and equally assigned to be grown in either normal or nitrogen rich soil. The production from these 20 plants during their 2nd year is summarized in the table below. Can we conclude that the extra nitrogen enhances the production? Soil n Strawberries (lbs) Mean SD Normal 10 0.623 0.107 Extra N 10 0.832 0.140 74 Example – Assumptions? Independent Observations??? No reason to believe otherwise. Equal Variances??? Maybe – of the three this might be the most questionable. If we assume it is ok, the pooled variance is 9 0.107 9 0.14 0.0155 18 2 S 2 pooled 2 Normally Distributed??? Again, no reason to believe otherwise. 75 Example – Hypotheses Assuming that our assumptions are ok, we want to test the hypotheses: H 0 : extra norm vs. H a : extra norm Our test statistic is then given by T xE xS Sp 1 n1 n12 0.832 0.623 0.0155 1 10 101 3.75 76 Example – Conclusion Under the null hypothesis this statistic has a tdistribution with 18 degrees of freedom. Checking the table, we see that the P-value is between 0.002 and 0.001. Since the p-value is smaller than α, we reject the null and conclude that extra nitrogen does enhance the crop. Note that if p-value were bigger than α we would fail to reject the null and conclude that there is not sufficient evidence to state that nitrogen enhances the crop. 77 Alternative Hypothesis Testing Hypothesis testing can be accomplished very easily if we have already computed a confidence interval. If Confidence Interval contains the value specified under the null hypothesis, you must fail to reject. If the confidence interval does not contain the value specified under the null, you reject the null and conclude the alternative. 78 Potential Errors in Testing Significance Level Power & Sample Size 79 Type I and Type II Errors True State of Nature H0 HA Conclude H0 Correct Result False Neg. (Type II Error) Conclude HA False Pos. (Type I Error) Correct Result 80 Type I Errors A Type I Error occurs when we incorrectly reject the null hypothesis. The probability of a Type I Error is P reject H0 | H0 is true We generally set the alpha value to 0.05. Note that the value of alpha that we choose defines an “unusual result” that would cause us to reject our null hypothesis. 81 Multiple Hypothesis Tests Alpha is set for a single test – more often than not we will want to perform more than one test on a dataset When there are multiple tests the effective alpha or overall Type I error rate (probability of at least one Type I Error) will be much higher than the alpha for a single test. Several ways to control the overall Type I Error rate. For now, we will consider only the bonferroni correction. When making this correction for k tests, use k as the significance level for each test. This should be used only for planned comparisons and is often extremely conservative if k is large. 82 Type II Errors A Type II Error occurs when we incorrectly fail to reject the null hypothesis. The probability of a Type II Error is P do not reject H0 | H0 is false The probability that we do NOT make a Type II error is called the power of the test. 83 Power & Sample Size Generally speaking, the goal of any test is to detect a difference of size at least 1 2 with high probability. The choice of is fairly subjective (what has practical significance?). 84 Balancing Sig. Level & Power The goal for any test is to have a small significance level and high power. Unfortunately, significance level and power are directly related Increasing the significance level will increases the power. Decreasing the significance level will decrease the power. Furthermore, there is little direct control over the power other than to increase the sample size. This can often be cost prohibitive. 85 Collaborative Learning Activity CLG #1.5-1.8 We may or may not have time for this in class, but you should take time to make sure you are able to do these problems. 86 Questions? 87 Upcoming in Topic 2... Review: Simple Linear Regression Related Reading: Chapters 4 – 7 88