Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 1 Introduction to Statistics Section 1.1 Fundamental Statistical Concepts Objectives • Explain the purpose of statistics. • Decide what tasks to complete before you analyze your data. • Distinguish between populations and samples. What Is Statistics? HEIGHT 54 5 10 52 58 61 5 8 6 55 5 11 5 Descriptive Statistics HEIGHT 55 58 5 52 54 MIN 5 11 6 5 8 5 10 5'7.3'' AVERAGE=5 5 6 1 MAX Inferential Statistics 5 5 5 8 5 5 2 54 MIN 5 11 6 5 8 5 10 AVERAGE=55' 57.3'' 61 MAX Defining the Problem Before you begin any analysis, you should complete certain tasks. 1. Outline the purpose of the study. 2. Document the study questions. 3. Define the population of interest. 4. Determine the need for sampling. 5. Define the data collection protocol. Cereal Example Rise n Shine 15 ounces Defining the Problem The purpose of the study is to determine whether Rise n Shine cereal boxes contain 15 ounces of cereal. The study question is whether the average amount of cereal in Rise n Shine boxes is equal to 15 ounces. Population Rise n Shine Rise n Shine Rise n Shine Rise n Shine Rise Rise n n Shine Shine Rise Rise n n Shine Shine Rise n Shine Rise n Shine Rise Rise n n Shine Shine Rise n Shine Rise n Shine Rise n Rise Shine n Shine Sample Rise n Shine Rise n Shine Rise n Shine Rise n Shine Rise n Rise Shine n Shine Rise n Shine Rise n Shine Rise n Shine Rise n Shine Rise n Shine Rise n Shine Rise n Shine Rise n Shine Rise n Shine Rise n Shine Rise n Shine Rise n Shine Rise n Shine Rise n Shine Simple Random Sampling Rise n Shine Rise n Shine Rise n Shine Rise n Shine Rise n Shine Rise n Shine Rise n Shine Rise n Shine ... Rise n Shine Rise n Shine Rise n Shine Rise n Shine Rise n Shine Convenience Sampling Rise n Shine Rise n Shine Rise n Shine Rise n Shine Rise n Shine Rise n Shine ... Rise n Shine Rise n Shine Rise n Shine Rise n Shine Rise n Shine Parameters and Statistics Statistics are used to approximate population parameters. Population Parameters Sample Statistics Mean x Variance 2 s2 Standard Deviation s Levels of Measurement The two levels of measurement of data used in this course are • continuous • discrete. Describing Your Data The goals when you are describing data are to • screen for unusual data values • inspect the spread and shape of continuous variables • characterize the central tendency • draw preliminary conclusions about your data. Process of Data Analysis Population Random Sample Describe Sample Statistics Make Inferences Section 1.2 Examining Distributions Objectives • Examine distributions of data. • Explain and interpret measures of location, dispersion, and shape. • Use the MEANS and UNIVARIATE procedures to produce summary statistics. • Use the UNIVARIATE procedure to generate stem-and-leaf, box-and-whisker, normal probability plots and histograms. Cereal Data Set Rise n Shine . . . . . . WEIGHT ID NUMBER . . . . . . . . . . . . Distributions When you examine the distribution of values for the variable WEIGHT, you can find out • the range of possible data values • the frequency of data values • whether the data values accumulate in the middle of the distribution or at one end. FREQUENCY Symmetric Distributions WEIGHT FREQUENCY Skewed Distributions WEIGHT Normal Distribution Examples of Normal Distributions std 1.5 std 1.0 std 0.5 Measures of Central Tendency The mean is the balancing point of your data. 15.02 14.98 15.01 15.00 14.99 Percentiles th FREQUENCY 40 Percentile 0 40% 60% WEIGHT FREQUENCY FREQUENCY Measures of Dispersion 15.00 15.00 WEIGHT WEIGHT Measures of Shape Skewed to Left FREQUENCY Symmetric FREQUENCY FREQUENCY WEIGHT Skewed to Right WEIGHT WEIGHT Measures of Shape Light-tailed Normal Heavy-tailed The MEANS Procedure PROC MEANS DATA=SAS-data-set <options>; VAR variables; RUN; The UNIVARIATE Procedure PROC UNIVARIATE DATA=SAS-data-set<options>; VAR variables; ID variable; HISTOGRAM variables / <options>; PROBPLOT variables / <options>; RUN; Descriptive Statistics This demonstration illustrates using the MEANS and UNIVARIATE procedures to calculate descriptive statistics for continuous variables. Graphical Displays of Distributions PROC UNIVARIATE produces three kinds of plots for examining the distribution of your data values: • stem-and-leaf plots • box-and-whisker plots • normal probability plots. PROC UNIVARIATE can also generate histograms and graphically enhanced normal probability plots. Stem-and-Leaf Plots 9 01338 8 0012347789 7 0013455667799 6 03568 5 8 4 3 9 2 0 1 4 Multiply Stem.Leaf by 10**1 Box-and-Whisker Plots 100|| 90||80| |70| |60| |50| | 40||30| |20| |10| max point 1.5 IQ units from box + 75th percentile 50th percentile median 25th percentile min point 1.5 IQ units from box 0 more than 1.5 IQ units from box * * more than 3 IQ units from box The mean is denoted by +. Normal Probability Plots . 3. 2. ...... . . . . . .. . . . . .. . .. .. . . . . . . . .. . .. . . . . . .. . .. .. . . . . . . .. . . . . . . ... . . . . . . . . . . . ... . . . . . . . . .. ... .............. . . . . 5. . . 4. . ....... ..... . . . ... . .. . .. . . . . .. . . .. . . . . ...... . . .. .... ... .. . . . . . . . ......... . 1. Examining Distributions This demonstration illustrates using PROC UNIVARIATE to generate stem-and-leaf, box-and-whisker, normal probability plots and histograms. Section 1.3 Confidence Intervals for the Mean Objectives • Explain and interpret the confidence intervals for the mean. • Explain the central limit theorem. • Calculate confidence intervals using the MEANS procedure. Point Estimates estimates estimates Variability among Samples mean of 15.02 mean of 15.03 . . . . . . Standard Error of the Mean A statistic that measures the variability of your estimate is the standard error of the mean. It differs from the sample standard deviation because • the sample standard deviation deals with the variability of your data • the standard error of the mean deals with the variability of your sample mean. Confidence Intervals 95% Confidence ( | | ) 5% Confidence |( | ) Assumptions about Confidence Intervals The types of confidence intervals in this course make the assumption that the sample means are normally distributed. Distribution of Sample Means Weight Mean of Weight Normal Distribution Useful Probabilities for Normal Distributions 68% 95% 99% Confidence Intervals Distribution of the Sample Means 95% x Central Limit Theorem To satisfy the assumption of normality, you can either • verify that the population distribution is approximately normal, or • apply the central limit theorem. The central limit theorem states that the distribution of sample means is approximately normal provided that the sample size is large enough. Central Limit Theorem Confidence Intervals This demonstration illustrates calculating confidence intervals using PROC MEANS. Section 1.4 Hypothesis Testing Objectives • Define some common terminology related to hypothesis testing. • Perform hypothesis testing using the UNIVARIATE procedure. • Compare the means of paired groups using the TTEST procedure. Judicial Analogy Hypothesis Significance Level Collect Evidence Decision Rule Coin Example H T T H H Coin Analogy Hypothesis Significance Level Collect Evidence Decision Rule Types of Errors You used a decision rule to make a decision, but was the decision correct? ACTUAL DECISION Fair Coin Not Fair Coin Fair Coin correct Type II error Not Fair Coin Type I error correct Modified Coin Experiment Which coins are fair? 55 Heads 45 Tails 40 Heads 60 Tails p-value = .27 p-value = .04 63 Heads 37 Tails 15 Heads 85 Tails p-value < .01 p-value < .01 Statistical Hypothesis Test H o : equality H 1 : difference Set Hypothesis Rise n Shine 15 oz. Collect Data set Significance Level p-value p-value Decision Rule Comparing and the p-Value In general, you • reject the null hypothesis if p < • fail to reject the null hypothesis if p . Performing a Test of Hypothesis To test the null hypothesis H0: = 0, SAS software calculates the t statistic ( x 0) t sx Two-Sided Test of Hypothesis The test of hypothesis is two-sided if the null is rejected when the actual value of interest is either less than or greater than the hypothesized value. H0: 15.00 H1: 15.00 Two-Sided Test of Hypothesis -3 -2 -1 0 T 1 2 3 One-Sided Test of Hypothesis In many situations, you are only interested in one direction. Perhaps you only want evidence that the mean is significantly lower than fifteen. For example, instead of testing H0: = 15 versus H1: 15 you test H0: 15 versus H1: < 15 One-Sided Test of Hypothesis -3 -2 -1 0 T 1 2 3 Hypothesis Testing This demonstration illustrates using PROC UNIVARIATE to perform hypothesis testing. Paired Samples ADVERTISING BEFORE AFTER Sales Sales The TTEST Procedure PROC TTEST DATA=SAS-data-set; CLASS variable; VAR variables; PAIRED variable*variable; RUN; Paired t-Test This demonstration illustrates using PROC TTEST to conduct a paired sample t-test. Section 1.5 Two-Sample t-Tests Objectives • Recognize and validate the assumptions of a two-sample t-test. • Analyze two populations with the TTEST procedure. ni or M Rise n Shine ng Cereal Example Assumptions Comparing Two Populations 2 1 Morning Rise n Shine • independent observations • normally distributed data for each group • equal variances for each group. F Test for Equality of Variances H0 : 2 1 = 2 H1 : 2 2 1 2 1 2 2 2 2 max(s , s ) F= min(s , s ) 2 1 = 2 2 Test Statistics and p-Values F Test for equal variances: H0: 12 = 22 Variance Test: F’ = 1.51 DF = (3,3) Prob > F’ = 0.7446 t-Tests for equal means: H0: 1 = 2 Unequal Variance t-test: T = 7.4017 DF = 5.8 Prob > |T| = 0.0004 Equal Variance t-test: T = 7.4017 DF = 6.0 Prob > |T| = 0.0003 Test Statistics and p-Values F Test for equal variances: H0: 12 = 22 Variance Test: F’ = 15.28 DF = (9,4) Prob > F’ = 0.0185 t-Tests for equal means: H0: 1 = 2 Unequal Variance t-test: T = -2.4518 DF = 11.1 Prob > |T| = 0.0320 Equal Variance t-test: T = -1.7835 DF = 13.0 Prob > |T| = 0.0979 Testing for Equality of Means This demonstration illustrates using PROC TTEST to test for the equality of means for two groups. Section 1.6 Output Delivery System Objectives • Introduce the Output Delivery System (ODS). • Examine some simple statements in ODS. • Use ODS to capture some specific UNIVARIATE procedure output. • Use ODS to generate a report in the HTML format. • Use ODS to generate data sets with specific PROC UNIVARIATE output. Output Delivery System SAS procedure computes results Output object created in ODS ODS converts data component into SAS data set ODS Statements • TRACE provides information about the output object such as the name and path. • LISTING opens, manages, or closes the Listing destination. • OUTPUT creates SAS data set from an output object. Output Delivery System This demonstration illustrates the Output Delivery System by introducing some simple concepts and building on that knowledge. Section 1.7 Exercises Section 1.8 Chapter Summary Section 1.9 Solutions to Exercises