Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Descriptive statistics: describing a sample of data There are several ways in which we can summarize, or describe, a set of data. Throughout the discussion in this handout, we will assume that we are describing a random sample of data from some larger population. Recall that numbers that describe a sample of data are called STATISTICS. The statistics serve as estimates of their corresponding population parameters. The statistics that we use to describe a set of data depend on the type of data with which we are dealing. We can summarize categorical (or binary) data with the proportion, while we can summarize measurement data (discrete or continuous) with the mean, median, range, interquartile range, variance and standard deviation. In summarizing a sample of data, we might be interested in describing the “center” of the data, or we might be interested in describing how the data vary. Statistics used to describe the center of the data are called MEASURES OF LOCATION, while statistics used to describe how the data vary are called MEASURES OF VARIABILITY. The following list enumerates the most commonly used statistics: 1. The MEAN (or AVERAGE) of a sample of measurements (or OBSERVATIONS) is obtained by simply “adding up the measurements, and dividing by the number of measurements in the dataset.” Notationally, the SAMPLE MEAN, denoted x̄ , is calculated using the following formula: x x ... xn ( xi x̄ 1 2 n n where x1, x2, ..., xn denote the measurements, n is the number of measurements (called the SAMPLE SIZE), and ( means “add up.” The sample mean x̄ is a measure of location that estimates the actual population mean µ. The sample mean can be used to summarize discrete measurement data or continuous measurement data. Examples: Variable Data Number of brothers 0, 2, 4, 1, 5 x̄ 02415 12 2.4 5 5 Weight of females 135, 105, 112, 135, 128, 132 x̄ 135 105 112 135 128 132 747 124.5 6 6 Handout 02 Sample mean Page 1 of 4 2. Roughly speaking, the SAMPLE MEDIAN is the value that divides a sample of data into two equal halves. That is, 50% of the data lie below the median and 50% of the data lie above it. To calculate the sample median, we must first order the data. Then, if the number of observations n is odd, the sample median is the middle observation; and if the number of observations n is even, the sample median is the average of the two middle observations. The sample median is a measure of location that estimates the actual population median. The sample median can be used to describe discrete measurement data or continuous measurement data. Examples: Variable Ordered Data Sample median Number of brothers 0, 1, 2, 4, 5 2 Weight of females 105, 112, 128, 132, 135, 135 128 132 130 2 Similar to the sample median is the first quartile and the third quartile. The FIRST QUARTILE, denoted Q1, is the value such that 25% of the data lie below the first quartile and 75% of the data lie above it. The THIRD QUARTILE, denoted Q3, is the value such that 75% of the data lie below the third quartile and 25% of the data lie above it. So, the first quartile, the median, and the third quartile effectively divide up a sample of data into quarters. NOTE: The sample mean is affected by extreme observations, or OUTLIERS, while the sample median is not. Therefore, in the presence of outliers, the median is the more appropriate measure of location. 3. The SAMPLE PROPORTION, denoted p̂ , is the “percentage” of observations in the sample having a certain trait. It is calculated by simply counting the number of observations in the sample having the trait divided by n, the total number of observations in the sample. The sample proportion, which estimates the actual population proportion p, is used to describe categorical data (including binary data). Examples: Variable Ever smoke? (1 = yes, 0 = no) Class? Handout 02 Data Sample proportion 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0 p̂smokers F, So, So, J, F, Se, Se, J, F, Se, So, Se p̂F 6 0.40 15 3 0.25 12 Page 2 of 4 4. The SAMPLE RANGE is the difference between the largest and smallest numbers in the sample. The sample range, which is a measure of variability, can be used to describe discrete measurement data or continuous measurement data. NOTE: You should get into the habit of using the minimum and maximum to see if your data set contains any outliers. If you find an outlier, you should identify whether it is a data transcription or data entry error before continuing to analyze your data. Examples: Variable Ordered Data Sample range Number of brothers 0, 1, 2, 4, 5 50=5 Weight of females 105, 112, 128, 132, 135, 135 135 105 = 30 5. The SAMPLE INTERQUARTILE RANGE, denoted IQR, is the difference between the third quartile and the first quartile, i.e. IQR = Q3 Q1. Thus, the sample range measures the range of all of the data, while the sample interquartile range measures the range of the middle half of the data. The interquartile range, which is a measure of variability, can be used to describe discrete measurement data or continuous measurement data. 6. Roughly speaking, the SAMPLE VARIANCE, denoted s2, measures the average amount the data points in the sample deviate from the sample mean. Therefore, the larger s2, the more variable the data. Notationally, the sample variance is calculated using the following formula: ( ( xi x̄ )2 s2 n1 where x1, x2, ..., xn denote the measurements, x̄ is the sample mean, and n is the sample size. The sample variance is a measure of variability that estimates the actual population variance )2. The sample variance can be used to summarize the variability of discrete measurement data or continuous measurement data. NOTE: Because the deviations are squared in calculating the sample variance, s2 is defined in terms of squared units. That is, if your data are measured in pounds, then s2 is defined in pounds-squared. The SAMPLE STANDARD DEVIATION, denoted s, is simply the positive square root of the sample variance; its advantage is that it defines variability in terms of the original units. Handout 02 Page 3 of 4 Example: Variable Data Number of brothers 0, 2, 4, 1, 5 Sample variance (02.4)2 (22.4)2 (42.4)2 (12.4)2 (52.4)2 (51) 5.76 0.16 2.56 1.96 6.76 17.2 s2 4.3 (51) 4 s2 Therefore, s2 is 4.3 brothers-squared, and s is the square root of 4.3, or 2.07 brothers. NOTE: The sample variance is affected by outliers. If you change the 1 to 10 in the above example, the sample variance changes from 4.3 to 14.21!! / RECALL that the goal is to use a statistic to ESTIMATE its corresponding parameter, or to use a statistic to TEST A HYPOTHESIS about the corresponding parameter. For example, a pharmaceutical company claims that its new pain reliever eliminates headaches in 90% of the people who use the drug. A medical doctor blindly tests the pharmaceutical company’s claim on a random sample of 100 of her patients. Only 52, or 52%, of the patients’ headaches were eliminated. Is it likely that a random sample would produce a sample proportion p̂ = 0.52, if the actual population proportion p were 0.90? That is, does the random sample provide sufficient evidence to reject the pharmaceutical company’s claim? Handout 02 Page 4 of 4