Download descriptive statistics

Descriptive Statistics for one Variable Variables and measurements • A variable is a characteristic of an individual or object in which the researcher is interested. For example the SAT score for a college student. • For a particular individual or object the variable will take a value called measurement. For example , John’s SAT is 720. Different Types of Variables • Some variables are quantitative variable, like the time for a person to finish a task or the person’s age. • Other variables are qualitative variables as the person’s nationality or the person’s preferred sport. • In this note we will work with quantitative variables. • All the measurement collected from individuals about a particular data is referred a “data”. • Our data will contain the measurement for only one variable. Statistics has two major chapters: • Descriptive Statistics • Inferential statistics Statistics Descriptive Statistics • Provides numerical and graphic procedures to summarize the information of the data in a clear and understandable way Inferential Statistics • Provides procedures to draw inferences about a population from a sample Population and Samples The Population under study is the set off all individuals of interest for the research. We will see that, in practice, the variable is measured only for a part of the population. That part of the population for which we collect measurements is called sample. The number of individuals in a sample is denoted by n. In this notes and examples we will assume that our data correspond to a sample of the population under study. Descriptive Measures • Central Tendency measures. They are computed in order to give a “center” around which the measurements in the data are distributed. • Variation or Variability measures. They describe “data spread” or how far away the measurements are from the center. • Relative Standing measures. They describe the relative position of a specific measurement in the data. Measures of Central Tendency • Mean: Sum of all measurements in the data divided by the number of measurements. • Median: A number such that at most half of the measurements are below it and at most half of the measurements are above it. • Mode: The most frequent measurement in the data. Example of Mean Measurements x Deviation x - mean 3 -1 5 1 5 1 1 -3 7 3 2 -2 6 2 7 3 0 -4 4 0 40 0 • MEAN = 40/10 = 4 • Notice that the sum of the “deviations” is 0. • Notice that every single observation intervenes in the computation of the mean. Example of Median Measurements Measurements Ranked x x 3 0 5 1 5 2 1 3 7 4 2 5 6 5 7 6 0 7 4 7 40 40 • Median: (4+5)/2 = 4.5 • Notice that only the two central values are used in the computation. • The median is not sensible to extreme values Example of Mode Measurements x 3 5 5 1 7 2 6 7 0 4 • In this case the data have two modes: • 5 and 7 • Both measurements are repeated twice Example of Mode Measurements x 3 5 1 1 4 7 3 8 3 • Mode: 3 • Notice that it is possible for a data not to have any mode. Measures of Variability • Range • Variance • Standard Deviation The Range • Definition: The range of a data is the difference between the largest and the smallest measurements in the data. • To find the range, first order the data from least to greatest. Then subtract the smallest value from the largest value in the set. • Example: A marathon race was completed by 7 participants. What is the range of times given in hours below? 2.3 hr, 8.7 hr, 3.5 hr, 5.1 hr, 4.9 hr, 7.1 hr, 4.2 hs Ordering the data from least to greatest, we get: 2.3, 3.5, 4.2, 4.9, 5.1, 7.1, 8.7. So highest - lowest = 8.7 hr - 2.3 hr = 6.4 hr Answer: The range of swim times is 6.4 hr. The Range is not Enough Consider the following examples of data 1,1,1,1,8 1,2,4,6,8 1,8,1,8,1 In the three cases the Range is the same: Range = 7 However, the three series exhibit completely different distributions of values along the range of values The sample variance The variance takes into account the deviation around the mean of the Data. The formula for the sample variance is as follows x  x    2 s 2 n 1 The Standard Deviation consists of the square root of the Variance s  Variance  s 2 Notice that the mean and the standard deviation have the same unit as the one of the measurements Variance (for a sample) • Steps: – Compute each deviation – Square each deviation – Sum all the squares – Divide by the data size (sample size) minus one: n-1 Example of Variance Measurements Deviations x 3 5 5 1 7 2 6 7 0 4 40 x - mean -1 1 1 -3 3 -2 2 3 -4 0 0 Square of deviations 1 1 1 9 9 4 4 9 16 0 54 • Variance = 54/9 = 6 • It is a measure of “spread”. • Notice that the larger the deviations (positive or negative) the larger the variance The standard deviation • It is defined as the square root of the variance • In the previous example • Variance = 6 • Standard deviation = Square root of the variance = Square root of 6 = 2.45 • The standard deviation summarizes the deviations in one number Percentiles • The p-th percentile is a number such that at most p% of the measurements are below it and at most 100 – p percent of the data are above it. • Example, if in a certain data the 85th percentile is 340 means that 15% of the measurements in the data are above 340. It also means that 85% of the measurements are below 340 • Notice that the median is the 50th percentile Tchebichev’s Rule The standard deviation can be used to construct an interval enclosing an important percent of the data. In fact, this rule says that for any data set: • At least 75% of the measurements differ from the mean less than twice the standard deviation. • At least 89% of the measurements differ from the mean less than three times the standard deviation. Note: This is a general property and it is called Tchebichev’s Rule: At least 1-1/k2 of the observation falls within k standard deviations from the mean. It is true for every dataset. Example of Tchebichev’s Rule Suppose that for a certain data is : • Mean = 20 • Standard deviation =3 Then: • A least 75% of the measurements are between 14 and 26 • At least 89% of the measurements are between 11 and 29 Further Notes • When the Mean is greater than the Median the data distribution is skewed to the Right. • When the Median is greater than the Mean the data distribution is skewed to the Left. • When Mean and Median are very close to each other the data distribution is approximately symmetric. Empirical Rule (68-95-99.7 Rule) For “Normal Distributions” (Data sets whose histograms are bell or mount shaped): • Approx. 68% of values are within 1 standard deviation of the mean • Approx. 95% of values are within 2 standard deviations of the mean • Approx. 99.7% of values are within 3 standard deviations of the mean Example of Empirical Rule Suppose that the hourly wages of certain type of workers have a “normal distribution” ( bell shaped histogram). Assume also that the mean is $16 with a standard deviation of $1.5 The we have: 1 standard deviation = $1.5 2 standard deviations = $3.0 3 standard deviations = $4.5 What does the empirical rule allow us to say? Solution The empirical rule allows us to say that: • Approx. 68% of workers in this occupation earn wages that are within 1 standard deviation of the mean : – Between 14 – 1.5 and 14 + 1.5 – Between $12.5 and $15.5 • Approx. 95% of workers in this occupation earn wages that are within 2 standard deviation of the mean : – Between 14 – 3 and 14 + 3 – Between $11.0 and $17.0 • Approx. 99.7% of workers in this occupation earn wages that are within 3 standard deviation of the mean : – Between 14 – 4.5 and 14 + 4.5 – Between $9.5 and $18.5

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download descriptive statistics