Download IQL Chapter 4

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

World Values Survey wikipedia , lookup

Time series wikipedia , lookup

Transcript
IQL Chapter 4 – Describing Data
Statistical Reasoning for everyday life, Bennett, Briggs, Triola, 3rd Edition
4.1 – What is Average?
Mean, Median and Mode
Definitions—Measures of Center in a Distribution
The mean is what we most commonly call the average value. It is found as follows:
The median is the middle value in the sorted data set (or halfway between the two middle values if
the number of values is even).
The mode is the most common value (or group of values) in a data set.
Rounding Rule for Statistical Calculations
State your answers with one more decimal place of precision than is found in the raw data.
Example: The mean of 2, 3, and 5 is 3.3333 . . . , which we round to 3.3. Because the raw data are whole
numbers, we round to the nearest tenth. As always, round only the final answer and not any
intermediate values used in your calculations.
Effects of Outliers
Definition
An outlier in a data set is a value that is much higher or much lower than almost all others
Page 1 of 8
In general, the value of an outlier has no effect on the median, because outliers don’t lie in the middle of
a data set. Outliers do not affect the mode either. (However, the median may change if we delete an
outlier, because we are changing the number of values in the data set.)
Confusion About “Average”
Averages can be confusing as it is not clear what is meant by average; mean or median, additionally
often times there is not enough information given as to how the average was arrived at, such as is there
outliers. Consider what the ‘average’ would be in hourly wages paid if the president of the company’s
annual earnings were included.
Weighted Mean
Definition
A weighted mean accounts for variations in the relative importance of data values. Each data
value is assigned a weight and the weighted mean is
weighted mean =
Means with Summation Notation (Optional)
The symbol Σ (the Greek capital letter sigma) is called the summation sign and indicates that a set of
numbers should be added. We use the symbol x to represent each value in a data set, so we write the
sum of all the data values as
Sum of all values = Σx
Page 2 of 8
Means and Medians with Binned Data (Optional)
The ideas of this section can be extended to binned data simply by assuming that the middle value in the
bin represents all the data values in the bin. For example, consider the following
table of 50 binned data values:
Bin
Frequency
0-6
10
7-13
10
14-20
10
21-27
20
4.2 –Shapes of Distributions
Number of Modes
Modes: When describing data; the mode refers to the shape or number of peaks in the visual display. It
is similar to the quantitative mode in that the peaks are usually higher counts in the data set, but this is
a qualitative use.
Uniform Distribution: has no mode
Bimodal: 2 peaks
Single – Peaked/Unimodal
Trimodal: 3 peaks/modes
Page 3 of 8
SYMMETRY OR SKEWNESS
Definitions
A distribution is symmetric if its left half is a mirror image of its right half.
A distribution is left-skewed if its values are more spread out on the left side.
A distribution is right-skewed if its values are more spread out on the right side.
VARIATION
Definition
Variation describes how widely data are spread out about the center of a data set.
4.3 Measures of Variation
WHY VARIATION MATTERS
In this section we will look at variation in quantitative manner/measures.
4.3 Meausres of Variation
Big Bank (three lines)
4.1 4.5
Best Bank (one line)
6.6 6.7
5.6
6.7
6.2
6.9
6.7
7.1
7.2
7.2
7.7
7.3
7.7
7.4
8.5
7.7
9.3
7.8
AVG
11 7.14
7.8 7.2
The wait is only slightly longer at the Big Bank, but the satisfaction comes from the variation at the two
banks.
Page 4 of 8
RANGE
Definition
The range of a set of data values is the difference between its highest and lowest data values:
range = highest value (max) - lowest value (min)
Quartiles and Five – Number Summary
Quartiles are values that divide the data distribution into quarters.
Definitions
The lower quartile (or first quartile or Q1) divides the lowest fourth of a data set from the upper threefourths. It is the median of the data values in the lower half of a data set. (Exclude the middle value in the data
set if the number of data points is odd.)
The middle quartile (or second quartile or Q2) is the overall median.
The upper quartile (or third quartile or Q3) divides the lowest three-fourths of a data set from the upper
fourth. It is the median of the data values in the upper half of a data set. (Exclude the middle value in the data
set if the number of data points is odd.)
The Five-Number Summary
The five-number summary for a data distribution consists of the following five numbers:
low value
lower quartile
median
upper quartile
high value
Page 5 of 8
Five – Number Summaries are typically displayed using a Boxplot, below is the steps for drawing a
boxplot.
Drawing a Boxplot
Step 1. Draw a number line that spans all the values in the data set.
Step 2. Enclose the values from the lower to the upper quartile in a box. (The thickness of the box has no
meaning.)
Step 3. Draw a line through the box at the median.
Step 4. Add “whiskers” extending to the low and high values.
PERCENTILES
Quartiles divide a data set into 4 segments. There are times when it is more useful to divide data sets
into more segments. Quintiles divide a data set into 5 segments, and deciles divide a data set into 10
segments. When the data set is larger or you wish to divide the data set into 100 segments, you then
use percentiles
Definition
The nth percentile of a data set divides the bottom n% of data values from the top (100 - n)%. A data value
that lies between two percentiles is often said to lie in the lower
percentile. You can approximate the percentile of any data value with the following formula:
percentile of data value =
STANDARD DEVIATION
Excel Function: STDEV Estimates standard deviation, assuming that the arguments represent only a
sample of the total population, and takes the form =STDEV(number1,number2,…), accepting up to 30
arguments.
The Standard Deviation is a measure of how wodely data values are spread around the mean of each
data set. The calculation is set forth below.
Page 6 of 8
Calculating the Standard Deviation
To calculate the standard deviation for any data set:
Step 1. Compute the mean of the data set. Then find the deviation from the mean for every data value by
subtracting the mean from the data value. That is, for every data value,
deviation from mean = data value – mean
Step 2. Find the squares (second power) of all the deviations from the mean.
Step 3. Add all the squares of the deviations from the mean.
Step 4. Divide this sum by the total number of data values minus 1.
Step 5. The standard deviation is the square root of this quotient
Overall, these steps produce the standard deviation formula:
(This formula is shown in summation notation on slide 36.)
IINTERPRETING THE STANDARD DEVIATION
The Range Rule of Thumb is an approximation that allows for interpretation of the Standard Deviation.
The Range Rule of Thumb
The standard deviation is approximately related to the range of a distribution by the range rule of
thumb:
standard deviation ≈
If we know the range of a distribution (range = high – low), we can use this rule to estimate the
standard deviation.
Alternatively, if we know the standard deviation, we can use this rule to estimate the low and high
values as follows:
low value ≈ mean – (2 x standard deviation)
high value ≈ mean + (2 x standard deviation)
The range rule of thumb does not work well when the high or low values are outliers.
Page 7 of 8
STANDARD DEVIATION WITH SUMMATION NOTATION (OPTIONAL SECTION)
The summation notation introduced earlier makes it easy to write the standard deviation formula in a
compact form.
The symbol s is the conventional symbol for the standard deviation of a sample.
For the standard deviation of a population, statisticians use the Greek letter s (sigma), and the term n - 1
in the formula is replaced by n. Consequently, you will get slightly different results for the standard
deviation depending on whether you assume the data represent a sample or a population.
4.4 – STATISTICAL PARADOXES
Page 8 of 8