Download Section 3

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

World Values Survey wikipedia , lookup

Time series wikipedia , lookup

Transcript
Chapter 3 - Numerically Summarizing Data
When you look at a distribution of data, consider three characteristics of the distribution:
shape, center and spread.
Shape – symmetric (bell-shaped or uniform). skewed right or skewed left
Center – commonly called the average
Spread – the dispersion of the data
The center and spread are numerical summaries of the data. The most appropriate measure
of center and spread depends on the shape of the distribution.
Section 3.1 – Measures of Central Tendency
The measure of central tendency numerically describes the typical data value.
A measure of the center of data is a value at the center (or middle) of a data set.
Notation:
x = variable representing data values
n = total number of values in the sample
N = total number of values in the population (difficult to know)
∑ = sum or “add up”
Arithmetic Mean (also called mean) – A numerical average. Add the data values and
divide by the total number of values. The mean should be rounded to one more decimal
place than that in the raw data.
μ = population mean =
x = sample mean =
x
N
(parameter)
x
= sum of the data values divided by n (statistic)
n
Note: Greek letters are used for parameters
Roman letters are used for statistics
Median – the value that lies exactly in the middle of the data when arranged in ascending
order (from smallest to largest).
 n  1


 2 
If there are an EVEN number of values, then find the mean (take the average) of the middle
n 
n
two values.  and  1
2 
2
If there are an ODD number of values, the median will be the middle value.
1
Mode – The value that occurs most often in the data set. A set of data can have no mode,
one mode, or more than one mode (bidmodal or multimodal). The mode can be found for
either quantitative or qualitative data.
Find the mean, median and mode for the data sets:
A) 11 8 13 12 11 9 3 14 13 15 13 16 12 12 12
B) 6 10 3 12 13 6 14 13 7 6 12 13 7 10 13 13 12 18
Finding Mean and Median on the calculator:
1. Enter the data into a list (L1)
2. Select STATS
3. Arrow over to the menu labeled “CALC”
4. Highlight “1-VAR STATS” and push enter.
5. Type the name of the list (L1 or L2 , etc.)
6. Push enter and scroll down.
Note: The default list for 1-VarStats is L1
When a data set is skewed left or right, there are extreme values in the tail which pull the
mean in the direction of the tail. The median is resistant to the extreme values since the
extremes in the data do not substantially affect its value.
Relation Between the Mean, Median and Distribution Shape:
Distribution Shape
Skewed Left
Symmetric
Skewed Right
Mean vs. Median
Mean smaller than Median
Mean equal to Median
Mean greater than Median
#28 page 140 Identify the shape of the distribution. Which is the best measure of center?
Summary chart page 137.
2
Section 3.2 – Measures of Dispersion
Dispersion means the degree to which the data is spread out
.
Range = largest data value – smallest data value (quantitative date; not resistant)
Variance is based on the deviation about the mean.
Variance of a sample = s2 =
Notation:
(x  x)
2
or
i
n 1
(x
i
2
)
  xi 
 n  1
2
n
n = the number of data values in the sample
xi = the individual data values
 x 
 x 
2
i
2
i
means “square each data point, then add them all up
means “add up all of the data points, then square the result”
Standard Deviation is the average amount by which data values differ from their mean. It
is the square root of the variance.
Standard Deviation of a sample = s =
s2
Note: The primary measure of variation, or spread, of data is the
standard deviation!
Properties and Interpretations of Standard Deviation:
1. The standard deviation is a measure of variation of all values from the mean. We say the
standard deviation is the “typical deviation from the mean”. The larger the standard
deviation, the more dispersion the distribution has.
2. The value of the standard deviation is either zero or positive.
(Zero means that all of the data points are the same number)
3. Outliers cause the standard deviation to increase dramatically.
4. The units of the standard deviation are the same as the units of the original data points.
3
Sample Mean:
Population Mean:
x
Sample Variance:
s2
Sample Standard Deviation: s

2
Population Variance:
Population Standard Deviation: 
Find the range, variance, and standard deviation of the following data set:
(The energy usage in kwh for the last 9 months.)
2702
1754
1405
665
872
795
863
1878
3097
Find the mean and standard deviation of the data set.
7.3 5.3 4.6 7.1 9.1 9.5 5.5 5.8 5.8 4.8 7.8 6.8 6.0 8.0 1.0 6.2
x =
s=
Change the first value of 7.3 to 97.3. Note the effect on the standard deviation.
#24 page 154
Find the standard deviation for each group.
Which color has more variation in the responses?
Participant Reaction Time
(BLUE)
1
0.582
2
0.481
3
0.841
4
0.267
5
0.685
6
0.450
Reaction Time
(RED)
0.408
0.407
0.542
0.402
0.456
0.533
4
Empirical Rule for a Bell-shaped distribution:
About 68% of all data values are within 1 standard deviation of the mean
(    and    )
About 95% of all data values are within 2 standard deviations of the mean
(   2 and   2 )
About 99.7% of all data values are within 3 standard deviations of the mean
(   3 and   3 )
Draw the picture on page 151
IQ scores of normal adults have a bell-shaped distribution with a mean of 100 and a
standard deviation of 15.
a) What percentage of adults have IQ scores between 55 and 145?
b) What percentage of adults have IQ scores greater than 145?
5
A sample of 100 values yields x = 50 and s = 5 and has a normal distribution (bell-shaped).
a) What percentage of these values would be expected to fall between 40 and 60?
b) What percentage of these values would be expected to fall between 45 and 55?
c) At least 99.7% of the values will fall within what two numbers?
Chebyshev’s Inequality: (works for any data set, not just bell-shaped ones)
1


At least  1  2   100 % of the data values will lie within k standard deviations of the mean.
 k 

(k > 1)
According to the US Census Bureau, the mean commute time to work for a resident of
Boston, MA is 27.3 minutes. Assume that the standard deviation of commute time is 8.1
minutes.
a) What percentage of commuters in Boston has a commute time within 2 standard
deviations of the mean?
b) What percentage of commuters have commute times between 3 and 51.6 minutes?
6
Section 3.3 – Measures of Central Tendency and Dispersion from Grouped Data
Weighted Mean =
 x  w 
w
i
i
= (value * weight) ÷ (total weight)
i
On the calculator, enter values in L1, weights (as decimals) in L2, then type 1-VarStats(L1,L2)
#12 page 165
In Marissa’s biology class, 5% of the grade is attendance, 10% of the grade is quizzes, 60%
of the grade is tests, and 25% of the grade is the final exam. Marissa had a 100% average
on attendance, 93% for quizzes, 86% for exams, and 85% on the final exam. Determine
Marissa’s course average.
#14 page 165
Michael and Kevin want to buy nuts. They can’t agree on which kind to buy, so they create a
mix. They bought 2.5 pounds of peanuts for $1.30 per pound, 4 pounds of cashews for
$4.50 per pound, and 2 pounds of almonds at $3.75 per pound. Determine the price per
pound of the mix.
7
Section 3.4 – Measures of Position and Outliers
Z-score – the distance that a data value is from the mean in terms of the number of
standard deviations.
Negative z-scores indicate that the data point is below the mean.
Positive z-scores indicate that the data point is above the mean.
Sample: z 
xx
s
Population: z 
x

ALWAYS ROUND Z-SCORES TO 2 DECIMAL PLACES!!
One method of determining whether a child is considered malnourished is by examining the
z-score represented by that child’s weight. A z-score < -2 means the child is malnourished,
and a z-score < -3 indicates severely malnourished.
If a population of girls has a mean of 44.5 pounds and a standard deviation of 7.76 pounds,
determine whether the weights below indicate a malnourished child.
a) weight = 34.5
z-score __________
Status _________________________
b) weight = 27.4
z-score __________
Status _________________________
c) weight = 19.5
z-score __________
Status _________________________
The average 20- to 29-year-old an is 69.6 inches tall, with a standard deviation of 3.0 inches,
while the average 20- to 29-year-old woman is 64.1 inches tall, with a standard deviation of
3.8 inches. Who is relatively taller, a 67.5-inch man or a 62-inch woman?
8
Percentiles – separates the sorted data into 100 equal parts with 1% of the data values in
each group.
P13 separates the lower 13% from the upper 87%
P55 separates the lower 55% from the upper 45%
etc…
The kth percentile, Pk, of a set of data is a value such that k percent of the data points are
less than or equal to the value.
Recall that the median separates the sorted data into 2 equal parts, the lower 50% and the
upper 50%. The median is the 50th percentile.
Similarly, the value that is the 35th percentile separates the bottom 35% of a data set from
the upper 65%.
If a child is in the 98th percentile in height, it means that the child is taller than ______% of
children the same age.
The 80th percentile (P80) separates the upper ________% of the values from the lower
_______% of the values.
The ________ percentile is the cutoff point for the upper 90% of the values.
Quartiles – divides the sorted data into 4 equal parts with 25% of the data values in each
group.
Q1 separates the lower 25% from the upper 75%
Q2 separates the lower 50% from the upper 50%
(also called the median)
Q3 separates the lower 75% from the upper 25%
Finding Quartiles:
1. Arrange the data in ascending order.
2. Find the median which is the second Quartile, Q2.
3. Find the median of data between the minimum value and the median. This is Q1.
4. Find the median of data between the median and the maximum value. This is Q2.
BEFORE CALCULATING PERCENTILES, THE DATA MUST BE SORTED LOW TO HIGH!!!
The following 25 data points are the sorted final grades of students who took a math course.
Calculate Q1, Q2, and Q3.
60 69 70 73 75 77 77 80 80 80 83 85
87 87 88 88 91 93 94 96 97 99 99 100 100
Quartiles can be found on your calculator using 1-VarStats (scroll down)
9
The Interquartile Range (IQR) – the range of the middle 50% of the values in a data set.
IQR = Q3 – Q1
Interpretation: The more spread out a data set is, the higher the IQR will be.
The IQR is used to determine outliers (extreme values).
Outlier – An extreme observation (high or low)
Checking for Outliers by using Quartiles:
1. Calculate Q1 and Q3.
2. Calculate the IQR.
3. Determine the “fences”.
Fences are the cutoff point for determining outliers:
Lower fence = Q1 – 1.5(IQR)
Upper fence = Q3 + 1.5(IQR)
4. If a data value is less than the lower fence or greater than the upper fence, it
is an OUTLIER.
When a distribution of data is highly skewed or contains extreme observations, it is best to
use the interquartile range as the measure of dispersion because it is resistant.
Calculate the IQR for the last example (course grades).
Which statistics should I use for my data set?
If the distribution is SYMMETRIC, the best measure of central tendency is the
_________________,
and the best measure of dispersion is the ______________________.
If the distribution is SKEWED, the best measure of central tendency is the
_________________,
and the best measure of dispersion is the ______________________.
10
#22 page 174
The following data represent the hemoglobin for 20 randomly selected cats.
5.7
10.0
7.7
10.3
7.8
10.6
8.7
10.7
8.9
11.0
9.4
11.2
9.5
11.7
9.6
12.9
9.6
13.0
9.9
13.4
a) Compute the z-score corresponding to the hemoglobin of Blackie, 7.8 g/dL. Interpret this
result.
b) Determine the quartiles.
c) Compute the IQR.
d) Determine the lower and upper fences.
e) Are there any outliers, according to this criterion?
Additional examples: #20 page 174
#26 page 174
11
Section 3.5 – The Five-Number Summary and Boxplots
The 5-Number Summary – five statistics that describe a set of data.
1.
2.
3.
4.
5.
Minimum value
Q1
Median (Q2)
Q3
Maximum value
Boxplot (also called Box-and-Whiskers plot) – a graphical representation of the 5-Number
Summary, incorporating the lower and upper fences.
Drawing a Boxplot:
1. Draw a box with vertical lines at Q1, Q2 and Q3 .
2. Calculate the IQR.
3. Determine the lower and upper fences.
4. Draw brackets at the lower and upper fences.
5. Draw a horizontal line from Q1 to the smallest data value inside the lower fence.
6. Draw a horizontal line from Q3 to the largest data value inside the upper fence.
7. Label any outliers with asterisks (*). Outliers are the data values that are less than
the lower fence or greater than the upper fence.
Draw a Boxplot for the course grades example.
#8 page 181
Jessica enrolled in a course that promised to increase her reading speed. To help judge the
effectiveness of the course, Jessica measured the number of words per minute she could
read prior to enrolling in the course. She obtained the following five-number summary:
110 140 157 173 205. Use this information to draw a boxplot of the reading speed.
Look at the graphs on page 179 to see the connection between a boxplot and the distribution
(shape) of the data set.
#4 page 181
#6 page 181
12