Download Teacher

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

World Values Survey wikipedia , lookup

Time series wikipedia , lookup

Transcript
 For categorical variables it is easy to draw the
distribution because each category is a natural “pile”.
We can use a bar chart or pie chart.
 For quantitative variables, data is often displayed using
either a histogram, stem and leaf plot or dot plot.
 Dot Plot – Places a dot along an axis for each case in
the data
 Good for small data sets
 Shows basic facts about the distribution
 Can be horizontal or vertical
 Here is a dot plot of the
winning times of the Kentucky
Derby in each race from 1875 to
the 2011 Derby.
 Basic facts about the
distribution:
 Easily see the fastest and
slowest times
 There are two clusters of points
 In 1896 the distance of the Derby race was changed from 1.5
miles to the current 1.25 miles
 How good was the 2012 U.S. women’s soccer team? With
players like Abby Wambach, Megan Rapinoe and Hope
Solo, the team put on an impressive showing en route to
winning the gold medal at the 2012 Olympics in London.
Here are the data on the number of goals scored by the
team in the 12 months prior to the 2012 Olympics.
1 3 1 14 13 4 3 4 2 5 2 0 4 1 3
4 3 4 2 4 3 1 2 4 2
Create a dot plot to represent the data.
 Draw a horizontal or vertical axis (a number line) and
label it with the variable name
 Scale the axis. Start by looking at the minimum and
the maximum values of the variable. Mark your axis
with tick marks
 Mark a dot above the location on the axis
corresponding to each data value.
 Histogram is useful when working with large sets of
data, stemplot is better for smaller data sets
 Easier to make by hand
 Give a quick picture of the shape of a distribution
while including the actual numerical values in the
graph
 8 8 means a pulse of 88 bpm
 Take the tens place of the number and make that the
stem, the ones place of the number is the leaf
 8 000044 means four pulse rates of 80 and two of 84
 Conclusions:
 All of the numbers are even and all are
multiples of 4.
 Something you could have never seen in a
histogram
 Gives insight into how the data was collected
 Do you think the nurse counted the pulse for a full
minute, or counted for 15 seconds and multiplied by
4?
 How many pairs of shoes does a typical teenager have?
To find out, a group of AP Statistics students
conducted a survey. They selected a random sample of
20 female students from their school. Then they
recorded the number of pairs of shoes that each
respondent reported having. Here are the data:
50
38
15
26
13
51
26
50
31
13
57
34
19
23
24
30
22
49
23
13
 Steps:
 Separate each observation into a stem, and a leaf. Write
the stems in a vertical column with the smallest number
at the top, and draw a vertical line at the right of this
column. Do not skip any stems even if there is no data
value for a particular stem.
 Write each leaf in the row to the right of its stem.
 Arrange the leaves in increasing order out from the stem
 Provide a key that explains in context what the stems
and leaves represent.
 The AP Statistics students also collected data from a
random sample of 20 male students at their school. Here
are the numbers of pairs of shoes reported by each male in
the sample:
14 7 6 5 12 38 8 7 10 10 10 11 4 5 22 7
5 10 35 7
 What would happen if we tried the same approach as
before with creating a stem plot?
 There are other ways to create your graph so you can see a
clearer picture of the shape and distribution.
 To get more stems and a better picture we can split the
stems, so 0 to 4 are placed on one stem and 5 to 9 are placed
on another stem.
 We can also use a back to back stem plot to compare the
number of shoes that males and females have.
 When you describe distribution you should always tell
about four key things:
 Shape
 Center
 Spread
 Outliers
 Concentrate on the main features. Look for major
peaks, not for minor ups and downs. Look for clusters,
obvious gaps and potential outliers.
 Look for symmetry and skewness:
 A distribution is roughly symmetric if the right and left
sides of the graph are approximately mirror images of
each other
 A distribution is skewed to the right if the right side of
the graph (the larger values) is much longer than the left
side. It is skewed to the left if the left side of the graph is
much longer than the right side.
 **The direction of skewness is the direction of the long
tail, not where most observations are clustered!
Skewed left
Skewed right
Unimodal
Bimodal
Uniform
Multimodal
 Do any unusual features stick out?
 Always mention any stragglers, called outliers, that
stand off away from the body of the distribution
 Example: If your collecting data on nose lengths and
Pinocchio is in the group you would definitely want to
mention that.
 An outlier can me the most informative part of your
data, or it might just be an error.
 Here is a stemplot of the percents of
residents aged 65 and older in the 50
states and the District of Columbia. The
stems are whole percents and the leaves
are tenths of a percent.
 The low outlier is Alaska. What percent
of Alaska residents are 65 or older?
 Ignoring the outlier, the shaper of the
distribution is…?
 The center of the distribution is close
to…?
 Slice up the possible values into equal-width
intervals called bins
 Histogram displays the bin counts as the
height of the bars (like a bar chart)
 Unlike a bar chart, the bars in a histogram
touch one another. An empty space
between bars represents a gap in data values
 If a value falls on the border between two
bars, it is placed in the bin on the right.
 Magnitudes (on the Richter scale) of the 1,318
earthquakes in the NGDC data:
 Conclusions:
 What is the interval of each bar?
 Count the number of bars in between each number and you will
see the interval is 0.2
 The tallest bar says there were about 200 earthquakes
between 7.0 and 7.2
 Earthquakes typically have magnitudes around 7, most are
between 5.5 and 8.5 some are as small as 4 or as large as 9
 Ones in Japan and Sumatra were some of the biggest ever
recorded.
 Replace the counts on the vertical axis with percentage
of total number of cases.
 Don’t confuse histograms with bar graphs.
 Histograms display quantitative variables. The horizontal
axis is marked in the units of measurement of the variable
 A bar graph is used to display categorical variables. The
horizontal axis identifies the categories being compared
 Outlier – there are three cities in the leftmost bar
 Are there any gaps in the distribution? Gaps help us
see multiple modes and encourage us to notice when
the data may come from different sources or contain
more than one group.
A credit card company wants to see how much
customers in a particular segment of their market use
their credit card. They have provided you with data on
the amount spent by 500 selected customers during a 3month period and have asked you to summarize the
expenditures . Of course, you begin by making a
histogram.
Question: Describe the
shape of this distribution.
Answer:
 Shape: The distribution is unimodal
 Center: It is skewed to the right, the high end of
monthly spending
 Spread: There is an extraordinarily large value at about
$7,000, and some values are negative.
 Let’s return to the tsunami earthquakes. Let’s look at
just 25 years of data.
 207 earthquakes
 Occurred from 1987 to 2011
 The data is symmetric
so it is easy for us to see
the center
 Median: the middle value that divides the histogram
into two equal areas
 The median has the same units as the data, be sure to
include the units whenever you talk about median.
 For this example, there are 207 earthquakes, so the
median is found at
207+1
2
= 104𝑡ℎ place in
the sorted data.
 The median earthquake
magnitude is 7.2
 Finding the median of a batch of n numbers:
 Order the values
 If n is odd, the median is the middle value
 Mathematically we find this position using
 If n is even, there are two middle values
 The median is the average of the two values
 The positions are found by
𝑛
2
𝑛
𝑎𝑛𝑑
2
+1
𝑛+1
2
 Suppose the data has these values:
14.1, 3.2, 25.3, 2.8, -17.5, 13.9, 45.8
 First, order the values:
-17.5, 2.8, 3.2, 13.9, 14.1, 25.3, 45.8
 Since n is 7 (odd), the median is (7+1)/2 = 4th value,
which counting from either side is 13.9
 Notice there are 3 values lower and 3 values higher.
 Suppose the data has the same values, but add 35.7:
14.1, 3.2, 25.3, 2.8, -17.5, 13.9, 45.8, 35.7
 First, order the values:
-17.5, 2.8, 3.2, 13.9, 14.1, 25.3, 35.7, 45.8
8
= 4th
2
13.9+14.1
2
 Now n=8 (even) so the median is the average of
place and
14.0
8
2
+ 1 = 5th place. So the median is
 Four data values are lower and four are higher
=
 The median is one way to find the center of the data, but
there are many other ways.
 Knowing the median, we can say that a typical tsunamicausing earthquake was about 7.2 on the Richter scale.
 How well does the median describe the data?
 Whenever we find the center of the data, the next step is
always to ask how well it actually summarizes the data.
 Medians are a good measure of the center when the
data is skewed or there are outliers
 Because the median considers only the order of the
values it is resistant to values that are extraordinarily
large or small as it ignores their distance from the center
 When we have symmetric data an alternative (better)
measure is to use the mean
 When the data is symmetric the mean and median will
be close
 Why not just always use the median?
 The median can sometimes be too resistant and can be
unaffected by changes in many data values, but the
mean considers every data point and gives them each
equal weight.
 When the data is unimodal and roughly symmetric you
should use the mean
 If you are unsure which is the better option, state them
both and explain why they differ.
⅀ means “sum” and is pronounced sigma
𝑇𝑜𝑡𝑎𝑙
𝑦
𝑦=
=
𝑛
𝑛
Number of values
Sum of
values
 You want to summarize the
expenditures of 500 credit
card company customers.
 Question: You have found the
mean expenditure to be
$478.19 and the median to be
$216.28. Which is the more
appropriate measure of center
and why?
 Answer: Because the distribution is skewed, the
median is the more appropriate measure of center.
Unlike the mean, it’s not affected by the large
outlying value or by the skewness.
 Even without making a histogram, we can expect some
variables to be skewed. When values of a quantitative
variable are bounded on one side but not the other, the
distribution may be skewed.
 Examples: incomes and waiting times can’t be less
than zero, so they are often skewed to the right.
Amounts of things (dollars, employees) are also often
skewed to the right for the same reason.
 Combinations of things are also often skewed
 For example, a histogram showing the number of
cancelled flights in a month. More flights are likely to
be cancelled in January (due to snowstorms) and August
(thunderstorms). Combining values across months
leads to a skewed distribution.
 The more the data vary, the less the median alone can




tell us.
We need to measure how the data values vary around
the center, how spread out they are.
The range of the data is the difference between the
maximum and minimum values: Range = max – min
Range is a single number, not an interval of values
The range has the disadvantage that a single extreme
value can make it very large.
 What is the range of the earthquake data?
 The maximum magnitude of these earthquakes is 9.1
and the minimum is 3.7 so the range is 9.1 – 3.7 = 5.4
 A better way to describe the spread of a variable might
be to ignore the extremes and concentrate on the
middle of the data.
 We can split the data into 4 quartiles.
 The lower and upper quartiles are also known as the
25th and 75th percentiles of the data and the median is
the 50th percentile.
 To find the quartiles:
 Divide the data in half at the median
 Divide both halves in half again, which will cut the data
into four quarters
 One quarter of the data lies below the lower quartile
 One quarter of the data lies above the upper quartile
 Interquartile Range: The difference between the
quartiles that tells us how much territory the middle
half of the data covers: IQR = Upper – Lower Quartile
 When n is odd, you can either include the median in
both halves, or omit it from either half
Example data from the median: n is odd
-17.5, 2.8, 3.2, 13.9, 14.1, 25.3, 45.8
Example data from the median: n is even
-17.5, 2.8, 3.2, 13.9, 14.1, 25.3, 37.5, 45.8
The IQR is almost always a reasonable summary of the
spread of a distribution. Even if the distribution itself is
skewed or has some outliers, the IQR should provide
useful information.
The one exception to this is when the data is strongly
bimodal, such as the Kentucky Derby data.
 The five number summary of a distribution reports its
median, quartiles, and extremes (maximum and
minimum)
 The 5 – number summary for the earthquake data
would look like this
Max
9.1
Q3
7.6
Median
7.2
Q1
6.7
Min
3.7
 It is good practice to report the number of data values
and the identity of the cases. Here there are 207
earthquakes.
 Conclusions from the 5-number summary:
 Provides a good overview of the
Max
9.1
distribution of magnitudes of these
tsunami-causing earthquakes.
 The median magnitude is 7.2.
 The IQR is 7.6 – 6.7 = 0.9
Q3
7.6
Median
7.2
Q1
6.7
Min
3.7
 Because this is small we see that many quakes are close to
the median magnitude
 25% of the earthquakes had a magnitude above 7.6,
although one tsunami was cause by a quake measuring
only 3.6 on the Richter scale.
 Once we have a 5-number summary of a quantitative
variable, we can display that information in a boxplot
 Step 1: Draw a single axis spanning the extent of the
data
 Step 2: Draw short lines to mark the lower, median,
and upper quartiles. Connect the lines to form a box
 Step 3: Construct “fences” around the main part of the
data. The upper fence is 1.5 IQR above the upper
quartile and the lower fence is 1.5 IQR below the lower
quartile
 Upper fence = Q3 + 1.5 IQR
 Lower fence = Q1 + 1.5 IQR
 The fences are just for construction they
are not part of the display.
 We use the fences to grow the “whiskers”
 Step 4: Draw lines from the ends of the
box up and down to the most extreme
data values found within the fences. If
data value falls outside one of the fences,
we do not connect it with a whisker.
 Step 5: Place a dot to display any outliers
beyond the fences
 A box plot highlights several features of the





distribution.
The central box shows the middle 50% of the data,
between the quartiles.
The height of the box is equal to the IQR.
If the median is roughly centered between the
quartiles, then the middle half of the data is roughly
symmetric
If the median is not centered then the distribution is
skewed.
The whiskers show skewness as well if they are not
roughly the same length.
 For the recent tsunami
earthquake data, the central
box contains all earthquakes
whose magnitudes are
between 6.7 and 7.6 on the
Richter scale.
 From the shape of the box, it looks like the central part
of the distribution of earthquakes is roughly symmetric
 At the low end the longer whisker and outliers indicate
that the distribution stretches out slightly to the left
 We also see the two large quakes that we have discussed
 State the 5-number summary and create a box plot for
the following data set:
1, 1, 2, 3, 4, 5, 6, 6,15
 In Algebra you used letters to represent values in a
problem and it did not matter what letter you chose.
You could call the width of a rectangle x, or you could
use w or any other letter you wanted.
 In Statistics, the notation is part of the vocabulary
 Example, n is always the number of data values.
ALWAYS
 Here is a new one: Whenever there is a bar over the
symbol, it means “find the mean (average)”
 IQR is always a reasonable summary of spread, but
because it only uses two quartiles it ignores much of
the information about how individual values vary.
 A more powerful approach uses the standard
deviation, which takes into account how far each
value is from the mean.
 Like the mean, standard deviation is most appropriate
for symmetric data.
 One way to think about spread is to determine how far




each data value is from the mean.
The difference between a value and the mean is called
deviation
We could average these deviations, but the positives
would cancel with the negatives and we would always
get zero
To keep them from cancelling out we square each
deviation, resulting in all positive values.
When we average the squared deviations the result is
called the variance
Sum
Variance
Difference of
value and mean
(𝑦
−
𝑦)
𝑠2 =
𝑛−1
n – 1 instead of
just n
 The problem with using the variance as a measure of
spread is the units. We want the units to match the
data, but the units of the variance are squared.
 For example, squared dollars or mpg2
 To get back to the original units we take the square
root of the variance. This is called the standard
deviation
 Standard deviation is a very important concept to
understand and will be used throughout the course.
Sum
Difference of
value and mean
Standard
Deviation
𝑠=
(𝑦 − 𝑦)
𝑛−1
n – 1 instead of
just n
 To find the standard deviation:
 Step 1 – Find the mean
 Step 2 – Subtract each data value from the mean
 Step 3 – Square all the differences
 Step 4 – Add all squares from step 3
 Step 5 – Divide the sum by n – 1
 This gives the variance
 Step 6 – Take the square root
 Suppose the batch of values is 14, 13, 20, 22, 18, 19
and 13. Find the standard deviation.
 Consider the following histogram representing resting
pulse rates of adults.
 The distribution is roughly symmetric
so we will use the mean and standard
deviation
 The mean pulse rate is 72.7 bpm. We can
see that some heart rates are higher and some are lower,
but how much?
 The standard deviation of 6.5 bpm indicates that on
average we can expect people’s heart rate to differ from
the mean by about 6.5 bpm
 Measures of spread tell how well other summaries describe
the data. ALWAYS report a spread along with any summary
of the center.
 If the data is skewed or has outliers: Median and IQR are
your best choices for summaries
 If the data is roughly symmetric: Mean and standard
deviation are your best options