Download Slide 4

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Categorical variable wikipedia , lookup

Time series wikipedia , lookup

Transcript
DID YOU SIGN UP FOR MY STAT LAB?
1.
2.
Yes
No
0%
1
0%
2
Slide
3- 1
UPCOMING IN CLASS

Homework #1 due tonight at 11:59pm

Quiz #1 in class August 29th (open book)

Homework #2 due Sunday (9/2) at 11:59pm

Part 1 of the Data Project due (9/5)
Slide
4- 2
DATA PROJECT

Objective: Ask a question and try to answer
it using statistics.
Step 1: DATA COLLECTION - Due Wed.
Sept. 5 in class.
 Step 2: DESCRIPTION OF DATA – Due
Mon. Sept. 17th in class
 Step 3: QUESTIONS – Due Mon. Oct. 29th in
class
 Step 4: FINAL DATA PROJECT – Due by
Friday Dec. 7th 5PM

EXAMPLE QUESTION



Is there a difference in carbon emission for the
Midwest and the Northwest U.S.?
Is there a difference in carbon emissions for years
when a Republican president is in office vs. a
Democrat?
Are carbon emissions in the Midwest at ‘safe’
levels?
COLLECT DATA





Bureau of Labor Statistics (BLS):
http://bls.gov/
Energy Information Administration (EIA):
http://www.eia.gov/
Bureau of Economic Analysis (BEA):
http://www.bea.gov/
Environmental Protection Agency (EPA):
http://epa.gov/
U.S. Census Bureau: http://www.census.gov/
REVIEW FROM LAST CLASS


A categorical (or qualitative) variable names categories
and answers questions about how cases fall into those
categories.
A quantitative variable is a measured variable (with
units) that answers questions about the quantity of
what is being measured.

Quantitative examples: income ($), height (inches), weight
(pounds)
Slide
3- 6
Slide
3- 7
Slide
3- 8
Slide
3- 9
Slide
3- 10
Slide
3- 11
Slide
3- 12
Slide
3- 13
WHICH OF THE COMPARISONS DO YOU
CONSIDER MOST VALID?
1.
2.
3.
Overall average, b/c it does not differentiate
93%
between the four programs.
Individual program comparisons, b/c they take
into account the different number of applicants
and admission rates for each of the four
programs.
Overall average, b/c it takes into account the
differences in number of applicants and
admission rates for each of the four programs.
7%
1
0%
2
3
CHAPTER 4
Displaying and Summarizing Quantitative
Data
DEALING WITH A LOT OF NUMBERS…




Summarizing the data will help us when we look
at large sets of quantitative data.
Without summaries of the data, it’s hard to grasp
what the data tell us.
The best thing to do is to make a picture…
We can’t use bar charts or pie charts for
quantitative data, since those displays are for
categorical variables.

Histograms, dot plots, stem & leaf display, box plot.
HISTOGRAMS: EARTHQUAKE
MAGNITUDES
The chapter example discusses earthquake
magnitudes
 First, slice up the entire span of values covered
by the quantitative variable into equal-width
piles called bins.
 The bins and the counts in each bin give the
distribution of the quantitative variable.

Slide
4- 17
HISTOGRAMS: EARTHQUAKE MAGNITUDES
(CONT.)


A histogram plots the bin
counts as the heights of
bars (like a bar chart).
Here is a histogram of
earthquake magnitudes
Slide
4- 18
HISTOGRAMS: EARTHQUAKE MAGNITUDES
(CONT.)
A relative frequency histogram displays the
percentage of cases in each bin instead of the count.
 In this way, relative
frequency histograms
are faithful to the
area principle.
 Here is a relative
frequency histogram of
earthquake magnitudes:

Slide
4- 19
DOTPLOTS
A dotplot is a simple
display. It just places a
dot along an axis for
each case in the data.
 The dotplot to the right
shows Kentucky Derby
winning times, plotting
each race as its own
dot.
 You might see a dotplot
displayed horizontally
or vertically.

Slide
4- 20
CONSTRUCTING A STEM-AND-LEAF
DISPLAY
First, cut each data value into leading digits
(“stems”) and trailing digits (“leaves”).
 Use the stems to label the bins.
 Use only one digit for each leaf—either round or
truncate the data values to one decimal place
after the stem.

Slide
4- 21
STEM-AND-LEAF DISPLAYS
Stem-and-leaf displays show the distribution of a
quantitative variable, like histograms do, while
preserving the individual values.
 Stem-and-leaf displays contain all the
information found in a histogram and, when
carefully drawn, satisfy the area principle and
show the distribution.

Slide
4- 22
STEM-AND-LEAF EXAMPLE

Compare the histogram and stem-and-leaf display for
the pulse rates of 24 women at a health clinic. Which
graphical display do you prefer?
Slide
4- 23
SHAPE, CENTER, AND SPREAD

When describing a distribution, make sure to
always tell about three things: shape, center, and
spread…
Slide
4- 24
WHAT IS THE SHAPE OF THE
DISTRIBUTION?
1.
2.
3.
Does the histogram have a single, central hump or
several separated humps?
Is the histogram symmetric?
Do any unusual features stick out?
Slide
4- 25
HUMPS
1.
Does the histogram have a single, central
hump or several separated bumps?


Humps in a histogram are called modes.
A histogram with one main peak is dubbed
unimodal; histograms with two peaks are bimodal;
histograms with three or more peaks are called
multimodal.
Slide
4- 26
ROLLING DICE DISTRIBUTION
Die
 http://academic.evergreen.edu/curricular/doingscience
/flash/dice.html

Dice
 http://www.shodor.org/interactivate/activities/ExpPro
bability/

Slide
4- 27
SYMMETRY
2.
Is the histogram symmetric?
 If you can fold the histogram along a
vertical line through the middle and
have the edges match pretty closely, the
histogram is symmetric.
Slide
4- 28
SYMMETRY (CONT.)


The (usually) thinner ends of a distribution are
called the tails. If one tail stretches out farther than
the other, the histogram is said to be skewed to the
side of the longer tail.
In the figure below, the histogram on the left is said
to be skewed left, while the histogram on the right is
said to be skewed right.
Slide
4- 29
ANYTHING UNUSUAL?
3.
Do any unusual features stick out?



Sometimes it’s the unusual features that tell us
something interesting or exciting about the data.
You should always mention any stragglers, or
outliers, that stand off away from the body of the
distribution.
Are there any gaps in the distribution? If so, we
might have data from more than one group.
Slide
4- 30
Slide
4- 31
WHAT CAN GO WRONG?
 Don’t
make a histogram of a categorical
variable—bar charts or pie charts should be
used for categorical data.
 Don’t
look for shape,
center, and spread
of a bar chart.
Slide
4- 32
WHAT CAN GO WRONG? (CONT.)

Choose a bin width appropriate to the data.

Changing the bin width changes the appearance of
the histogram:
Slide
4- 33
WHERE IS THE CENTER?
Mean
 Median
 Mode

Slide
4- 34
CENTER OF A DISTRIBUTION - MEAN

When a distribution is unimodal and symmetric,
most people will point to the center of a distribution.

If we want to calculate a number, we can average
the data.

We use the Greek letter sigma to mean “sum” and
write:
Total  y
y

n
n
The formula says that to find the mean, we add
up the observations and divide by n (the number
of observations).
CENTER OF A DISTRIBUTION – MEAN (CONT)

The mean feels like the center because it is the point
where the histogram balances:
Slide
4- 36
IF YOU HAVE A SKEWED DISTRIBUTION

The mean or balancing point will be weighted more by
outliers
CENTER OF A DISTRIBUTION – MEDIAN

The median is the value with exactly half the data
values below it and half above it.
 It is the middle data
value (once the data
values have been
ordered) that divides
the histogram into
two equal areas.
 It has the same
units as the data.
Slide
4- 38
MEAN OR MEDIAN? (CONT.)
In symmetric distributions, the mean and median
are approximately the same in value, so either
measure of center may be used.
 For skewed data, though, it’s better to report the
median than the mean as a measure of center.

Slide
4- 39
SPREAD: HOME ON THE RANGE
Always report a measure of spread along with a
measure of center when describing a distribution
numerically.
 The range of the data is the difference between
the maximum and minimum values:
Range = max – min
 A disadvantage of the range is that a single
extreme value can make it very large and, thus,
not representative of the data overall.

Slide
4- 40
SPREAD: THE INTERQUARTILE RANGE
The interquartile range (IQR) lets us ignore
extreme data values and concentrate on the
middle of the data.
 To find the IQR, we first need to know what
quartiles are…

Slide
4- 41
SPREAD: THE INTERQUARTILE RANGE
(CONT.)

Quartiles divide the data into four equal sections.



One quarter of the data lies below the lower quartile,
Q1
One quarter of the data lies above the upper quartile,
Q3.
The difference between the quartiles is the IQR,
so
IQR = Q3 - Q1
Slide
4- 42
SPREAD: THE INTERQUARTILE RANGE (CONT.)


The lower and upper quartiles are the 25th and 75th percentiles of
the data, so…
The IQR contains the middle 50% of the values of the distribution,
as shown in Figure 4.13 from the text:
Slide
4- 43
WHAT ABOUT SPREAD? THE STANDARD
DEVIATION
A more powerful measure of spread than the IQR
is the standard deviation, which takes into
account how far each data value is from the
mean.
 A deviation is the distance that a data value is
from the mean.


Since adding all deviations together would total zero,
we square each deviation and find an average of sorts
for the deviations.
WHAT ABOUT SPREAD? THE STANDARD
DEVIATION (CONT.)

The variance, notated by s2, is found by summing the
squared deviations and (almost) averaging them:
y  y



2
s

2
n 1
The variance will play a role later in our study, but it is
problematic as a measure of spread—it is measured in
squared units!
WHAT ABOUT SPREAD? THE STANDARD
DEVIATION (CONT.)

The standard deviation, s, is just the square root of the
variance and is measured in the same units as the
original data.
 y  y 
2
s
n 1
Slide
4- 46
THINKING ABOUT VARIATION




Since Statistics is about variation, spread is an
important fundamental concept of Statistics.
Measures of spread help us talk about what we
don’t know.
When the data values are tightly clustered
around the center of the distribution, the IQR
and standard deviation will be small.
When the data values are scattered far from the
center, the IQR and standard deviation will be
large.
Slide
4- 47
WHAT MEASURES OF CENTER AND SPREAD ARE MOST
APPROPRIATE?
1.
2.
3.
4.
Median
and St.
25%
25%
25%Dev.25%
Median and IQR
Mean and St. Dev.
Mean and IQR
Slide
4- 48
1
2
3
4
WHICH IS LARGER MEAN OR MEDIAN
1.
2.
3.
Mean> median b/c
skewed to right
33%
33%
33%
Median > mean b/c
skewed to right
Mean and median are
equal b/c symmetric
Slide
4- 49
1
2
3
Slide
4- 50
Slide
4- 51
FOR NEXT TIME…

Chapter 5 – comparing distributions