Download Graphing Categorical Variables

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Misuse of statistics wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
AP Statistics
Take home package
Complete these notes by reading Chapter P and Chapter 1 of the
text. The notes below do not necessarily follow the reading
sequentially but are arranged to group information together. Then
complete the homework problems listed at the end. You can
expect a quiz/test over this material within the first few days of
the school year.
Come prepared with questions.
AP Statistics: Chapter P
Statistics is _____________________________________________________________________
Data is ____________________________________________
Data consists of information about some group of individuals (may be people, animals or even
inanimate objects), and the characteristics we measure on each individual are called variables.
Example 1: Give an example of each of the following “types” of individuals along with a
corresponding variable for the individual.
A person:
An animal:
An inanimate object:
Variables fall into two main categories:
1. A categorical, or qualitative, variable _____________________________________________
_____________________________________________________________________________
2. A quantitative variable _________________________________________________________
_____________________________________________________________________________
Example 2: Consider the three “types” of individuals listed in example 1 and give a possible
categorical and a possible quantitative variable for each.
A person:
An animal:
An inanimate object:
Ideally, any set of data is accompanied by background information that helps us understand it. When
you meet a new set of data, ask yourself the following key questions (write notes for each).
WHO
WHAT
WHY
WHEN, WHERE, HOW and BY WHOM
The distribution of a variable tells us ___________________________________________________
__________________________________________________________________________________
Example 3: Take a standard deck of 52 playing cards, randomly select a card and record if it is an
ace, two, three, etc. Return the card to the deck and randomly select a second card and record if it
is an ace, two etc. Repeat 20 times.
Example 4: Take a regular 6-sided dice and roll it 20 times. Each time record what number
appears.
Statistical inference involves drawing conclusions about a large group, called the ________________
by gathering information from a smaller subgroup, called the ________________. You may wonder
why not just gather information about everyone in the population, this is called a _________________,
rather than bother with a sample? The reason is simple, too much time and too much money!!
The main statistical designs for producing data are _______________, __________________, and
_________________________________________.
Example P3 on p.9 of the text illustrates one concern when using a survey to gather data. What is this
concern?
In an observational study, __________________________________________________________
________________________________________________________________________________
In an experiment, we ______________________________________________________________
____________________________________________________________
What is the key difference between an observational study and an experiment?
Data analysis is _____________________________________________________________________
What is a side-by-side bar graph best used for?
What type of data is a dotplot used for?
HW: p11 / P.1 – P.5
p19 / P.7 – P.12
p25 / P.13 – P.15, P.18
p30 / P.19, P.21 – P.24, P.28
AP Statistics: Chapter 1
What two types of graphs are typically used for categorical variables?
If a particular section of a pie chart is to represent 17% of the data, what should be the central angle
measure that determines that particular wedge?
What two types of graphs are typically used for quantitative variables?
Steps for constructing a stem-and-leaf plot:
1. ________________________________________________________________________________
__________________________________________________________________________________
2. ________________________________________________________________________________
__________________________________________________________________________________
3. ________________________________________________________________________________
__________________________________________________________________________________
On the bottom of page 45 there is a Minitab (a statistical computer package) version of a stem–and–
leaf plot. The far left column simply shows a cumulative total of the number of leafs for that stem and
all the stems before it. It also shows a leaf unit, so that you know what digit in the data value the leaf
represents. For example, the stem and leaf 0 9 represents 9,000 while the stem and leaf 2 5
represents 25,000.
Compare and contrast stem-and-leaf plots and histograms as to their advantages and when each should
be used.
Two techniques that are helpful when using a stem-and-leaf plot for a moderately large set of data are
_______________________ and _______________________. Describe each technique.
A __________________________________ is very useful when you wish to compare two related
distributions.
A histogram _______________________________________________________________________
__________________________________________________________________________________
You can choose any convenient number of classes but ______________________________________
___________________________________________. If you choose too few classes, you get a
____________________ graph while too many classes will yield a _________________ graph.
Count the number of data values that fall into each class. These counts are called _________________
and a table that lists the class and the frequency for each class is called a _______________________.
A relative frequency histogram gives percents instead of frequencies and is very useful when
comparing two sets of data where one set has many more values than the other.
In a cumulative frequency histogram each class’s frequency is the sum of the frequencies for that
class and all the classes before it as well.
Constructing a graph to represent our data is only the first step. The next step is to interpret what we
see. When you describe the distribution pay special attention to the …
shape ___________________________________________________________________________
The length of the “tails” will tell us whether a graph (i.e. distribution) is left –skewed (left tail is the
longest) or right-skewed (the right tail is the longest).
modes __________________________________________________________________________
__________________ - one major peak, __________________ - two major peaks
center ___________________________________________________________________________
The two most common measures of center are the mean and the median. These will be discussed in
greater detail later in these notes.
spread ___________________________________________________________________________
The IQR and standard deviation are probably the two most common measures of spread. Both will
be discussed in greater detail later in these notes.
outliers __________________________________________________________________________
Outliers will be discussed in greater detail later in these notes.
When you have to describe the shape of a distribution, don’t get mad,
C U S S
For the center, refer to mean, median or, perhaps, mode.
Unusual refers to outliers or gaps in the data.
Spread refers to IQR, standard deviation or range
Shape refers to symmetrical or skewed as well as any peaks
E
N
T
E
R
N
U
S
U
A
L
P
R
E
A
D
H
A
P
E
To look at the relative standing of an individual observation, we use a relative cumulative frequency
graph, which is called an __________________ (pronounced o-jive).
Here’s how to make an ogive.
1. Decide on class intervals and make a frequency table. Add 3 columns to your frequency table
labeled: relative frequency, cumulative frequency and relative cumulative frequency (divide the
cumulative frequency by the total
2. Complete the frequency table below which shows the ages of U.S. Presidents at their inauguration.
Class
40 – 44
45 – 49
50 – 54
55 – 59
60 – 64
65 – 69
Total:
Frequency
2
6
13
12
7
3
43
Relative Frequency
Cumulative Frequency Relative Cumulative Frequency
3. Label and scale your axes and title your graph. Label the horizontal axis “Age at Inauguration” and
the vertical axis “Relative Cumulative Frequency”. Scale the horizontal axis according to your
choice of class intervals and the vertical axis from 0% to 100%.
4. Plot a point corresponding to the relative cumulative frequency in each class interval at the left
endpoint of the next class interval. Connect consecutive points with a line segment to form
the ogive. The last point you plotted should be at a height of ____________.
Ages of U.S. Presidents at the Time of Their Inauguration
Age at Inauguration
Ogives can be used to locate an individual within the distribution.
Example 1: Determine Bill Clinton’s relative standing when he took office at the age of 46.
Ogives can also be used to locate a value corresponding to a percentile.
Example 2: What is the center of the distribution? __________
A time plot of a variable plots each observation against time. Always put ____________ on the
horizontal scale and the variable you are measuring on the vertical scale. Connecting the data points by
line segments helps emphasize any change over time. A good use of a time plot would be to graph
stock market prices over time.
The table below (from page 70 in your text) gives the EPA city and highway mileage for cars in the
“two-seater” and “minicompact” categories.
Fuel economy (mph) for 2004 model motor vehicles
Two-seater Cars
Minicompact Cars
Model
City Highway Model
City Highway
Acura NSX
17
24
Aston Martin Vanquish 12
19
Audi TT Roadster
20
28
Audi TT Coupe
21
29
BMW Z4 Roadster
20
28
BMW 325CI
19
27
Cadillac XLR
17
25
BMW 330CI
19
28
Chevrolet Corvette
18
25
BMW M3
16
23
Dodge Viper
12
20
Jaguar XK8
18
26
Ferrari 360 Modena
11
16
Jaguar XKR
16
23
Ferrari Maranello
10
16
Lexus SC 430
18
23
Ford Thunderbird
17
23
Mini Cooper
25
32
Honda Insight
60
66
Mitsibishi Eclipse
23
31
Larborghini Gallardo
9
15
Mitsibishi Spyder
20
29
Lamborghini Murcielago
9
13
Porsche Cabriolet
18
26
Lotus Esprit
15
22
Porsche Turbo 911
14
22
Maserati Spyder
12
17
Mazda Miata
22
28
Mercedes-Benz SL 500
16
23
Mercedes-Benz SL600
13
19
Nissan 350Z
20
26
Porsche Boxster
20
29
Parsche Carrera 911
15
23
Toyota MR2
26
32
Measuring Center: The Mean & Median
To calculate the mean, add the values of the observations and divide by the number of observations.
 The mean of a sample is denoted x , pronounced x-bar.
 The mean of a population is denoted  , the Greek letter Mu.
Example 3: Determine the mean highway mileage for two-seaters.
What outlier do you see in the data? ____________________
Example 4: Determine the mean highway mileage for two-seaters without the outlier.
Examples 3 & 4 illustrate an important weakness of the mean as a measure of center: the mean is
sensitive to the influence of a few extreme observations. These may be outliers, but a skewed
distribution that has no outliers will also pull the mean toward its long tail.
The median (denoted by _____) is the __________________ of a distribution:
To calculate the median….
1. Order the observations from smallest to largest.
2. If the number of observations is odd, the median is simply the middle value in the list. You
can find the location by counting __________ observations from the bottom (or top).
3. If the number of observations is even, you should average the two middle numbers. The
location of the median is again __________ from the bottom or top of the list.
Example 5: Find the median highway mileage for 2004 model two-seater cars.
Example 6: Drop the Honda Insight (the outlier) and find the median.
Is the median sensitive to the influence of an extreme observation? _____
We say that the median is an _____________________________________________ of center.
Mean versus Median
The mean and median of a roughly symmetrical distribution will be ___________________________.
If the distribution is exactly symmetric, the mean and median are _______________. In a skewed
distribution, the mean is __________________________ in the long tail than the median.

In a skewed distribution, the ____________ is the more accurate measure of center.
In descriptions of data, the “average” value of a variable is usually referred to as the __________
whereas the “typical” value is usually referred to as the __________________.
Measuring Spread: The Quartiles
A measure of center alone can be misleading.
Example 7: Find the mean and median of: 6000
x = __________
M = __________
M = __________
8000
9000
15000
range (see below) = __________
Example 8: Find the mean and median of: 1000
x = __________
7000
1000
8000
8000
27000
range = __________
One way to measure spread, or variability, is to calculate the range, which is ____________________
_________________________________________________________________
Another way to describe the spread of a distribution is by considering different percentiles. The pth
percentile of a distribution is the value that has ____________________________________________
________________________________. The median is the ________ percentile. The 25th percentile is
called the ______________________________ while the 75th percentile is called the ______________
___________________.
Example 9: Find the median and quartiles of the 21 gasoline-powered two-seater cars below.
13 15 16 16 17 19 20 22 23 23 23 24 25 25 26 28 28 28 29 32
Example 10: Find the median and quartiles of the 13 minicompact cars below.
19 22 23 23 23 26 26 27 28 29 29 31 32
66
The Five-Number Summary and Boxplots
The five-number summary of a set of observations consists of the _____________________, the
____________________________, the _____________, the __________________________ and the
_____________________. These five numbers give a fairly complete description of center and spread.
Example 11: Find the five-number summary for the highway gas mileage for two-seaters and
minicompacts in examples 9 & 10.
two-seaters
minicompacts
_____
_____
_____
_____
_____
_____
_____
_____
_____
_____
Remember that the median describes _________________________________________________, the
quartiles show ___________________________________________________________________ and
the minimum and maximum values show_________________________________________________
The five-number summary can be presented visually by a boxplot. These are the steps for
constructing a boxplot.
1. ________________________________________________________________________________
2. ________________________________________________________________________________
3. ________________________________________________________________________________

You should also place the exact values of the five-number summary above the appropriate line.
Example 12: On the graph below, construct a boxplot for the highway gas mileage for two-seaters
from example 9
0
10
20
30
40
50
60
70
The 1.5IQR Rule for Outliers
The distance between the 1st and 3rd quartiles is called the ___________________________________,
which is abbreviated IQR for obvious reasons. The quartiles and IQR are resistant to changes in either
tail of a distribution. Note, however, that no single numerical measure of spread, such as IQR, is very
useful for describing skewed distributions.

We will call a data value a “suspected” outlier if _____________________________________
____________________________________________________________________________
* The IQR rule for outliers is the only one given in this text. A commonly used rule that uses
the mean instead of the median is “a data value is an outlier if it lies more than 2 standard
deviations above or below the mean”.
In a modified boxplot, _______________________________________________________________
______________________________________ and asterisks are used to denote any outliers.
Example 13: Consider the highway gas mileage for two-seaters from example 9.
(a) Show that the Honda Insight is a suspected outlier.
(b) Find the lower bound in order for an observation to be an outlier.
(c) Draw a modified boxplot.
HW Assignments:
P46 / 1.1ab, 1.3, 1.4, 1.5 (dot plot and stem-and-leaf)
P55 / 1.7 - 1.9, 1.11, 1.12
P64 / 1.13 – 1.15, 1.18
P74 / 1.27 – 1.32
P82 / 1.33, 1.34a-d, 1.35, 1.36a, 1.37
Measuring Spread: The Standard Deviation
While the five-number summary certainly gives a great deal of information about the distribution of a
set of numbers, the most common numerical description of a distribution is the combination of the
mean to measure ________________ and the standard deviation to measure ________________ .
The standard deviation measures spread by ______________________________________________
_______________________________________________


The standard deviation of a sample is denoted by s.
The standard deviation of a population is denoted  , the Greek letter Sigma.
The following formula is used to compute the standard deviation of a sample.
s=
What does all this mean????
The deviations xi  x measure _______________________________________________________.
Some of these deviations will be positive and some negative. Why?
The sum of the deviations (the Greek letter sigma,  , means find the sum) of the observations from
their mean will always be ______. Squaring the deviations makes them all positive. After adding
the now positive deviations, we find their average by dividing by n – 1. Why n – 1?
This number n – 1 is called the _______________________________ (see example 14 below)
Finally, taking the square root undoes the squaring of the deviations that we did initially.
Example 14: I have 6 numbers whose sum is 0. Five of the numbers are 2, 5, –3, 6 and –4. What is the
sixth number? ______ Notice that if you only have 5 of the numbers, you can determine the sixth.
The variance of a set of observations, s 2 or  2 , is simply the square of the standard deviation.
Example 15: Find the standard deviation of the following metabolic rates (in calories per 24 hours) of
7 men.
1795 1666 1362 1614 1460 1867 1439
step 1: Find x .
step 2: Determine xi  x .
step 3: Square each number in step 2, add the squares together and divide by n – 1 to find the variance.
step 4: Take the square root of the answer to step 3 to find the standard deviation.
x=
xi  x
xi
 xi  x 
2
1795
1666
1362
1614
1460
1867
1439

  n  1 

 s 2 (variance)
 s (standard deviation)
There is a shortcut formula for computing the standard deviation of a sample. Use it to find the
standard deviation of the numbers in example 15
s
nx 2  (x) 2
n(n  1)
Properties of the Standard Deviation
1. s measures spread about the _______________ and should be used only when the mean is used as
the measure of center
2. s = 0 only when there is ___________________________________ (i.e. _____________________
___________________________________. Otherwise,__________. As the observations become
more spread out about their mean, s gets ________________.
3. s, like the mean x , is not resistant to outliers. A few outliers can make s very large. Distributions
with outliers and strongly skewed distributions have very large standard deviations. As such, the
number s does not give much helpful information about such distributions.
Choosing Measures of Center and Spread
The five number summary, in particular the median and the IQR, is usually better than the mean and
standard deviation for describing _______________________________________________________
_______________________________________ Use x and s only for reasonably _________________
distributions that are free of outliers.
In the United States we commonly use feet and inches to measure height while much of the rest of the
world will use the metric system. How does converting data values from one unit of measure to
another affect the various measures of center and spread that we have discussed?
Lets consider an example: The following numbers are tests scores out of 50 for 8 statistics students.
40, 42, 47, 32, 39, 29, 41, 45
x = _________
M = __________
s = ___________
What if the teacher added 3 points to each test grade as a curve?
x = _________
M = __________
s = ___________
What if the teacher decided to make the test worth 100 points instead? The new scores would
be
80, 84, 94, 64, 78, 58, 82, 90
x = _________
M = __________
s = ___________
A linear transformation changes the original variable x into a new variable xnew by an equation of the
form ________________________ where, the constant a ____________________________________
__________________________________________ while the constant b _______________________
______________________________________________.
Note: Adding the same number, a, to each observation ______________________________________
_____________________________________________________________________________
Multiplying each observation by the same number, b, __________________________________
_____________________________________________________________________________
HW Assignments:
P89 / 1.39, 1.40, 1.42, 1.43
P97 / 1.45, 1.46, 1.50, 1.54
P100 / 1.51, 1.54, 1.55, 1.57, 1.58