Download Chapter 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Regression toward the mean wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
Chapter 1: Looking at Data--Distributions
Section 1.1: Introduction, Displaying Distributions with Graphs
Section 1.2: Describing Distributions with Numbers
Learning goals for this chapter:
Identify categorical and quantitative variables.
Interpret, create (by hand and with SPSS), and know when to use: bar graphs, pie
charts, stemplots (standard, back-to-back, split), histograms, and boxplots
(regular, modified, side-by-side).
Describe the shape, center, and spread of data distributions.
Define, calculate (by hand and with SPSS), and know when to use measures of
center (mean vs. median) and spread (range, 5-number summary, IQR, variance,
standard deviation).
Understand what a resistant measure of center and spread is and when this is
important.
Use the 1.5IQR rule to look for outliers.
Draw a Normal curve in correct proportions and identify the mean/median,
standard deviation, middle 68%, middle 95%, and middle 99.7%.
Perform calculations with the empirical rule, both backwards and forwards.
Understand the need for standardization.
Big picture: what do we learn in this chapter?
Individuals vs. Variables
Categorical vs. Quantitative Variables
Graphs:
Bar graphs and pie charts (categorical variables)
Histograms and stemplots (quantitative variables—good for checking for
symmetry and skewness)
Boxplots (quantitative variables—graphical display of the 5 # summary, modified
boxplots show outliers)
Describing distributions
Shape (symmetric/skewed, unimodal/bimodal/multimodal)
Center (mean or median)
Spread (usually standard deviation/variance or IQR from the 5 # summary)
Outliers
If you have a symmetric distribution with no outliers, use the mean and standard
deviation.
If you have a skewed distribution and/or you have outliers, use the 5 # summary
instead.
1
2 components in describing data or information:
Individuals: objects being described by a set of data (people, households, cars,
animals, corn, etc.)
Variables: characteristics of individuals (height, yield, length, age, eye color,
etc.)
Categorical: places an individual into one of several groups (gender, eye
color, college major, hometown, etc.)
Quantitative: Attaches a numerical value to a variable so that adding or
averaging the values makes sense (height, weight, age, income, yield, etc.)
Distribution of a variable: describes what values a variables takes and how often it
takes those values
If you have more than one variable in your problem, you should look at each variable by
itself before you look at relationships between the variables.
Example: Identify whether the following questions would give you categorical or
quantitative data.
a) What letter grade did you get in your Calculus class last semester?
b) What was your score on the last exam?
c) Who will you vote for in the next election?
d) How many votes did George W. Bush get?
e) How many red M&Ms are in this bag?
f) Which type of M&Ms has more red ones: peanut or plain?
It’s always a good idea to start by displaying variables graphically before you do any
other statistical analysis. What kind of graph should you use? That depends on
whether you have a categorical or quantitative variable.
Categorical Variables:
Bar graphs or pie charts
Messy room example: In a poll of 200 parents of children ages 6 to 12,
respondents were asked to name the most disgusting things ever found in their
children’s rooms. The results are below (J&C 2005)
Most disgusting thing
Food-related
# of parents
106
2
% of parents
53%
Animal and insect-related
nuisances
Clothing (dirty socks and
underwear especially)
Other
22
11%
22
11%
50
25%
Bar graph (can use either # of parents like below or % of parents):
120
100
Count
80
60
40
20
0
animal
clothing
food
other
type of disgusting mess
Cases weighted by # of parents
Pie chart (needs % of parents):
type of disgusting mess
animal
clothing
food
other
11.0%
animal
25.0%
other
11.0%
clothing
53.0%
food
Cases weighted by # of parents
3
Quantitative Variables:
Stemplots, histograms, and boxplots (discussed a little later)
Example: You investigate the amount of time students spend online (in minutes).
You study 28 students, and their times are listed below. Show the distribution of
times with a stemplot.
7
42
72
20
43
75
24
44
77
25
45
78
25
46
79
28
47
83
28
48
87
30
48
88
32
50
35
51
To create a stemplot by hand,
1.
Put the data in order from smallest to largest.
2.
The ―stem‖ will be all digits for a data point except for the last one. Write the
stems in a vertical line. (Think of ―7‖ as being ―07‖ so that all the numbers
have a digit in the tens place.)
3.
The ―leaf‖ will be the next digit (in this case, the ones place) from each data
point. Write the leaves after the appropriate stem, in increasing order.
4.
It is possible to ―trim‖ any digits that you feel may be unnecessary. For
example, if our second data point had been 20.3, we would probably choose to
ignore the ―.3‖ for the purposes of the stemplot so that we could create a more
reasonable stemplot. If we did not ignore this ―.3‖, then our stems would have
been 07, 08, 09, 10, 11, 12, 13,…, 88 with decimal numbers as our leaves. This
would show a very uniform stemplot with only one leaf for each stem (all
leaves would be 0 except for the 3). This would not be helpful to us at all. It
makes much more sense to use the tens place for the stem and the ones place as
the leaves in this example.
A split stemplot just has more
stems. There are several ways to
split the stems. Here they are
split by fives.
Stemplot
0
1
2
3
4
5
6
7
8
|7
|
|045588
|025
|23456788
|01
|
|25789
|378
0
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
4
|7
|
|
|04
|5588
|02
|5
|234
|56788
|01
|
|
|
|2
|5789
|3
|78
Why do we need split stemplots? Sometimes it is easier to see the shape of the data with
more stems. Sometimes a regular stemplot is better. If you’re not sure, try it both ways
and see if a pattern appears.
Try a stemplot and a split stemplot with this data (use the hundreds place for stems):
3, 4, 17, 18, 39, 93, 102, 110, 143, 178, 250, 278, 299, 300.1
Histograms
Sorting the quantitative data into bins. How many bins?
Not too many bins with either 0 or 1 counts.
Not overly summarized so that you lose all the information
Not so detailed that it is no longer a summary
Too few bins
OK
5
Too many bins
Histograms
Bar graphs
The bars for each interval touch each The bars for each category do not touch each
other.
other. There are spaces between the bars.
Histograms have a continuous,
quantitative x-axis, with the x-values
in order.
Quantitative variables
Bar graphs can have the categories on the x-axis
listed in any order (alphabetical, biggest-tosmallest, etc.)
Categorical variables
Histograms
Stemplots
Quantitative variables
Quantitative variables
Good for big data sets, especially if
technology is available.
Good for small data sets, convenient for back-ofthe-envelope calculations. Rarely found in
scientific or laymen publications.
Uses a box to represent each data
point.
Uses a digit to represent each data point.
6
You’ve drawn your graph (histogram or stemplot). Now what?
Look for overall pattern and any outliers.
The pattern is described by shape, center, and spread.
1.
Shape:
o # of peaks (unimodal = 1, bimodal = 2, multimodal > 2)
o Where the long tail is:
Symmetric
Right skewed
Left skewed
(long tail on the
(long tail on the left)
right)
Median
Mean
Median < Mean
Median > Mean
To describe the shape, use a histogram with
a smoothed curve highlighting the overall
pattern of the distribution (don’t get overly
detailed).
2.
Center: (If the distribution is symmetric, the mean will equal the median, but
otherwise these numbers are not the same.)
1 n
a) Mean: arithmetic average, x
xi
ni1
Where n = the total # of observations
And xi = an individual observation
b) Mode: the most common number, biggest peak
7
c) Median: M, midpoint of the distribution such that ½ the observations are
smaller and ½ the observations are larger. The median is not as affected
by outliers as the mean is; the median is resistant to outliers.
To find the median:
i.
Order the data form smallest to largest
ii.
Count the # of observations (n)
n 1
iii.
Calculate
to find the center of the data set.
2
iv.
If n is odd, M is the data point at the center of the data set.
n 1
v.
If n is even,
falls between 2 data points, called the
2
―middle pair.‖ M = the average of the middle pair
Examples of center:
Find the mean and median of the following 7 numbers in Dataset A:
23
25
32.5
33
67
1
-20
Find the mean and median of the following 8 numbers in Dataset B:
1
3.
2
4
6
8
9
12
13
Spread:
a) Range = max – min (simplest, not always the most helpful)
b) Variance: s2, average of the square of deviations of observations from the
mean
1 n
2
s
( xi x )2
n 1i1
c) Standard Deviation: s, square root of the variance, common way for
measuring how far observations are from the mean
Example of finding the standard deviation by hand: 0, 2, 4
1. Calculate the mean.
2. Calculate the variance.
3. Take the square root of the variance.
8
d) Pth percentile: value such that p% of the observations fall at or below it
Median = M = 50th percentile
First Quartile = Q1 = 25th percentile
Third Quartile = Q3 = 75th percentile
How do you find quartiles? Think of them as ―mini-medians.‖ Leave the
median out, and then find the median of what is left over on the left side (Q1)
and what is left over on the right side (Q3).
Find the 1st and 3rd quartiles of the following 7 numbers in Dataset A:
-20
Min
1
23
25
M
32.5
33
67
Max
Find the 1st and 3rd quartiles of the following 8 numbers in Dataset B:
1
Min
2
4
6
8
9
12
13
Max
M=7
e) 5-Number Summary: Min Q1
M
Q3
Max
f) Interquartile Range (IQR) = Q3 – Q1
Call an observation a suspected outlier if it is:
> Q3 + 1.5 IQR
OR
< Q1 – 1.5 IQR
g) Boxplots: Graph of the 5-number summary
Modified boxplots have lines extend from the box out to the smallest and
largest observations which are NOT outliers. Dots mark any outliers.
(We will always ask for the modified boxplot, but if there are no outliers,
the modified and regular boxplots look exactly the same.)
9
Boxplot for Dataset A with 5number summary:
-20, 1, 25, 33, 67
Since there were no outliers in
this dataset, a regular boxplot and
a modified boxplot look exactly
the same for this data.
For the online time example (with 2 additional data points added in), list the 5-number
summary, find any outliers present, and show a boxplot and modified boxplot.
7
42
72
20
43
75
24
44
77
25
45
78
25
46
79
28
47
83
10
28
48
87
30
48
88
32
50
135
35
51
151
How do you know which method is best for determining center and spread?
5-Number Summary: better for skewed distributions or distribution with outliers
Mean and Standard Deviation: good for reasonably symmetric distributions free of
outliers.
Always start with a graph!
In the internet time example, here are how the mean/standard deviation and 5-number
summary are affected by the outlier:
Mean
Standard Deviation
5-number summary
With outlier (151)
54.77
32.647
7, 30, 46.5, 77, 151
With outlier removed from dataset
51.45
27.600
7, 29, 46, 76, 135
―The Median vs. the Mean in the Age of Average‖ by Mike Pesca on NPR’s Day-to-Day
7/19/06: http://www.npr.org/templates/story/story.php?storyId=5567890
Do you always have to do all of this by hand? NO!
Statistical software packages like SPSS can make life much easier for you, but it’s a good
idea to know how to do these by hand so you can make sense of your output. Also, on
the exam, you won’t have access to a computer.
Read over your SPSS manual and get comfortable with using SPSS. You will have a
chance to practice on the HW for this week, and you will work on it in lab on Friday.
Enter your data, then Analyze--> Descriptive Statistics--> Explore. Follow the
instructions on p. 48 of the SPSS manual.
The output from SPSS for the internet time problem looks like:
Descriptive s
Time spent on the web
Mean
95% Confidence
Interval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviat ion
Minimum
Max imum
Range
Interquartile Range
Skewness
Kurt osis
11
Lower Bound
Upper Bound
Stat istic
54.77
42.58
Std. Error
5.961
66.96
52.13
46.50
1065.840
32.647
7
151
144
48
1.314
1.977
.427
.833
Stem-and-Leaf Plot
Histogram
Frequency
Stem &
Leaf
10
1.00
0
9.00
0
10.00
0
5.00
0
3.00
0
.00
1
1.00
1
1.00 Extremes
Frequency
8
6
4
.
.
.
.
.
.
.
0
222222333
4444444455
77777
888
3
(>=151)
2
Stem width:
Each leaf:
Mean = 54.77
Std. Dev. = 32.647
N = 30
0
0
50
100
150
100
1 case(s)
Time spent on the web
Notice on the boxplot, it is easy to identify the potential outlier. This would be your
indication that the 5-number summary would be the best way to describe your data. (You
could also try calculating the mean and standard deviation without the outlier for
comparison.)
SPSS can also give you the Quartiles (listed under ―Percentiles‖), but these are not
necessarily the same answers as what you would get by hand. The ―weighted average‖
and ―Tukey’s Hinges‖ are not the same method we use. For this class, whenever we
ask you to calculate the Quartiles, we want you to do them by hand.
12
What if you want to compare the results from two or more different groups? Use
side-by-side boxplots or back-to-back stemplots for your graphs.
Female
Male
9 2
81 3
6 4
5
330 6
8110 7
652 8
999 9
13
0
88
08
3459
22456
Preview of Section 1.3 (from Section 1.3)
A z-score tells us how many standard deviations away from the mean an observation is.
z
x
This is also called getting a standardized value.
Why is standardization useful? For comparing apples to oranges.
Example: (p. 88, Problem 1.99) Jacob scores 16 on the ACT. Emily scores 670 on the
SAT. Assuming that both tests measure scholastic aptitude, who has the higher score?
The SAT scores for 1.4 million students in a recent graduating class were roughly
normal with a mean of 1026 and standard deviation of 209. The ACT scores for more
than 1 million students in the same class were roughly normal with mean of 20.8 and
standard deviation of 4.8.
14
How else can we use standardization? If the distribution of observations has a bellshape, then these standardized values have some special properties. One of these is the
68-95-99.7% Empirical Rule.
Approximately 68% of the observations fall within 1 of the mean
(between
1 and
1 ).
Approximately 95% of the observations fall within 2 of the mean
(between
2 and
2 ).
Approximately 99.7% of the observations fall within 3 of the mean
(between
3 and
3 ).
P( -1 <X< +1 ) = 0.68
P( -2 <X< +2 ) = 0.95
P( -3 <X< +3 ) = 0.997
Standard deviations away
from the mean (z-score),
so a z-score of -2 could
also be written as
2 ,
for example.
The mean and the median of a
bell-shaped curve are in the
middle. This is shown with a
0 because the mean is 0
standard deviations away
from itself.
The most famous bell-shaped distribution is the Normal distribution. We will spend
several lectures talking about it for Section 1.3, and it will be important to everything we
do for the rest of the semester.
15
Example: Checking account balances are approximately Normally distributed with a
mean of $1325 and a standard deviation of $25.
a)
Between what numbers do 68% of the balances fall?
b)
Above what number do 2.5% of the balances lie?
c)
Approximately what percent of balances are between 1250 and 1400?
16