Download Exploratory Data Analysis: One Variable

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
Exploratory Data Analysis: One
Variable
FPP 3-6
Plan of attack
 Distinguish different types of variables
 Summarize data numerically
 Summarize data graphically
 Use theoretical distributions to potentially learn more about
a variable.
2
The five steps of statistical
analyses
Form the question
2. Collect data
3. Model the observed data
1.
1.
We start with exploratory techniques.
Check the model for reasonableness
5. Make and present conclusions
4.
Just to make sure we are on the same page
 More (or repeated) vocabulary
 Individuals are the objects described by a set of data
 examples: employees, lab mice, states…
 A variable is any characteristic of an individual that is of interest
to the researcher. Takes on different values for different
individuals
 examples: age, salary, weight, location…
 How is this different from a mathematical variable?
Just to make sure we are on the same page
#2
 Measurement The value of a variable obtained and
recorded on an individual
 Example: 145 recorded as a person’s weight, 65 recorded as
the height of a tree, etc.
 Data is a set of measurements made on a group of
individuals
 The distribution of a variable tells us what values it takes
and how often it takes these values
Possible values ->
How often each occur ->
Chest Size
count
33-34
21
Chest Sizes of 5,738 Militamen
35-36 37-38 39-40 41-42 43-44
266
1169
2152
1592
462
45-46
71
47-48
5
Two Types of Variables
 a categorical/qualitative variable places an individual into one of several
groups or categories
 examples:
 Gender, Race, Job Type, Geographic location…
 JMP calls these variables nominal
 a quantitative variable takes numerical values for which arithmetic
operations such as adding and averaging make sense
 examples:
 Height, Age, Salary, Price, Cost…
 Can be further divided to ordinal and continuous
 Why two types?
 Both require their own summaries (graphically and numerically) and analysis.
 I can’t emphasis enough the importance of identifying the type of variable being
considered before proceeding with any type of statistical analysis
Example
Name
Age Gender Race
Fleetwood, Delores
39 Female White
Perez, Juan
27 Male
White
Wang, Lin
20 Female Asian
Johnson, LaVerne
48 Male
Black
 Age: quantitative
 Gender: categorical
 Race: categorical
 Salary: quantitative
 Job type: categorical
Salary
62,100
47,350
18,250
77,600
Job Type
Management
Technical
Clerical
Management
Variable types in JMP
 Qualitative/categorical
 JMP uses Nominal
 Quantitative
 Discrete
 JMP uses Ordinal
 Continuous
 JMP uses Continuous
Exploratory data analysis
 Statistical tools that help examine data in order to describe
their main features
 Basic strategy
 Examine variables one by one, then look at the relationships
among the different variables
 Start with graphs, then add numerical summaries of specific
aspects of the data
Exploratory data analysis:
One variable
 Graphical displays
 Qualitative/categorical data: bar chart, pie chart, etc.
 Quantitative data: histogram, stem-leaf, boxplot, timeplot etc.
 Summary statistics
 Qualitative/categorical: contingency tables
 Quantitative: mean, median, standard deviation, range etc.
 Probability models
 Qualitative: Binomial distribution(others we won’t cover in this class)
 Quantitative: Normal curve (others we won’t cover in this class)
Example categorical/qualitative
data
Summary table
 we summarize categorical data using a table. Note
that percentages are often called Relative
Frequencies.
Class
Frequency
Relative Frequency
Highest Degree Obtained Number of CEOs
Proportion
None
1
0.04
Bachelors
7
0.28
Masters
11
0.44
Doctorate / Law
6
0.24
Totals
25
1.00
Bar graph
 The bar graph quickly
compares the degrees of the
four groups
 The heights of the four bars
show the counts for the four
degree categories
Pie chart
 A pie chart helps us see what
part of the whole group
forms
 To make a pie chart, you
must include all the
categories that make up a
whole
Summary of categorical variables
 Graphically
 Bar graphs, pie charts
 Bar graph nearly always preferable to a pie chart. It is easier to compare
bar heights compared to slices of a pie
 Numerically: tables with total counts or percents
Quantitative variables
 Graphical summary




Histogram
Stemplots
Time plots
more
 Numerical sumary






Mean
Median
Quartiles
Range
Standard deviation
more
Histograms
The bins are:
3.0 ≤ rate < 4.0
4.0 ≤ rate < 5.0
5.0 ≤ rate < 6.0
6.0 ≤ rate < 7.0
7.0 ≤ rate < 8.0
8.0 ≤ rate < 9.0
9.0 ≤ rate < 10.0
10.0 ≤ rate < 11.0
11.0 ≤ rate < 12.0
12.0 ≤ rate < 13.0
13.0 ≤ rate < 14.0
14.0 ≤ rate < 15.0
Histograms
The bins are:
3.0 ≤ rate < 4.0
4.0 ≤ rate < 5.0
5.0 ≤ rate < 6.0
6.0 ≤ rate < 7.0
7.0 ≤ rate < 8.0
8.0 ≤ rate < 9.0
9.0 ≤ rate < 10.0
10.0 ≤ rate < 11.0
11.0 ≤ rate < 12.0
12.0 ≤ rate < 13.0
13.0 ≤ rate < 14.0
14.0 ≤ rate < 15.0
Histograms
The bins are:
2.0 ≤ rate < 4.0
4.0 ≤ rate < 6.0
6.0 ≤ rate < 8.0
8.0 ≤ rate < 10.0
10.0 ≤ rate < 12.0
12.0 ≤ rate < 14.0
14.0 ≤ rate < 16.0
16.0 ≤ rate < 18.0
Histograms
 Where did the bins come from?
 They were chosen rather arbitrarily
 Does choosing other bins change the picture?
 Yes!! And sometimes dramatically
 What do we do about this?
 Some pretty smart people have come up with some “optimal”
bin widths and we will rely on there suggestions
Histogram
 The purpose of a graph is to help us understand the data
 After you make a graph, always ask, “What do I see?”
 Once you have displayed a distribution you can see the
important features
Histograms
 We will describe the features of the distribution that the
histogram is displaying with three characteristics
1. Shape
 Symmetric, skewed right, skewed left, uni-modal, multi-modal,
bell shaped
2. Center
 Mean, median
3. Spread (outliers or not)
 Standard deviation, Inter-quartile range
Body temperatures of 30 people
D
i
B
9
s
o
69
t
d
.97
r
i
y
579
b
T
.98
u
e
589
t
m
. 99
591
i
o
n
s
p
(
Q
u
1
0 m0
9
9
.
9
7
.
9
0
.
7
5
q.
5
0
m
.
2
5
q.
1
0
.
5
.0 2 50 .
0
.
5
0
. m0
F
a
)
n
t
. a 0 x9 %
5
% 9
5
% 9
0
% 9
u0
a% 9r
0
e% 9d
u0
a% 9r
0
% 9
%
9
%
9
% in
im
9
Mi
l
im
9
M
9
S
9
S
9
u
9
lot
8
Nia
8t
7
7
7
7
.
ue
8
.t
d
8
.t
d
8
.p
p
5
.wile e
1
.n 6
. ile 1
.
3
.
0
.
0
.u
m
0
oe ms
m
a
0
0
0
e
0
r2
0
2
3
0
0
0
e
9
0
0
9
0
n
0
D
0
E
0
r
5
0
5
0
0
0
0
8
.
. e 7
. r 1
8
.
99
A
x
is
Incomes from 500 households in
2000 current population survey
o
t
u
r
e
b
h
u
o
t
l
u
n
s
i
2
1
o
h
s
t
i
1
5
0
5
0
10
50
20 0
50
00
C
D
i
o
d
n
i
Q
1
9
9
0
9
5
7
0 5
0 2
1
00 2
0
0
n
u
0
0
0
0
s
0m
9
7
0
5 q
0 m
5 q
0
. 0
.
. m
0
.
.
.
.
.
.
.
5
5
0
c
a
. a
5
5
0
u 0 a
0 e
u 0 a
0
%
%
% in
o
n
Mt
20 x
2%
1%
1%
%
%
%
%
m
i
M 8%im 2
S5 t 5
S6 t 8
u 0 p 1
rlo 6 tw 3
d
N 3 ia 3
r 1 t 7
7
3
im
e
o l
e
d
d
p
ile
n
ile
u
me
5 u4 a
9
7 1
9 e
1 4r
7
2
8
7
m
s
7 m6n 7
0 D1
0 9E 7
9 5r 9
3 3 5
2
2
9
2
7
1
7
3
Histogram vs. Bar graph
 Spaces mean something in histograms but not in bar graphs
 Shape means nothing with bar graphs
 The biggest difference is that they are displaying
fundamentally different types of variables
Time Plots
 Many variables are measured at intervals over time
 Examples
 Closing stock prices
 Number of hurricanes
 Unemployment rates
 If interest is a variable is to see change over time use a time
plot
Time Plots
 Patterns to look for
 Patterns that repeat themselves at known regular intervals of
time are called seasonal variation
 A trend is a persistant, long-term rise or fall
Time plots
10
8
Hurricanes
number of hurricanes
each year from
1970 - 1990
6
4
2
0
1965
1970
1975
1980
Year
1985
1990
1995
Numerical summaries of
quantitative variables
 Want a numerical summary for center and spread
 Center
 Mean
 Median
 Mode
 Spread
 Range
 Inter-quartile range
 Standard deviation
 5 number summary is a popular collection of the following
 min, 1st quartile, median, 3rd quartile, max
Mean
 To find the mean of a set of observations, add their values
and divide by the number of observations
 equation 1:
x1  x 2  K  x N

N
 equation 2:
1 N
   xi
N i1


Mean example
 The average age of 20 people in a room is 25. A 28 year old
leaves while a 30 year old enters the room.
 Does the average age change?
 If so, what is the new average age?
Median
 The median is the midpoint of a distribution
 The number such that half the observations are smaller and the
other half are larger
 Also called the 50th percentile or 2nd quartile
 To compute a median
 Order observations
 If number of observations is odd the median is the center
observation
 If number of observations is even the median is the average of
the two center observations
Median example
 The median age of 20 people in a room is 25. A 28 year old
leaves while a 30 year old enters the room.
 Does the median age change?
 If so, what is the new median age?
 The median age of 21 people in a room is 25. A 28 year old
leaves while a 30 year old enters the room.
 Does the median age change?
 If so, what is the new median age?
Mean vs Median
 When histogram is symmetric mean and median are similar
 Mean and median are different when histogram is skewed
 Skewed to the right mean is larger than median
 Skewed to the left mean is smaller than median
 The business magazine Forbes estimates that the “average”
household wealth of its readers is either about $800,000 or
about $2.2 million, depending on which “average” it reports.
Which of these numbers is the mean wealth and which is the
median wealth? Why?
Mean vs Median
 Symmetric distribution
Mean vs Median
 Right skewed distribution
Mean vs Median
 Left skewed distribution
Extreme example
 Income in small town of 6 people
$25,000 $27,000 $29,000
$35,000 $37,000 $38,000
 Mean is $31,830 and median is $32,000
 Bill Gates moves to town
$25,000 $27,000 $29,000
$35,000 $37,000 $38,000 $40,000,000
 Mean is $5,741,571 median is $35,000
 Mean is pulled by the outlier while the median is not. The
median is a better of measure of center for these data
Is a central measure enough?
 A warm, stable climate greatly affects some individual’s health.
Atlanta and San Diego have about equal average temperatures (62o
vs. 64o). If a person’s health requires a stable climate, in which
city would you recommend they live?
Measures of spread
 Range:
 subtract the largest value form the smallest
 Inter-quartile range:
 subtract the 3rd quartile from the 1st quartile
 Standard Deviation (SD):
 “average” distance from the mean
 Which one should we use?
Standard Deviation
 The standard deviation looks at how far observations are




from their mean
It is the square root of the average squared deviations from
the mean
Compute distance of each value from mean
Square each of these distances
Take the average of these squares and square root
1  n
2
   x i  
N i1
 Often we will use SD to denote standard deviation

Example
Standard deviation
 Order these
histograms by the
SD of the numbers
they portray. Go
from smallest
largest
 What is a reasonable
- 15 - 10
- 30
- 20
-5
0
- 10
5
0
10
10
15
20
20
30
guess of the SD for
each?
-1
- 0. 5 0
.5
1
1. 5
2
2. 5
Histograms on same scale
- 30
- 20
- 10
0
10
20
- 30
- 20
- 10
0
10
20
30
- 30
- 20
- 10
0
10
20
30
30
Problem from text (p. 74, #2)
 Which of the following sets of numbers has the smaller SD’
a) 50, 40, 60, 30, 70, 25, 75
b) 50, 40, 60, 30, 70, 25, 75, 50, 50, 50
 Repeat for these two sets
c) 50, 40, 60, 30, 70, 25, 75
d) 50, 40, 60, 30, 70, 25, 75, 99, 1
More intuition behind the SD
 This is a variance contest.You must give a list of six numbers
chosen from the whole numbers 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
with repeats allowed.
 Give a list of six numbers with the largest standard deviation
such a list described above can possibly have.
 Give a list of six numbers with the smallest standard
deviation such a list can possibly have.
Properties of SD
 SD ≥ 0. (When is SD = 0)?
 Has the same unit of measurement as the original
observations
 Inflated by outliers
Mean and SD
 What happens to the mean if you add 5 to every number in a
list?
 What happens to the SD?
N
1
   xi
N i1
1 
2
   x i  
N i1
n

Standard deviation
 SDs are like measurement units on a ruler
 Any quantitative variable can be converted into
“standardized” units
 These are often called z-scores and are denoted by the letter z
 Important formula
 Example
value  mean value  
z

SD

 ACT versus SAT scores
 Which is more impressive
the SAT, or a 32 on the ACT?
 A 1340 on
The normal curve
 When histogram looks like a bell-shaped curve, z-scores are associated
with percentages
 The percentage of the data in between two different z-score values
equals the area under the normal curve in between the two z-score
values
 A bit of notation here.
 N(, ) is short hand for writing normal curve with mean  and standard
deviation  (get used to this notation as it will be used fairly regularly
through out the course)
Normal curves
Normal curves
Properties of normal curve
 In the Normal distribution with mean  and standard deviation :
 68% of the observations fall within 1  of 
 95% of the observations fall within 2 s of 
 99.7% of the observations fall within 3 s of 
 By remembering these numbers, you can think about Normal
curves without constantly making detailed calculations
Properties of normal curves
 For a N(0,1) the following holds
IQ
 A person is considered to have mental retardation when
1. IQ is below 70
2. Significant limitations exist in two or more adaptive skill areas
3. Condition is present from childhood
 What percentage of people have IQ that meet the first
criterion of mental retardation
IQ
 A histogram of all people’s IQ scores has a μ=100 and a
σ=16
 How to get % of people with IQ < 70
More IQ
 Reggie Jackson, one of the greatest baseball players ever, has an IQ of 140. What
percentage of people have bigger IQs than Reggie?
 Marilyn vos Savant, self-proclaimed smartest person in the world, has a reported IQ of
205. What percentage of people have IQ scores smaller than Marilyn’s score?
 Mensa is a society for “intelligent people.” To qualify for Mensa, one needs to be in at
least the upper 2% of the population in IQ score. What is the score needed to qualify for
Mensa?
Checking if data follow normal
curve
 Look for symmetric
histogram
 A different method is a
normal probability plot.
When normal curve is a
good fit, points fall on a
nearly straight line
Measurement error
 Measurement error model
 Measurement = truth + chance error
 Outliers
 Bias effects all measurements in the same way
 Measurement = truth + bias + chance error
 Often we assume that the chance error follows a normal
curve that is centered at 0