Download Descriptive Statistics - Naval Postgraduate School

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Module 2: Descriptive Statistics
(and a bit about R)
Statistics (OA3102)
Professor Ron Fricker
Naval Postgraduate School
Monterey, California
Reading assignment:
WM&S chapter 1
Revision: 1-12
1
Why Care About
Descriptive Statistics?
• Data sets continue to grow ever bigger
– The human mind cannot assimilate and make
sense of volumes of raw data
• Descriptive statistics are useful data reduction
– Numeric summaries
– Graphical plots
• Good descriptive statistics help analysts and
decision makers understand what the raw
data means
Revision: 1-12
2
Goals for this Module
• Define types of data and types of variables
• Learn how to appropriately summarize data
using descriptive statistics
– Numerical descriptive statistics
• Measures of location: mean, median, mode
• Measures of spread: variance, standard
deviation, range, inter-quartile range, etc.
– Graphical descriptive statistics
• Continuous variables: histogram, boxplot
• Categorical variables: barplots, pie charts
• R paradigms and summarizing data with R
Revision: 1-12
33
Variables
• A characteristic that is being studied in a
statistical problem is called a variable
• Types of variables:
– Continuous: Can divide by any number and result
still makes sense
• Examples: flight time, failure rate, detection
distance
– Categorical:
• Ordinal: ordered categories
– Examples: rank, magazine capacity, shirt size
• Nominal: unordered categories
– Examples: gender, service branch, ship type
Revision: 1-12
44
Types of Data
Data
Qualitative
(nominal)
Quantitative
Discrete
(ordinal)
Revision: 1-12
Continuous
(continuous)
55
Some Descriptive Statistics
• Numerical:
– Location: Mean, median, mode
– Spread: Standard deviation,
variance, range, quantiles, IQR
– Correlation
• Graphical:
– Histograms, bar charts,
dot charts, boxplots,
scatter plots, etc.
• Good descriptive statistics leads to good
decision making
Revision: 1-12
6
Sample Mean ( x )
• Sample average or sample mean
– Sample consists of n observations, x1,…,xn
1 n
x   xi
n i 1
– Often denoted by
x
(spoken “x-bar”)
• To calculate
– R: use mean() function
– Excel: =AVERAGE(cell reference)
Revision: 1-12
7
Sample Median (~
x)
• The median is the halfway point in the
ordered data
• Steps to calculate the median:
– Order the data from smallest to largest
– If the number of data is odd, the middle
observation is the median. E.g.,
1 3 5 6 12 12 99
– If the number is even, then the average of the two
middle observations is the median. E.g.,
1 3 5 6 12 12
Revision: 1-12
5.5
8
Using More Formal Notation…
• Let x(i ) denote the ith order statistic from a
sample x1 , x2 ,..., xn
– E.g., for x1  5, x2  12, x3  2 , we have
x(1)  2, x( 2)  5, x(3)  12
• Then the sample median can be defined as
xn   xn 1
2
x 2
n odd: ~
n even: ~
x  xn1 
2
2
– Equations apply to samples and populations
• To calculate
– R: use median() function
– Excel: =MEDIAN(cell reference)
Revision: 1-12
9
Mean vs. Median
• Both are measures of location or “central
tendency”
– But, median less affected by outliers
• Example:
– Imagine a sample of data: 0, 0, 0, 1, 1, 1, 2, 2, 2
• Median=mean=1
– Another sample of data: 0, 0, 0, 1, 1, 1, 2, 2, 83
• Median still equals 1, but mean=10!
• Which to use? Depends on whether you are:
– characterizing a “typical” observation (the median)
– or describing the average value (the mean)
Revision: 1-12
10
Exercise
• Calculate “by hand” the mean and median for the
data: {6,1,3,7,3,6,7,4,8}
Revision: 1-12
11
11
Exercise (continued)
• Now do the same for {6,1,3,7,3,6,7,4,8,100}
Revision: 1-12
12
12
Now, in R:
• For {6,1,3,7,3,6,7,4,8}:
• For {6,1,3,7,3,6,7,4,8,100}:
Revision: 1-12
13
Common Measures of “Spread”
• Measures of location tell you where the “center” of
the data is
• Measures of spread tell you how variable the data is
around the center
• Typical measures of spread:
– Sample variance: essentially, the average squared deviation
around the mean,
n
2
1
s 
( xi  x )

n  1 i 1
2
– Standard deviation: the square root of the variance, s  s
• The standard deviation is in the same units at the mean
Revision: 1-12
2
14
Exercise
• Calculate “by hand” the sample variance and
standard deviation for the data: {1,2,3,4,5}
Revision: 1-12
15
15
Pictorially
Revision: 1-12
16
Pictorially
Revision: 1-12
17
Pictorially
Revision: 1-12
18
Pictorially
Revision: 1-12
19
Ignore Variability at Your Peril
• Often analyses only focus on the average
• But it’s possible to be right on average and be
way off in every case
– The average high temperature
in Washington DC in June is
83 degrees
• “Oh, how balmy!”
• No...it’s either 75°
or it’s 90+ degrees!
Revision: 1-12
From Flaws and Fallicies in Statistical Thinking
by Stephen K. Campbell.
20
The Range (R)
• Range is another measure of spread
• In words, it is the largest observation in the
sample minus the smallest observation
– Example: A sample of students’ ages in the class
• Data: 21, 23, 23, 25, 25, 26, 27, 31, 33, 33, 35, 40
• Note that they are already ordered!
• R = 40 - 21 = 19
– Using previous notation: R  x n   x 1
• In R: use the code diff(range())
– range() function gives x(1) and x(n)
Revision: 1-12
21
Other Measures of Spread:
Quantiles and Percentiles
• Percentiles
– For data, the pth percentile , 0  p  100 , is the
value of x such that p% of the data is less than
or equal to x
• Quantiles same as percentiles except for
scale
– Percentiles are on a 0 to 100 scale
– Quantiles are on a 0 to 1 scale
– The pth quantile equals the (px100)th percentile
Revision: 1-12
22
Special Percentiles and Quantiles
• Special percentiles:
– Minimum: 0th percentile (or 0 quantile)
– Median: 50th percentile (or 0.5 quantile)
– Maximum: 100th percentile (or 1.0 quantile)
• Quartiles: 25th and 75th percentiles
– Devore: “lower fourth” and “upper fourth”
• Interquartile Range (IQR):
IQR = 75th percentile - 25th percentile
– Devore calls the IQR the “fourth spread”
– In R: IQR()
Revision: 1-12
23
Calculating Quantiles
• R function: quantile(data, probs)
– data is a numeric vector of data
– probs is a numeric vector of probabilities
• Default: 0, 0.25, 0.5, 0.75 and 1.0 quantiles
• In R, pth quantile is x(px(n-1)+1)
– If px(n-1)+1 is not an integer, interpolate between
two closest values
– E.g.,
Revision: 1-12
24
Hinges
• Hinges are an alternative to quartiles
– They’re the x(j) and x(n-j+1) order statistics, for
 n 1 
 2  1
j
2
where if j is not integer, interpolate
• Easier way to compute:
– If n is even, they’re the median values of the upper
and lower halves of the sorted data
– If n is odd, they’re the median values of the upper
and lower halves of the sorted data, where each
half includes the median data point
Revision: 1-12
25
Exercise
• “By hand,” calculate the five number summary for
{12,2,7,5,15,4,9,18,6}
– The five number summary is the minimum, lower hinge,
median, upper hinge, maximum
Revision: 1-12
26
26
Exercise (continued)
• “By hand,” calculate the five number summary for
{12,2,7,5,15,4,9,18,6,10}
Revision: 1-12
27
27
Results in R
Revision: 1-12
28
28
The Empirical Rule
• If the distribution of measurements is
approximately normal, then:
• 68% of the data is
within m ± 1s
• 95% within m ± 2s
• 99.7% (“almost
all”) within m ± 3s
0.40
0.35
0.30
0.25
0.20
68%
0.15
0.10
95%
0.05
99.7%
0.00
-4
-3
-2
-1
0
Z
1
2
3
4
29
Remember Notation Conventions
• Summation:
– Σ notation and subscripts
• Size:
– n denotes size of sample
– N denotes size of population
• Knowns vs. unknowns:
– Small letters (i.e., “x”) mean quantity is known
– Capital letters (i.e., “X”) mean quantity is unknown
(i.e., it’s a random variable)
Revision: 1-12
30
Graphically Depicting Data
(thousands)
15
10
5
Count Axis
• Many different types of plots and charts
80 85 90 95 100 105 110 115 120 125
• What ever you do, don’t fall into the trap of just
using Excel plots because they’re easy
– R much more powerful and flexible
– Excel does not do some important/useful plot types
Revision: 1-12
31
A Classic Good Graphic
Revision: 1-12
32
Some Types of Graphical and
Tabular Summaries of Data
• Univariate discrete data: tables, barplots, dot
charts, pie charts
• Univariate continuous data: stem-and-leaf
plots, strip charts, histograms, boxplots
• Bivariate discrete data: two-way contingency
tables
• Bivariate continuous data: scatterplots, QQ
plots
Revision: 1-12
33
Tabular Summaries of Data
• Categorical data: counts and/or percentages
by category
• Continuous data: counts and/or percentages
within “bins”
– Bins: sequential intervals over the range of data
• Generally intervals are of equal width
• Must decide how to count data point that falls
on the boundary between two bins
– Either count them all in the left bins, or in the right
bins
– Doesn’t matter which, just be consistent
Revision: 1-12
34
Example: Tabular Summary
of Univariate Categorical Data
Manufacturer Frequency
Honda
41
Yamaha
27
Kawasaki
20
Harley-Davidson
18
BMW
3
Other
11
120
Relative
Frequency
(fraction)
0.34
0.23
0.17
0.15
0.03
0.08
1.00
• In R, use the table() function
• For the example:
Revision: 1-12
35
Barplots
• Barplots also known as bar charts and bar
graphs
• Plot one bar for each category
– Bars show counts or percentage of observations in
each category
• Can plot bars vertically or horizontally
• In R: barplot()
– Option horiz=TRUE plots bars horizontally
(default is FALSE)
Revision: 1-12
36
In R
barplot(table(manufac),xlab="Manufacturer",ylab="Count")
Revision: 1-12
barplot(table(manufac),ylab="Manufacturer“
,xlab="Count",horiz=TRUE)
37
Plotting Fractions
barplot(table(manufac)/length(manufac),
xlab="Manufacturer",ylab="Fraction")
Revision: 1-12
barplot(table(manufac)/length(manufac),
ylab="Manufacturer",xlab="Fraction",horiz=TRUE)
38
Histograms
• A histogram is a graph of the observed
frequencies in a sample or population
• Histograms show the distribution of the data
• Reading a histogram:
There are 10
observations greater
than 215 but less
than or equal to 225
12
10
8
6
4
2
0
170
Revision: 1-12
180
190
200
210
220
230
240
250
260
39
Histograms Depict
the Empirical Distribution
• Histograms help answer:
– Where is the mean of the data (roughly) located?
– How variable is the data?
– What is the overall shape of the data?
• Is the distribution symmetric? Is it skewed? If so, in
what direction?
– Are there any unusual observations?
• In R: hist() function
– Options:
• breaks option allows user to vary number of bars
• freq=TRUE (default) gives counts
• freq=FALSE gives density histogram (area sums to one)
Revision: 1-12
40
Frequency Histogram
of Challenger Data
> challenger<-c(84,49,61,40,
83,67,45,66,70,69,80,
58,68,60,67,72,73,70,
57,63,70,78,52,67,53,
67,75,61,70,81,76,79,
75,76,58,31)
> hist(challenger)
84
68
Revision: 1-12
53
49
60
67
61
67
75
40
72
61
83
73
70
67
70
81
45
57
76
66
63
79
70
70
75
69
78
76
80
52
58
58
67
31
41
Density Histogram
of Challenger Data
hist(challenger,freq=FALSE)
Revision: 1-12
42
Dos and Don’ts for Histograms
• Do try alternate numbers of bars
– Find best depiction of the shape (distribution) of data
– Start with number of classes = n (i.e., breaks= n
hist(challenger,breaks=2)
hist(challenger,breaks=5)
hist(challenger,breaks=9)
1 )
hist(challenger,breaks=25)
• Don’t use unequal bin widths – keep the bar widths all
the same
• Don’t plot histograms by hand – use software
Revision: 1-12
43
Frequency (count)
Extremes in Histograms
40
35
30
25
20
15
10
5
0
30-89
Temperature (F)
One extreme: A
single bar for all the
data – but that just
shows the total, no
information about the
shape of the data
Revision: 1-12
n classes seems to be
about right to show
distribution of the data
Another extreme:
One bar for each
temperature – but
that’s just a bar chart.
It’s hard to see the
shape
44
Differences Between
Barplots and Histograms
• Barplots:
– For categorical data
– Often most easily read with bars plotted horizontally
– Adjacent bars are separated from each other
• Histograms:
– For continuous data
– Convention to plot bars vertically (to look like a pdf)
– Adjacent (nonzero) bars touch (since base of each
bar denotes the “bin” for that bar)
Revision: 1-12
45
Boxplots
• Boxplots show distribution in one dimension
– Only useful for continuous variables
– Good for comparing distributions of a continuous
variable between categorical groups
– Will not show multiple modes
• Illustration (of one variant):
outlier
whiskers
median
outliers
hinges
Revision: 1-12
46
Exercise
• Given the following
summary statistics
for the Challenger
data,
(roughly) draw the
boxplot over the
“strip chart” 
Revision: 1-12
47
Exercise: Result from R
• Boxplot 
Revision: 1-12
48
Histograms vs. Boxplots
• Histogram shows distribution of the data in two
dimensions – the boxplot is in one dimension
– Histogram shows frequency of observations within ranges
– Boxplot only shows summary statistics
Revision: 1-12
49
We’ll Use Software To Do Most
Calculations and Plots…
• …generally R
• Benefits of R include:
– It’s free
– More importantly, it’s powerful, flexible, extensible,
and cutting-edge
– In terms of extensible, there are now thousands of
libraries (aka packages) available to do custom
calculations, plots, etc.
Revision: 1-12
50
Some R Paradigms
•
•
•
•
Command line interface
Object-oriented programming
Types of objects, particularly data frames
Vector-based calculations
Revision: 1-12
51
Command Line Interface
• Command line allows scripting/programming,
which gives flexibility and extensibility
– Point and click paradigm limits user to what has
been programmed into the interface
– Trade-off is “user friendliness,” meaning command
line users must learn the underlying language and
syntax
• Good news: Once you gain a working
familiarity, you have access to very powerful
computing tool
Revision: 1-12
52
All the Std Graphics Plus…
Revision: 1-12
53
Example #1: Flexible Graphics
Revision: 1-12
54
Example #2: Flexible Graphics
Revision: 1-12
55
Example #3: Flexible Graphics
Revision: 1-12
56
Example #4: Flexible Graphics
Revision: 1-12
57
Example #5: Flexible Graphics
Revision: 1-12
58
Object-oriented Programming
• R is an object-oriented programming
language
– Wikipedia: “Object-oriented programming (OOP) is a
programming paradigm that uses "objects" … to design
applications and computer programs. ”
• Everything in R is an object of some type
– Each type of object has particular properties
– Properties control what objects can and cannot
do, as well as how other objects interact with them
Revision: 1-12
59
Types of Objects
• Important types of objects in R:
–
–
–
–
Vector: a one-dimensional list of numbers
Matrix: a two-dimensional list of numbers
Array: a multi-dimensional list of numbers
Data.frame: a two-dimensional list that can contain
any type of data (numeric, string, logical, etc)
– Function: small programs that usually take input
as arguments and after running produce output
• The function class(obj) will tell you what
type of object “obj” is
Revision: 1-12
60
More on Data Frames
• Think of them like tables
– Columns correspond to variables (and data in
columns must all be of the same type)
– Rows correspond to observations
Revision: 1-12
61
More on Functions
• Functions always end with parenthesis
– If there are arguments, they go here
– Some functions don’t have or need arguments
• Example: ls()
– Function code output when parentheses left off
• Can run functions of functions
– Example: mean(seq(1:9))
• Lots of built-in functions and you can write
your own
Revision: 1-12
62
Vector-based Calculations
• R very efficient (i.e., fast) working with
vectors, much less so with loops
• Key idea: In data frames, instead of writing
code that operates on the rows of a data
frame (i.e., observation by observation) you
write code that operates on the variables
(i.e., the columns, which are the variables!)
• Takes a while to get used to thinking in terms
of vectors rather than individual observations
Revision: 1-12
63
Simple Example
• Data frame with data
on various types of
travel for a set of
individuals:
• Easy way to calc total days deployed in R:
Revision: 1-12
64
Simple Example, continued
• Even fancier:
• The hard way:
Revision: 1-12
65
What We Covered in this Module
• Defined types of data and types of variables
• Learned how to appropriately summarize
data using descriptive statistics
– Numerical descriptive statistics
• Measures of location: mean, median, mode
• Measures of spread: variance, standard
deviation, range, inter-quartile range, etc.
– Graphical descriptive statistics
• Continuous variables: histogram, boxplot
• Categorical variables: barplots, pie charts
• R paradigms and summarizing data with R
Revision: 1-12
66
66
Homework
• WM&S chapter 1
– Required exercises 2, 9, 13, 17, 22, 25
– Extra credit: 11
• Hints and instructions:
 Do exercises 2,13, and 25 in R as much as possible
o The data sets are in Sakai in CSV format; read them in using
the instructions from Lab #1
o Exercise 2: Just construct a frequency histogram in R with the
Mt. Washington observation left out
o Exercises 13 and 25: The sort() function in R could be useful
for counting the number that fall in each interval
 Exercise 9: Use either Table 4 in WM&S or R to calculate. If
you use R, the pnorm() function will be helpful
 Exercise 17: Only do the approximation for Exercise 1.2
Revision: 1-12
67