Download MATH 105: [Probability and] Statistics Joe Whittaker B25 Fylde

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Law of large numbers wikipedia , lookup

Time series wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
MATH 105: [Probability and] Statistics
Joe Whittaker
B25 Fylde College
Department of Mathematics and Statistics
Lancaster University
April 2010
LUVLE: https://domino.lancs.ac.uk/09-10/MATH/MATH105.nsf
Organization
The module runs for five weeks, weeks 21-25, with four lectures a week, a weekly
workshop and a weekly Lab100 help session. Handouts:
• Course notes
• Exercises: Workshop, Quiz, Course Work.
Please bring both to the lectures and workshops. The notes have gaps which are to be
filled in during the lectures.
Your participation in the course, by taking part in experiments, contributing in lectures
and workshops and responding to the questionnaire is much appreciated.
Timetable
sun
9-10
10-11
11-12
12-1
1-2
2-3
3-4
4-5
5-6
11pm
mon
tue
wed
thu
GFoxLT1
GFoxLT1
OfficeB25
Faraday
105-QZCW due
fri
WkShop4
OfficeB25
GFoxLT1
WkShop1
WkShop2
WkShop3
100-QZ due
Lectures: are held at 10am Tuesday, 11am Wednesday, 9am Thursday and 12 noon
Friday.
Workshops: will be held in Management School Lecture Theatre 7. Lists of groups are
posted outside the Maths and Stats Department Office in Fylde College.
Workshops start in the first week.
Labs: continue in the lab100 stream:
i
Monday at 10; 12; 4; 5; Tuesday at 9; 11; 12; 5. Labs start in the first week, and a test
in week24.
Any problems, please see Julia in B4c Fylde.
Assessment
• 20% Course Work (10% quiz + 10% written),
• 30% end-of-term test (Friday week 25),
• 50% final exam.
Deadlines
Online quiz questions (labelled QZ) should be completed by 2pm on the following
Wednesday.
Homework questions (labelled CW) should be handed in by 2pm on the following
Wednesday in your tutor’s pigeonhole.
Solutions are posted on the course webpage.
Labs
The Lab100 course is running in parallel with math105. Weekly help sessions are
available. You are expected to have downloaded R on to your computer. The first
lectures are on R, and are examinable in math105.
Preliminaries
The math105 course continues on from math104, very directly. Firstly you have met
R in the lab100 work associated with math104. Both math104 and math105 require
R and the first Chapter here goes over a tutorial introduction to some of the basic
concepts of the language.
Secondly, the extension of probability from discrete random variables, discussed in
math104, to continuous random variables is discussed here. Both the discrete and
the continous cases are needed for statistics. The mathematical prerequisites for the
analysis of continuous random variables is the integral calculus of math101.
The third part of the course introduces the statistical methods which are required
for tackling a range of applied problems. The focus is on strategies for data modelling rather than mathematical theory. However, there is some theory, and we aim to
introduce basic concepts as it will be taught fully in later statistics courses.
Data examples are used throughout the course, to illustrate the techniques that the
course aims to teach you. The course data sets are on the LUVLE course web.
ii
At the end of this course, you should be able to:
• understand the basic concepts and objects of the R language, including some
elements of programming;
• define the basic concepts of continuous random variables, the probability density
function and the cumulative distribution function;
• have familiarity with some standard continuous random variables, such as the
Uniform, Exponential and Normal; be aware of their parameters and how these
relate to expectations.
• use R to make computations and plots of the cdf, and of quantiles derived from
it;
• use R to simulate from standard distributions;
• use graphical tools such as histograms, scatterplots, empirical distribution function and the boxplot;
• calculate and understand numerical summary statistics such as mean, median,
variance, quantiles and the correlation coefficient;
• discuss a range of modelling assumptions that can play a part in statistical analysis.
Background reading
Although the lectures and these accompanying notes are self-contained, further details
can be found in the following recommended texts:
Clarke, G.M. and Cooke, D. (1998). A Basic Course in Statistics. 4th ed, Arnold.
Daly, F., Hand, D., Jones, M., Lunn, A. and McConway, K. (1995). Elements of
Statistics. Addison Wesley.
Lindsey, J. (1995). Introductory Statistics: A Modelling Approach. Oxford Science Publications.
iii
Chapter 1
Introduction to R
R is a software package, and a language, that provides a statistical computing environment. R is open source and can be downloaded from http://www.r-project.org.
More information on obtaining R, and this tutorial, can be found on our /department/info/intranet/com
pages.
1.1
The tutorial
Objects
Type x = 3 or x <- 3 to create a new object called x which has the value 3. The
operator = or <- is not the mathematical = but an assignment operator.
Predict the value of y to understand what is happening:
x <- 6
x
x^2
y <- x*(4+x/2)
y # the answer
# is a comment, and all to its right is ignored. The arithmatic operators + - * / ( )
work as expected. The hat is used for exponentials, so 3^2 is 9.
Exercise 1.1 Create a new variable called z with the value ”five cubed divided by
seven plus two”.
Sol:
z =
5^3/(7+2)
# 13.88889 z =
5^3/7+2
Use precedence to resolve ambiguity.
1
# 19.85714
There are some tricks you can use to save on typing: whenever possible paste Rcode
from the pdf file into a text editor; use the up and down arrow keys to recall and edit
previous commands.
Functions
Most statments in R involve functions, and usually involve the use of round brackets
(). Functions are ways of running commands in R on given inputs, there may or may
not be an output.
y <- sin(pi/4)
round(y)
ls()
rm(y)
q()
# gives y the sine of pi/4
# lists your objects
# removes y
# quit R.
At exit you are asked if you would like to save the objects you created. If you answer
”yes” all the objects will still be there the next time you start R.
Type ls without the (). Notice that this shows the code of the ls function, but does
not run the function.
Vectors
Vectors are used to store more than one number in an object.
x <- c(pi, 1, 8.6, -1, 0) #
y <- 1:5
#
y <- seq(from=0,to=6,by=2)#
length(y)
#
z <- c("a","b","c")
#
"c" function creates vector
y gets (1, 2, 3, 4, 5)
a sequence, takes 3 args
z gets 3 characters
Vectors are indexed using the square brackets [].
x[3]
x[c(4,1)]
x[-3]
x+2
round(x)
#
#
#
#
#
element 3 of x
elements 4 and 1 of x
x without element 3
element by element arithmetic
rounds each element of x
Notice how functions work on vectors, they apply to each element of the vector.
Exercise 1.2 For each of the numbers 2, 3, 4, 5, 6 and 12 find the square of the number
divided by 2.
Sol:
2
x = c(2, 3, 4, 5, 6, 12)
x^2/2
# or is it?
(x/2)^2
Graphics: simple plots
The function plot() starts a new plot. Usually it requires a vector of x-coordinates
and a vector of y-coordinates as input:
x <- 1:20
y <- x^3
plot(x,y)
# function with 2 (or more) arguments
A new graphics window pops up for the plot. Subsequent plots overwrite the current
plot in this window.
To get a line plot instead of a point plot, use the optional argument type="l" with
plot():
plot(x,y,type="l")
You can add points or lines to an existing plot, using the points or lines functions:
plot(x,y)
points(rev(x),y)
lines(x,8000-y)
#
rev reverses the order
Different character for points need a pch= argument. Numbers give various symbols,
and characters use that letter as a marker:
plot(x,y)
points(rev(x),y,pch=3)
# add crosses
points(x,8000-y,pch="x") # a character
Change line widths with lwd=, or line styles with lty=. Colours are set with col=
plot(x,y, col="red")
lines(x,y,lwd=4)
lines(rev(x),y,lty=2)
#
#
thick line
dashed line
You can label your axes with the xlab= and ylab= arguments, you can give your plot
a title with the main= argument.
3
plot(x,y,xlab="X Is Across",
ylab="Y Is Up",
main="Main Title")
Exercise 1.3 Draw a blue circle, put a nice title on your graph but no axis labels.
Hint: think radial.
theta <- seq(0, 2*pi, length=100)
x = cos(theta)
y = sin(theta)
plot(x,y,type=’n’,
# sets out axes, no points
xlab="", ylab="", main="circle")
lines(x,y,col="blue")
Getting Help in R
You can start R’s help system by typing help.start(), or by using the menus.
Either one will start a web browser window showing the R help web page. If this doesn’t
work, go to the http://stat.ethz.ch/R-manual/R-patched/doc/html/index.html or
http://tinyurl.com/cny9k.
e.g. To find out more about the seq function either enter ?seq, or use the menus.
Exercise 1.4 Use the text function to put the name of your favourite philosopher in
the centre of your blue circle.
Sol:
?text
text(0,0,’Plato’)
Reading data into R
The read.table() function is used to read a data file into R. Save the file class96.dat
in your home directory. Look at the file in your favourite text editor.
Load it into R with a command such as
class96 <- read.table("h:/class96.dat")# windows
class96 <- read.table("class96.dat")
# linux local dir
class96 <- read.table("~/class96.dat") # linux top dir
Notice that the forward slash, not the backward slash, is used to delimit folders, even
in windows.
4
Matrices
The class96 dataset contains the heights and weights of the students enrolled in GSSE401
in 96. The four columns in the matrix are: number in list, height (cm), weight (kg),
and sex, where 1=female, 2=male.
class(class96)
# data.frame, more than a matrix
class96
# displays the values
dim(class96)
# dimensions of the matrix
class96[3,4]
# element in row 3 column 4
class96[1:3,]
# first three rows and all the cols
class96[,c(1,3)] # 1st and 3rd column
hist(class96[,2],
main="Student Heights", xlab="cm")
# histogram of student heights
Headers and column names
names(class96)
# default names
names(class96) = c("number", "height",
"weight", "gender")
class96
class96[1:3, c("height", "weight")]
# same as class96[1:3,c(2,3)]
class96$height[1:3] # list access
There are more options for read.table, e.g. to read in files separated by commas, use
sep=",". The na.strings argument tells R how missing values are coded. R expects
missing values to be written as NA. If the missing values are coded differently, a dot
say, use na.strings=".".
Writing functions
Functions are of the generic form
name <- function(input args){ statements }
For example
myfun <- function(x){ plot(x, x^2-x) }
myfun takes one argument, x, and makes a graph with it. It expects x to be a vector
with several values.
x = seq(-2, 3, len=100)
myfun(x)
myfun
# lists the code
5
Storing commands in files
Usually we write our R functions in separate files and then load them into R. Tradition
has it that give these files .R extensions. You can use any editor to write functions,
though Emacs recognises R code and has some special R features. Create a new file
called joe.R containing
theplot <- function() {
x <- seq(-2, 2, len=1000)
y <- log(abs(x))
plot(x, y, main="the plot", lwd=2, col="green")
}
Notice that the statements of the function are enclosed in curly braces {}. There are
three ways of loading this file into R 1 typing source("joe.R"); 2 If you’re running
R in emacs, pull up the file joe.R and press Ctrl-C Ctrl-L, or use the menus ESS ->
Load File; 3 If your R interface has menus (such as the Gnome or windows GUI),
select File -> source. Having sourced in the file joe.R run the function theplot().
Arguments
Change your function theplot to the following
theplot <- function(minx, maxx) {
x <- seq(minx, maxx, len=1000)
y <- log(abs(x))
plot(x, y, main="the plot", lwd=2, col="green")
}
As you can probably guess, if you type theplot(-1, 3) the graph that results will
have an x axis going from −1 to 3. You could get the same result by running
theplot(maxx=3, minx=-1)
theplot(min=-1, ma=3)
#
or
If the first line of the file is changed to
theplot <- function(minx=-2, maxx=2) {
the default values of minx and maxx will be −2 and 2. Try running theplot(max=5).
Types of R objects
There is a difference between numbers and characters. 1 is a number. "one" and "1"
are character strings. Try the following:
6
x <- c("1", "2")
x
x+1
as.numeric(x) + 1
as.character(1:4)
class(x)
# gives class of an object
Logicals
Logicals or binary data consists of TRUE (or T or 1) or FALSE (or F or 0). Predict the
output from running R.
x <- 1:4
x == 2
# notice the double ==
x > 2
x != 2
x[x!=2]
# clever
x[x!=2] <- 7
Exercise 1.5
Write a function that changes every negative element in a vector to NA.
Sol:
neg2na = function(x){ x[x<0] = NA; return(x) }
y = c(1,2,-3,4)
neg2na(y)
Lists
Lists are collections of stuff, an element of a list can be anything, a matrix, a vector,
a function, or even another list.
x <- list(lenin = sum,
marx = c("bourgeois", "class struggle"),
engels = matrix(1:4, nrow=2))
x
x$marx
x[[2]]
x[[1]](1:4)
# is 1+2+3+4
x[["engels"]]
length(x) ; names(x); summary(x)
Most statistical functions in R return lists.
7
Nothing
NULL means nothing. NA is a missing value. NaN means not a number. Try the following
to see if there is a difference between NULL and NA.
c()
c(1:3, NA)
c(1:3, NULL)
x <- c(NA, NA, 3, pi)
x == 3
is.na(x)
is.na(NULL)
is.null(x)
is.null(NULL)
Typing x==NA fails, you need is.na(x). Other functions to test the classs of an object
are is.list, is.matrix, is.logical, is.numeric.
Programming: Loops
The for loop is an important programming tool. A simple loop is for( x in c(4,2,6) ) { print(x)
There are other kinds of loops which are more difficult to use but are faster than for
loops. The by function is quite clever. To find the mean height of the class96 students
according to sex,
by(class96$height, class96$sex, mean)
Apply
To apply a function to every column of a matrix use apply, and for lists use lapply.
Exercise 1.6
Use apply to find the sum of each column of class96.
Sol:
sum(class96)
apply(class96,2,sum)
mean(class96)
# mean is special [class96 has names]
apply(class96,2,mean)
8
If statements
Predict the output from the following code
for (x in 1:4) {
if (x>3) {
print("big")
} else {
print("small")
} }
Exercise 1.7 Write the code to draw four circles with decreasing radii. Put an if
statement in the for loop to make the 2 smallest circles red.
theta = seq(0,2*pi,length=100)
x0 = cos(theta)
y0 = sin(theta)
plot(x0,y0,type=’n’)
for(i in 1:4) {
r = (.8)^i
x = r*x0
y = r*y0
if(i>2) { lines(x,y,col=’red’)}
else
{ lines(x,y,col=’blue’)}
} # end for
Objects within functions
Any objects you create within a function die when the function ends.
dumbfun <- function() { x <- 1 }
x <- 2
dumbfun()
x
x only had the value 1 when the function was running. When the function ended, the
value of x went back to 2.
Saving objects
save(x, y, file="xandy.RData")
load("xandy.RData")
# load the file, with x and y
save.image()
# save all objects to .RData
When you quit with the q() function, R runs save.image(). Whenever you start R,
R runs load(".RData"). Emacs asks you what directory to start R in, because it will
load the .RData file from that directory.
9
Multiple plots
The graphics device can display several small plots at a time instead of one big one.
Use the par() (parameter) function:
par(mfrow=c(2,3)) # 2x3 array
for(plato in 1:6) { plot(1:5, pch=plato) }
mtext("The Republic", outer=T,
line=-3, cex=2, col="red")
The screen clears when you try and plot the seventh graph. To reset the plotting
window back to normal, do par(mfrow=c(1,1)).
Printing plots
Plots can be printed directly but it is best to save them to a file. This is particularly
useful for essays and reports since these files can be easily read into standard word
processors (Latex, Word, etc). Try the following:
par(mfrow=c(1,1))
plot(sin(1:1000))
pdf(file="sines.pdf", height=4, width=6)
plot(sin(1:1000))
dev.off()
# give file name size
# plot to file
# finished writing
Check this works: xpdf sines.pdf.
Functions for probability distributions
There are four functions related to standard distributions in R, prefixed by one of dpqr.
For the Poisson distribution, the pmf is p(x) = exp(−λ)λx /x! for x = 0, 1, . . ..
dpois(2, lambda=1 )
# d gives the pmf
# parameter lambda
exp(-1)*1^2/factorial(2)
ppois(2, lambda=1 )
# p gives the cdf
exp(-1)*(1^0+1^1/factorial(1)+1^2/factorial(2))
qpois(.7, lambda=1 )
# q gives the quantile
rpois(10, lambda=1 )
# r gives random numbers
Plotting mass functions using barplot
There are some examples of this in math104/lab100 exercises.
Exercise 1.8
Make a free hand plot from this code.
10
par(mfrow=c(1,2))
# sets up subplots
barplot( dpois(0:10, lambda=1 ),
names.arg=0:10, ylim=c(0,.4))
barplot( dpois(0:10, lambda=3 ),
names.arg=0:10, ylim=c(0,.4))
Plotting the pdf
The probability density function (pdf), introduced in the next section, is the analogue
of the pmf for a continuous rv.
Exercise 1.9
for x > 0.
The exponential distribution is a good example, and has pdf f (x) = θ exp(−θx)
dexp(2, rate=1 )
exp(-2)
pexp(2, rate=1 )
1-exp(-2)
qexp(.7, rate=1 )
rexp(10, rate=1 )
xval = seq(0,4,len=100)
f = dexp(xval, rate=1 )
F = pexp(xval, rate=1 )
plot( xval, f, type=’n’)
lines(xval, f, col=’red’)
lines(xval, F, col=’blue’)
# d gives the pdf
# parameter rate=theta
# p gives the cdf
# q gives the quantile
# r gives random numbers
; grid()
Accessing course datasets
Throughout the session, we will use some data examples, which you can download from
the course webpage.
Save the file in your working directory, using the filename m105.Rdata. Then in R,
type load("m105.Rdata"); ls() or load("YOURPATH/m105.Rdata") This is needed
for each new R session.
1.2
Chapter summary
The basic constructs of the R language are objects and functions, and are introduced by
way of example. Examples of objects are vectors, matrices and dataframes. These can
contain numbers, characters or mixtures of such. Examples of functions are methods
of manipulating these objects including extracting arithmetic summaries, plots and
transformations.
Some instances of writing functions are given, together with a brief summary of some
programming constructs, such as the for loop and the if statement.
11
Methods specific to plotting pdfs are given, and to reading data from file.
12
Chapter 2
Continuous random variables
2.1
Review of probability
Math104 introduced the concept of probability and of a discrete random variable. Here
we review some of the basics and introduce continuous random variables.
Probability
Probability considers an experiment before it is performed. Probability, P , is a measure
of the chance that an event may occur in the experiment. Tossing a coin or conducting
an election survey is an example of an experiment. An event, A, is a subset of the
sample space, Ω, the set of all possible outcomes.
Observing a tail in a coin throw or hearing a yes response to a survey question are both
events. Legitimate questions are then: What is the probability of seeing the tail twice
in the experiment of tossing two coins. What is the probability of getting no positive
responses in the survey?
The Axioms of Probability
Mathematically, probability is a function P which assigns to each event A in the sample
space Ω a number P (A) in [0, 1] such that
• Axiom 1: P (A) ≥ 0 for all A ⊆ Ω;
• Axiom 2: P (Ω) = 1;
• Axiom 3: P (A ∪ B) = P (A) + P (B) if A ∩ B = ∅ for any A, B ⊆ Ω.
For mathematicians probability is a function. In every day English probability is closely
associated with words such chance, uncertainty, randomness, likelihood.
If probability considers the experiment before it is performed, statistics considers the
experiment after it is performed.
13
Examples of discrete random variables
The sample space Ω for a discrete rv X is countable, and we usually take it to be a
subset of the integers or the non-negative integers.
Exercise 2.1
Give examples related to University, family and sport.
Sol:
college membership Ω = {Bow, Car, . . .} ,
exam grades Ω = {A, B, C, D, E},
number of goals in a match, Ω = {0, 1, 2, . . .}
number of children in a family, same.
Probability mass function
Definition: The probability mass function (pmf) of a discrete random variable X is
p(x), where
p(x) = P (X = x) for x = 0, 1, 2, . . .
Result: (Properties of the pmf). The probability mass function p(x) satisfies
• 0 ≤ p(x) ≤ 1 for all x;
P∞
•
x=0 p(x) = 1;
• For any event A, P (X ∈ A) =
For example,
P
x∈A
p(x).
P (a < X ≤ b) = P (X = a + 1) + P (X = a + 2) + · · · + P (X = b)
= p(a + 1) + p(a + 2) + · · · + p(b).
Definition: The cumulative distribution function (cdf) is defined as
F (x) = P (X ≤ x)
for
− ∞ < x < ∞.
Result: The cdf simplifies to
int(x)
F (x) = P (X ≤ x) = P (X ≤ int(x)) =
X
P (X = x),
k=0
where int(x) denotes the largest integer smaller than or equal to x, eg int(5.2) = 5,
int(3) = 3, int(−2.1) = −2.
This is a step function, and is not continuous.
14
Exercise 2.2 For a random variable X that takes values {0, 1} with probabilities
θ, 1 − θ, obtain P (X ≤ x) for all x ≥ 0.
Sol:
Add graph.

 0 if x < 0
θ if 0 ≤ x < 1
P (X ≤ x) =

1 if 1 ≤ x
Exercise 2.3 Use this Rcode to plot the cdf at the points (−1, 0, 1, 2) when θ = .4.
Draw the graph in your notes.
theta=.4
p0 = theta ; p1 = 1-theta
xval = c(-1, 0,
1, 2)
F
= c( 0, p0, p0+p1, 1)
plot( xval ,F)
plot( xval ,F, type = ’s’)
points(xval ,F)
2.2
# adds in step function
Continuous and discrete rvs
A mathematical way of describing a probability experiment and its events is to define
a random variable associated with it.
Definition: A random variable X is a function from sample space Ω to the real numbers
R (continuous) and to the integers Z (discrete).
Exercise 2.4 Experiment 1: In a presential election with two candidates B and C,
the possible outcomes are Ω = {B, C}. Define a random variable X that maps from Ω
to {0, 1}:
X(B) = 0, X(C) = 1.
Then the probability of the event {C} is equivalent to P (X = 1).
Exercise 2.5 Experiment 2: A national air quality monitoring system automatically
collects measurements of ozone level at designated sites. The possible outcomes are
Ω = {x : x ≥ 0}. Define a random variable X to be the value of the measurement,
X(x) = x,
the identity map. Then the probability that ozone level falls below a certain level c is
given by P (X ≤ c).
15
Remarks on rvs
• A random variable (rv) X is a function that associates a unique number with each
possible outcome of an experiment.
• Associated with each discrete random variable X is a probability mass function
(pmf) p(x) from which probabilites of all possible events involving X may be
computed.
• Associated with each continuous random variable X is a probability distribution
function (pdf) f (x) from which probabilites of all possible events involving X
may be computed.
• Associated with the pmf and with the pdf is the cumulative distribution function
(cdf) F (x) that gives particular probabilities.
• A continuous rv is by definition one that has a continuous cdf. The cdf of a
discrete rv is a step function.
• Associated with the pmf and the pdf are numerical summaries such as E(X),
var(X) and, for continuous rvs, quantiles of F (x).
• Often in scientific investigation X represents the variable of main interest that
can be measured or observed.
The cumulative distribution function (cdf)
In order to describe all possible outcomes of an experiment, we focus on an event of
the basic form
{X ≤ x}
for fixed x, where x can take any value.
Exercise 2.6
operations.
Express a general event {a < X ≤ b} using the basic form with set
Sol:
{a < X ≤ b} = {X ≤ b} ∩ {X ≤ a}c .
If we have a rule of assigning probability to an event of the basic form, then probability
of any event can be determined.
Definition: For any discrete or continuous univariate random variable X, the cumulative
distribution function, cdf, F : R→[0, 1], is defined by
F (x) = P (X ≤ x).
In terms of the original sample space the event {X ≤ x} is interpreted as {ω : X(ω) ≤ x}.
16
F is defined for −∞ < x < ∞, and we require F (−∞) = 0 and F (∞) = 1 to avoid
having to deal with degenerate rvs.
xvals = seq(-6,6,length=100)
F = pnorm(xvals)
plot(xvals,F,type=’n’); grid()
lines(xvals,F,col=’red’)
It is a result that F is a non-decreasing function.
Exercise 2.7
Prove that for a ≤ b,
P (a < X ≤ b) = F (b) − F (a).
Sol:
{X ≤ b} = {a < X ≤ b} ∪ {X ≤ a} from above,
union disjoint events, so
P ({X ≤ b}) = P ({a < X ≤ b}) + P ({X ≤ a}) or
F (b) = P (a < X ≤ b) + F (a).
Probability for continuous rvs
When the cdf F (x) = P (X ≤ x) is continuous the outcomes of the experiment have to
be measurements on a continuous scale, and the rv is said to be continuous. Examples
include
ozone level, weight, direction, waiting times, stock price,. . .
Result: (Zero probability). If X is continuous rv
P (X = x) = 0
for all x.
Proof:
P (x − h < X ≤ x + h) = F (x + h) − F (x − h)
above
so if F is continuous
P (X = x) = lim P (x − h < X ≤ x + h)
h→0
= lim F (x + h) − F (x − h) = 0.
h→0
Therefore, unlike the discrete case, the probability distribution function cannot be
reduced to sum of single events. To describe probability of an event of a continuous
random variable, we need new mathematical tools!
17
Probability density function
Assume the cdf is differentiable as well as continuous.
Definition: The probability density function, pdf f (x) a continuous random variable X is
defined by
f (x) =
d
F (x).
dx
Result: (The cdf as a definite integral). The cdf satisfies
F (x) =
Z
x
f (u) du .
−∞
Proof: Standard rules of integral calculus.
The cdf is a definite integral of the pdf. (If discrete the cdf is the definite sum of the
pmf.)
Result: The probability density function f (x) satisfies
• f (x) ≥ 0 for all x;
•
R∞
−∞
f (x) dx = 1.
• For any event A, P (X ∈ A) =
R
x∈A
f (x) dx.
However it may be that f (x) ≥ 1 for some x.
Interpretation of the pdf
Result: (Area under the curve.) Using calculus,
P (a < X ≤ b) = F (b) − F (a)
Z b
Z
=
f (x) dx −
=
−∞
Z b
a
f (x) dx
−∞
f (x) dx,
a
but this is the area under the curve (x, f (x)) between (a, b]. Hence this area represents
the probability that the rv X lies in this interval.
18
f (x)
Probability density function
P (a < X ≤ b)
a
b
x
Example of a pdf. P (a < X ≤ b) is the area under the curve between a and b.
Note that the density function f (x) itself does NOT represent the probability of any
event.
Exercise 2.8
For a random variable X with cumulative distribution function
x if 0 ≤ x ≤ 1
F (x) =
0 otherwise.
(a) Find P (0.3 < X ≤ 0.5).
(b) Find the pdf of X.
(c) Sketch the function pdf and shade area under the curve between 0.3 and 0.5.
Sol:
(a) P (0.3 < X ≤ 0.5) = F (0.5) − F (0.3) = 0.5 − 0.3 = 0.2.
1
if 0 ≤ x ≤ 1
d
(b) f (x) = dx F (x) =
0 if x < 0 or x > 1.
(c) Sketch.
Simulating rvs
It is desirable to do experiments with simulated data, where we know the true underlying distribution, and is never the case with real life data!
19
If a random variable X has the Uniform distribution on the interval (0, 1) then the pdf
is
f (x) = 1 for 0 < x < 1, and 0
otherwise.
We write X ∼ Uniform(0, 1).
The area under the curve is the probability
Z b
P (a < X < b) =
f (x)dx = b − a,
for
0 < a < b < 1.
a
The shaded area represents P (0.2 < X < 0.5)
1
Uniform(0,1) density
0
P(0.2<X<0.5)
0
0.2
0.5
1
x
Exercise 2.9 Uniform. Simulate 1000 realisations of the rv X ∼ Uniform(0, 1) using
runif. Draw the histogram.
Plot the pdf on the range (−.5, 1.5) using the function dunif to give 100 points and
find the probability that P (0.2 < X < 0.5) using the function punif.
x = runif(1000)
# r=rv unif=Uniform
hist(x, prob=T, breaks=20, col=’yellow’,xlim=c(-.5,1.5))
range = seq(-.5,1.5,length=100) # plotting points
f = dunif(range)
# d=pdf
plot(range, f, type=’n’)
lines(range, f)
punif(0.5) - punif(0.2)
y = (0.2<x) & (x<0.5)
sum(y)
# the frequency of 1’s
Sol:
Theoretically P (0.2 < X < 0.5) = 0.3. The relative number of points in (.2, 5) is
282/1000.
2.3
Expected values
Expectation
20
Definition: If X is a discrete rv with pmf p(x) on {0, 1, · · ·}, then the expected value of
X is
∞
X
µ = E[X] =
xp(x) .
x=0
If X is a continuous random variable with pdf f (x) on (−∞, ∞), then the expected
value of X is
R∞
µ = E[X] = −∞ xf (x) dx.
We can think of this as an average of the different values that X may take, weighted
according to their chance of occurrence.
Expectations of functions of rvs
Consider g(X) where g is a fixed function.
Definition: If X is a discrete rv with probability mass function p(x) on {0, 1, . . .}, then
the expected value of g(X) is
E[g(X)] =
∞
X
g(x)p(x) .
x=0
If X is a continuous rv with probability density function f (x) on (−∞, ∞), then the
expected value of g(X) is
E[g(X)] =
Exercise 2.10
Z
∞
g(x)f (x) dx .
−∞
Show that E[3] = 3.
Sol:
Proof: We regard 3 as a constant function of X,
E[3] =
Z
= 3
∞
3f (x) dx
−∞
Z ∞
f (x) dx
def E
calculus
−∞
= 3[F (∞) − F (−∞)]
result above
= 3[1 − 0]
non-degenerate
= 3.
21
Exercise 2.11 Let X have the pdf f (x) = exp(−x) for all x ≥ 0. The expectation
E[X] (with value = µ) is
Z ∞
µ = E[X] =
x exp(−x) dx
0
Z ∞
∞
= [−x exp(−x)]0 +
exp(−x) dx integ by parts
0
= 0+
[− exp(−x)]∞
0
Find E[X 2 ] and E[(X − µ)2 ].
= 0 − (−1) = 1.
Sol:
2
E[X ] =
Z
∞
x2 exp(−x) dx
0
2
= [−x
exp(−x)]∞
0
+
Z
∞
2x exp(−x) dx
0
= 0 + 2 × 1 = 2,
Z ∞
2
E[(X − µ) ] =
(x − 1)2 exp(−x) dx
Z0 ∞
=
(x2 − 2x + 1) exp(−x) dx
0
= 2 − 2 × 1 + 1 = 1.
Properties of expectation
Result: (Linearity of expectation). If X has expectation E[X] and Y is a linear function
of X as Y = aX + b then Y has expectation
E[Y ] = a E[X] + b .
Result: More generally,
E[g(X) + h(X)] = E[g(X)] + E[h(X)]
E[cg(X)] = c E[g(X)]
E[aX + b] = a E[X] + b
Note that we proved them in MATH 104 for discrete random variables.
Using linear properties of expectation, we may compute E[(X − a)2 ] by
E[(X − a)2 ] = E[X 2 − 2aX + a2 ] algebra
= E[X 2 ] − E[2aX] + E[a2 ] by (2.1) twice
= E[X 2 ] − 2aE[X] + a2 by (2.3).
22
(2.1)
(2.2)
(2.3)
Variance and standard deviation
Definition: If X is a random variable with expected value µ = E[X], the variance of X
is
σ 2 = var[X] =
=
E[(X − µ)2 ]
 P∞
2

x=0 (x − µ) p(x)
 R∞
−∞
(x − µ)2 f (x) dx
for discrete rv on {0, 1, . . . , }
for continuous rv on (−∞, ∞) .
Result: The variance of X can be calculated as
σ 2 = E[X 2 ] − µ2 .
Proof: Use the above result.
Definition: The standard deviation of X is
σ=
p
var[X] .
The variance, or better the standard deviation, is a measure of the spread of a random
variable about its expectation.
Exercise 2.12
For f (x) = exp(−x) for all x ≥ 0, find the standard deviation of X.
Sol:
From above the variance σ 2 = E(X − µ)2 = 1. Consequently the std is σ =
√
1 = 1.
Properties of the variance
Result: If var[X] exists and Y = a + bX, then var[Y ] = b2 var[X]. Hence, the standard
deviation of Y is σY = |b|σ.
Exercise 2.13
Why is the absolute value needed in the above expression?
Sol:
p
Use counterexample:
var(−3X)
=
9
var(X),
taking
sqrt
give
3
var(X), which is
p
not the same as −3 var(X).
23
0.4
Probability mass function
0.4
Probability mass function
0.3
µ = 0.83
0.0
0.1
y1
σ = 0.83
0.2
0.2
0.0
0.1
y1
0.3
µ = 2.5
σ = 1.1
0
1
2
3
4
5
0
2
3
x
Density
Density
4
5
µ = 0.83
1
µ = 2.5
0.3
1
x
σ = 1.1
0.1
0.1
0.5
σ = 0.83
0
1
2
3
4
5
x
x
Means and standard deviations for discrete and continuous rvs.
2.4
Standard continuous distributions
We specify several standard distributions in terms of given pdfs: the uniform, the
exponential and the normal.
Uniform
This distribution is used to model variables that can take any value on a fixed interval,
when the probability of occurrence does not vary over the interval.
Definition: The pdf of a Uniform rv X, distributed on the interval (a, b) is given by:
f (x; a, b) =
1
b−a
0
if a < x < b;
otherwise,
where the parameters are (a, b) and −∞ < a < b < ∞. This is written as X ∼ Uniform(a, b).
We often write f (x) = 1/(b − a) for a < x < b, and suppress the fact that (i) there are
other arguments, f (x; a, b) and (ii) f (x) = 0 when x < a or x > b.
24
1/(b−a)
0
P (a < X ≤ x0 )
x0
a
b
x
pdf for Uniform(a, b) random variable. Shaded area represents P (a < X ≤ x0 ).
Result: the expected value and variance of X ∼ Uniform(a, b) are
E[X] =
a+b
,
2
var[X] =
(b − a)2
.
12
Proof:
E[X] =
=
Z
xf (x) dx
−∞
Z b
a
=
∞
x1/(b − a) dx
b+a
1
[x2 /2]ba =
.
b−a
2
Similar calculations work for the variance.
Exercise 2.14 Evaluate the pdf and the cdf of a Uniform rv with parameters a = −2, b = 2,
at x = .5 and then plot on an interval. There is one ambiguity in the plot: identify.
dunif(0.5, min=-2, max=2)
# pdf Unif(-2,2)
# at x=0.5, f(0.5)=0.25
punif(0.5, min=-2, max=2 ) # cdf of Unif(-2,2) F(0.5)= 0.625
xval = seq(-2.5, 2.5, length=101)
f =
dunif(xval, -2,2)
#
F =
punif(xval, -2,2)
#
plot(xval, F,type=’n’)
lines(xval, f,col=’blue’)
lines(xval, F,col=’red’)
Sol:
The vertical lines on the pdf should not be there.
25
Exponential
This distribution is often used to model variables that are the times until specific events
happen when the events occur at random at a given rate over time.
Definition: The pdf of an Exponential rv X is
θ exp(−θx)
for x > 0,
f (x; θ) =
0
otherwise,
where 0 < x and the rate parameter θ > 0. This is written as X ∼ Exponential(θ) and
θ ∈ (0, ∞).
Result: The cdf of X ∼ Exponential(θ) is
F (x) = 1 − exp(−θx)
for
x>0
and
0
otherwise.
Proof:
F (x) =
=
=
Z
x
f (u)du
Z−∞
x
f (u)du
Z0 x
for
x>0
θ exp(−θu)du
for
x>0
0
= [− exp(−θu)]x0
= 1 − exp(−θx)
Result:
E[X] =
1
,
θ
for
for
var(X) =
x>0
x > 0.
1
θ2
Proof: Seen above.
The parameter θ is known as the rate parameter because if X is the time until the
1
is the rate of occurrence.
next event occurs, then θ = E[X]
Exercise 2.15 The value of θ influences the probability of different outcomes. How
is the shape of the function related to the parameter θ? Which pdf in the figure has
lowest tail probability P (X > 10)?
xvals = seq(-.2,6,length=100)
f1 = dexp(xvals, rate=1)
f2 = dexp(xvals, rate=2)
f3 = dexp(xvals, rate=1/2)
plot(xvals,f2,type=’n’) ; grid()
lines(xvals,f1)
lines(xvals,f2,col=’red’)
lines(xvals,f3,col=’blue’) ; grid()
26
Sol:
As f (0) = θ, the highest curve at 0 is θ = 2 (pdf exceeds 1) the lowest curve at 0 is
θ = 0.5. The exponential decay of the function is quicker for larger θ, the smallest tail
probability is when θ = 2.
Exercise 2.16 Evaluate the pdf and the cdf of an Exponential distribution and plot;
give an eyeball estimate of P (X < 1).
xval = seq(-0.2, 4, length=100)
f
= dexp(xval, rate=2)
# pdf
F
= pexp(xval, rate=2)
# cdf
plot(xval, f, type =’n’,ylab=’’) ; grid()
lines(xval, f, col=’red’)
lines(xval, F, col=’blue’)
Sol:
the pdf starts at (0, 2).
a pdf is not a probability.
from cdf about .9 pexp(1, rate=2)# 0.86
Exercise 2.17 Suppose that the time the first goal is scored can be modelled by an
Exponential distribution with rate parameter θ = 2/3 hours. Write down the cdf. Find
the probability that time until the goal occurs is (i) more than 30 minutes away, (ii)
between 30 and 50 minutes.
Sol:
Let X be the random variable of the waiting time. Then X ∼ Exponential(2/3) and
F (x) = 1 − exp(−(2/3)x). (i) P (X > 1/2) = 1 − F (1/2) = exp(−2/3 · 1/2) = 0.7165,
(ii)
P (1/2 < X < 5/6) = F (5/6) − F (1/2)
= exp(−2/3 · 1/2) − exp(−2/3 · 5/6)
= 0.1428,
assuming no half time.
Normal distribution: background
quoted from gqview weblib/Gauss.html
The normal distribution was introduced by the French mathematician Abraham
De Moivre in 1733. De Moivre used this distribution to approximate probabilities of winning in various games of chance involving coin tossing. It was
later used by the German mathematician Karl Gauss to prredict the location
27
of astronomical bodies and became known as the Gaussian distribution. In
the late nineteenth century statisticians started to believe that most data
sets would have histograms with the Gaussian bell-shaped form and that all
normal data sets would follow this form and so the curve came to be known
as the normal curve.
This distribution is also known as the Gaussian distribution, after the German mathematician Karl Frederick Gauss. The density was pictured on the German 10 mark
note bearing Gauss’s image!
Normal distribution
Definition: The pdf of a Normal random variable X is
2 !
1
1 x−µ
,
f (x; µ, σ) = √
exp −
2
σ
2πσ
where −∞ < x < ∞, and the parameters −∞ < µ < ∞ and 0 < σ. This is written as
X ∼ N(µ, σ 2 ) and θ ∈ Θ = (−∞, ∞) × (0, ∞).
Result:
var(X) = σ 2
E[X] = µ ,
Proof: Too hard for math105. The Normal distribution plays an important role in
a result that is key to statistics, known as the central limit theorem. This theorem,
discussed in Math230 and Math313 gives a theoretical basis to the empirical observation
that many random phenomena seem to follow a Normal distribution. Usually, the mean
parameter µ and the scale parameter σ are unknown, although sometimes it is assumed
that σ is known as this simplifies things considerably. These parameters are crucial in
determining probabilities.
Consider the figure
0.8
Exercise 2.18
0
0.2
0.4
0.6
sigma=0.5
sigma=1
sigma=1.5
−3
0
3
x
Pdfs for Normal(µ, σ 2 ) random variables where µ = 0 and σ = 0.5, 1, 1.5.
28
Which one has higher probability of P (|X| > 3)?
xvals = seq(-4,4,length=100)
f1 = dnorm(xvals, sd=1)
f2 = dnorm(xvals, sd=2)
f3 = dnorm(xvals, sd=1/2)
plot(xvals,f3,type=’n’) ; grid()
lines(xvals,f1)
lines(xvals,f2,col=’red’)
lines(xvals,f3,col=’blue’)
Sol:
The larger σ, the more spread. So θ = 1.5 has the largest probability of P (|X| > 3)
and θ = 0.5 has the smallest.
Exercise 2.19 Complete the code to establish that the dnorm function gives the same
result as direct calculation of the pdf when X ∼ N(2, 4).
xvals = seq(-4,8, length=11)
pdf = dnorm(xvals,mean=2,sd=2)
f = 1/(sqrt(2*pi)*2)*
sum(f!=pdf)
# 0 bingo
Sol:
f = 1/(sqrt(2*pi)*2)*exp(-0.5*((xvals-2)/2)^2)
Normal cdf and quantiles
The normal cdf is
F (x) =
Z
x
−∞
f (u) du =
Z
x
−∞
√
1
2πσ 2
exp
n
−
(u − µ)2 o
du .
2σ 2
This does not have a closed form expression so numerical evaluation is required, if we
want to obtain probabilities of the form P (X ≤ x) or quantiles. Note that R functions
for the Normal use the standard deviation σ, not the variance σ 2 .
Exercise 2.20
Write down the numerical values of P (X ≤ x) corresponding to
pnorm(0,mean=2,sd=sqrt(5)) # X~N(2,5), P(
pnorm(0,mean=2,sd=sqrt(3)) # X~N(2, ), P(
1-pnorm(-2,mean=0, sd=2)
# X~N(0, ), P(
)=0.1855467
)=0.1241065
)=0.8413447
Exercise 2.21 A normal distribution is proposed to model the variation in height of
women with parameters µ = 160 and σ 2 = 25 measured in cm. Find the proportion of
tall women, defined as over 175cm tall, in terms of an integral.
29
Sol:
Let H be the random variable of woman’s height then H ∼ N(160, 25). So
Z ∞
n (x − 160)2 o
1
√
dx.
P (H > 175) =
exp −
2 · 252
2π25
175
In the above example we have expressed the proportion in terms of an integral and
as the number of deviations from the mean. The integral is impossible to calculate
analytically so numerical evaluation is required to obtain probabilities or quantiles.
Standardardization of the random variable
It is useful to express such probabilities in terms of a standardized random variable,
with µ = 0 and σ = 1.
Result: If X ∼ N(µ, σ 2 ) then
Z=
X −µ
∼ N(0, 1),
σ
and conversely if Z ∼ N(0, 1), then
X = µ + σZ ∼ N(µ, σ 2) .
Proof: The formal proof will be given in math230 and here it is sufficient to note that
E[Z] = 0
var[Z] = 1 .
Definition: A random variable Z is said to have a standard normal distribution with mean
0 and standard deviation 1 if its pdf is given by
1
f (z) = √ exp(−z 2 /2) ,
2π
where −∞ < z < ∞ and is denoted by Z ∼ N(0,1).
The cdf, the area under the curve, of the standard normal variable Z is given by
Z z
1
√ exp(−x2 /2) dx .
Φ(z) = P (Z ≤ z) =
2π
−∞
Values of Φ(z) are obtained from a table of standard normal probabilities or from R:
for (z in c(-3.00,-2.33,-1.67,-1.00,-0.33,0.33,1.00,1.67,2.33,3.00)){
print( pnorm(z) )
}
z
Φ(z)
-3.00
0.0013
-2.33
0.0098
-1.67
0.0478
-1.00
0.1587
-0.33
0.3694
30
0.33
0.6306
1.00
0.8413
1.67
0.9522
2.33
0.9902
3.00
0.9987
Exercise 2.22
dure:
Repeat the previous example to illustrate the standardization proceH − 160
175 − 160
>
) cunning
5
5
= P (Z > 3) = 1 − P (Z ≤ 3)
= 1 − Φ(3)
= 1 − 0.9987 = 0.0013 from pnorm(3)
P (H > 175) = P (
.
The figure illustrates coverage properties of a Normal distribution.
µ − 3σ
µ − 2σ
µ−σ
µ
µ+σ
µ + 2σ
µ + 3σ
P (µ − σ < X < µ + σ) = 0.683
P (µ − 2σ < X < µ + 2σ) = 0.954
P (µ − 3σ < X < µ + 3σ) = 0.997
2.5
Quantiles and the cdf
Often interest is in the values of a continuous random variable which are not exceeded
with a given probability, e.g. income of lower 10% income tax payer or the score of the
top 5% of students.
Quantiles
Let X be a random variable and p any value such that 0 ≤ p ≤ 1.
Definition: The pth quantile of the distribution of X is the value xp that satisfies:
P (X ≤ xp ) = p or equivalently xp = F −1 (p),
where F −1 is the inverse function of F .
When p = 0.5, the quantile x0.5 is called the median. When the cdf F is continuous the
inverse function is uniquely defined. [Life is more problematic with step functions.]
31
p = 0.6
xp = qnorm(p, mean=2, sd=1)
#
xvals = seq(-2,5,length=100)
F = pnorm(xvals, mean=2, sd=1)
plot(xvals,F,type=’n’) ; grid()
lines(xvals,F)
abline(v=0,lty=3)
lines(c(0,xp),c(p,p),col=’red’)
lines(c(xp,xp),c(0,p),col=’red’)
2.2533
Cumulative distribution function
1
F (x)
p
0
xp
x
Quartiles
The quartiles of a distribution are the quantiles, those values at which we can cut the
distribution into four equally probable slices: (x0.25 , x0.5 , x0.75 ).
Cumulative distribution function
Density
1
0.75
f (x)
F (x)
0.5
0.5
0.25
0
x(.25)
x(.75)
x(.5)
x
x(.75)
x
32
Quartiles (x0.25 , x0.5 , x0.75 ) shown on cdf and pdf respectively.
Exercise 2.23 Suppose X ∼ Uniform(a, b). Find the cdf, sketch its graph, and give a
formula for the p-th quantile xp .
a = -2 ; b = 4
xvals = seq(a-.5,b+.5,length=100)
F = punif(xvals,min=a,max=b)
plot(xvals, F, type=’n’) ; grid()
lines(xvals, F)
xmedian = qunif(0.5,min=a,max=b)
Sol:
F (x) =
So xp
Exercise 2.24
x
1
du
a b−a
x−a
for a ≤ x ≤ b
=
b−a
= F −1 (p) by def
= a + p(b − a).
Z
Find the mean and the median of X ∼ Uniform(a, b) and compare.
Sol:
x0.5
b
a+b
1
dx =
,
b−a
2
a
= a + 0.5(b − a) = (a + b)/2,
E(X) =
Z
x
same as the mean.
Exercise 2.25 Suppose X ∼ Exponential(θ), derive the cdf from the pdf, and find the
median. Verify that the mean of X is 1/θ using this calculation, and compare to the
median.
Z ∞
E(X) =
uf (u) du
def of expectation
0
Z ∞
∞
= [−u exp(−θu)]0 +
integ by parts
exp(−θu) du
0
1
= 0 − 0 + [− exp(−θu)]∞
0
θ
1
=
.
θ
33
Sol:
Evaluate cdf:
F (x) =
Z
0
=
Z
x
f (u) du
property of F
x
θ exp(−θu) du
0
= [exp(−θu)]x0
= 1 − exp(−θx)
for x > 0
.
For x < 0, F (x) = 0.
Quantiles: solving F (xp ) = p gives xp = θ−1 log (1 − p)−1 .
For the median, p = 0.5 so the median is x0.5 = θ1 log 2.
Comparison µ = 1/θ > (1/θ) log 2 = x0.5 .
The distribution is not symmetric so the mean and the median are not the same.
The median is smaller here because the smaller values are less concentrated than the
larger values to the right.
Exercise 2.26 Sample 200 realisations of X ∼ N(2, 4) and plot a scaled histogram.
Overlay the theoretical pdf on this diagram. Overplot the empirical and theoretical
cdfs. Calculate the 0.25, 0.5, and 0.75 sample quantiles and compare to the theoretical
values. Make a brief record of these results in your notes. The empirical cdf and sample
quantiles are discussed in the next chapter.
par(mfrow=c(1,2))
x = rnorm(200,mean=2,sd=2)
# note sd=2,
hist(x,prob=TRUE,breaks=20,col=’yellow’) # bell shaped or what
# overlay the true pdf to make comparison:
a=-5 ; b=8
# trial and error
xvals = seq(a, b, length=101)
pdf = dnorm(xvals,mean=2,sd=2)
lines(xvals,f,col=’red’)
# not bad
# now overlay the true cdf on the empirical cdf
plot(ecdf(x),pch=’.’)
F = pnorm(xvals,mean=2,sd=2)
lines(xvals,F,col=’blue’)
# again good
quantile(x)
# sample quantiles
qnorm(c(0.25,0.5,0.75),mean=2,sd=2)
# close
min(x)
Exercise 2.27
Complete the missing parts of the code.
runif(50, min=0,max=1)
rnorm(20,
=0,sd=5)
#
#
50 obs Uniform(0, )
20 obs Normal(0, )
34
rexp(100, rate=0.5)
rpois(200,
=3)
rbinom(35,size=6,prob=0.2)
rgeom(150,prob=1-0.2)
# 100 obs Exponential(0.5)
# 200 obs Poisson(3)
# 35 obs Binomial( ,0.2)
# 150 obs Geometric(0.2)
The reason for the 1 − 0.2 in the Geometric case is that unfortunately, in R the probability specified is the success probability, whereas the parameter θ in the pmf of a
Geometric random variable is the failure probability.
Transformations of rvs
In certain examples it is easy to obtain the cdf of a transformed rv Y = g(X), by a
change of variable.
Exercise 2.28
Show that if X ∼ Uniform(0, 1) and Y = − log (X) that Y ∼ Exp(1).
Sol:
Proof: We need to find and identify the cdf of Y .
P (Y < y) =
=
=
=
=
P (− log (X) < y)
the key to this e.g.
P ( log (X) > −y)
P (X > exp(−y))
monotonicity
1 − P (X ≤ exp(−y))
1 − exp(−y))
as
X ∼ Uniform(0, 1).
But this is the cdf of Y ∼ Exp(1).
Exercise 2.29 Run this code to empirically veriy X ∼ Uniform(0, 1) and Y = − log (X)
that Y ∼ Exp(1).
x = runif(10000)
y = -log(x)
hist(y, prob=T, breaks=40, col=’yellow’)
yvals = seq(-.1,4,length=200)
f = dexp(yvals)
lines(yvals,f,col=’red’)
2.6
Chapter summary
The Chapter starts with a review of probability and its axioms, and then reviews discrete random variables, the pmf, expectation and application to standard distributions,
all material included in math104.
35
The math105 course continues probability theory to cover the extension to continuous
rvs. Their properties are determined by the cumulative distribution function (cdf),
which in turn leads to the definiton of the probability density function (pdf). Pmfs
and pdfs are compared and contrasted.
Expectation, and its notion of a weighted average, is generalised to cover the continuous
case and its properties are discussed. Important definitions for the mean, variance and
standard deviation are given in terms of expectation.
Standard continuous distributions, including the Uniform, Exponential and Normal
distributions, are described. Quantiles are those values of the rv that cover a given
probability, and are relatively easy to define for a continuous rv.
All these probabilistic concepts are illustrated throughout in the R language with
special emphasis on plotting and simulation.
36
Chapter 3
Statistics and exploratory data analysis
In our everyday lives, we are surrounded by uncertainty due to random variation.
We often make decisions based on incomplete information.
Mostly, we can cope with this level of uncertainty, but in situations where the decision
is of particular importance, it can be informative to understand this uncertainty in
greater detail, to aid the decision making.
Statistics is unique in that it allows us to make formal statements quantifying uncertainty, and this provides a framework for decision making when faced with uncertainty.
3.1
Uncertainty
Sterling’s slide has continued, with the pound falling close to $1.37...The pound
also weakened against the euro, with the single currency now worth 94 pence.
If I am planning to make a trip in summer abroad, is it better to change the currency
now than later?
Is there evidence of global warning or is it simply random fluctuation?
Would the answer affect your way of living?
Decision making
We follow many different routes, rational or irrational, to find an answer and to cope
with such situations.
Often it is useful to obtain some evidence in order to decide what the answer should
be.
What sort of evidence would be useful in answering such questions?
37
For the UK economy, we may look at exchange rates over the past few months to figure
out a trend, if any, we may want to include other factors that may explain the trend,
or study similar periods in the past. To determine such factors or variables we may
want to speak to economists.
For the global warming, we may want to study a pattern in temperature over the past
years in England, Europe or around the world. There may be other variables of interest,
for example, increasing number of flooding or storms. Discussion with climatologist or
hydrologist would be helpful in deciding which variables should be considered.
What is data?
In statistical studies data refers to the information that is collected from experiments,
surveys or observational studies.
For example by themself 4, 3.5, 3.2 is not data but only a sequence of numbers. However
if we know these numbers are measurements of new-born baby’s weights, then these
numbers become data.
Numbers require metadata to become data.
Probability and statistics
In Probability, we consider an experiment before it is performed. The measurements
to be observed are modelled as random variables. We may deduce the probability of
various outcomes of the experiment in terms of certain basic parameters.
In Statistics, we have to infer things about the values of the parameters from the
observed outcomes, the realisations, of an experiment after it has been performed.
Is Friday 13th bad for your health?
Consider the following claim:
I’ve heard that Friday 13th is unlucky, am I more likely to be involved in a
car accident if I go out on Friday 13th than any other day?
What kind of evidence would be helpful? perhaps hospital admissions.
Suppose that data is available of emergency admissions to hospitals in the Southwest
Thames region due to transport accidents, on six Friday 13ths, and corresponding
emergency admissions due to transport accidents for the Friday 6th immediately before
each Friday 13th:
Number
Accidents on 6th
Accidents on 13th
1
9
13
2
6
12
3
11
14
4
11
10
5
3
4
6
5
12
Does the data support the claim?
Compare the number of accidents by finding the average (the unweighted mean) number
of accidents on both days:
38
Average number of accidents = Total number of accidents / Total number of days, so
that
9 + 6 + 11 + 11 + 3 + 5
x̄6th =
= 7.5
6
and
13 + 12 + 14 + 10 + 4 + 12
x̄13th =
= 10.83.
6
Exercise 3.1
Referring to the Friday 13th example,
• Why compare instead of focusing on accidents only on 13th Fridays? Need a baseline.
• Why have we chosen Friday 6th as the comparison day? Compare like with like.
• There are more accidents on Friday 13th than on Friday 6th, therefore I am more
likely to be involved in a car accident if I go out on Friday 13th. Tentatively: yes.
What is this course about
• To illustrate scientific contexts where statistical issues may arise;
• to demonstrate where statistics can be useful, by showing the sort of questions
it can answer, and the situations in which it is used;
• to understand sampling variation and quantify uncertainty;
• introduce various exploratory tools and summary statistics for data analysis;
• introduce specific techniques from statistical modelling and inference; and
• apply all this to real data. Wow, and this as well!!
Sources of variation
Exercise 3.2
outcomes:
Toss a coin 10 times. How many heads are expected? Record the
H, H, H, T, T, H, H, H, H, T
• Are you surprised that you didn’t have exactly 5, the half of the number of trials?
Has the result changed your opinion about the coin?
• Are you surprised that your neighbors didn’t have exactly the same number of
heads as you did?
• Repeat experiment another two times, on two further coins and record the number
of heads. Did you get the same number of heads each time?
• What would happen if you toss 20 times?
39
You have witnessed sampling variation.
Exercise 3.3 Think back to the Friday 13th example. Is the higher chance of being
in a car accident on Friday 13th, due to sampling variation?
Sol:
Possibly: but nearly all Friday 13ths had elevated accidents.
The variation within Friday 13ths is not as great as between Friday 6ths and Friday
13ths.
Ultimate test: collect new data on Friday 13th dates.
Later we introduce a statistical framework to evaluate how much evidence there is for
a true difference.
Population and sample
In the Friday 13th example, our interest is not limited to those available dates. Ideally
we consider all the possible accidents occurring on all Friday 13th’s. We call the
complete group of units, or people, understudy the population.
• Population: the set of all individuals or units of interest, exactly defined.
• Sample: a subset of the population, chosen to be representative of the population.
Statistical inference is learning about the population through the behaviour of a sample.
Where is statistics used?
Statistics is used in a surprisingly diverse range of areas. Here is a small selection of
the fields to which statistics contributes.
Environmental monitoring: for the setting of regulatory standards and in deciding whether
these are being met;
Engineering: to gauge the quality of products used in manufacturing and building;
Agriculture: to understand field trials of new varieties and choose the crops that will
grow best in particular conditions;
Economics: to describe unemployment and inflation, which are used by the government
and by business to decide economic policies and form financial strategies;
Finance: risk management, and prediction of the future behaviour of the markets;
Pharmaceutical industry: to judge the clinical effectiveness and safety of new drugs before they can be licensed;
40
Insurance: in setting premium sizes, to reflect the underlying risk of the events that are
being insured against;
Medicine: to assess the reliability of clinical trials reported in journals, and choose the
most effective treatment for patients;
Ecology: to monitor population sizes and to model interactions between different species;
Business: market research is used to plan sales strategies.
The Sally Clark Case
Statistics has played a key role in many topical news issues, including the controversial
court case of Sally Clark. The case is an famous example of the misuse (or misunderstanding) of statistics contributing to a miscarriage of justice. The Royal Statistical
Society were so concerned that they wrote a press release, highlighting the statistical
mistakes made.
Sally Clark was a mother convicted of murder, when two of her babies died
of ‘Cot Death’ - the name given to the unexplained death of a young infant
(SIDS).
The paediatrician Sir Roy Meadow, acting as an expert witness for the
prosecution in the case, famously claimed that the odds of two unexplained
deaths in the same family was 1 in 73 million.
Where does this figure come from?
Exercise 3.4 The odds of a single unexplained death in an affluent, non-smoking
family is estimated as 1 in 8500. The figure 73 million comes from multiplying these
odds by themselves: 8500 × 8500 ≈ 73million. Is this a reasonable calculation?
Sol:
It is only appropriate to multiply these odds together if the second death is independent of the first.
This is not reasonable since the children have the same DNA.
A second problem
A second problem is known as the ‘prosecutors’s fallacy’, which goes as follows:
The chance of two unexplained deaths in the same family occurring by
chance is 1 in 73 million. Therefore, the chance of Sally Clark being innocent
is 1 in 73 million also.
What is wrong with this argument? The following analogy will help.
41
Exercise 3.5 The idea behind the British National Lottery lottery is that 49 balls are
placed in a machine, and 6 of them are drawn. Before the draw takes place, a punter
pays 1 pound to place a guess on which six balls will be drawn. There is a prize of one
million pounds available, to a correct guess, but the chance of getting it right is 1 : 14
million. You decide to play, and, amazingly, all six of your numbers come up! You
travel to the headquarters of the national lottery to claim your winnings, but instead
...
Sol:
. . . you are arrested – accused of cheating! and the prosecuting lawyer argues “The
chance of getting all six balls correct by chance is 1 : 14 million. Therefore, the chance
of the defendant being innocent is 1 : 14 million also”.
Exercise 3.6
the code.
Formulate the Bayes calculation of the probability of innocence. Here is
pb.a = 1/(14*10^6)
# P(B|A)
pb.acomp = 0.99
# P(B|A^c)
pa = 1-1/(10^6)
# P(A)
pa.b = pb.a*pa/( pb.a*pa + pb.acomp*(1-pa) )#
0.0672
Sol:
A = “innocence”, B = “six balls correct”. Want P (A|B).
P (A|B) = P (B|A)P (A)/P (B)
by Bayes,
P (B|A)P (A)
=
P (B|A)P (A) + P (B|Ac )P (Ac ))
by TPT.
For calculations guestimate:
P (B|A) = 1/(14 × 106 )
P (B|Ac ) = 0.99
P (A) = 1 − 1/103 prior prob of innocence.
Posterior prob P (A|B) ≈ 0.06729469 i.e. nearly 1 in 10.
Data
In experiments and surveys certain specific attributes are measured on the units. These
are called variables. For example, in the Friday 13th data, the unit is a Friday 13, and
the variable we measure is the number of accidents.
The variable is a random variable if is determined at random or by some random process.
To apply probability theory we convert the measurements to numerical scales.
Types of data
Most random variables falls into the following two categories, depending on the characteristic and how it is measured:
42
Discrete: Variables taking values in countable sets:
e.g. gender, eye color, college membership, exam grades(A, B, C, D, E), number
of goals in a match, children in a family,. . . .
Continuous: Variables taking values on some interval of the real line:
e.g. height, weight, direction, time. . . .
Sample survey data
We see that some data are useful in carrying out our investigation. But how do we
choose data? What are the important considerations? Is there any limit to the amount
of evidence that can be obtained from some given data? Think back to the data on
Friday 13th – could we use it to decide whether car accidents were especially common
on Fridays?
So if the evidence available is limited by the data we have, it makes sense that we
should think very carefully about how we collect the data.
If you are not collecting the data yourself, it is always important to understand how
the data is collected, so that you are aware of any limitations that may place on your
analysis.
To illustrate the idea, we begin with an extreme example.
Exercise 3.7 Student study: There is interest in estimating how many hours students
spend studying every week. So you design a survey and find participants.
Thinking to yourself where a good place would be to find students to fill
in your survey, you have a brilliant idea. . . the Library! You sit outside
and stop students as they leave to fill in your questionnaire. After some
time you have enough results for analysis. You find that students spend,
on average, 30 hours a week studying.
What is wrong with the way in which the study has been carried out?
• What is population of interest for the survey? All UG students at UoL 2010.
• What property should the sample have? Be representative.
If you had stopped students outside the University Bar instead of the library, would
you have got similar results? No.
Can you think of a better way to collect data for your survey? Yes.
For a sample to be representative of the population requires a rigorous definition of the
population. Other populations for this survey could be
full time students, maths students, female students,. . . ,students in 1964,. . .
For what population is sampling by stopping people outside the library appropriate?
library users.!
43
A representative sample reflects the characteristics and nature of the population. If the
sample is not representative, we usually introduce a systematic error called bias into
the calculation.
Exercise 3.8 Beach comber: A measure of how polluted are British beaches is the
volume of residual plastic found on the beach. A survey is proposed to estimate this.
Write down the issues that need to be addressed.
Sol:
Issues:
How large is a large sample
The term n usually denotes the number of units or subjects in the sample. There are
practical as well as statistical considerations to choosing the size of the sample. On
the practical side, financial constraints may mean a sample has to be smaller than
n = 1000. Some statistical considerations will be discussed later.
Random sample
The widely accepted method to obtain a representative sample of the population is
by selecting a random sample. Statisticians like these.
A simple random sample of size n from a population is one in which each possible sample
of that size has the same chance of being selected.
One method to ensure random sampling is to write the name of every member of the
population on a slip of paper, place these slips into a hat, then draw out the required
amount for the sample.
A more practical method has been developed using the computer, called a random
number generator. For an example of a pre-election poll, we may need n = 1000 random
numbers between 1 and 40 million, for a sample size of n = 1000 out of the 40 million
eligible voters in the UK. If we have all the voters written in a list, we can pick out the
selected subjects for our sample.
sample(1:10,4)
# 3 7 6 2
Other kinds of sampling
It is not always feasible to carry out sampling in a truly random fashion. It can be
very expensive to contact 1000 random chosen people in a pre-election poll:
44
geographically dispersed, difficult to reach, long delays. We may have to resort
to a sampling method that is not random for practical reasons. Provided we are careful,
we can minimize the bias that is caused.
Exercise 3.9 Suppose we go to the city centre, stop passerbys in the street and ask who
they are going to vote for in the next election. This is sometimes known as convenience
sampling. An improved version is known as quota sampling. What kinds of bias may
be introduced? Shoppers are not representative of voters.
Does increasing the size of a sample decrease the bias?
Exercise 3.10 For the student study hours example, one survey collects 1000 responses,
with convenience sampling, with interviews made outside the library, stoppng random
students.
A second survey collects only 50 responses, with random sampling from a list of the
entire student population of the University.
Which study should we believe more? It depends on the population of interest.
It is almost always better to have a small, representative sample, than a large biased
sample.
From here on we assume that the sample is random and study properties of simple
random samples. This greatly simplifies our mathematical treatment of the problem
and provides insights into important statistical ideas used.
3.2
Exploratory data analysis
We introduced some examples of discrete and continuous random variables and studied
their properties. If we know the exact analytical form of the underlying distribution
of interest (i.e. the population), there is no need to collect data nor make statistical
analysis. In reality this is rarely the case, especially in the beginning of investigation,
and even if there is a conjectured model for the data, we always need to check if it is
consistent with data.
Data and variability
Data is measured information and is fixed. But in representing the population it also
carries uncertainty. This may be due to inherent random variability in the characteristic of interest: e.g. a coin throw. Measurement variability from one day to another:
e.g. weight. Sampling variability: e.g. one individual is selected into the sample,
another is not.
In mathematical terms, in all of these three cases, the characteristic being measured is
represented by a random variable: e.g. X = todays weight of an individual, e.g. X =
number of plastic bottles on beach selected, e.g. X records 1 if throw a head.
45
Random variables and realisations
There is an important difference between: a random variable and its realisation, observation. A random variable is always written in upper case and is a function with an
associated probability distribution (pmf/pdf); e.g. X = Ozone level.
An observation on a random variable is written in lower case and is just a number; e.g.
x = observed value of Ozone.
A data set of size n may be considered in two ways:
X1 , . . . , Xn
x1 , . . . , xn
random variables
given realizations.
The first is needed for probability and statistical modelling. The second is needed for
exploratory data analysis.
Data analysis
The first stage in any analysis is to get to know the problem and the data. The first
stages of data analysis usually involves a variety of graphical procedures to visualise the
data, and the calculation of a few simple summary numbers, or summary statistics
that capture key features of the data.
The variability in the data is a reflection of and an approximation to the true underlying
distribution and its features. We need to care how good the approximation is.
Role of exploratory data analysis
There are three essential roles:
Finding errors and anomalies: missing data, outliers, changes of scale,. . . .
However carefully data have been collected, it is always possible that they contain
errors. Early detection of these errors can save time and confusion later on. These may
be due to recording or transcription error or broken equipment among other causes.
Suggesting subsequent analyses: plots of data and summary statistics give information
on location, scale and shape of the distribution and relationships between variables.
This builds up a feeling for the structure of the data, which gives insight into subsequent
statistical modelling.
Augmenting understanding of applied problem: exploratory tools sharpen the scientific
questions addressed. Context and scientific rationale for analysis is paramount.
3.3
Examples with associated data sets
Each of these real life problems has an associated data set which we explore, to show
the whole process involved in detailed statistical analysis from conception through to
conclusion.
46
Marine science
Ecological
Atmospheric Chemistry
Health
Excess waves
Diseased trees
Ozone and air pollution
Comparing hospitals
Offshore waves at Newlyn
Northings
0
20000
40000
60000
80000
100000
Coastal engineers at the port of Newlyn, in the south west of England, require detailed
understanding of oceanographic processes in order to estimate overtopping rates of
the sea wall protecting the town. They can then assess whether existing sea wall is
adequate, or whether further protection should be built. Offshore waves are induced
by meteorological conditions, and though complex, they can be summarised by their
height and their period.
Here we will concentrate on the excess heights of these waves over a threshold.
Newlyn
0
20000
The specific problem for the engineers is:
Given a small probability of exceedance, what is the wave height that is exceeded with
that probability?
How accurate is this estimate? statistics.
Diseased trees
In an ecological study of diseased trees, trees along transects through a plantation were
examined and assessed as diseased or healthy. Data collection goes as follows. First a
diseased tree is found. Then the number of neighbouring trees in an unbroken run of
diseased trees along the transect is recorded. Ecologists are interested in the following:
How does the disease spread between trees, and what is the probability that trees are
infected by the disease?
The observations made on a total of 109 runs of diseased trees recorded in the Table
below. We use this data set to show the benefits of collecting more data. To do
this we have broken down the data in the Table into data collected from the first 50
observations and from the whole data set, we refer to these as the partial and full data
sets respectively.
Run length
Number of runs
in first 50 observations
Number of runs
in all 109 observations
47
0
31
1
16
2
2
3
0
4
1
5
0
71
28
5
2
2
1
40000
Eastings
60000
80000
Urban and rural ozone
In the UK the Department for Environment, Food & Rural Affairs operates a national
air quality monitoring system, with a network of sites at which air quality measurements are taken automatically. These measurements are used to summarise current
air pollution levels, for forecasting of future levels and to provide data for scientific
research into the atmospheric processes behind the pollution. We look at ground-level
ozone (O3 ).
Ozone: the background
This pollutant is not emitted directly into the atmosphere, but is produced by chemical
reactions between nitrogen dioxide (NO2 ), hydrocarbons and sunlight. When present at high
levels, ozone can irritate the eyes and air passages causing breathing difficulties and may
increase susceptibility to infection. Ozone is toxic to some crops, vegetation and trees and is
a highly reactive chemical, capable of attacking surfaces, fabrics and rubber materials.
Whereas nitrogen dioxide participates in the formation of ozone, nitrogen oxide (NO) destroys
ozone to form oxygen and nitrogen dioxide. For this reason, ozone levels are not as high in
urban areas (where high levels of NO are emitted from vehicles) as in rural areas. As the
nitrogen oxides and hydrocarbons are transported out of urban areas, the ozone-destroying
NO is oxidised to NO2 , which participates in ozone formation.
As sunlight provides the energy to initiate ozone formation, high levels of ozone are generally
observed during hot, still, sunny, summertime weather in locations where the airmass has
previously collected emissions of hydrocarbons and nitrogen oxides (e.g. urban areas with
traffic). The resulting ozone pollution or summertime smog may persist for several days and
be transported over long distances.
Ozone: the data
We focus on data from two monitoring sites: - an urban site in Leeds city centre and
- a rural site at Ladybower Reservoir, just west of Sheffield.
The data at each site are daily measurements of the maximum hourly mean concentration
of O3 and NO2 , recorded in parts per billion (ppb), from 1994 – 1998 inclusive. To focus
on the question of whether there is any effect of season on ozone levels, we compare
data from winter (November – February inclusive) and early summer (April – July
inclusive).
We address the following questions:
How, if at all, does the distribution of ozone measurements vary between the urban
48
and rural sites?
How, if at all, is the distribution of ozone measurements affected by season?
How, if at all, does the presence of other pollutants affect the levels of measured ozone?
The purpose of the statistical analysis is to provide an objective analysis of the data,
by extracting the information in the data relevant to each of the scientific questions.
Comparing hospitals
League tables for many public institutions such as schools, hospitals and even universities try to compare the relative performances of the institutions. This very small
example uses the outcomes of a difficult operation at two hospitals. Ten patients at each
hospital underwent the operation. The patients were selected to make sure that they
had similar severity of illness and other characteristics which are believed to influence
the outcome of the operation. There is no connection between the two hospitals.
Each operation was classified as successful or unsuccessful. The first hospital had nine
out of ten successful operations and the second hospital had five out of ten successful.
What can we conclude about the relative performances of the two hospitals?
R code for the data
The data sets are saved from R in the file m105.Rdata.
load("./m105.Rdata") # linux directory
ls() # "barley"
"ozone.summer" "ozone.winter"
# "waveExcesses" "waves"
# Ozone
names(ozone.summer)
attach(ozone.summer)
# "Leeds.O3"
"Leeds.NO2"
"Ladybower.O3" "Ladybower.NO2"
hist(Leeds.O3)
Population and sample: examples
In the Ozone problem, there is data from a number of days during 1994-1998. However,
interest is not solely in the levels of ozone on the days on which measurements were
taken. The objective of a statistical analysis is to learn about the relationships between
variables, and extrapolate perhaps to future dates.
Exercise 3.11 For each of the problem data sets state the populations that we are
trying to learn about:
Newlyn waves: All waves encountered offshore at Newlyn.
Ozone: Levels of ozone at the two locations given the time of year.
49
Diseased trees: All trees in similar forests.
Hospitals: Other operations at the two hospitals.
Exercise 3.12 For diseased tree data set, define the variable of interest as X and its
possible range of values. X= length of unbroken run of diseased trees. Discrete: X ∈ {0, 1, . .
Exercise 3.13 For the hospital data set, define the variables of interest and possible
range of values: X is the number of successful operations in the first hospital,
Y the number in the second. Discrete: X, Y ∈ {0, 1, . . . , 10}.
3.4
Graphical methods
Graphical methods are needed for visualising multivariate and univariate data. If
the data is high dimensional, then it can be difficult to visualise since plots are two
dimensional! Ways of overcoming this is an active area of computer science.
Here the focus is on methods for examining the distribution of a single variable and
relationships between pairs of variables.
Historical note – Florence Nightingale
Good graphical display is the important first step in any data analysis. Choosing how to do it is part science, part art, and sometimes part politics! Florence
Nightingale was the first female Fellow of the Royal Statistical Society. She pioneered the use
of statistics as an organised way of learning, leading to improvements in medical and surgical
practices. She developed the polar-area diagram, to dramatise the needless deaths caused by
unsanitary conditions. Florence Nightingale revolutionised the idea that social phenomena
could be objectively measured and subjected to mathematical analysis, innovating in the
collection, interpretation, and graphical display of descriptive statistics.
Histograms
The standard histogram of a observations on a variable displays the frequency, the
number of observations, in each bin, where the bins divide up the range of the variable,
and are usually of equal width.
50
A technical definition is hard to write down, and requires a definition of the empirical
cdf.
A histogram displays the variability and the distribution of the variable. It may suggest
one pdf rather than another as a possible statistical model for the variable. In a sense
the histogram is an empirical pdf.
Exercise 3.14 Diseased trees. Plot histograms for the partial and full data sets and
summarise the shape of the distributions displayed.
partial = c(31, 16, 2, 0, 1, 0 )
full
= c(71, 28, 5, 2, 2, 1 )
# barplot(full) # is another way
# unbundle the data
Partial = rep(0:5,partial)
Full
= rep(0:5,full)
par(mfrow=c(1,2))
hist(Partial, xlab="Run length",ylab="Count",main="Partial",
ylim=c(0,75),breaks=seq(-0.5,5.5,by=1), col=’red’)
hist(Full, xlab="Run length",ylab="Count",main="Full",
ylim=c(0,75),breaks=seq(-0.5,5.5,by=1), col=’blue’)
Sol:
Both indicate a geometric decay in the distribution of run lengths.
Scaled histogram
The histogram estimates the underlying pmf of a discrete variable or the pdf of a
continuous variable. Recall that all pmfs sum to 1, and that all pdfs integrate to 1. It
thus makes sense to plot histograms with relative frequency rather than raw frequency
and so respect this summation,
Exercise 3.15 Diseased trees.
?hist
# freq is needed
hist(Partial,prob=TRUE,
xlab="Run length",ylab="Rel freq",main="Partial",
breaks=seq(-0.5,5.5,by=1), col=’red’)
hist(Full,prob=TRUE,
xlab="Run length",ylab="Rel freq",main="Full",
breaks=seq(-0.5,5.5,by=1), col=’blue’)
This histogram has area 1. The shape of the histogram does not change. The vertical
axis now represents the relative frequency rather than raw frequency.
The benefit of rescaling is to better compare distbutions.
51
Exercise 3.16 Ozone: Comparing histograms The histograms of the summer ozone
data for both sites are given in this code. Need to get the scales right for comparison.
The conclusions can differ: eg peakedness.
load("./m105.Rdata")
attach(ozone.summer) ; names(ozone.summer)
par(mfrow=c(1,2))
hist(Leeds.O3); hist(Ladybower.O3)
hist(Leeds.O3,prob=T); hist(Ladybower.O3,prob=T)
hist(Leeds.O3,prob=T,ylim=c(0,.05));
hist(Ladybower.O3,prob=T,ylim=c(0,.05))
hist(Leeds.O3,prob=T,ylim=c(0,.05),breaks=20);
hist(Ladybower.O3,prob=T,ylim=c(0,.05),breaks=20)
hist(Leeds.O3,prob=T,ylim=c(0,.06),breaks=20);
hist(Ladybower.O3,prob=T,ylim=c(0,.06),breaks=20)
These are clearly different, but the spread and shape of these histograms is sufficiently
close to make it difficult to identify any obvious difference by eye.
To really look at differences: consider differencing.
Exercise 3.17 Ozone: the differences. We have observations on the ozone level at
each site, (xi , yi) for every day i = 1, . . . , n. Looking directly at the daily differences,
di = xi − yi , in ozone removes common variability (e.g. atmospheric conditions) to the
two locations.
par(mfrow=c(1,1))
length(Leeds.O3)
# 469
d = Leeds.O3 - Ladybower.O3
hist(d,freq=FALSE,col=’yellow’, # ways to skin a cat
xlab="difference",ylab="Rel freq",main="O3 differences")
grid()
Exercise 3.18 Conclusions drawn: The variability of these differenced data is less
than the variability of the measurements made at the separate sites. So common
factors that affect both sites, and influence ozone values, are removed from the
differenced data. Differencing is only possible if measurements are collected on the
same unit=day. Most differences are negative : measurements at Ladybower are
larger than at Leeds. This supports scientific expectations that rural ozone levels are
generally higher than urban levels.
Choice of bin size for a histogram
Constructing a histogram smooths the data, and the width of the bins determines how
much smoothing is applied. Broad bins correspond to highly smoothed data, in which
much of the structure of the data set is lost. Narrow bins undersmooth the data,
leaving in random variation which obscures the structure of the data, but in a different
way.
52
Exercise 3.19 Choosing bin size for the summer ozone data. Examples of very wide
and very narrow bins are shown for the summer ozone data from the Leeds city centre
site.
par(mfrow=c(1,2))
x = Leeds.O3
hist(x,prob=T,col=’yellow’,breaks=2)
hist(x,prob=T,col=’red’,breaks=500)
Using a very large bin size has obscured the structure of the data. So has the very small
bin size – the right hand plot just shows the raw data! surprisingly informative here.
The earlier plot is somewhere between and achieved by trial and error.
Heights of offshore waves at Newlyn
The data set waves gives the maximal levels (in metres) recorded over consecutive 15
hour windows, throughout the period 1971-77.
Typing in waves displays the whole vector.
Exercise 3.20
Find the length of this vector: length(waves) # 2894
Find the mean of the offshore wave heights: mean(waves) # 2.866
Display a histogram of the offshore wave heights: hist(waves)
400
0
200
Frequency
600
800
Offshore waves
0
2
4
6
8
10
12
Wave height
Describe the shape of this distribution and the range of this variable: Asymmetric, long right tail, all
What does the y-axis of this plot represent? Counts of observations that fall in each bin.
Scale the histogram to have area 1. hist(waves,prob=TRUE)
What does the y-axis of this plot represent now? Relative frequency. The x-axis are
wave heights measured in metres.
53
3.5
Empirical cdf
The cumulative distribution function (cdf) of a random variable X is
F (x) = P (X ≤ x),
for
−∞ < x < ∞.
whether discrete or continuous. Define the indicator function
1
if
X ≤z
I(X ≤ z) =
0
otherwise.
Exercise 3.21
Result: (Unbiased estimate of cdf.) Show, for any fixed z, the expected value of
I(X ≤ z) is F (z).
Sol:
E[I(X ≤ z)] =
=
Z
∞
Z−∞
z
−∞
I(x ≤ z)f (x)dx def E
Z ∞
1.f (x)dx +
0.f (x)dx
z
= P (X ≤ z) + 0
= F (z).
def F
Definition: The empirical cdf is defined as
n
F̃ (x) =
1X
I(xi ≤ x).
n i=1
Result: The ecdf can be calculated from F̃ (x) = n1 ( number of i st xi ≤ x). Each observation has an equal weight 1/n in this computation.
Exercise 3.22 5 realisations of a rv X are {2, 3, 4, 1, 2}. Compute F̃ (x) at x = .5, 1.5, 2.5, 3.5, 4.5.
How would the calculation change if the points x = 0, 1, 2, 3, 4 are used?
Sol:
F̃ (0.5)
F̃ (1.5)
F̃ (2.5)
F̃ (3.5)
F̃ (4.5)
=
=
=
=
=
0/5
1/5
3/5
4/5
5/5.
Not much change F̃ (0) = F̃ (0.5), F̃ (1) = F̃ (1.5),. . . . But this implies that the ecdf is
a step function.
54
Properties of the ecdf
The empirical cdf F̃ (x) is a proper cdf and
• is a step function with jumps at the data points;
• F̃ (x) = 1 if x ≥ max(x1 , . . . , xn );
• F̃ (x) = 0 if x < min(x1 , . . . , xn ).
An alternative calculation of the ecdf
As the ecdf is a step function with jumps at the data points, there is an easier way of
calculation. Take the realisations x1 , . . . , xn ; order them with the smallest first; label
these order statistics as x(1) , x(2) , . . . , x(n) so that
x(1) ≤ x(2) ≤ . . . ≤ x(n) .
The subscripts give the ranks of the data points.
x=c(2, 3, 4, 1, 2)
rank(x)
sort(x)
# 2.5 4.0 5.0 1.0 2.5
# order statistics
Result: the ecdf can be evaluated at the order statistics
F̃ (x(i) ) =
i
.
n
where
x(i) ≤ x < x(i+1) .
and for values of x in between
F̃ (x) =
i
,
n
Proof: Number of x ≤ x(i) is i.
Exercise 3.23
For observations {2, 3, 4, 1, 2}, find F̃ (x) and sketch the plot.
x
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
F̃ (x)
Sol:
Order the data: 1, 2, 2, 3, 4.
x
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
1
3
4
5
F̃ (x)
5
5
5
5
F̃ (x) 05 50 51 15 53 35 54 54 55 55 55
Exercise 3.24 Summer ozone. Use the first 20 observations from Leeds city center
summer ozone values to compute the ecdf.
55
n = 20
x = Leeds.O3[1:n]
xrank = sort(x)
# order the data
Fn = seq(1,n)/n
# a jump of 1/n
plot(xrank, Fn, type=’s’) ; grid() # step function
plot(ecdf(x),pch=’.’) ; grid() # ecdf is a R function
# for the whole data
par(mfrow=c(1,1))
x = Leeds.O3
plot(ecdf(x),pch=’.’) ; grid(12) # ecdf is a R function
Draw some conclusions from the complete data.
Sol:
About 60% of days the daily maxima was less than 35 and about 20% of time the
daily maxima was greater than 40, steady increase in the cdf between 20 and 40, the
maximum stretches out to 80.
3.6
Summary statistics
In addition to visualising our data graphically, we can calculate some summary statistics
which capture important features of our data. Numerical summaries of the data can
• facilitate the comparison of different variables;
• help make clear statements about aspects of the data.
Mathematical notation
Recall the notation
n
X
i=1
g(i) = g(1) + g(2) + . . . + g(n − 1) + g(n)
for any positive integer value of n and any function g. In statistics we often have to
do mathematics with sums of this form. The most common forms of this expression
encountered are:
n
X
xi = x1 + . . . + xn
and
i=1
n
X
i=1
56
x2i = x21 + . . . + x2n .
Sample mean
Consider a random variable X from which we obtain n realisations x1 , . . . , xn . To
emphasise some mathematical properties of averaging we may write the realisations as
a vector x = (x1 , . . . , xn ).
Definition: The sample mean of n observations x1 , . . . , xn is denoted by x̄, or by m(x),
and is obtained by summing all the xi and dividing by n:
n
x̄ =
1X
xi
n i=1
n
and
m(x) =
1X
xi .
n i=1
This measures the location of the sample. It is an estimate of the expectation E(X),
or the mean of X.
Sample variance and standard deviation
Definition: The sample variance of n observations x1 , . . . , xn is denoted s2 and is given
by:
n
1X
(xi − x̄)2 .
s =
n i=1
2
Note the divisor n. Many textbooks use the divisor (n − 1) instead of n here, wierd.
There are technical reasons for this but, for large values of n, it makes little difference.
The sample variance is a measure of spread of X and also an estimate of the variance
var(X). Ideally, a spread measure should have the same units as the original data.
√
Definition: The sample standard deviation of observations x1 , . . . , xn is s = s2 . The
standard deviation σ of X is the square root of σ 2 = var(X), the sample standard
deviation estimates this value from the data.
Exercise 3.25 Waves. Find the sample mean of the wave height data mean(waves) # 2.866.
Find the sample variance var(waves) # 2.564049.
Use the sqrt() function to derive the sample standard deviation sqrt(var(waves)) # 1.601265.
Exercise 3.26 Ozone data. Calculate summary statistics of O3 to look more closely for
differences between the locations and the seasons.
There are four groups, arising from the two levels of each of the two nominal variables
location and season. Standard deviations are in parentheses.
The means are summer
winter
Leeds city
31.78 (9.28)
20.52 (10.77)
Ladybower
43.63 (11.81)
29.24 (8.40)
Give the Rcode to compute these numbers. Draw conclusions from these summary
statistics.
57
Sol:
mean(Leeds.O3)
mean(ozone.summer$Leeds.O3)
# list
mean(ozone.winter$Leeds.O3)
sd(ozone.winter$Ladybower.O3)
The conclusions are comparative: The mean values for Ladybower are higher than for
Leeds. The summer mean values are higher than the winter ones. The spreads are
roughly the same.
Sample quantiles
Sample quantiles are calculated directly from the empirical cdf.
Definition: the pth sample quantile, x̃p , satisfies
F̃ (x̃p ) = p
for
0 < p < 1.
The median x̃0.5 corresponds to p = 0.5, it is another widely used measure of location.
The definition x̃p = F̃ −1 (p) does not work here because F̃ is a step function and so its
inverse is not defined.
Exercise 3.27
code
Calculate the sample mean and the median, for each dataset, using this
stats = function(x){ c(mean(x),median(x)) }
stats(c(2, 4, 6, 8, 10))
#
6 6
stats(c(2, 4, 6, 8, 100))
#
stats(c(2, 4, 6, 8, 1000)) #
The lesson learnt is that the median is insensitive to outliers.
Find the 0.6 quantile of the Leeds summer ozone daily maxima.
0.8
1.0
Leeds
0.2
0.4
F̃ (x)
0.6
p = 0.6
x̃p in interval (33,34)
0.0
Exercise 3.28
0
20
40
60
Summer daily maxima
58
80
The 0.6 sample quantile lies between (33, 34). quantile(Leeds.O3,prob=0.6) # 33
Exercise 3.29 The function quantile() calculates quantiles of a vector: quantile(waves).
The minimum, maximum and median values are 0.32m, 11.05m, 2.46m.
Compare the median to the mean Mean higher since histogram is skewed to the right.
Exercise 3.30 Plot the empirical cdf of the waves data plot(ecdf(waves),pch=’.’); grid(21)
to answer the following.
Find the median of the wave height distribution: 2.5m approx.
Find the 0.1 and 0.9 quantiles of the wave height distribution: 1.2m and 5.1m approx.
Estimate the probability of a randomly selected wave being less than 1.7m: 0.25.
Find the wave height exceeded by 25% of the waves 3.7m approx.
Box-and-whisker plots
These plots summarise the observations in terms of quantiles. They display the extremes (the whiskers), and the central values (the box defined by the quartiles and the
median).
Definition: the interquartile range is x̃0.75 − x̃0.25 . The length of the box is the interquartile range.
Exercise 3.31 Boxplot for the Ozone data.
load("./m105.Rdata")
attach(ozone.summer) ; names(ozone.summer)
par(mfrow=c(1,2))
hist(Leeds.O3); hist(Ladybower.O3)
boxplot(Leeds.O3,ylim=c(0,110));
boxplot(Ladybower.O3,ylim=c(0,110))
quantile(Ladybower.O3)
#
Features of the boxplots are: the thick line in the box is the median; the upper line
in the box is the 75% quantile, and the lower line is the 25% quantile; the minimum
and maximum are easily identified; and points appearing outside the limits may be
considered outliers. Summarise conclusions to be drawn from these boxplots.
Sol:
Skewness is shown as asymmetry of the box around the median; here it is only the
right hand tail that is long. The Ladybower distribution is a shift to the right of the
Leeds distribution. Comparison requires the same scales.
59
3.7
Bivariate relationships
Histograms and empirical distribution functions are useful methods for visualising a
single variable. However, with multivariate data, it is important to examine the relationships between variables as well as the structure of each variable by itself. The
scatterplot simply plots the value of one variable against another.
Definition: if (xi , yi) are two observations on the same unit i = 1, 2, . . . , n, the plot of
(xi , yi) is called a scatterplot.
Exercise 3.32 Ozone. Consider the effect of the nitrogen dioxide (NO2 ) on ozone levels.
We focus on the Leeds city centre measurements. Use this Rcode to give scatter plot
of O3 and NO2 for summer and winter. Sketch the graph in your notes.
par(mfrow=c(1,2))
xsumm = ozone.summer$Leeds.O3
ysumm = ozone.summer$Leeds.NO2
lim = c(0,100) # vital for comparison
plot(xsumm,ysumm,type=’n’,xlim=lim,ylim=lim) ; grid()
points(xsumm,ysumm,col=’red’,pch=’.’,cex=2)
xwint = ozone.winter$Leeds.O3
ywint = ozone.winter$Leeds.NO2
plot(xwint,ywint,type=’n’,xlim=lim,ylim=lim) ; grid()
points(xwint,ywint,col=’blue’,pch=’.’,cex=2)
# stretch the graphic
Draw conclusions.
Sol:
Ozone. Similar joint distributions, main body slightly differently located. No obvious
relationship between x and y, perhaps winter (x=small,y) difference. Many outliers.
The sample correlation coefficient
Consider two rvs X and Y on which we have iid observations (x1 , y1), . . . , (xn , yn ).
Let m(x) denote the sample mean of the x = (xi ; i = 1, . . . , n), let s(x) denote the
sample standard deviation of the (xi ; i = 1, . . . , n). Similarly define m(y) and s(y).
Standardised versions of xi and yi are
xi − m(x)
s(x)
and
yi − m(y)
.
s(y)
Definition: the sample correlation coefficient r(x, y) is the average of the product of
these standardised values
n 1 X xi − m(x)
yi − m(y)
r(x, y) =
.
n i=1
s(x)
s(y)
60
n = 20
x = runif(n) ; y = runif(n)
cor(x,y)
mean( (x-mean(x))/sd(x) * (y-mean(y))/sd(y) )
# why are these different?
f = sqrt((n-1)/n)
mean( (x-mean(x))/(f*sd(x)) * (y-mean(y))/(f*sd(y)) )
Result: (The correlation coefficient is invariant to standardisation.) For given scalars
a, b, c, d and vector of ones 1 = (1, 1, . . . , 1)
r(ax + b1, cy + d1) = sign(ac)r(x, y).
Proof: See exercises.
Result: The correlation coefficient always satisfies −1 ≤ r(x, y) ≤ 0.
Proof: Because of the invariance of the correlation coefficient to standardisation, take
x, y to have mean 0, and variance 1. Thus
X
X
xi = 0 and
x2i = n,
i
i
and similarly for y. Consider the quadratic form
1X
Q =
(xi + yi )2
n i
1X 2
=
(xi + yi2 + 2xi yi)
n i
1X 2 1X 2 2X
=
x +
y +
xi yi
n i i n i i n i
= 1 + 1 + 2r(x, y).
Now Q ≥ 0, so that 0 ≤ 2 + 2r, and r ≥ − 1. Similarly start with Q =
and find 0 ≤ 2 − 2r, so that r ≤ 1.
1
n
P
i (xi
− y i )2
The sample correlation coefficient is a measure of linear association, or clustering
around a line. Interpretation: r(x, y) = 0 gives no linear association, r(x, y) < 0 means
negative linear association, r(x, y) > 0 means positive linear association; when r(x, y)
is near ±1 the association is strong.
Exercise 3.33
Use this code to generate data with r = 0.5, roughly.
par(mfrow=c(1,1))
n = 400
z = rnorm(n)
x = z + rnorm(n); y = z + rnorm(n)
plot(x,y, type=’p’, pch=’x’)
cor(x,y)
# .47
61
Use other relations of x and y to z to give plots with r = −0.5, r = 0.9, r = 0, roughly.
Sol:
x = z + rnorm(n) ;
plot(x,y, type=’p’,
x = 3*z + rnorm(n);
plot(x,y, type=’p’,
x = rnorm(n)
;
plot(x,y, type=’p’,
y = -z + rnorm(n)
pch=’x’); cor(x,y)
y = 3*z + rnorm(n)
pch=’x’); cor(x,y)
y = rnorm(n)
pch=’x’); cor(x,y)
# -.52
# .91
# .04
Exercise 3.34 The sample correlation coefficient is not appropriate for detecting nonlinear association.
x = z + rnorm(n) ; y = z^2 + rnorm(n)
plot(x,y,type=’p’, pch=’x’); cor(x,y)
# -.04
Exercise 3.35 Ozone data. Calculate the sample correlation coefficients between O3
and NO2 for the ozone data. There are four groups, arising from the two levels of each
of the two nominal variables location and season.
xsc = ozone.summer$Leeds.O3
# summer in the city
ysc = ozone.summer$Leeds.NO2
xwc = ozone.winter$Leeds.O3
# winter
ywc = ozone.winter$Leeds.NO2
xsr = ozone.summer$Ladybower.O3 # rural
ysr = ozone.summer$Ladybower.NO2
xwr = ozone.winter$Ladybower.O3
ywr = ozone.winter$Ladybower.NO2
cor(xsc,ysc) ; cor(xsr,ysr)
cor(xwc,ywc) ; cor(xwr,ywr)
Collating the results gives Summer
Winter
Leeds city
0.10
-0.24
Ladybower reservoir
0.25
-0.48
What conclusions can you draw from these statistics?
Sol:
The correlations between O3 and NO2 are small with only one being moderate. By
comparison with the earlier figure, one might worry about outliers and/or non-linearity.
The fear is that outliers may distort the value of the coefficient. plot(xwr,ywr,type=’p’)
shows association but non-linear.
62
Exercise 3.36 The sample correlation coefficient is an estimate of the population
correlation between X and Y , denoted corr(X, Y ). While its definition is beyond
math105, consider how one might start by arguing an analogy to the relation between
E(X) and x̄.
Sol:
Compare
x̄ =
X
1
n
xi .
i
E(X) =
weighted average, and
∞
Z
x.f (x)dx
weighted average.
x=−∞
Now taking the standardised variables
r =
E(XY ) =
E(XY ) =
X
xi yi .
Zi ∞
1
n
xy.??.dx??
Zx=−∞??
Z ∞
∞
x=−∞
xy.f (x, y)dxdy
conjecture.
y=−∞
Need to define a joint pdf f (x, y).
3.8
Chapter summary
An introduction to statistics and exploratory data analysis is developed in terms of
uncertainty, decision making and data. The symbiotic theories of probability and of
statistics are contrasted in terms of the before analysis and the after analysis of a
probability experiment.
Data drives statistics, and sources of variation between and within data sets are described. One source of random variation is sampling and the concept of the simple
random sample is introduced. Conceptual issues, such as the representative nature of
the sample, the population, and methodological issues such as how to define a large
sample, and other forms of sampling, are briefly discussed.
Given a data set the first step in statistics is to understand its context and subject it to
an exploratory data analysis in order to understand its structure and variability. Data
sets for waves, trees, and ozone are used as running examples throughout the chapter.
The histogram is perhaps the most well known graphical method of eda, and is one way
to portray distributions. We use it to construct an empirical estimate of the pmf or pdf
of the rv understudy. However, the empirical cdf is just as practically important and
theoretically has pride of place. Summary statistics related to the data set are the well
known sample mean, variance and standard deviation, and the lesser known sample
63
quantiles. Boxplots, which are condensed summaries of the histogram, are based on
given quantiles. The Chapter ends with the extension to bivariate relationships and
the definition of the sample correlation coefficient.
Throughout these statistical concepts are illustrated in the R language with special
emphasis on calculation, plotting and simulation.
64
Contents
1 Introduction to R
1.1
1.2
The tutorial . . . .
Chapter summary
1
. . . . . . . . . . . . . . . . . . . . . . . . . . .
1
. . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2 Continuous random variables
2.1
2.2
2.3
2.4
2.5
2.6
13
Review of probability . . . . . . . . .
Continuous and discrete rvs . . . .
Expected values . . . . . . . . . . . . .
Standard continuous distributions
Quantiles and the cdf . . . . . . . . .
Chapter summary . . . . . . . . . . . .
. . . . . . . . . . . . . . .
13
. . . . . . . . . . . . . . .
15
. . . . . . . . . . . . . . .
20
. . . . . . . . . . . . . . .
24
. . . . . . . . . . . . . . .
31
. . . . . . . . . . . . . . .
35
3 Statistics and exploratory data analysis
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
37
Uncertainty . . . . . . . . . . . . . . . . .
Exploratory data analysis . . . . . . .
Examples with associated data sets
Graphical methods . . . . . . . . . . . .
Empirical cdf . . . . . . . . . . . . . . . .
Summary statistics . . . . . . . . . . . .
Bivariate relationships . . . . . . . . .
Chapter summary . . . . . . . . . . . . .
65
. . . . . . . . . . . . . .
37
. . . . . . . . . . . . . .
45
. . . . . . . . . . . . . .
46
. . . . . . . . . . . . . .
50
. . . . . . . . . . . . . .
54
. . . . . . . . . . . . . .
56
. . . . . . . . . . . . . .
60
. . . . . . . . . . . . . .
63