* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download MATH 105: [Probability and] Statistics Joe Whittaker B25 Fylde
Survey
Document related concepts
Transcript
MATH 105: [Probability and] Statistics Joe Whittaker B25 Fylde College Department of Mathematics and Statistics Lancaster University April 2010 LUVLE: https://domino.lancs.ac.uk/09-10/MATH/MATH105.nsf Organization The module runs for five weeks, weeks 21-25, with four lectures a week, a weekly workshop and a weekly Lab100 help session. Handouts: • Course notes • Exercises: Workshop, Quiz, Course Work. Please bring both to the lectures and workshops. The notes have gaps which are to be filled in during the lectures. Your participation in the course, by taking part in experiments, contributing in lectures and workshops and responding to the questionnaire is much appreciated. Timetable sun 9-10 10-11 11-12 12-1 1-2 2-3 3-4 4-5 5-6 11pm mon tue wed thu GFoxLT1 GFoxLT1 OfficeB25 Faraday 105-QZCW due fri WkShop4 OfficeB25 GFoxLT1 WkShop1 WkShop2 WkShop3 100-QZ due Lectures: are held at 10am Tuesday, 11am Wednesday, 9am Thursday and 12 noon Friday. Workshops: will be held in Management School Lecture Theatre 7. Lists of groups are posted outside the Maths and Stats Department Office in Fylde College. Workshops start in the first week. Labs: continue in the lab100 stream: i Monday at 10; 12; 4; 5; Tuesday at 9; 11; 12; 5. Labs start in the first week, and a test in week24. Any problems, please see Julia in B4c Fylde. Assessment • 20% Course Work (10% quiz + 10% written), • 30% end-of-term test (Friday week 25), • 50% final exam. Deadlines Online quiz questions (labelled QZ) should be completed by 2pm on the following Wednesday. Homework questions (labelled CW) should be handed in by 2pm on the following Wednesday in your tutor’s pigeonhole. Solutions are posted on the course webpage. Labs The Lab100 course is running in parallel with math105. Weekly help sessions are available. You are expected to have downloaded R on to your computer. The first lectures are on R, and are examinable in math105. Preliminaries The math105 course continues on from math104, very directly. Firstly you have met R in the lab100 work associated with math104. Both math104 and math105 require R and the first Chapter here goes over a tutorial introduction to some of the basic concepts of the language. Secondly, the extension of probability from discrete random variables, discussed in math104, to continuous random variables is discussed here. Both the discrete and the continous cases are needed for statistics. The mathematical prerequisites for the analysis of continuous random variables is the integral calculus of math101. The third part of the course introduces the statistical methods which are required for tackling a range of applied problems. The focus is on strategies for data modelling rather than mathematical theory. However, there is some theory, and we aim to introduce basic concepts as it will be taught fully in later statistics courses. Data examples are used throughout the course, to illustrate the techniques that the course aims to teach you. The course data sets are on the LUVLE course web. ii At the end of this course, you should be able to: • understand the basic concepts and objects of the R language, including some elements of programming; • define the basic concepts of continuous random variables, the probability density function and the cumulative distribution function; • have familiarity with some standard continuous random variables, such as the Uniform, Exponential and Normal; be aware of their parameters and how these relate to expectations. • use R to make computations and plots of the cdf, and of quantiles derived from it; • use R to simulate from standard distributions; • use graphical tools such as histograms, scatterplots, empirical distribution function and the boxplot; • calculate and understand numerical summary statistics such as mean, median, variance, quantiles and the correlation coefficient; • discuss a range of modelling assumptions that can play a part in statistical analysis. Background reading Although the lectures and these accompanying notes are self-contained, further details can be found in the following recommended texts: Clarke, G.M. and Cooke, D. (1998). A Basic Course in Statistics. 4th ed, Arnold. Daly, F., Hand, D., Jones, M., Lunn, A. and McConway, K. (1995). Elements of Statistics. Addison Wesley. Lindsey, J. (1995). Introductory Statistics: A Modelling Approach. Oxford Science Publications. iii Chapter 1 Introduction to R R is a software package, and a language, that provides a statistical computing environment. R is open source and can be downloaded from http://www.r-project.org. More information on obtaining R, and this tutorial, can be found on our /department/info/intranet/com pages. 1.1 The tutorial Objects Type x = 3 or x <- 3 to create a new object called x which has the value 3. The operator = or <- is not the mathematical = but an assignment operator. Predict the value of y to understand what is happening: x <- 6 x x^2 y <- x*(4+x/2) y # the answer # is a comment, and all to its right is ignored. The arithmatic operators + - * / ( ) work as expected. The hat is used for exponentials, so 3^2 is 9. Exercise 1.1 Create a new variable called z with the value ”five cubed divided by seven plus two”. Sol: z = 5^3/(7+2) # 13.88889 z = 5^3/7+2 Use precedence to resolve ambiguity. 1 # 19.85714 There are some tricks you can use to save on typing: whenever possible paste Rcode from the pdf file into a text editor; use the up and down arrow keys to recall and edit previous commands. Functions Most statments in R involve functions, and usually involve the use of round brackets (). Functions are ways of running commands in R on given inputs, there may or may not be an output. y <- sin(pi/4) round(y) ls() rm(y) q() # gives y the sine of pi/4 # lists your objects # removes y # quit R. At exit you are asked if you would like to save the objects you created. If you answer ”yes” all the objects will still be there the next time you start R. Type ls without the (). Notice that this shows the code of the ls function, but does not run the function. Vectors Vectors are used to store more than one number in an object. x <- c(pi, 1, 8.6, -1, 0) # y <- 1:5 # y <- seq(from=0,to=6,by=2)# length(y) # z <- c("a","b","c") # "c" function creates vector y gets (1, 2, 3, 4, 5) a sequence, takes 3 args z gets 3 characters Vectors are indexed using the square brackets []. x[3] x[c(4,1)] x[-3] x+2 round(x) # # # # # element 3 of x elements 4 and 1 of x x without element 3 element by element arithmetic rounds each element of x Notice how functions work on vectors, they apply to each element of the vector. Exercise 1.2 For each of the numbers 2, 3, 4, 5, 6 and 12 find the square of the number divided by 2. Sol: 2 x = c(2, 3, 4, 5, 6, 12) x^2/2 # or is it? (x/2)^2 Graphics: simple plots The function plot() starts a new plot. Usually it requires a vector of x-coordinates and a vector of y-coordinates as input: x <- 1:20 y <- x^3 plot(x,y) # function with 2 (or more) arguments A new graphics window pops up for the plot. Subsequent plots overwrite the current plot in this window. To get a line plot instead of a point plot, use the optional argument type="l" with plot(): plot(x,y,type="l") You can add points or lines to an existing plot, using the points or lines functions: plot(x,y) points(rev(x),y) lines(x,8000-y) # rev reverses the order Different character for points need a pch= argument. Numbers give various symbols, and characters use that letter as a marker: plot(x,y) points(rev(x),y,pch=3) # add crosses points(x,8000-y,pch="x") # a character Change line widths with lwd=, or line styles with lty=. Colours are set with col= plot(x,y, col="red") lines(x,y,lwd=4) lines(rev(x),y,lty=2) # # thick line dashed line You can label your axes with the xlab= and ylab= arguments, you can give your plot a title with the main= argument. 3 plot(x,y,xlab="X Is Across", ylab="Y Is Up", main="Main Title") Exercise 1.3 Draw a blue circle, put a nice title on your graph but no axis labels. Hint: think radial. theta <- seq(0, 2*pi, length=100) x = cos(theta) y = sin(theta) plot(x,y,type=’n’, # sets out axes, no points xlab="", ylab="", main="circle") lines(x,y,col="blue") Getting Help in R You can start R’s help system by typing help.start(), or by using the menus. Either one will start a web browser window showing the R help web page. If this doesn’t work, go to the http://stat.ethz.ch/R-manual/R-patched/doc/html/index.html or http://tinyurl.com/cny9k. e.g. To find out more about the seq function either enter ?seq, or use the menus. Exercise 1.4 Use the text function to put the name of your favourite philosopher in the centre of your blue circle. Sol: ?text text(0,0,’Plato’) Reading data into R The read.table() function is used to read a data file into R. Save the file class96.dat in your home directory. Look at the file in your favourite text editor. Load it into R with a command such as class96 <- read.table("h:/class96.dat")# windows class96 <- read.table("class96.dat") # linux local dir class96 <- read.table("~/class96.dat") # linux top dir Notice that the forward slash, not the backward slash, is used to delimit folders, even in windows. 4 Matrices The class96 dataset contains the heights and weights of the students enrolled in GSSE401 in 96. The four columns in the matrix are: number in list, height (cm), weight (kg), and sex, where 1=female, 2=male. class(class96) # data.frame, more than a matrix class96 # displays the values dim(class96) # dimensions of the matrix class96[3,4] # element in row 3 column 4 class96[1:3,] # first three rows and all the cols class96[,c(1,3)] # 1st and 3rd column hist(class96[,2], main="Student Heights", xlab="cm") # histogram of student heights Headers and column names names(class96) # default names names(class96) = c("number", "height", "weight", "gender") class96 class96[1:3, c("height", "weight")] # same as class96[1:3,c(2,3)] class96$height[1:3] # list access There are more options for read.table, e.g. to read in files separated by commas, use sep=",". The na.strings argument tells R how missing values are coded. R expects missing values to be written as NA. If the missing values are coded differently, a dot say, use na.strings=".". Writing functions Functions are of the generic form name <- function(input args){ statements } For example myfun <- function(x){ plot(x, x^2-x) } myfun takes one argument, x, and makes a graph with it. It expects x to be a vector with several values. x = seq(-2, 3, len=100) myfun(x) myfun # lists the code 5 Storing commands in files Usually we write our R functions in separate files and then load them into R. Tradition has it that give these files .R extensions. You can use any editor to write functions, though Emacs recognises R code and has some special R features. Create a new file called joe.R containing theplot <- function() { x <- seq(-2, 2, len=1000) y <- log(abs(x)) plot(x, y, main="the plot", lwd=2, col="green") } Notice that the statements of the function are enclosed in curly braces {}. There are three ways of loading this file into R 1 typing source("joe.R"); 2 If you’re running R in emacs, pull up the file joe.R and press Ctrl-C Ctrl-L, or use the menus ESS -> Load File; 3 If your R interface has menus (such as the Gnome or windows GUI), select File -> source. Having sourced in the file joe.R run the function theplot(). Arguments Change your function theplot to the following theplot <- function(minx, maxx) { x <- seq(minx, maxx, len=1000) y <- log(abs(x)) plot(x, y, main="the plot", lwd=2, col="green") } As you can probably guess, if you type theplot(-1, 3) the graph that results will have an x axis going from −1 to 3. You could get the same result by running theplot(maxx=3, minx=-1) theplot(min=-1, ma=3) # or If the first line of the file is changed to theplot <- function(minx=-2, maxx=2) { the default values of minx and maxx will be −2 and 2. Try running theplot(max=5). Types of R objects There is a difference between numbers and characters. 1 is a number. "one" and "1" are character strings. Try the following: 6 x <- c("1", "2") x x+1 as.numeric(x) + 1 as.character(1:4) class(x) # gives class of an object Logicals Logicals or binary data consists of TRUE (or T or 1) or FALSE (or F or 0). Predict the output from running R. x <- 1:4 x == 2 # notice the double == x > 2 x != 2 x[x!=2] # clever x[x!=2] <- 7 Exercise 1.5 Write a function that changes every negative element in a vector to NA. Sol: neg2na = function(x){ x[x<0] = NA; return(x) } y = c(1,2,-3,4) neg2na(y) Lists Lists are collections of stuff, an element of a list can be anything, a matrix, a vector, a function, or even another list. x <- list(lenin = sum, marx = c("bourgeois", "class struggle"), engels = matrix(1:4, nrow=2)) x x$marx x[[2]] x[[1]](1:4) # is 1+2+3+4 x[["engels"]] length(x) ; names(x); summary(x) Most statistical functions in R return lists. 7 Nothing NULL means nothing. NA is a missing value. NaN means not a number. Try the following to see if there is a difference between NULL and NA. c() c(1:3, NA) c(1:3, NULL) x <- c(NA, NA, 3, pi) x == 3 is.na(x) is.na(NULL) is.null(x) is.null(NULL) Typing x==NA fails, you need is.na(x). Other functions to test the classs of an object are is.list, is.matrix, is.logical, is.numeric. Programming: Loops The for loop is an important programming tool. A simple loop is for( x in c(4,2,6) ) { print(x) There are other kinds of loops which are more difficult to use but are faster than for loops. The by function is quite clever. To find the mean height of the class96 students according to sex, by(class96$height, class96$sex, mean) Apply To apply a function to every column of a matrix use apply, and for lists use lapply. Exercise 1.6 Use apply to find the sum of each column of class96. Sol: sum(class96) apply(class96,2,sum) mean(class96) # mean is special [class96 has names] apply(class96,2,mean) 8 If statements Predict the output from the following code for (x in 1:4) { if (x>3) { print("big") } else { print("small") } } Exercise 1.7 Write the code to draw four circles with decreasing radii. Put an if statement in the for loop to make the 2 smallest circles red. theta = seq(0,2*pi,length=100) x0 = cos(theta) y0 = sin(theta) plot(x0,y0,type=’n’) for(i in 1:4) { r = (.8)^i x = r*x0 y = r*y0 if(i>2) { lines(x,y,col=’red’)} else { lines(x,y,col=’blue’)} } # end for Objects within functions Any objects you create within a function die when the function ends. dumbfun <- function() { x <- 1 } x <- 2 dumbfun() x x only had the value 1 when the function was running. When the function ended, the value of x went back to 2. Saving objects save(x, y, file="xandy.RData") load("xandy.RData") # load the file, with x and y save.image() # save all objects to .RData When you quit with the q() function, R runs save.image(). Whenever you start R, R runs load(".RData"). Emacs asks you what directory to start R in, because it will load the .RData file from that directory. 9 Multiple plots The graphics device can display several small plots at a time instead of one big one. Use the par() (parameter) function: par(mfrow=c(2,3)) # 2x3 array for(plato in 1:6) { plot(1:5, pch=plato) } mtext("The Republic", outer=T, line=-3, cex=2, col="red") The screen clears when you try and plot the seventh graph. To reset the plotting window back to normal, do par(mfrow=c(1,1)). Printing plots Plots can be printed directly but it is best to save them to a file. This is particularly useful for essays and reports since these files can be easily read into standard word processors (Latex, Word, etc). Try the following: par(mfrow=c(1,1)) plot(sin(1:1000)) pdf(file="sines.pdf", height=4, width=6) plot(sin(1:1000)) dev.off() # give file name size # plot to file # finished writing Check this works: xpdf sines.pdf. Functions for probability distributions There are four functions related to standard distributions in R, prefixed by one of dpqr. For the Poisson distribution, the pmf is p(x) = exp(−λ)λx /x! for x = 0, 1, . . .. dpois(2, lambda=1 ) # d gives the pmf # parameter lambda exp(-1)*1^2/factorial(2) ppois(2, lambda=1 ) # p gives the cdf exp(-1)*(1^0+1^1/factorial(1)+1^2/factorial(2)) qpois(.7, lambda=1 ) # q gives the quantile rpois(10, lambda=1 ) # r gives random numbers Plotting mass functions using barplot There are some examples of this in math104/lab100 exercises. Exercise 1.8 Make a free hand plot from this code. 10 par(mfrow=c(1,2)) # sets up subplots barplot( dpois(0:10, lambda=1 ), names.arg=0:10, ylim=c(0,.4)) barplot( dpois(0:10, lambda=3 ), names.arg=0:10, ylim=c(0,.4)) Plotting the pdf The probability density function (pdf), introduced in the next section, is the analogue of the pmf for a continuous rv. Exercise 1.9 for x > 0. The exponential distribution is a good example, and has pdf f (x) = θ exp(−θx) dexp(2, rate=1 ) exp(-2) pexp(2, rate=1 ) 1-exp(-2) qexp(.7, rate=1 ) rexp(10, rate=1 ) xval = seq(0,4,len=100) f = dexp(xval, rate=1 ) F = pexp(xval, rate=1 ) plot( xval, f, type=’n’) lines(xval, f, col=’red’) lines(xval, F, col=’blue’) # d gives the pdf # parameter rate=theta # p gives the cdf # q gives the quantile # r gives random numbers ; grid() Accessing course datasets Throughout the session, we will use some data examples, which you can download from the course webpage. Save the file in your working directory, using the filename m105.Rdata. Then in R, type load("m105.Rdata"); ls() or load("YOURPATH/m105.Rdata") This is needed for each new R session. 1.2 Chapter summary The basic constructs of the R language are objects and functions, and are introduced by way of example. Examples of objects are vectors, matrices and dataframes. These can contain numbers, characters or mixtures of such. Examples of functions are methods of manipulating these objects including extracting arithmetic summaries, plots and transformations. Some instances of writing functions are given, together with a brief summary of some programming constructs, such as the for loop and the if statement. 11 Methods specific to plotting pdfs are given, and to reading data from file. 12 Chapter 2 Continuous random variables 2.1 Review of probability Math104 introduced the concept of probability and of a discrete random variable. Here we review some of the basics and introduce continuous random variables. Probability Probability considers an experiment before it is performed. Probability, P , is a measure of the chance that an event may occur in the experiment. Tossing a coin or conducting an election survey is an example of an experiment. An event, A, is a subset of the sample space, Ω, the set of all possible outcomes. Observing a tail in a coin throw or hearing a yes response to a survey question are both events. Legitimate questions are then: What is the probability of seeing the tail twice in the experiment of tossing two coins. What is the probability of getting no positive responses in the survey? The Axioms of Probability Mathematically, probability is a function P which assigns to each event A in the sample space Ω a number P (A) in [0, 1] such that • Axiom 1: P (A) ≥ 0 for all A ⊆ Ω; • Axiom 2: P (Ω) = 1; • Axiom 3: P (A ∪ B) = P (A) + P (B) if A ∩ B = ∅ for any A, B ⊆ Ω. For mathematicians probability is a function. In every day English probability is closely associated with words such chance, uncertainty, randomness, likelihood. If probability considers the experiment before it is performed, statistics considers the experiment after it is performed. 13 Examples of discrete random variables The sample space Ω for a discrete rv X is countable, and we usually take it to be a subset of the integers or the non-negative integers. Exercise 2.1 Give examples related to University, family and sport. Sol: college membership Ω = {Bow, Car, . . .} , exam grades Ω = {A, B, C, D, E}, number of goals in a match, Ω = {0, 1, 2, . . .} number of children in a family, same. Probability mass function Definition: The probability mass function (pmf) of a discrete random variable X is p(x), where p(x) = P (X = x) for x = 0, 1, 2, . . . Result: (Properties of the pmf). The probability mass function p(x) satisfies • 0 ≤ p(x) ≤ 1 for all x; P∞ • x=0 p(x) = 1; • For any event A, P (X ∈ A) = For example, P x∈A p(x). P (a < X ≤ b) = P (X = a + 1) + P (X = a + 2) + · · · + P (X = b) = p(a + 1) + p(a + 2) + · · · + p(b). Definition: The cumulative distribution function (cdf) is defined as F (x) = P (X ≤ x) for − ∞ < x < ∞. Result: The cdf simplifies to int(x) F (x) = P (X ≤ x) = P (X ≤ int(x)) = X P (X = x), k=0 where int(x) denotes the largest integer smaller than or equal to x, eg int(5.2) = 5, int(3) = 3, int(−2.1) = −2. This is a step function, and is not continuous. 14 Exercise 2.2 For a random variable X that takes values {0, 1} with probabilities θ, 1 − θ, obtain P (X ≤ x) for all x ≥ 0. Sol: Add graph. 0 if x < 0 θ if 0 ≤ x < 1 P (X ≤ x) = 1 if 1 ≤ x Exercise 2.3 Use this Rcode to plot the cdf at the points (−1, 0, 1, 2) when θ = .4. Draw the graph in your notes. theta=.4 p0 = theta ; p1 = 1-theta xval = c(-1, 0, 1, 2) F = c( 0, p0, p0+p1, 1) plot( xval ,F) plot( xval ,F, type = ’s’) points(xval ,F) 2.2 # adds in step function Continuous and discrete rvs A mathematical way of describing a probability experiment and its events is to define a random variable associated with it. Definition: A random variable X is a function from sample space Ω to the real numbers R (continuous) and to the integers Z (discrete). Exercise 2.4 Experiment 1: In a presential election with two candidates B and C, the possible outcomes are Ω = {B, C}. Define a random variable X that maps from Ω to {0, 1}: X(B) = 0, X(C) = 1. Then the probability of the event {C} is equivalent to P (X = 1). Exercise 2.5 Experiment 2: A national air quality monitoring system automatically collects measurements of ozone level at designated sites. The possible outcomes are Ω = {x : x ≥ 0}. Define a random variable X to be the value of the measurement, X(x) = x, the identity map. Then the probability that ozone level falls below a certain level c is given by P (X ≤ c). 15 Remarks on rvs • A random variable (rv) X is a function that associates a unique number with each possible outcome of an experiment. • Associated with each discrete random variable X is a probability mass function (pmf) p(x) from which probabilites of all possible events involving X may be computed. • Associated with each continuous random variable X is a probability distribution function (pdf) f (x) from which probabilites of all possible events involving X may be computed. • Associated with the pmf and with the pdf is the cumulative distribution function (cdf) F (x) that gives particular probabilities. • A continuous rv is by definition one that has a continuous cdf. The cdf of a discrete rv is a step function. • Associated with the pmf and the pdf are numerical summaries such as E(X), var(X) and, for continuous rvs, quantiles of F (x). • Often in scientific investigation X represents the variable of main interest that can be measured or observed. The cumulative distribution function (cdf) In order to describe all possible outcomes of an experiment, we focus on an event of the basic form {X ≤ x} for fixed x, where x can take any value. Exercise 2.6 operations. Express a general event {a < X ≤ b} using the basic form with set Sol: {a < X ≤ b} = {X ≤ b} ∩ {X ≤ a}c . If we have a rule of assigning probability to an event of the basic form, then probability of any event can be determined. Definition: For any discrete or continuous univariate random variable X, the cumulative distribution function, cdf, F : R→[0, 1], is defined by F (x) = P (X ≤ x). In terms of the original sample space the event {X ≤ x} is interpreted as {ω : X(ω) ≤ x}. 16 F is defined for −∞ < x < ∞, and we require F (−∞) = 0 and F (∞) = 1 to avoid having to deal with degenerate rvs. xvals = seq(-6,6,length=100) F = pnorm(xvals) plot(xvals,F,type=’n’); grid() lines(xvals,F,col=’red’) It is a result that F is a non-decreasing function. Exercise 2.7 Prove that for a ≤ b, P (a < X ≤ b) = F (b) − F (a). Sol: {X ≤ b} = {a < X ≤ b} ∪ {X ≤ a} from above, union disjoint events, so P ({X ≤ b}) = P ({a < X ≤ b}) + P ({X ≤ a}) or F (b) = P (a < X ≤ b) + F (a). Probability for continuous rvs When the cdf F (x) = P (X ≤ x) is continuous the outcomes of the experiment have to be measurements on a continuous scale, and the rv is said to be continuous. Examples include ozone level, weight, direction, waiting times, stock price,. . . Result: (Zero probability). If X is continuous rv P (X = x) = 0 for all x. Proof: P (x − h < X ≤ x + h) = F (x + h) − F (x − h) above so if F is continuous P (X = x) = lim P (x − h < X ≤ x + h) h→0 = lim F (x + h) − F (x − h) = 0. h→0 Therefore, unlike the discrete case, the probability distribution function cannot be reduced to sum of single events. To describe probability of an event of a continuous random variable, we need new mathematical tools! 17 Probability density function Assume the cdf is differentiable as well as continuous. Definition: The probability density function, pdf f (x) a continuous random variable X is defined by f (x) = d F (x). dx Result: (The cdf as a definite integral). The cdf satisfies F (x) = Z x f (u) du . −∞ Proof: Standard rules of integral calculus. The cdf is a definite integral of the pdf. (If discrete the cdf is the definite sum of the pmf.) Result: The probability density function f (x) satisfies • f (x) ≥ 0 for all x; • R∞ −∞ f (x) dx = 1. • For any event A, P (X ∈ A) = R x∈A f (x) dx. However it may be that f (x) ≥ 1 for some x. Interpretation of the pdf Result: (Area under the curve.) Using calculus, P (a < X ≤ b) = F (b) − F (a) Z b Z = f (x) dx − = −∞ Z b a f (x) dx −∞ f (x) dx, a but this is the area under the curve (x, f (x)) between (a, b]. Hence this area represents the probability that the rv X lies in this interval. 18 f (x) Probability density function P (a < X ≤ b) a b x Example of a pdf. P (a < X ≤ b) is the area under the curve between a and b. Note that the density function f (x) itself does NOT represent the probability of any event. Exercise 2.8 For a random variable X with cumulative distribution function x if 0 ≤ x ≤ 1 F (x) = 0 otherwise. (a) Find P (0.3 < X ≤ 0.5). (b) Find the pdf of X. (c) Sketch the function pdf and shade area under the curve between 0.3 and 0.5. Sol: (a) P (0.3 < X ≤ 0.5) = F (0.5) − F (0.3) = 0.5 − 0.3 = 0.2. 1 if 0 ≤ x ≤ 1 d (b) f (x) = dx F (x) = 0 if x < 0 or x > 1. (c) Sketch. Simulating rvs It is desirable to do experiments with simulated data, where we know the true underlying distribution, and is never the case with real life data! 19 If a random variable X has the Uniform distribution on the interval (0, 1) then the pdf is f (x) = 1 for 0 < x < 1, and 0 otherwise. We write X ∼ Uniform(0, 1). The area under the curve is the probability Z b P (a < X < b) = f (x)dx = b − a, for 0 < a < b < 1. a The shaded area represents P (0.2 < X < 0.5) 1 Uniform(0,1) density 0 P(0.2<X<0.5) 0 0.2 0.5 1 x Exercise 2.9 Uniform. Simulate 1000 realisations of the rv X ∼ Uniform(0, 1) using runif. Draw the histogram. Plot the pdf on the range (−.5, 1.5) using the function dunif to give 100 points and find the probability that P (0.2 < X < 0.5) using the function punif. x = runif(1000) # r=rv unif=Uniform hist(x, prob=T, breaks=20, col=’yellow’,xlim=c(-.5,1.5)) range = seq(-.5,1.5,length=100) # plotting points f = dunif(range) # d=pdf plot(range, f, type=’n’) lines(range, f) punif(0.5) - punif(0.2) y = (0.2<x) & (x<0.5) sum(y) # the frequency of 1’s Sol: Theoretically P (0.2 < X < 0.5) = 0.3. The relative number of points in (.2, 5) is 282/1000. 2.3 Expected values Expectation 20 Definition: If X is a discrete rv with pmf p(x) on {0, 1, · · ·}, then the expected value of X is ∞ X µ = E[X] = xp(x) . x=0 If X is a continuous random variable with pdf f (x) on (−∞, ∞), then the expected value of X is R∞ µ = E[X] = −∞ xf (x) dx. We can think of this as an average of the different values that X may take, weighted according to their chance of occurrence. Expectations of functions of rvs Consider g(X) where g is a fixed function. Definition: If X is a discrete rv with probability mass function p(x) on {0, 1, . . .}, then the expected value of g(X) is E[g(X)] = ∞ X g(x)p(x) . x=0 If X is a continuous rv with probability density function f (x) on (−∞, ∞), then the expected value of g(X) is E[g(X)] = Exercise 2.10 Z ∞ g(x)f (x) dx . −∞ Show that E[3] = 3. Sol: Proof: We regard 3 as a constant function of X, E[3] = Z = 3 ∞ 3f (x) dx −∞ Z ∞ f (x) dx def E calculus −∞ = 3[F (∞) − F (−∞)] result above = 3[1 − 0] non-degenerate = 3. 21 Exercise 2.11 Let X have the pdf f (x) = exp(−x) for all x ≥ 0. The expectation E[X] (with value = µ) is Z ∞ µ = E[X] = x exp(−x) dx 0 Z ∞ ∞ = [−x exp(−x)]0 + exp(−x) dx integ by parts 0 = 0+ [− exp(−x)]∞ 0 Find E[X 2 ] and E[(X − µ)2 ]. = 0 − (−1) = 1. Sol: 2 E[X ] = Z ∞ x2 exp(−x) dx 0 2 = [−x exp(−x)]∞ 0 + Z ∞ 2x exp(−x) dx 0 = 0 + 2 × 1 = 2, Z ∞ 2 E[(X − µ) ] = (x − 1)2 exp(−x) dx Z0 ∞ = (x2 − 2x + 1) exp(−x) dx 0 = 2 − 2 × 1 + 1 = 1. Properties of expectation Result: (Linearity of expectation). If X has expectation E[X] and Y is a linear function of X as Y = aX + b then Y has expectation E[Y ] = a E[X] + b . Result: More generally, E[g(X) + h(X)] = E[g(X)] + E[h(X)] E[cg(X)] = c E[g(X)] E[aX + b] = a E[X] + b Note that we proved them in MATH 104 for discrete random variables. Using linear properties of expectation, we may compute E[(X − a)2 ] by E[(X − a)2 ] = E[X 2 − 2aX + a2 ] algebra = E[X 2 ] − E[2aX] + E[a2 ] by (2.1) twice = E[X 2 ] − 2aE[X] + a2 by (2.3). 22 (2.1) (2.2) (2.3) Variance and standard deviation Definition: If X is a random variable with expected value µ = E[X], the variance of X is σ 2 = var[X] = = E[(X − µ)2 ] P∞ 2 x=0 (x − µ) p(x) R∞ −∞ (x − µ)2 f (x) dx for discrete rv on {0, 1, . . . , } for continuous rv on (−∞, ∞) . Result: The variance of X can be calculated as σ 2 = E[X 2 ] − µ2 . Proof: Use the above result. Definition: The standard deviation of X is σ= p var[X] . The variance, or better the standard deviation, is a measure of the spread of a random variable about its expectation. Exercise 2.12 For f (x) = exp(−x) for all x ≥ 0, find the standard deviation of X. Sol: From above the variance σ 2 = E(X − µ)2 = 1. Consequently the std is σ = √ 1 = 1. Properties of the variance Result: If var[X] exists and Y = a + bX, then var[Y ] = b2 var[X]. Hence, the standard deviation of Y is σY = |b|σ. Exercise 2.13 Why is the absolute value needed in the above expression? Sol: p Use counterexample: var(−3X) = 9 var(X), taking sqrt give 3 var(X), which is p not the same as −3 var(X). 23 0.4 Probability mass function 0.4 Probability mass function 0.3 µ = 0.83 0.0 0.1 y1 σ = 0.83 0.2 0.2 0.0 0.1 y1 0.3 µ = 2.5 σ = 1.1 0 1 2 3 4 5 0 2 3 x Density Density 4 5 µ = 0.83 1 µ = 2.5 0.3 1 x σ = 1.1 0.1 0.1 0.5 σ = 0.83 0 1 2 3 4 5 x x Means and standard deviations for discrete and continuous rvs. 2.4 Standard continuous distributions We specify several standard distributions in terms of given pdfs: the uniform, the exponential and the normal. Uniform This distribution is used to model variables that can take any value on a fixed interval, when the probability of occurrence does not vary over the interval. Definition: The pdf of a Uniform rv X, distributed on the interval (a, b) is given by: f (x; a, b) = 1 b−a 0 if a < x < b; otherwise, where the parameters are (a, b) and −∞ < a < b < ∞. This is written as X ∼ Uniform(a, b). We often write f (x) = 1/(b − a) for a < x < b, and suppress the fact that (i) there are other arguments, f (x; a, b) and (ii) f (x) = 0 when x < a or x > b. 24 1/(b−a) 0 P (a < X ≤ x0 ) x0 a b x pdf for Uniform(a, b) random variable. Shaded area represents P (a < X ≤ x0 ). Result: the expected value and variance of X ∼ Uniform(a, b) are E[X] = a+b , 2 var[X] = (b − a)2 . 12 Proof: E[X] = = Z xf (x) dx −∞ Z b a = ∞ x1/(b − a) dx b+a 1 [x2 /2]ba = . b−a 2 Similar calculations work for the variance. Exercise 2.14 Evaluate the pdf and the cdf of a Uniform rv with parameters a = −2, b = 2, at x = .5 and then plot on an interval. There is one ambiguity in the plot: identify. dunif(0.5, min=-2, max=2) # pdf Unif(-2,2) # at x=0.5, f(0.5)=0.25 punif(0.5, min=-2, max=2 ) # cdf of Unif(-2,2) F(0.5)= 0.625 xval = seq(-2.5, 2.5, length=101) f = dunif(xval, -2,2) # F = punif(xval, -2,2) # plot(xval, F,type=’n’) lines(xval, f,col=’blue’) lines(xval, F,col=’red’) Sol: The vertical lines on the pdf should not be there. 25 Exponential This distribution is often used to model variables that are the times until specific events happen when the events occur at random at a given rate over time. Definition: The pdf of an Exponential rv X is θ exp(−θx) for x > 0, f (x; θ) = 0 otherwise, where 0 < x and the rate parameter θ > 0. This is written as X ∼ Exponential(θ) and θ ∈ (0, ∞). Result: The cdf of X ∼ Exponential(θ) is F (x) = 1 − exp(−θx) for x>0 and 0 otherwise. Proof: F (x) = = = Z x f (u)du Z−∞ x f (u)du Z0 x for x>0 θ exp(−θu)du for x>0 0 = [− exp(−θu)]x0 = 1 − exp(−θx) Result: E[X] = 1 , θ for for var(X) = x>0 x > 0. 1 θ2 Proof: Seen above. The parameter θ is known as the rate parameter because if X is the time until the 1 is the rate of occurrence. next event occurs, then θ = E[X] Exercise 2.15 The value of θ influences the probability of different outcomes. How is the shape of the function related to the parameter θ? Which pdf in the figure has lowest tail probability P (X > 10)? xvals = seq(-.2,6,length=100) f1 = dexp(xvals, rate=1) f2 = dexp(xvals, rate=2) f3 = dexp(xvals, rate=1/2) plot(xvals,f2,type=’n’) ; grid() lines(xvals,f1) lines(xvals,f2,col=’red’) lines(xvals,f3,col=’blue’) ; grid() 26 Sol: As f (0) = θ, the highest curve at 0 is θ = 2 (pdf exceeds 1) the lowest curve at 0 is θ = 0.5. The exponential decay of the function is quicker for larger θ, the smallest tail probability is when θ = 2. Exercise 2.16 Evaluate the pdf and the cdf of an Exponential distribution and plot; give an eyeball estimate of P (X < 1). xval = seq(-0.2, 4, length=100) f = dexp(xval, rate=2) # pdf F = pexp(xval, rate=2) # cdf plot(xval, f, type =’n’,ylab=’’) ; grid() lines(xval, f, col=’red’) lines(xval, F, col=’blue’) Sol: the pdf starts at (0, 2). a pdf is not a probability. from cdf about .9 pexp(1, rate=2)# 0.86 Exercise 2.17 Suppose that the time the first goal is scored can be modelled by an Exponential distribution with rate parameter θ = 2/3 hours. Write down the cdf. Find the probability that time until the goal occurs is (i) more than 30 minutes away, (ii) between 30 and 50 minutes. Sol: Let X be the random variable of the waiting time. Then X ∼ Exponential(2/3) and F (x) = 1 − exp(−(2/3)x). (i) P (X > 1/2) = 1 − F (1/2) = exp(−2/3 · 1/2) = 0.7165, (ii) P (1/2 < X < 5/6) = F (5/6) − F (1/2) = exp(−2/3 · 1/2) − exp(−2/3 · 5/6) = 0.1428, assuming no half time. Normal distribution: background quoted from gqview weblib/Gauss.html The normal distribution was introduced by the French mathematician Abraham De Moivre in 1733. De Moivre used this distribution to approximate probabilities of winning in various games of chance involving coin tossing. It was later used by the German mathematician Karl Gauss to prredict the location 27 of astronomical bodies and became known as the Gaussian distribution. In the late nineteenth century statisticians started to believe that most data sets would have histograms with the Gaussian bell-shaped form and that all normal data sets would follow this form and so the curve came to be known as the normal curve. This distribution is also known as the Gaussian distribution, after the German mathematician Karl Frederick Gauss. The density was pictured on the German 10 mark note bearing Gauss’s image! Normal distribution Definition: The pdf of a Normal random variable X is 2 ! 1 1 x−µ , f (x; µ, σ) = √ exp − 2 σ 2πσ where −∞ < x < ∞, and the parameters −∞ < µ < ∞ and 0 < σ. This is written as X ∼ N(µ, σ 2 ) and θ ∈ Θ = (−∞, ∞) × (0, ∞). Result: var(X) = σ 2 E[X] = µ , Proof: Too hard for math105. The Normal distribution plays an important role in a result that is key to statistics, known as the central limit theorem. This theorem, discussed in Math230 and Math313 gives a theoretical basis to the empirical observation that many random phenomena seem to follow a Normal distribution. Usually, the mean parameter µ and the scale parameter σ are unknown, although sometimes it is assumed that σ is known as this simplifies things considerably. These parameters are crucial in determining probabilities. Consider the figure 0.8 Exercise 2.18 0 0.2 0.4 0.6 sigma=0.5 sigma=1 sigma=1.5 −3 0 3 x Pdfs for Normal(µ, σ 2 ) random variables where µ = 0 and σ = 0.5, 1, 1.5. 28 Which one has higher probability of P (|X| > 3)? xvals = seq(-4,4,length=100) f1 = dnorm(xvals, sd=1) f2 = dnorm(xvals, sd=2) f3 = dnorm(xvals, sd=1/2) plot(xvals,f3,type=’n’) ; grid() lines(xvals,f1) lines(xvals,f2,col=’red’) lines(xvals,f3,col=’blue’) Sol: The larger σ, the more spread. So θ = 1.5 has the largest probability of P (|X| > 3) and θ = 0.5 has the smallest. Exercise 2.19 Complete the code to establish that the dnorm function gives the same result as direct calculation of the pdf when X ∼ N(2, 4). xvals = seq(-4,8, length=11) pdf = dnorm(xvals,mean=2,sd=2) f = 1/(sqrt(2*pi)*2)* sum(f!=pdf) # 0 bingo Sol: f = 1/(sqrt(2*pi)*2)*exp(-0.5*((xvals-2)/2)^2) Normal cdf and quantiles The normal cdf is F (x) = Z x −∞ f (u) du = Z x −∞ √ 1 2πσ 2 exp n − (u − µ)2 o du . 2σ 2 This does not have a closed form expression so numerical evaluation is required, if we want to obtain probabilities of the form P (X ≤ x) or quantiles. Note that R functions for the Normal use the standard deviation σ, not the variance σ 2 . Exercise 2.20 Write down the numerical values of P (X ≤ x) corresponding to pnorm(0,mean=2,sd=sqrt(5)) # X~N(2,5), P( pnorm(0,mean=2,sd=sqrt(3)) # X~N(2, ), P( 1-pnorm(-2,mean=0, sd=2) # X~N(0, ), P( )=0.1855467 )=0.1241065 )=0.8413447 Exercise 2.21 A normal distribution is proposed to model the variation in height of women with parameters µ = 160 and σ 2 = 25 measured in cm. Find the proportion of tall women, defined as over 175cm tall, in terms of an integral. 29 Sol: Let H be the random variable of woman’s height then H ∼ N(160, 25). So Z ∞ n (x − 160)2 o 1 √ dx. P (H > 175) = exp − 2 · 252 2π25 175 In the above example we have expressed the proportion in terms of an integral and as the number of deviations from the mean. The integral is impossible to calculate analytically so numerical evaluation is required to obtain probabilities or quantiles. Standardardization of the random variable It is useful to express such probabilities in terms of a standardized random variable, with µ = 0 and σ = 1. Result: If X ∼ N(µ, σ 2 ) then Z= X −µ ∼ N(0, 1), σ and conversely if Z ∼ N(0, 1), then X = µ + σZ ∼ N(µ, σ 2) . Proof: The formal proof will be given in math230 and here it is sufficient to note that E[Z] = 0 var[Z] = 1 . Definition: A random variable Z is said to have a standard normal distribution with mean 0 and standard deviation 1 if its pdf is given by 1 f (z) = √ exp(−z 2 /2) , 2π where −∞ < z < ∞ and is denoted by Z ∼ N(0,1). The cdf, the area under the curve, of the standard normal variable Z is given by Z z 1 √ exp(−x2 /2) dx . Φ(z) = P (Z ≤ z) = 2π −∞ Values of Φ(z) are obtained from a table of standard normal probabilities or from R: for (z in c(-3.00,-2.33,-1.67,-1.00,-0.33,0.33,1.00,1.67,2.33,3.00)){ print( pnorm(z) ) } z Φ(z) -3.00 0.0013 -2.33 0.0098 -1.67 0.0478 -1.00 0.1587 -0.33 0.3694 30 0.33 0.6306 1.00 0.8413 1.67 0.9522 2.33 0.9902 3.00 0.9987 Exercise 2.22 dure: Repeat the previous example to illustrate the standardization proceH − 160 175 − 160 > ) cunning 5 5 = P (Z > 3) = 1 − P (Z ≤ 3) = 1 − Φ(3) = 1 − 0.9987 = 0.0013 from pnorm(3) P (H > 175) = P ( . The figure illustrates coverage properties of a Normal distribution. µ − 3σ µ − 2σ µ−σ µ µ+σ µ + 2σ µ + 3σ P (µ − σ < X < µ + σ) = 0.683 P (µ − 2σ < X < µ + 2σ) = 0.954 P (µ − 3σ < X < µ + 3σ) = 0.997 2.5 Quantiles and the cdf Often interest is in the values of a continuous random variable which are not exceeded with a given probability, e.g. income of lower 10% income tax payer or the score of the top 5% of students. Quantiles Let X be a random variable and p any value such that 0 ≤ p ≤ 1. Definition: The pth quantile of the distribution of X is the value xp that satisfies: P (X ≤ xp ) = p or equivalently xp = F −1 (p), where F −1 is the inverse function of F . When p = 0.5, the quantile x0.5 is called the median. When the cdf F is continuous the inverse function is uniquely defined. [Life is more problematic with step functions.] 31 p = 0.6 xp = qnorm(p, mean=2, sd=1) # xvals = seq(-2,5,length=100) F = pnorm(xvals, mean=2, sd=1) plot(xvals,F,type=’n’) ; grid() lines(xvals,F) abline(v=0,lty=3) lines(c(0,xp),c(p,p),col=’red’) lines(c(xp,xp),c(0,p),col=’red’) 2.2533 Cumulative distribution function 1 F (x) p 0 xp x Quartiles The quartiles of a distribution are the quantiles, those values at which we can cut the distribution into four equally probable slices: (x0.25 , x0.5 , x0.75 ). Cumulative distribution function Density 1 0.75 f (x) F (x) 0.5 0.5 0.25 0 x(.25) x(.75) x(.5) x x(.75) x 32 Quartiles (x0.25 , x0.5 , x0.75 ) shown on cdf and pdf respectively. Exercise 2.23 Suppose X ∼ Uniform(a, b). Find the cdf, sketch its graph, and give a formula for the p-th quantile xp . a = -2 ; b = 4 xvals = seq(a-.5,b+.5,length=100) F = punif(xvals,min=a,max=b) plot(xvals, F, type=’n’) ; grid() lines(xvals, F) xmedian = qunif(0.5,min=a,max=b) Sol: F (x) = So xp Exercise 2.24 x 1 du a b−a x−a for a ≤ x ≤ b = b−a = F −1 (p) by def = a + p(b − a). Z Find the mean and the median of X ∼ Uniform(a, b) and compare. Sol: x0.5 b a+b 1 dx = , b−a 2 a = a + 0.5(b − a) = (a + b)/2, E(X) = Z x same as the mean. Exercise 2.25 Suppose X ∼ Exponential(θ), derive the cdf from the pdf, and find the median. Verify that the mean of X is 1/θ using this calculation, and compare to the median. Z ∞ E(X) = uf (u) du def of expectation 0 Z ∞ ∞ = [−u exp(−θu)]0 + integ by parts exp(−θu) du 0 1 = 0 − 0 + [− exp(−θu)]∞ 0 θ 1 = . θ 33 Sol: Evaluate cdf: F (x) = Z 0 = Z x f (u) du property of F x θ exp(−θu) du 0 = [exp(−θu)]x0 = 1 − exp(−θx) for x > 0 . For x < 0, F (x) = 0. Quantiles: solving F (xp ) = p gives xp = θ−1 log (1 − p)−1 . For the median, p = 0.5 so the median is x0.5 = θ1 log 2. Comparison µ = 1/θ > (1/θ) log 2 = x0.5 . The distribution is not symmetric so the mean and the median are not the same. The median is smaller here because the smaller values are less concentrated than the larger values to the right. Exercise 2.26 Sample 200 realisations of X ∼ N(2, 4) and plot a scaled histogram. Overlay the theoretical pdf on this diagram. Overplot the empirical and theoretical cdfs. Calculate the 0.25, 0.5, and 0.75 sample quantiles and compare to the theoretical values. Make a brief record of these results in your notes. The empirical cdf and sample quantiles are discussed in the next chapter. par(mfrow=c(1,2)) x = rnorm(200,mean=2,sd=2) # note sd=2, hist(x,prob=TRUE,breaks=20,col=’yellow’) # bell shaped or what # overlay the true pdf to make comparison: a=-5 ; b=8 # trial and error xvals = seq(a, b, length=101) pdf = dnorm(xvals,mean=2,sd=2) lines(xvals,f,col=’red’) # not bad # now overlay the true cdf on the empirical cdf plot(ecdf(x),pch=’.’) F = pnorm(xvals,mean=2,sd=2) lines(xvals,F,col=’blue’) # again good quantile(x) # sample quantiles qnorm(c(0.25,0.5,0.75),mean=2,sd=2) # close min(x) Exercise 2.27 Complete the missing parts of the code. runif(50, min=0,max=1) rnorm(20, =0,sd=5) # # 50 obs Uniform(0, ) 20 obs Normal(0, ) 34 rexp(100, rate=0.5) rpois(200, =3) rbinom(35,size=6,prob=0.2) rgeom(150,prob=1-0.2) # 100 obs Exponential(0.5) # 200 obs Poisson(3) # 35 obs Binomial( ,0.2) # 150 obs Geometric(0.2) The reason for the 1 − 0.2 in the Geometric case is that unfortunately, in R the probability specified is the success probability, whereas the parameter θ in the pmf of a Geometric random variable is the failure probability. Transformations of rvs In certain examples it is easy to obtain the cdf of a transformed rv Y = g(X), by a change of variable. Exercise 2.28 Show that if X ∼ Uniform(0, 1) and Y = − log (X) that Y ∼ Exp(1). Sol: Proof: We need to find and identify the cdf of Y . P (Y < y) = = = = = P (− log (X) < y) the key to this e.g. P ( log (X) > −y) P (X > exp(−y)) monotonicity 1 − P (X ≤ exp(−y)) 1 − exp(−y)) as X ∼ Uniform(0, 1). But this is the cdf of Y ∼ Exp(1). Exercise 2.29 Run this code to empirically veriy X ∼ Uniform(0, 1) and Y = − log (X) that Y ∼ Exp(1). x = runif(10000) y = -log(x) hist(y, prob=T, breaks=40, col=’yellow’) yvals = seq(-.1,4,length=200) f = dexp(yvals) lines(yvals,f,col=’red’) 2.6 Chapter summary The Chapter starts with a review of probability and its axioms, and then reviews discrete random variables, the pmf, expectation and application to standard distributions, all material included in math104. 35 The math105 course continues probability theory to cover the extension to continuous rvs. Their properties are determined by the cumulative distribution function (cdf), which in turn leads to the definiton of the probability density function (pdf). Pmfs and pdfs are compared and contrasted. Expectation, and its notion of a weighted average, is generalised to cover the continuous case and its properties are discussed. Important definitions for the mean, variance and standard deviation are given in terms of expectation. Standard continuous distributions, including the Uniform, Exponential and Normal distributions, are described. Quantiles are those values of the rv that cover a given probability, and are relatively easy to define for a continuous rv. All these probabilistic concepts are illustrated throughout in the R language with special emphasis on plotting and simulation. 36 Chapter 3 Statistics and exploratory data analysis In our everyday lives, we are surrounded by uncertainty due to random variation. We often make decisions based on incomplete information. Mostly, we can cope with this level of uncertainty, but in situations where the decision is of particular importance, it can be informative to understand this uncertainty in greater detail, to aid the decision making. Statistics is unique in that it allows us to make formal statements quantifying uncertainty, and this provides a framework for decision making when faced with uncertainty. 3.1 Uncertainty Sterling’s slide has continued, with the pound falling close to $1.37...The pound also weakened against the euro, with the single currency now worth 94 pence. If I am planning to make a trip in summer abroad, is it better to change the currency now than later? Is there evidence of global warning or is it simply random fluctuation? Would the answer affect your way of living? Decision making We follow many different routes, rational or irrational, to find an answer and to cope with such situations. Often it is useful to obtain some evidence in order to decide what the answer should be. What sort of evidence would be useful in answering such questions? 37 For the UK economy, we may look at exchange rates over the past few months to figure out a trend, if any, we may want to include other factors that may explain the trend, or study similar periods in the past. To determine such factors or variables we may want to speak to economists. For the global warming, we may want to study a pattern in temperature over the past years in England, Europe or around the world. There may be other variables of interest, for example, increasing number of flooding or storms. Discussion with climatologist or hydrologist would be helpful in deciding which variables should be considered. What is data? In statistical studies data refers to the information that is collected from experiments, surveys or observational studies. For example by themself 4, 3.5, 3.2 is not data but only a sequence of numbers. However if we know these numbers are measurements of new-born baby’s weights, then these numbers become data. Numbers require metadata to become data. Probability and statistics In Probability, we consider an experiment before it is performed. The measurements to be observed are modelled as random variables. We may deduce the probability of various outcomes of the experiment in terms of certain basic parameters. In Statistics, we have to infer things about the values of the parameters from the observed outcomes, the realisations, of an experiment after it has been performed. Is Friday 13th bad for your health? Consider the following claim: I’ve heard that Friday 13th is unlucky, am I more likely to be involved in a car accident if I go out on Friday 13th than any other day? What kind of evidence would be helpful? perhaps hospital admissions. Suppose that data is available of emergency admissions to hospitals in the Southwest Thames region due to transport accidents, on six Friday 13ths, and corresponding emergency admissions due to transport accidents for the Friday 6th immediately before each Friday 13th: Number Accidents on 6th Accidents on 13th 1 9 13 2 6 12 3 11 14 4 11 10 5 3 4 6 5 12 Does the data support the claim? Compare the number of accidents by finding the average (the unweighted mean) number of accidents on both days: 38 Average number of accidents = Total number of accidents / Total number of days, so that 9 + 6 + 11 + 11 + 3 + 5 x̄6th = = 7.5 6 and 13 + 12 + 14 + 10 + 4 + 12 x̄13th = = 10.83. 6 Exercise 3.1 Referring to the Friday 13th example, • Why compare instead of focusing on accidents only on 13th Fridays? Need a baseline. • Why have we chosen Friday 6th as the comparison day? Compare like with like. • There are more accidents on Friday 13th than on Friday 6th, therefore I am more likely to be involved in a car accident if I go out on Friday 13th. Tentatively: yes. What is this course about • To illustrate scientific contexts where statistical issues may arise; • to demonstrate where statistics can be useful, by showing the sort of questions it can answer, and the situations in which it is used; • to understand sampling variation and quantify uncertainty; • introduce various exploratory tools and summary statistics for data analysis; • introduce specific techniques from statistical modelling and inference; and • apply all this to real data. Wow, and this as well!! Sources of variation Exercise 3.2 outcomes: Toss a coin 10 times. How many heads are expected? Record the H, H, H, T, T, H, H, H, H, T • Are you surprised that you didn’t have exactly 5, the half of the number of trials? Has the result changed your opinion about the coin? • Are you surprised that your neighbors didn’t have exactly the same number of heads as you did? • Repeat experiment another two times, on two further coins and record the number of heads. Did you get the same number of heads each time? • What would happen if you toss 20 times? 39 You have witnessed sampling variation. Exercise 3.3 Think back to the Friday 13th example. Is the higher chance of being in a car accident on Friday 13th, due to sampling variation? Sol: Possibly: but nearly all Friday 13ths had elevated accidents. The variation within Friday 13ths is not as great as between Friday 6ths and Friday 13ths. Ultimate test: collect new data on Friday 13th dates. Later we introduce a statistical framework to evaluate how much evidence there is for a true difference. Population and sample In the Friday 13th example, our interest is not limited to those available dates. Ideally we consider all the possible accidents occurring on all Friday 13th’s. We call the complete group of units, or people, understudy the population. • Population: the set of all individuals or units of interest, exactly defined. • Sample: a subset of the population, chosen to be representative of the population. Statistical inference is learning about the population through the behaviour of a sample. Where is statistics used? Statistics is used in a surprisingly diverse range of areas. Here is a small selection of the fields to which statistics contributes. Environmental monitoring: for the setting of regulatory standards and in deciding whether these are being met; Engineering: to gauge the quality of products used in manufacturing and building; Agriculture: to understand field trials of new varieties and choose the crops that will grow best in particular conditions; Economics: to describe unemployment and inflation, which are used by the government and by business to decide economic policies and form financial strategies; Finance: risk management, and prediction of the future behaviour of the markets; Pharmaceutical industry: to judge the clinical effectiveness and safety of new drugs before they can be licensed; 40 Insurance: in setting premium sizes, to reflect the underlying risk of the events that are being insured against; Medicine: to assess the reliability of clinical trials reported in journals, and choose the most effective treatment for patients; Ecology: to monitor population sizes and to model interactions between different species; Business: market research is used to plan sales strategies. The Sally Clark Case Statistics has played a key role in many topical news issues, including the controversial court case of Sally Clark. The case is an famous example of the misuse (or misunderstanding) of statistics contributing to a miscarriage of justice. The Royal Statistical Society were so concerned that they wrote a press release, highlighting the statistical mistakes made. Sally Clark was a mother convicted of murder, when two of her babies died of ‘Cot Death’ - the name given to the unexplained death of a young infant (SIDS). The paediatrician Sir Roy Meadow, acting as an expert witness for the prosecution in the case, famously claimed that the odds of two unexplained deaths in the same family was 1 in 73 million. Where does this figure come from? Exercise 3.4 The odds of a single unexplained death in an affluent, non-smoking family is estimated as 1 in 8500. The figure 73 million comes from multiplying these odds by themselves: 8500 × 8500 ≈ 73million. Is this a reasonable calculation? Sol: It is only appropriate to multiply these odds together if the second death is independent of the first. This is not reasonable since the children have the same DNA. A second problem A second problem is known as the ‘prosecutors’s fallacy’, which goes as follows: The chance of two unexplained deaths in the same family occurring by chance is 1 in 73 million. Therefore, the chance of Sally Clark being innocent is 1 in 73 million also. What is wrong with this argument? The following analogy will help. 41 Exercise 3.5 The idea behind the British National Lottery lottery is that 49 balls are placed in a machine, and 6 of them are drawn. Before the draw takes place, a punter pays 1 pound to place a guess on which six balls will be drawn. There is a prize of one million pounds available, to a correct guess, but the chance of getting it right is 1 : 14 million. You decide to play, and, amazingly, all six of your numbers come up! You travel to the headquarters of the national lottery to claim your winnings, but instead ... Sol: . . . you are arrested – accused of cheating! and the prosecuting lawyer argues “The chance of getting all six balls correct by chance is 1 : 14 million. Therefore, the chance of the defendant being innocent is 1 : 14 million also”. Exercise 3.6 the code. Formulate the Bayes calculation of the probability of innocence. Here is pb.a = 1/(14*10^6) # P(B|A) pb.acomp = 0.99 # P(B|A^c) pa = 1-1/(10^6) # P(A) pa.b = pb.a*pa/( pb.a*pa + pb.acomp*(1-pa) )# 0.0672 Sol: A = “innocence”, B = “six balls correct”. Want P (A|B). P (A|B) = P (B|A)P (A)/P (B) by Bayes, P (B|A)P (A) = P (B|A)P (A) + P (B|Ac )P (Ac )) by TPT. For calculations guestimate: P (B|A) = 1/(14 × 106 ) P (B|Ac ) = 0.99 P (A) = 1 − 1/103 prior prob of innocence. Posterior prob P (A|B) ≈ 0.06729469 i.e. nearly 1 in 10. Data In experiments and surveys certain specific attributes are measured on the units. These are called variables. For example, in the Friday 13th data, the unit is a Friday 13, and the variable we measure is the number of accidents. The variable is a random variable if is determined at random or by some random process. To apply probability theory we convert the measurements to numerical scales. Types of data Most random variables falls into the following two categories, depending on the characteristic and how it is measured: 42 Discrete: Variables taking values in countable sets: e.g. gender, eye color, college membership, exam grades(A, B, C, D, E), number of goals in a match, children in a family,. . . . Continuous: Variables taking values on some interval of the real line: e.g. height, weight, direction, time. . . . Sample survey data We see that some data are useful in carrying out our investigation. But how do we choose data? What are the important considerations? Is there any limit to the amount of evidence that can be obtained from some given data? Think back to the data on Friday 13th – could we use it to decide whether car accidents were especially common on Fridays? So if the evidence available is limited by the data we have, it makes sense that we should think very carefully about how we collect the data. If you are not collecting the data yourself, it is always important to understand how the data is collected, so that you are aware of any limitations that may place on your analysis. To illustrate the idea, we begin with an extreme example. Exercise 3.7 Student study: There is interest in estimating how many hours students spend studying every week. So you design a survey and find participants. Thinking to yourself where a good place would be to find students to fill in your survey, you have a brilliant idea. . . the Library! You sit outside and stop students as they leave to fill in your questionnaire. After some time you have enough results for analysis. You find that students spend, on average, 30 hours a week studying. What is wrong with the way in which the study has been carried out? • What is population of interest for the survey? All UG students at UoL 2010. • What property should the sample have? Be representative. If you had stopped students outside the University Bar instead of the library, would you have got similar results? No. Can you think of a better way to collect data for your survey? Yes. For a sample to be representative of the population requires a rigorous definition of the population. Other populations for this survey could be full time students, maths students, female students,. . . ,students in 1964,. . . For what population is sampling by stopping people outside the library appropriate? library users.! 43 A representative sample reflects the characteristics and nature of the population. If the sample is not representative, we usually introduce a systematic error called bias into the calculation. Exercise 3.8 Beach comber: A measure of how polluted are British beaches is the volume of residual plastic found on the beach. A survey is proposed to estimate this. Write down the issues that need to be addressed. Sol: Issues: How large is a large sample The term n usually denotes the number of units or subjects in the sample. There are practical as well as statistical considerations to choosing the size of the sample. On the practical side, financial constraints may mean a sample has to be smaller than n = 1000. Some statistical considerations will be discussed later. Random sample The widely accepted method to obtain a representative sample of the population is by selecting a random sample. Statisticians like these. A simple random sample of size n from a population is one in which each possible sample of that size has the same chance of being selected. One method to ensure random sampling is to write the name of every member of the population on a slip of paper, place these slips into a hat, then draw out the required amount for the sample. A more practical method has been developed using the computer, called a random number generator. For an example of a pre-election poll, we may need n = 1000 random numbers between 1 and 40 million, for a sample size of n = 1000 out of the 40 million eligible voters in the UK. If we have all the voters written in a list, we can pick out the selected subjects for our sample. sample(1:10,4) # 3 7 6 2 Other kinds of sampling It is not always feasible to carry out sampling in a truly random fashion. It can be very expensive to contact 1000 random chosen people in a pre-election poll: 44 geographically dispersed, difficult to reach, long delays. We may have to resort to a sampling method that is not random for practical reasons. Provided we are careful, we can minimize the bias that is caused. Exercise 3.9 Suppose we go to the city centre, stop passerbys in the street and ask who they are going to vote for in the next election. This is sometimes known as convenience sampling. An improved version is known as quota sampling. What kinds of bias may be introduced? Shoppers are not representative of voters. Does increasing the size of a sample decrease the bias? Exercise 3.10 For the student study hours example, one survey collects 1000 responses, with convenience sampling, with interviews made outside the library, stoppng random students. A second survey collects only 50 responses, with random sampling from a list of the entire student population of the University. Which study should we believe more? It depends on the population of interest. It is almost always better to have a small, representative sample, than a large biased sample. From here on we assume that the sample is random and study properties of simple random samples. This greatly simplifies our mathematical treatment of the problem and provides insights into important statistical ideas used. 3.2 Exploratory data analysis We introduced some examples of discrete and continuous random variables and studied their properties. If we know the exact analytical form of the underlying distribution of interest (i.e. the population), there is no need to collect data nor make statistical analysis. In reality this is rarely the case, especially in the beginning of investigation, and even if there is a conjectured model for the data, we always need to check if it is consistent with data. Data and variability Data is measured information and is fixed. But in representing the population it also carries uncertainty. This may be due to inherent random variability in the characteristic of interest: e.g. a coin throw. Measurement variability from one day to another: e.g. weight. Sampling variability: e.g. one individual is selected into the sample, another is not. In mathematical terms, in all of these three cases, the characteristic being measured is represented by a random variable: e.g. X = todays weight of an individual, e.g. X = number of plastic bottles on beach selected, e.g. X records 1 if throw a head. 45 Random variables and realisations There is an important difference between: a random variable and its realisation, observation. A random variable is always written in upper case and is a function with an associated probability distribution (pmf/pdf); e.g. X = Ozone level. An observation on a random variable is written in lower case and is just a number; e.g. x = observed value of Ozone. A data set of size n may be considered in two ways: X1 , . . . , Xn x1 , . . . , xn random variables given realizations. The first is needed for probability and statistical modelling. The second is needed for exploratory data analysis. Data analysis The first stage in any analysis is to get to know the problem and the data. The first stages of data analysis usually involves a variety of graphical procedures to visualise the data, and the calculation of a few simple summary numbers, or summary statistics that capture key features of the data. The variability in the data is a reflection of and an approximation to the true underlying distribution and its features. We need to care how good the approximation is. Role of exploratory data analysis There are three essential roles: Finding errors and anomalies: missing data, outliers, changes of scale,. . . . However carefully data have been collected, it is always possible that they contain errors. Early detection of these errors can save time and confusion later on. These may be due to recording or transcription error or broken equipment among other causes. Suggesting subsequent analyses: plots of data and summary statistics give information on location, scale and shape of the distribution and relationships between variables. This builds up a feeling for the structure of the data, which gives insight into subsequent statistical modelling. Augmenting understanding of applied problem: exploratory tools sharpen the scientific questions addressed. Context and scientific rationale for analysis is paramount. 3.3 Examples with associated data sets Each of these real life problems has an associated data set which we explore, to show the whole process involved in detailed statistical analysis from conception through to conclusion. 46 Marine science Ecological Atmospheric Chemistry Health Excess waves Diseased trees Ozone and air pollution Comparing hospitals Offshore waves at Newlyn Northings 0 20000 40000 60000 80000 100000 Coastal engineers at the port of Newlyn, in the south west of England, require detailed understanding of oceanographic processes in order to estimate overtopping rates of the sea wall protecting the town. They can then assess whether existing sea wall is adequate, or whether further protection should be built. Offshore waves are induced by meteorological conditions, and though complex, they can be summarised by their height and their period. Here we will concentrate on the excess heights of these waves over a threshold. Newlyn 0 20000 The specific problem for the engineers is: Given a small probability of exceedance, what is the wave height that is exceeded with that probability? How accurate is this estimate? statistics. Diseased trees In an ecological study of diseased trees, trees along transects through a plantation were examined and assessed as diseased or healthy. Data collection goes as follows. First a diseased tree is found. Then the number of neighbouring trees in an unbroken run of diseased trees along the transect is recorded. Ecologists are interested in the following: How does the disease spread between trees, and what is the probability that trees are infected by the disease? The observations made on a total of 109 runs of diseased trees recorded in the Table below. We use this data set to show the benefits of collecting more data. To do this we have broken down the data in the Table into data collected from the first 50 observations and from the whole data set, we refer to these as the partial and full data sets respectively. Run length Number of runs in first 50 observations Number of runs in all 109 observations 47 0 31 1 16 2 2 3 0 4 1 5 0 71 28 5 2 2 1 40000 Eastings 60000 80000 Urban and rural ozone In the UK the Department for Environment, Food & Rural Affairs operates a national air quality monitoring system, with a network of sites at which air quality measurements are taken automatically. These measurements are used to summarise current air pollution levels, for forecasting of future levels and to provide data for scientific research into the atmospheric processes behind the pollution. We look at ground-level ozone (O3 ). Ozone: the background This pollutant is not emitted directly into the atmosphere, but is produced by chemical reactions between nitrogen dioxide (NO2 ), hydrocarbons and sunlight. When present at high levels, ozone can irritate the eyes and air passages causing breathing difficulties and may increase susceptibility to infection. Ozone is toxic to some crops, vegetation and trees and is a highly reactive chemical, capable of attacking surfaces, fabrics and rubber materials. Whereas nitrogen dioxide participates in the formation of ozone, nitrogen oxide (NO) destroys ozone to form oxygen and nitrogen dioxide. For this reason, ozone levels are not as high in urban areas (where high levels of NO are emitted from vehicles) as in rural areas. As the nitrogen oxides and hydrocarbons are transported out of urban areas, the ozone-destroying NO is oxidised to NO2 , which participates in ozone formation. As sunlight provides the energy to initiate ozone formation, high levels of ozone are generally observed during hot, still, sunny, summertime weather in locations where the airmass has previously collected emissions of hydrocarbons and nitrogen oxides (e.g. urban areas with traffic). The resulting ozone pollution or summertime smog may persist for several days and be transported over long distances. Ozone: the data We focus on data from two monitoring sites: - an urban site in Leeds city centre and - a rural site at Ladybower Reservoir, just west of Sheffield. The data at each site are daily measurements of the maximum hourly mean concentration of O3 and NO2 , recorded in parts per billion (ppb), from 1994 – 1998 inclusive. To focus on the question of whether there is any effect of season on ozone levels, we compare data from winter (November – February inclusive) and early summer (April – July inclusive). We address the following questions: How, if at all, does the distribution of ozone measurements vary between the urban 48 and rural sites? How, if at all, is the distribution of ozone measurements affected by season? How, if at all, does the presence of other pollutants affect the levels of measured ozone? The purpose of the statistical analysis is to provide an objective analysis of the data, by extracting the information in the data relevant to each of the scientific questions. Comparing hospitals League tables for many public institutions such as schools, hospitals and even universities try to compare the relative performances of the institutions. This very small example uses the outcomes of a difficult operation at two hospitals. Ten patients at each hospital underwent the operation. The patients were selected to make sure that they had similar severity of illness and other characteristics which are believed to influence the outcome of the operation. There is no connection between the two hospitals. Each operation was classified as successful or unsuccessful. The first hospital had nine out of ten successful operations and the second hospital had five out of ten successful. What can we conclude about the relative performances of the two hospitals? R code for the data The data sets are saved from R in the file m105.Rdata. load("./m105.Rdata") # linux directory ls() # "barley" "ozone.summer" "ozone.winter" # "waveExcesses" "waves" # Ozone names(ozone.summer) attach(ozone.summer) # "Leeds.O3" "Leeds.NO2" "Ladybower.O3" "Ladybower.NO2" hist(Leeds.O3) Population and sample: examples In the Ozone problem, there is data from a number of days during 1994-1998. However, interest is not solely in the levels of ozone on the days on which measurements were taken. The objective of a statistical analysis is to learn about the relationships between variables, and extrapolate perhaps to future dates. Exercise 3.11 For each of the problem data sets state the populations that we are trying to learn about: Newlyn waves: All waves encountered offshore at Newlyn. Ozone: Levels of ozone at the two locations given the time of year. 49 Diseased trees: All trees in similar forests. Hospitals: Other operations at the two hospitals. Exercise 3.12 For diseased tree data set, define the variable of interest as X and its possible range of values. X= length of unbroken run of diseased trees. Discrete: X ∈ {0, 1, . . Exercise 3.13 For the hospital data set, define the variables of interest and possible range of values: X is the number of successful operations in the first hospital, Y the number in the second. Discrete: X, Y ∈ {0, 1, . . . , 10}. 3.4 Graphical methods Graphical methods are needed for visualising multivariate and univariate data. If the data is high dimensional, then it can be difficult to visualise since plots are two dimensional! Ways of overcoming this is an active area of computer science. Here the focus is on methods for examining the distribution of a single variable and relationships between pairs of variables. Historical note – Florence Nightingale Good graphical display is the important first step in any data analysis. Choosing how to do it is part science, part art, and sometimes part politics! Florence Nightingale was the first female Fellow of the Royal Statistical Society. She pioneered the use of statistics as an organised way of learning, leading to improvements in medical and surgical practices. She developed the polar-area diagram, to dramatise the needless deaths caused by unsanitary conditions. Florence Nightingale revolutionised the idea that social phenomena could be objectively measured and subjected to mathematical analysis, innovating in the collection, interpretation, and graphical display of descriptive statistics. Histograms The standard histogram of a observations on a variable displays the frequency, the number of observations, in each bin, where the bins divide up the range of the variable, and are usually of equal width. 50 A technical definition is hard to write down, and requires a definition of the empirical cdf. A histogram displays the variability and the distribution of the variable. It may suggest one pdf rather than another as a possible statistical model for the variable. In a sense the histogram is an empirical pdf. Exercise 3.14 Diseased trees. Plot histograms for the partial and full data sets and summarise the shape of the distributions displayed. partial = c(31, 16, 2, 0, 1, 0 ) full = c(71, 28, 5, 2, 2, 1 ) # barplot(full) # is another way # unbundle the data Partial = rep(0:5,partial) Full = rep(0:5,full) par(mfrow=c(1,2)) hist(Partial, xlab="Run length",ylab="Count",main="Partial", ylim=c(0,75),breaks=seq(-0.5,5.5,by=1), col=’red’) hist(Full, xlab="Run length",ylab="Count",main="Full", ylim=c(0,75),breaks=seq(-0.5,5.5,by=1), col=’blue’) Sol: Both indicate a geometric decay in the distribution of run lengths. Scaled histogram The histogram estimates the underlying pmf of a discrete variable or the pdf of a continuous variable. Recall that all pmfs sum to 1, and that all pdfs integrate to 1. It thus makes sense to plot histograms with relative frequency rather than raw frequency and so respect this summation, Exercise 3.15 Diseased trees. ?hist # freq is needed hist(Partial,prob=TRUE, xlab="Run length",ylab="Rel freq",main="Partial", breaks=seq(-0.5,5.5,by=1), col=’red’) hist(Full,prob=TRUE, xlab="Run length",ylab="Rel freq",main="Full", breaks=seq(-0.5,5.5,by=1), col=’blue’) This histogram has area 1. The shape of the histogram does not change. The vertical axis now represents the relative frequency rather than raw frequency. The benefit of rescaling is to better compare distbutions. 51 Exercise 3.16 Ozone: Comparing histograms The histograms of the summer ozone data for both sites are given in this code. Need to get the scales right for comparison. The conclusions can differ: eg peakedness. load("./m105.Rdata") attach(ozone.summer) ; names(ozone.summer) par(mfrow=c(1,2)) hist(Leeds.O3); hist(Ladybower.O3) hist(Leeds.O3,prob=T); hist(Ladybower.O3,prob=T) hist(Leeds.O3,prob=T,ylim=c(0,.05)); hist(Ladybower.O3,prob=T,ylim=c(0,.05)) hist(Leeds.O3,prob=T,ylim=c(0,.05),breaks=20); hist(Ladybower.O3,prob=T,ylim=c(0,.05),breaks=20) hist(Leeds.O3,prob=T,ylim=c(0,.06),breaks=20); hist(Ladybower.O3,prob=T,ylim=c(0,.06),breaks=20) These are clearly different, but the spread and shape of these histograms is sufficiently close to make it difficult to identify any obvious difference by eye. To really look at differences: consider differencing. Exercise 3.17 Ozone: the differences. We have observations on the ozone level at each site, (xi , yi) for every day i = 1, . . . , n. Looking directly at the daily differences, di = xi − yi , in ozone removes common variability (e.g. atmospheric conditions) to the two locations. par(mfrow=c(1,1)) length(Leeds.O3) # 469 d = Leeds.O3 - Ladybower.O3 hist(d,freq=FALSE,col=’yellow’, # ways to skin a cat xlab="difference",ylab="Rel freq",main="O3 differences") grid() Exercise 3.18 Conclusions drawn: The variability of these differenced data is less than the variability of the measurements made at the separate sites. So common factors that affect both sites, and influence ozone values, are removed from the differenced data. Differencing is only possible if measurements are collected on the same unit=day. Most differences are negative : measurements at Ladybower are larger than at Leeds. This supports scientific expectations that rural ozone levels are generally higher than urban levels. Choice of bin size for a histogram Constructing a histogram smooths the data, and the width of the bins determines how much smoothing is applied. Broad bins correspond to highly smoothed data, in which much of the structure of the data set is lost. Narrow bins undersmooth the data, leaving in random variation which obscures the structure of the data, but in a different way. 52 Exercise 3.19 Choosing bin size for the summer ozone data. Examples of very wide and very narrow bins are shown for the summer ozone data from the Leeds city centre site. par(mfrow=c(1,2)) x = Leeds.O3 hist(x,prob=T,col=’yellow’,breaks=2) hist(x,prob=T,col=’red’,breaks=500) Using a very large bin size has obscured the structure of the data. So has the very small bin size – the right hand plot just shows the raw data! surprisingly informative here. The earlier plot is somewhere between and achieved by trial and error. Heights of offshore waves at Newlyn The data set waves gives the maximal levels (in metres) recorded over consecutive 15 hour windows, throughout the period 1971-77. Typing in waves displays the whole vector. Exercise 3.20 Find the length of this vector: length(waves) # 2894 Find the mean of the offshore wave heights: mean(waves) # 2.866 Display a histogram of the offshore wave heights: hist(waves) 400 0 200 Frequency 600 800 Offshore waves 0 2 4 6 8 10 12 Wave height Describe the shape of this distribution and the range of this variable: Asymmetric, long right tail, all What does the y-axis of this plot represent? Counts of observations that fall in each bin. Scale the histogram to have area 1. hist(waves,prob=TRUE) What does the y-axis of this plot represent now? Relative frequency. The x-axis are wave heights measured in metres. 53 3.5 Empirical cdf The cumulative distribution function (cdf) of a random variable X is F (x) = P (X ≤ x), for −∞ < x < ∞. whether discrete or continuous. Define the indicator function 1 if X ≤z I(X ≤ z) = 0 otherwise. Exercise 3.21 Result: (Unbiased estimate of cdf.) Show, for any fixed z, the expected value of I(X ≤ z) is F (z). Sol: E[I(X ≤ z)] = = Z ∞ Z−∞ z −∞ I(x ≤ z)f (x)dx def E Z ∞ 1.f (x)dx + 0.f (x)dx z = P (X ≤ z) + 0 = F (z). def F Definition: The empirical cdf is defined as n F̃ (x) = 1X I(xi ≤ x). n i=1 Result: The ecdf can be calculated from F̃ (x) = n1 ( number of i st xi ≤ x). Each observation has an equal weight 1/n in this computation. Exercise 3.22 5 realisations of a rv X are {2, 3, 4, 1, 2}. Compute F̃ (x) at x = .5, 1.5, 2.5, 3.5, 4.5. How would the calculation change if the points x = 0, 1, 2, 3, 4 are used? Sol: F̃ (0.5) F̃ (1.5) F̃ (2.5) F̃ (3.5) F̃ (4.5) = = = = = 0/5 1/5 3/5 4/5 5/5. Not much change F̃ (0) = F̃ (0.5), F̃ (1) = F̃ (1.5),. . . . But this implies that the ecdf is a step function. 54 Properties of the ecdf The empirical cdf F̃ (x) is a proper cdf and • is a step function with jumps at the data points; • F̃ (x) = 1 if x ≥ max(x1 , . . . , xn ); • F̃ (x) = 0 if x < min(x1 , . . . , xn ). An alternative calculation of the ecdf As the ecdf is a step function with jumps at the data points, there is an easier way of calculation. Take the realisations x1 , . . . , xn ; order them with the smallest first; label these order statistics as x(1) , x(2) , . . . , x(n) so that x(1) ≤ x(2) ≤ . . . ≤ x(n) . The subscripts give the ranks of the data points. x=c(2, 3, 4, 1, 2) rank(x) sort(x) # 2.5 4.0 5.0 1.0 2.5 # order statistics Result: the ecdf can be evaluated at the order statistics F̃ (x(i) ) = i . n where x(i) ≤ x < x(i+1) . and for values of x in between F̃ (x) = i , n Proof: Number of x ≤ x(i) is i. Exercise 3.23 For observations {2, 3, 4, 1, 2}, find F̃ (x) and sketch the plot. x 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 F̃ (x) Sol: Order the data: 1, 2, 2, 3, 4. x 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 1 3 4 5 F̃ (x) 5 5 5 5 F̃ (x) 05 50 51 15 53 35 54 54 55 55 55 Exercise 3.24 Summer ozone. Use the first 20 observations from Leeds city center summer ozone values to compute the ecdf. 55 n = 20 x = Leeds.O3[1:n] xrank = sort(x) # order the data Fn = seq(1,n)/n # a jump of 1/n plot(xrank, Fn, type=’s’) ; grid() # step function plot(ecdf(x),pch=’.’) ; grid() # ecdf is a R function # for the whole data par(mfrow=c(1,1)) x = Leeds.O3 plot(ecdf(x),pch=’.’) ; grid(12) # ecdf is a R function Draw some conclusions from the complete data. Sol: About 60% of days the daily maxima was less than 35 and about 20% of time the daily maxima was greater than 40, steady increase in the cdf between 20 and 40, the maximum stretches out to 80. 3.6 Summary statistics In addition to visualising our data graphically, we can calculate some summary statistics which capture important features of our data. Numerical summaries of the data can • facilitate the comparison of different variables; • help make clear statements about aspects of the data. Mathematical notation Recall the notation n X i=1 g(i) = g(1) + g(2) + . . . + g(n − 1) + g(n) for any positive integer value of n and any function g. In statistics we often have to do mathematics with sums of this form. The most common forms of this expression encountered are: n X xi = x1 + . . . + xn and i=1 n X i=1 56 x2i = x21 + . . . + x2n . Sample mean Consider a random variable X from which we obtain n realisations x1 , . . . , xn . To emphasise some mathematical properties of averaging we may write the realisations as a vector x = (x1 , . . . , xn ). Definition: The sample mean of n observations x1 , . . . , xn is denoted by x̄, or by m(x), and is obtained by summing all the xi and dividing by n: n x̄ = 1X xi n i=1 n and m(x) = 1X xi . n i=1 This measures the location of the sample. It is an estimate of the expectation E(X), or the mean of X. Sample variance and standard deviation Definition: The sample variance of n observations x1 , . . . , xn is denoted s2 and is given by: n 1X (xi − x̄)2 . s = n i=1 2 Note the divisor n. Many textbooks use the divisor (n − 1) instead of n here, wierd. There are technical reasons for this but, for large values of n, it makes little difference. The sample variance is a measure of spread of X and also an estimate of the variance var(X). Ideally, a spread measure should have the same units as the original data. √ Definition: The sample standard deviation of observations x1 , . . . , xn is s = s2 . The standard deviation σ of X is the square root of σ 2 = var(X), the sample standard deviation estimates this value from the data. Exercise 3.25 Waves. Find the sample mean of the wave height data mean(waves) # 2.866. Find the sample variance var(waves) # 2.564049. Use the sqrt() function to derive the sample standard deviation sqrt(var(waves)) # 1.601265. Exercise 3.26 Ozone data. Calculate summary statistics of O3 to look more closely for differences between the locations and the seasons. There are four groups, arising from the two levels of each of the two nominal variables location and season. Standard deviations are in parentheses. The means are summer winter Leeds city 31.78 (9.28) 20.52 (10.77) Ladybower 43.63 (11.81) 29.24 (8.40) Give the Rcode to compute these numbers. Draw conclusions from these summary statistics. 57 Sol: mean(Leeds.O3) mean(ozone.summer$Leeds.O3) # list mean(ozone.winter$Leeds.O3) sd(ozone.winter$Ladybower.O3) The conclusions are comparative: The mean values for Ladybower are higher than for Leeds. The summer mean values are higher than the winter ones. The spreads are roughly the same. Sample quantiles Sample quantiles are calculated directly from the empirical cdf. Definition: the pth sample quantile, x̃p , satisfies F̃ (x̃p ) = p for 0 < p < 1. The median x̃0.5 corresponds to p = 0.5, it is another widely used measure of location. The definition x̃p = F̃ −1 (p) does not work here because F̃ is a step function and so its inverse is not defined. Exercise 3.27 code Calculate the sample mean and the median, for each dataset, using this stats = function(x){ c(mean(x),median(x)) } stats(c(2, 4, 6, 8, 10)) # 6 6 stats(c(2, 4, 6, 8, 100)) # stats(c(2, 4, 6, 8, 1000)) # The lesson learnt is that the median is insensitive to outliers. Find the 0.6 quantile of the Leeds summer ozone daily maxima. 0.8 1.0 Leeds 0.2 0.4 F̃ (x) 0.6 p = 0.6 x̃p in interval (33,34) 0.0 Exercise 3.28 0 20 40 60 Summer daily maxima 58 80 The 0.6 sample quantile lies between (33, 34). quantile(Leeds.O3,prob=0.6) # 33 Exercise 3.29 The function quantile() calculates quantiles of a vector: quantile(waves). The minimum, maximum and median values are 0.32m, 11.05m, 2.46m. Compare the median to the mean Mean higher since histogram is skewed to the right. Exercise 3.30 Plot the empirical cdf of the waves data plot(ecdf(waves),pch=’.’); grid(21) to answer the following. Find the median of the wave height distribution: 2.5m approx. Find the 0.1 and 0.9 quantiles of the wave height distribution: 1.2m and 5.1m approx. Estimate the probability of a randomly selected wave being less than 1.7m: 0.25. Find the wave height exceeded by 25% of the waves 3.7m approx. Box-and-whisker plots These plots summarise the observations in terms of quantiles. They display the extremes (the whiskers), and the central values (the box defined by the quartiles and the median). Definition: the interquartile range is x̃0.75 − x̃0.25 . The length of the box is the interquartile range. Exercise 3.31 Boxplot for the Ozone data. load("./m105.Rdata") attach(ozone.summer) ; names(ozone.summer) par(mfrow=c(1,2)) hist(Leeds.O3); hist(Ladybower.O3) boxplot(Leeds.O3,ylim=c(0,110)); boxplot(Ladybower.O3,ylim=c(0,110)) quantile(Ladybower.O3) # Features of the boxplots are: the thick line in the box is the median; the upper line in the box is the 75% quantile, and the lower line is the 25% quantile; the minimum and maximum are easily identified; and points appearing outside the limits may be considered outliers. Summarise conclusions to be drawn from these boxplots. Sol: Skewness is shown as asymmetry of the box around the median; here it is only the right hand tail that is long. The Ladybower distribution is a shift to the right of the Leeds distribution. Comparison requires the same scales. 59 3.7 Bivariate relationships Histograms and empirical distribution functions are useful methods for visualising a single variable. However, with multivariate data, it is important to examine the relationships between variables as well as the structure of each variable by itself. The scatterplot simply plots the value of one variable against another. Definition: if (xi , yi) are two observations on the same unit i = 1, 2, . . . , n, the plot of (xi , yi) is called a scatterplot. Exercise 3.32 Ozone. Consider the effect of the nitrogen dioxide (NO2 ) on ozone levels. We focus on the Leeds city centre measurements. Use this Rcode to give scatter plot of O3 and NO2 for summer and winter. Sketch the graph in your notes. par(mfrow=c(1,2)) xsumm = ozone.summer$Leeds.O3 ysumm = ozone.summer$Leeds.NO2 lim = c(0,100) # vital for comparison plot(xsumm,ysumm,type=’n’,xlim=lim,ylim=lim) ; grid() points(xsumm,ysumm,col=’red’,pch=’.’,cex=2) xwint = ozone.winter$Leeds.O3 ywint = ozone.winter$Leeds.NO2 plot(xwint,ywint,type=’n’,xlim=lim,ylim=lim) ; grid() points(xwint,ywint,col=’blue’,pch=’.’,cex=2) # stretch the graphic Draw conclusions. Sol: Ozone. Similar joint distributions, main body slightly differently located. No obvious relationship between x and y, perhaps winter (x=small,y) difference. Many outliers. The sample correlation coefficient Consider two rvs X and Y on which we have iid observations (x1 , y1), . . . , (xn , yn ). Let m(x) denote the sample mean of the x = (xi ; i = 1, . . . , n), let s(x) denote the sample standard deviation of the (xi ; i = 1, . . . , n). Similarly define m(y) and s(y). Standardised versions of xi and yi are xi − m(x) s(x) and yi − m(y) . s(y) Definition: the sample correlation coefficient r(x, y) is the average of the product of these standardised values n 1 X xi − m(x) yi − m(y) r(x, y) = . n i=1 s(x) s(y) 60 n = 20 x = runif(n) ; y = runif(n) cor(x,y) mean( (x-mean(x))/sd(x) * (y-mean(y))/sd(y) ) # why are these different? f = sqrt((n-1)/n) mean( (x-mean(x))/(f*sd(x)) * (y-mean(y))/(f*sd(y)) ) Result: (The correlation coefficient is invariant to standardisation.) For given scalars a, b, c, d and vector of ones 1 = (1, 1, . . . , 1) r(ax + b1, cy + d1) = sign(ac)r(x, y). Proof: See exercises. Result: The correlation coefficient always satisfies −1 ≤ r(x, y) ≤ 0. Proof: Because of the invariance of the correlation coefficient to standardisation, take x, y to have mean 0, and variance 1. Thus X X xi = 0 and x2i = n, i i and similarly for y. Consider the quadratic form 1X Q = (xi + yi )2 n i 1X 2 = (xi + yi2 + 2xi yi) n i 1X 2 1X 2 2X = x + y + xi yi n i i n i i n i = 1 + 1 + 2r(x, y). Now Q ≥ 0, so that 0 ≤ 2 + 2r, and r ≥ − 1. Similarly start with Q = and find 0 ≤ 2 − 2r, so that r ≤ 1. 1 n P i (xi − y i )2 The sample correlation coefficient is a measure of linear association, or clustering around a line. Interpretation: r(x, y) = 0 gives no linear association, r(x, y) < 0 means negative linear association, r(x, y) > 0 means positive linear association; when r(x, y) is near ±1 the association is strong. Exercise 3.33 Use this code to generate data with r = 0.5, roughly. par(mfrow=c(1,1)) n = 400 z = rnorm(n) x = z + rnorm(n); y = z + rnorm(n) plot(x,y, type=’p’, pch=’x’) cor(x,y) # .47 61 Use other relations of x and y to z to give plots with r = −0.5, r = 0.9, r = 0, roughly. Sol: x = z + rnorm(n) ; plot(x,y, type=’p’, x = 3*z + rnorm(n); plot(x,y, type=’p’, x = rnorm(n) ; plot(x,y, type=’p’, y = -z + rnorm(n) pch=’x’); cor(x,y) y = 3*z + rnorm(n) pch=’x’); cor(x,y) y = rnorm(n) pch=’x’); cor(x,y) # -.52 # .91 # .04 Exercise 3.34 The sample correlation coefficient is not appropriate for detecting nonlinear association. x = z + rnorm(n) ; y = z^2 + rnorm(n) plot(x,y,type=’p’, pch=’x’); cor(x,y) # -.04 Exercise 3.35 Ozone data. Calculate the sample correlation coefficients between O3 and NO2 for the ozone data. There are four groups, arising from the two levels of each of the two nominal variables location and season. xsc = ozone.summer$Leeds.O3 # summer in the city ysc = ozone.summer$Leeds.NO2 xwc = ozone.winter$Leeds.O3 # winter ywc = ozone.winter$Leeds.NO2 xsr = ozone.summer$Ladybower.O3 # rural ysr = ozone.summer$Ladybower.NO2 xwr = ozone.winter$Ladybower.O3 ywr = ozone.winter$Ladybower.NO2 cor(xsc,ysc) ; cor(xsr,ysr) cor(xwc,ywc) ; cor(xwr,ywr) Collating the results gives Summer Winter Leeds city 0.10 -0.24 Ladybower reservoir 0.25 -0.48 What conclusions can you draw from these statistics? Sol: The correlations between O3 and NO2 are small with only one being moderate. By comparison with the earlier figure, one might worry about outliers and/or non-linearity. The fear is that outliers may distort the value of the coefficient. plot(xwr,ywr,type=’p’) shows association but non-linear. 62 Exercise 3.36 The sample correlation coefficient is an estimate of the population correlation between X and Y , denoted corr(X, Y ). While its definition is beyond math105, consider how one might start by arguing an analogy to the relation between E(X) and x̄. Sol: Compare x̄ = X 1 n xi . i E(X) = weighted average, and ∞ Z x.f (x)dx weighted average. x=−∞ Now taking the standardised variables r = E(XY ) = E(XY ) = X xi yi . Zi ∞ 1 n xy.??.dx?? Zx=−∞?? Z ∞ ∞ x=−∞ xy.f (x, y)dxdy conjecture. y=−∞ Need to define a joint pdf f (x, y). 3.8 Chapter summary An introduction to statistics and exploratory data analysis is developed in terms of uncertainty, decision making and data. The symbiotic theories of probability and of statistics are contrasted in terms of the before analysis and the after analysis of a probability experiment. Data drives statistics, and sources of variation between and within data sets are described. One source of random variation is sampling and the concept of the simple random sample is introduced. Conceptual issues, such as the representative nature of the sample, the population, and methodological issues such as how to define a large sample, and other forms of sampling, are briefly discussed. Given a data set the first step in statistics is to understand its context and subject it to an exploratory data analysis in order to understand its structure and variability. Data sets for waves, trees, and ozone are used as running examples throughout the chapter. The histogram is perhaps the most well known graphical method of eda, and is one way to portray distributions. We use it to construct an empirical estimate of the pmf or pdf of the rv understudy. However, the empirical cdf is just as practically important and theoretically has pride of place. Summary statistics related to the data set are the well known sample mean, variance and standard deviation, and the lesser known sample 63 quantiles. Boxplots, which are condensed summaries of the histogram, are based on given quantiles. The Chapter ends with the extension to bivariate relationships and the definition of the sample correlation coefficient. Throughout these statistical concepts are illustrated in the R language with special emphasis on calculation, plotting and simulation. 64 Contents 1 Introduction to R 1.1 1.2 The tutorial . . . . Chapter summary 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2 Continuous random variables 2.1 2.2 2.3 2.4 2.5 2.6 13 Review of probability . . . . . . . . . Continuous and discrete rvs . . . . Expected values . . . . . . . . . . . . . Standard continuous distributions Quantiles and the cdf . . . . . . . . . Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 . . . . . . . . . . . . . . . 15 . . . . . . . . . . . . . . . 20 . . . . . . . . . . . . . . . 24 . . . . . . . . . . . . . . . 31 . . . . . . . . . . . . . . . 35 3 Statistics and exploratory data analysis 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 37 Uncertainty . . . . . . . . . . . . . . . . . Exploratory data analysis . . . . . . . Examples with associated data sets Graphical methods . . . . . . . . . . . . Empirical cdf . . . . . . . . . . . . . . . . Summary statistics . . . . . . . . . . . . Bivariate relationships . . . . . . . . . Chapter summary . . . . . . . . . . . . . 65 . . . . . . . . . . . . . . 37 . . . . . . . . . . . . . . 45 . . . . . . . . . . . . . . 46 . . . . . . . . . . . . . . 50 . . . . . . . . . . . . . . 54 . . . . . . . . . . . . . . 56 . . . . . . . . . . . . . . 60 . . . . . . . . . . . . . . 63