Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
January 31 - Welcome to Mathematics 243 Each class period, you will receive a one-page “outline” such as this one that outlines the topics of the day and also includes the homework assignment. For today, first a note from your instructor: I will not be able to in class on Monday, January 31, or Tuesday, February 1. I am serving on a panel at the National Science Foundation to review grant proposals. Normally, I do not cancel classes during the regular semester – your time is precious. However this particular program at the NSF has been so important to Calvin in providing funds for curricular improvement that I thought it important that Calvin contribute to doing the work of evaluating grant proposals. The NSF makes funding decisions almost entirely on the basis of peer review. So I hope that you will excuse my absence (and read my policy on your skipping class!) and I hope that you will be ready to begin class on Thursday, February 3. So that you aren’t bored until then, there is a homework assignment listed below. Homework is due Friday, February 4. I can answer any questions that arise about the homework on Thursday. Michael Stob Handouts that you should have in this packet 1. A one-page sheet, “syllabus” with crucial data about the course on the front and a calendar on the reverse 2. A one-page sheet entitled “Important Information” 3. The first three sections of simpleR, the notes for the statistical computer program, R, that we will use this semester (I have provided the first few sections but you can download the whole manual from the internet so I will not copy any more of it.) Homework, Due Friday, February 4 1. Read the “syllabus” and the “important information.” Be sure to note any questions that you have about these policies so that you can ask them in class on Thursday. 2. Find the course website: www.calvin.edu/∼stob/M243 Explore the website. In particular, read the information on the website about the computer package R. 3. Read sections 1.1 and 1.2 of the textbook. Pay particular attention to the definitions of terms that are in boldface in the text. Do problems 1.4,6,14,15,16. Note that problems are numbered consecutively withing chapters so problem 1.4 is problem number 4 of Chapter 1. It happens to be on page 20. All data for problems is available on the CD that came with your textbook. (Stem-and-leaf plots are a tool suited to hand analysis but histograms should normally be drawn with computer software. For this assignment, you may draw the histograms by hand but we will soon learn how to read homework data into R and draw the histograms with the software.) 4. Read the sections 1 and 2 of simpleR. Do problems 2.1–2.6. You will need to use R to do this. Either find R in a computer lab or download it to your computer. You will want to download it if at all possible since you will be using R for most assignments in this class. (If you wish to install it on your computer but you have only a slow internet connection, I will give you a CD with a copy to install on Thursday. Currently, the Macintosh lab in North Hall Basement and the Engineering labs have R installed.) February 3 – Data 1. Statistics is the science of data. (a) Three activities: i. collecting data ii. analyzing data iii. making inferences from data (b) Data are numbers in context. (c) A variable is a function defined on a set of objects (usually numerical) (d) Datasets consist of objects and variables. 2. Sampling from a population. (a) A population is a (well-defined) set of objects. (b) A sample is a subset of a population. (c) A simple random sample of size k from a population of size n is a sample chosen by a procedure for which any subset of size k has the same chance to be the sample chosen. (d) Making inferences about the population from the sample. (e) R command sample(1:100,15,replace=F) chooses a random sample of size 15 from the set of numbers in {1, 2, . . . , 99, 100} (without replacement). 3. Questions about the course. 4. Questions about homework. Homework, Due Tuesday, February 8 1. Read the supplementary notes Sections 1 and 2. 2. Read Devore and Farnum, Section 4.2, pages 161–166. 3. Do problems 1.1,2.1-4 of the Supplementary Notes. February 4 – Experiments 1. Experiments (versus observation) 2. Independent (treatment, factor) variables, dependent (response) variable 3. Statistics is about describing and explaining variation 4. Replication 5. Experimental error (variation when independent variable is fixed) 6. Randomization 7. Blocking 8. The DOE mantra: Block what you know, randomize what you don’t. Homework, Due Tuesday, February 8 1. Read Devore and Farnum, Section 4.3. 2. In a clinical test of cancer treatments, there is usually a control group and a treatment group. A very simple example of such a study can be found at http://www.stat.ucla.edu/cases/breast cancer/. Answer the questions that appear at that website. 3. The main Calvin basketball court has two baskets with very different visual backgrounds. Some people claim that it is more difficult to make freethrows at the South basket than at the North basket. Describe a design of an experiment to test this hypothesis. Think about the roles of replication, randomization and blocking in this particular experiment. February 7 – Distributions 1. Today all data are from continuous variables 2. Relative frequency histograms (area corresponds to proportion of data) 3. Distributions 4. Intepretation of distributions - approximation, model 5. The exponential family f (x) = λe−λx x≥0 6. The exponential distribution is sometimes used as a model for lifetime data, waiting times 7. R commands of note (a) dexp(c,lambda) gives the value of the density at c (b) pexp(c,lambda) gives the value of Rc 0 λe−λx dx Homework, Due Friday, February 11 1. Read Devore and Farnum, Section 1.3 2. Do problems 19,20,22,23,24 in Section 1.3 of Devore and Farnum. February 8 – More Continuous Distributions 1. Today all data are from continuous variables 2. The normal distribution f (x) = √ 1 x−u 2 1 e− 2 ( σ ) 2πσ 2 −∞<x<∞ (a) Qualitative properties: unimodal, symmetric, “bell-shaped” (b) Sample applications (c) R - dnorm(x,mu,sigma), pnorm(x,mu,sigma), qnorm(p,mu,sigma) (d) The standard normal (Z) distribution 3. The uniform distribution dunif(x,a,b) 4. The Weibull distribution dweibull(x,alpha,beta) 5. The beta distribution dbeta(x,alpha,beta) Homework, Due Friday, February 11 1. Read Devore and Farnum, Section 1.4 and “gaze” at Section 1.5. 2. Do problems 32,34,38,40 of Devore and Farnum Section 1.4 and problem 50 of Section 1.5 February 10 – Discrete Distributions 1. Discrete distributions. 2. Mass functions. 3. The Poisson distribution. (R pois) 4. The hypergeometric distribution. (R hyper) This is a three parameter distribution with parameters called m, n, k. The distribution arises from sampling k elements without replacement from a population with m objects of one kind and n of the other. For each x, the mass function p(x) is the proportion of samples that have exactly x objects of the first kind. The mass function is p(x) = m x n k−x m+n x x = 0, 1, . . . , min(k, m) where ! a a! = b b!(a − b)! Homework Read pp 29-30 and pp 51–53. Do the following problems: 1. Problems 56 of Section 1.6. 2. Suppose that a simple random sample of size 10 is chosen from a population of size 100. Suppose 60 members of the population are female. How often will such a sample include at least 6 females? (The hypergeometric distribution will be useful here.) February 11 – Measures of center 1. mean (notation: x̄ is the mean of observations x1 , . . . , xn ) 2. median (notation: x̃) 3. the median is resistant to the effect of outliers while the mean is not 4. trimmed means (notation: x̄p ) 5. mean of a distribution (notation: µ) 6. median of a distribution (notation: µ̃) 7. mean and median of important distributions: Distribution Exponential Normal Uniform Weibull Beta Poisson Hypergeometric parameters λ µ, σ a, b α, β α, β λ m, n, k mean 1/λ µ (a + b)/2 median ln 2/λ µ (a + b)/2 α/(α + β) λ (km)/(m + n) Homework 1. Read Section 2.1. 2. You do not have to do the problem about the hypergeometric distribution included on the outline of February 10. 3. Do problems 2.2,4,6,8 of Devore and Farnum. 4. The mean and median of the exponential distribution. (a) Verify the entries for mean and median in the above table for the exponential distribution. (b) According to the above table, which number is the greater of the median and the mean of the exponential distribution? Give an explanation in terms of the shape of the exponential distribution as to why this should be so. 5. Important properties of the mean. (a) Show that the mean of x1 , x2 , . . . , xn is the unique number c such that n X (xi − c) = 0. i=1 (b) Show that the mean of x1 , . . . , xn is the value that minimizes n X i=1 (xi − c)2 . February 14 – Boxplots and percentiles 1. statistics is concerned with describing and explaining variation 2. quartiles and interquartile range (IQR) 3. five number summary 4. boxplots (box-and-whiskers) 5. percentiles 6. quartiles and percentiles of continuous distributions Homework, Due Friday, February 18 1. Read Section 2.3 of Devore and Farnum. 2. Do problems 2.32,34,36,38,42 3. Suppose a variable has an exponential distribution with parameter λ = 1. (a) What are the lower quartile, upper quartile, and IQR of the distribution? (b) What percentage of values of this variable would be classified as outliers if outliers are defined as in the construction of the boxplot? 4. Suppose that x1 , . . . , xn are observations of a variable X and observations yi are defined by yi = αxi + β where α and β are constants. How does the boxplot of the yi compare to that of the xi ? February 15 – Measures of spread 1. (“sample”) variance of a set of observations (Notation: s2 ) 2. (“sample”) standard deviation (Notation: s) 3. Why n − 1 instead of n? 4. important (obvious) identity (which we use with c = x̄ and c = µ). n X (xi − c)2 = i=1 n X x2i − 2c n X xi + nc2 x=1 i=1 5. Use c = x̄: Sxx = n X 2 (xi − x̄) = . . . = i=1 n X x2i 2 − nx̄ = i=1 n X x2i Pn − ( i=1 i=1 xi )2 n 6. variance and standard deviation of distributions (Notation: σ 2 , σ) Distribution Normal Exponential Beta Uniform Poisson Hypergeometric mean variance µ σ2 1/λ 1/λ2 α/(α + β) αβ/((α + β + 1)(α + β)2 ) (a + b)/2 (b − a)2 /12 λ λ km/n (kmn(m + n − k))/((n + m)2 (n − 1)) Homework 1. Read Section 2.2 of Devore and Farnum. 2. Do problems 2.15,16,22,24,30 3. Show that “the average of the squares is greater than the square of the averages” unless the data are constant. Formally, let yi = x2i for each i. Then show ȳ > x̄2 unless x1 = . . . = xn . (Hint: the above expression for Sxx can be employed to great effect here.) February 17 – Randomness 1. Random “experiments.” Essential features: (a) more than one possible outcome, (b) the experiment is repeatable under (more or less) identical conditions, (c) the outcome that will obtain under any given repetition is uncertain. 2. Two canonical examples related to data collection. (a) A random sample of size m is to be chosen from a finite population of size n, (b) A number n of units are to be assigned at random to k experimental treatments. 3. Crucial terminology. Fix an experiment E. Definition 1. An outcome is one of the possible (atomic) results of the experiment. Definition 2. The sample space is the set of possible outcomes. Definition 3. An event is any subset of the sample space. 4. Example: for the first canonical example. Suppose that a random sample of size 30 is taken from the population of all Calvin senior students. Any specific collection of seniors of size 30 is an outcome. (There are boatloads of these.) Some example events include: the set of samples that have exactly 15 males and 15 females; the set of samples that have exactly 10 students with GPA greater than 3.00; the set of samples that have 30 varsity basketball players (there aren’t too many of these!). 5. A probability function is a rule that assigns to each event A of the experiment a real number, P (A) such that 0 ≤ P (A) ≤ 1. 6. The probability P (A) is supposed to “predict” the long-run relative frequency of the occurence of the event A if the experiment is repeated many times. In other words: P (A) ≈ number of times A occurs number of times experiment is repeated with the approximation improving as the number of times the experiment is repeated increases. 7. Important: Probability statements are about what may occur in the long-run and are statements before the fact about what the result of the process of doing the experiment might be. They are not statements about what has already occured. What has already occured is certain! Homework, Due Tuesday, February 22 1. Read Devore and Farnum, Section 5.1. 2. Do problems 5.1,2,3,4,5 of Devore and Farnum. 3. Suppose that two numbers are to be chosen at random (without replacement) from the numbers 1, . . . , 10. (a) Make a systematic list of all the outcomes of this random experiment. How many are there? (b) Let A be the event such that the sum of the two numbers chosen is even. List the outcomes in A. (c) For the event A in part (b), what is a reasonable number to define P (A) to be? Why? (d) Now suppose that the two numbers are to be chosen at random with replacement. How many outcomes are there (you need not list them)? what is a reasonable number to assign P (A) in this case? February 18 – Probability 1. Goal: given an experiment E, assign probabilities P (A) to all events A so that P (A) is the “limiting relative frequency” of the event if the experiment is repeated. 2. Helpful language of events. Use the language of sets. (a) A ∪ B (or A or B) is the event that occurs if any outcome in A or B (or both) occurs (b) A ∩ B or (A and B) is the event that occurs if any outcome that is in both A and B occurs (c) A0 (or the complement of A) is the event that occurs if an outcome that is not in A occurs 3. Axioms for probability theory Axiom 1. For every event A, 0 ≤ P (A) ≤ 1. Axiom 2. If S is the sample space, P (S) = 1. Axiom 3. If A1 , A2 , . . . are mutually exclusive events (i.e., Ai ∩ Aj = ∅ for all i 6= j), then P (∪∞ i=1 Ai ) = P ∞ i=1 P (Ai ). 4. Simple case: if the sample space has finitely many outcomes o1 , . . . , ok that are equally likely, then assign P ({oi }) = 1/k for each outcome oi . 5. Example: throw two dice once. 6. Example of simple case: sample m objects without replacement from a set of n objects. The number of n equally likely outcomes is m . 7. An exhaustive (and exhausting) analysis of the example: sample 5 stones from a population consisting of 34 yellow, 25 blue, 4 green (63 total). Homework, Due Tuesday, February 22 Read Devore and Farnum Section 5.2. 1. Assume that A and B are events and that P (A), P (B) and P (A ∩ B) are given. Find formulas for the following events in terms of these three numbers (hint: use the third axiom above, drawing a Venn diagram may be helpful): (a) Exactly one of A or B (b) neither A or B (c) at least one of A or B (d) A and not B. 2. Consider the case of the 63 stones but the experiment of choosing 10 stones instead of 5. (a) How many different equally likely outcomes are there in the sample space? (b) For each number k = 0, 1, . . . , 10, what is the probability of choosing k yellow stones in the sample of 10? (c) In the case of sampling 5 stones from 63 stones, how many different outcomes are there if we replace each stone before selecting the next? February 21 – Conditional Probability 1. Conditional probability. (The case of partial information.) 2. Notation: P (A|B) read “probability of event A given B” 3. Definition: P (A|B) = P (A ∩ B) P (B) 4. examples: two dice, the random senior 5. Be careful: P (A|B) is not necessarily equal to P (B|A). 6. Independence. (a) the intuitive content. (b) the formal definition: A and B are independent if P (A ∩ B) = P (A)P (B). 7. The difference between sampling with and without replacement amounts to independence. Homework, Due Friday, 25 1. Read Devore and Farnum Section 5.3. 2. Do problems 5.18,19,20,22 in Devore and Farnum. February 22 – More on conditional probability 1. Example: AIDS tests are “99.5% accurate.” 2. Compound experiments. 3. Repeated trials of an experiment with two possible outcomes: “success” and “failure.” (a) The binomial mass function: n independent trials; p probability of “success,” Probability of k successes, p(k), is given by n k p(k) = p (1 − p)n−k 0≤k≤n k (b) R commands: dbinom, pbinom Homework, Due Friday, 25 Read Devore and Farnum Section 1.6, pages 48–51. 1. Do problems 1.52 and 1.55 (note that these are in Chapter 1!) 2. Consider the following experiment. There are three cards: one is white on both sides, one is red on both sides, and one is white on one side and red on the other. The experiment consists of choosing a card at random and laying it on the table with a randomly chosen side face up (i.e., each card is equally likely and each side of the card chosen is equally likely). What is the probability that, if a red side is facing up then the face down side is also red? 3. In basketball, a “one-and-one” is executed as follows. A player shoots a free throw. If the player makes the free throw, one point is awarded and the player is allowed to shoot one more free throw for an additional point. If a player misses the first free throw, then the player does not shoot the second one. Thus a player can score either 0,1, or 2 points on a one-and-one. Suppose that a player has a probability of 0.7 of making any free throw. (a) What is the probability that the player scores 0 points? 1 point? 2 points? (b) In general, if the player is allowed to continue shooting until she misses, what is the probability that the player scores k points for any k? February 24 – Random variables and distributions 1. Given an experiment E with sample space S, a random variable is a (real-valued) function defined on S. Use uppercase letters for random variables and lowercase letters for their values. 2. Examples of random variables. 3. Events correspond to possible sets of values of X. 4. A random variable has a distribution given by a density function (or a mass function). 5. Probability distributions as models for probabilities. 6. Many examples. 7. Interpretation of mean and variance of a probability distribution. Homework, Due Friday, March 4 1. Read Devore and Farnum, Section 5.4, 212-217. 2. You might find it useful to obtain (from the website) and read Section 6 of SimpleR. 3. Do problems 5.30, 5.32, 5.35. February 25 – Properties of Distributions of Random Variables 1. Using R with distributions. (Generating random numbers: rnorm) 2. Suppose that the random variable X has a certain distribution (given by a density function f (x) or a mass function p(x). Then the mean of the distribution, µ is often denoted by µX and, is computed by Z ∞ X xf (x) dx xp(x) or µX = µX = −∞ x 3. Interpretation: theoretical long-range average of X as the experiment is repeated a lot. 4. Making new random variables from old: given a real-valued function g, define a new random variable Y = g(X) by applying g to the value of X on any outcome. 5. Our favorite functions: X 2 , X n , eX , log X. 6. We could, in principle, compute a mass function or density function for g(X) given one for g. 7. Examples: 8. Computing µY where Y = g(X): µg(X) = X Z g(x)p(x) or ∞ µg(X) = x g(x)f (x) dx −∞ 2 is simply a special case where Y = (X − µ)2 . 9. Note that σX 10. Alternate useful notation for µg(X) is E(g(X)) which is read “Expected value of g(X)” 11. Properties: (a) E(cX) = cE(X) for any constant c. (b) E(X + a) = E(X) + a for any constant a. Homework, Due Friday, March 4 1. Use R to generate 100 random samples of size 100 from an exponential distribution with parameter 1. Find the mean of each of the 100 samples. Compute a five-number summary of the set of the 100 means. 2. Suppose that X is a normal random variable with µ = 0 and σ 2 = 1. Let Y = eX . (a) Compute P (0 ≤ Y ≤ 1). (Hint: First find c such that Y ≤ 1 if and onle if X < c) (b) Compute P (0 ≤ Y ≤ 3). 3. Here’s a silly experiment for you. It has two steps: first, a four-sided die (with the sides labelled 1,2,3,4) is thrown and the number c that appears is noted. Then c coins are thrown and the number X of heads is recorded. (a) What is the expected number of coins thrown? (b) What is the mass functions of X? (c) What is the mean of X? (d) In what way are your answers to parts (a) and parts (c) consistent? February 28 – Distributions of two random variables 1. Experiments with more than one associated random variable. Examples. 2. Two discrete variables: (a) Example: throw 2 dice and record smaller number S and larger number L. (b) Two discrete variables X and Y have a joint mass function p(x, y). (c) From the joint mass function can compute individual mass functions, means, etc. (d) The special case of independence - p(x, y) = pX (x)pY (y). 3. Two continuous variables: (a) Example: Math SAT score M and Verbal SAT score V . (b) Two continuous variables X and Y have a joint density function f (x, y). (c) From the joint density function can compute individual density functions, means, etc. (d) The special case of independence - f (x, y) = fX (x)fY (y). (e) Another special case, the bivariate normal distribution. 4. Extend all of the above to more variables and to the mixture of discrete and continuous variables. Homework, Due Friday, March 4 1. Read Devore and Farnum pages 218–220 and pages 146–149. 2. Do problem 3.41. 3. Suppose that X and Y are continuous random variables with f (x, y) = 2, 0 ≤ x ≤ y ≤ 1. Find P (1/2 ≤ X). (Hint: Draw a picture.) March 3 — Sampling Distributions 1. The two examples for the day: (a) How many raisins in a box? (b) Pick a senior at random and report her GPA. 2. The setting for the remainder of the course: (a) We have a random experiment with associated random variable X. We call X the population random variable. (b) The distribution of X is not completely known. Often we make assumptions about its shape (e.g., X has a normal distribution) (c) We repeat the experiment n times, independently. The random variable for the ith repetition is called X1 . (d) The variables X1 , X2 , . . . , Xn are independent and identically distributed (i.i.d.). Such a collection of random variables is called a random sample from the population X. (e) The values of these variables x1 , x2 , . . . , xn are the data (f) We want to use the data to make inferences about X. (g) We compute statistics: a statistic is a function of the sample. For example X̄ is a function of the n random variables X1 , . . . , Xn . Other examples S 2 , S, IQR, X̃. (h) A statistic also has a distribution that is also not completely known (but it is known in terms of the distribution of X). 3. $64,000 question. What does the value of the statistic tell us about the population random variable? 4. For a while, we will work on the very simple (but very important) question: What does X̄ tell us about µX ? Homework, Due Tuesday, March 8 1. You do not need to do the homework problem that is on the outline from Monday. 2. Read Devore and Farnum, Section 5.5. March 4 — Sampling Distributions 1. The setting for the remainder of the course: (a) We have a random experiment with associated random variable X. We call X the population random variable. (b) The distribution of X is not completely known. Often we make assumptions about its shape (e.g., X has a normal distribution) (c) We repeat the experiment n times, independently. The random variable for the ith repetition is called Xi . (d) The variables X1 , X2 , . . . , Xn are independent and identically distributed (i.i.d.). Such a collection of random variables is called a random sample from the population X. (e) The values of these variables x1 , x2 , . . . , xn are the data (f) We want to use the data to make inferences about X. (g) We compute statistics: a statistic is a function of the sample. For example X̄ is a function of the n random variables X1 , . . . , Xn . Other examples S 2 , S, IQR, X̃. (h) A statistic also has a distribution that is also not completely known (but it is known in terms of the distribution of X). 2. $64,000 question. What does the value of the statistic tell us about the population random variable? 3. For a while, we will work on the very simple (but very important) question: What does X̄ tell us about µX ? 4. Key fact: as the sample size gets larger, the sample mean, X̄, gets better as an approximation to the µX . 5. Empirical evidence. Homework, Due Tuesday, March 8 1. Read Devore and Farnum, Section 5.5, again. 2. One problem but many parts. We are going to investigate the sampling distribution of the sample mean of the beta distribution. Assume that X is a random variable that has a beta distribution with α = 3 and β = 5. (Normally, of course, we do not know the distribution of X. That’s the whole problem. But we can explore what happens when we know what the answer is to help us understand what to do when we don’t know what the answer is.) The mean of this beta distribution, µX = 3/8. (In general the mean of a beta distribution is α/(α + β). (a) What is the IQR of this distribution? (Remember, using qbeta will help.) (b) Generate 1000 random numbers from this beta distribution. Each of these 1000 numbers could be considered a random sample of size 1. Save these 1000 numbers in a vector called one. Give a fivenumber summary of these 1000 numbers. What is the IQR of these 1000 numbers? Compare this number to your answer in part (a). (c) Now generate 1000 random samples of size 2 from this distribution and compute the sample mean of each of those samples of size 2. Save these sample means in a vector called two. Give a fivenumber summary of these 1000 sample means. What is the IQR of these 1000 numbers? (d) Now generate 1000 random samples of size 5 from this distribution and again compute the sample mean of each of those samples of size 5. Save these sample means in a vector called five. Again, give a fivenumber summary of these 1000 sample means and compute the IQR. (e) Finally, repeat the process for samples of size 10. (Call the vector ten.) (f) Do a boxplot of the vectors one, two, five, ten on the same plot. (g) On your boxplot, sketch what you think the boxplot would look like if we took samples of size 20. Defend your sketch. March 7 — Sampling Distribution of X 1. Setting: (a) X is a random variable with unknown distribution. (b) X1 , . . . , Xn is a random sample from X (independent and identically distributed random variables.) (c) X = (X1 + · · · + Xn )/n, the sample mean is an estimate of µX . 2. Important facts about the distribution of X. Theorem 1. If X is any distribution and X is the sample mean of a sample of size n, then (a) µX = µX (sometimes written E(X) = E(X)) 2 2 = σX /n (sometimes written Var(X) = Var(X)/n) (b) σX Theorem 2. If X is a normal distribution then the distribution of X is also normal. Theorem 3 (The Central Limit Theorem). If X is a distribution and X n is the sample mean of a sample of size n, then the limit of the distributions of X n as n goes to infinity is a normal distribution. 3. The Central Limit Theorem says that if the sample size is “large enough”, then the distribution of the sample mean is approximately normal. 4. The binomial distribution and its relation to the sample mean. Homework, Due Friday, March 11 1. Read Devore and Farnum, Section 5.6. 2. Do problems 5.46,48,49,50,52 March 8 - Estimators 1. Setting: (a) X is a random variable with unknown distribution. (b) X1 , . . . , Xn is a random sample from X (independent and identically distributed random variables.) 2. Problem.Want to estimate a parameter θ of the distribution of X. 3. Estimators and estimates. Definition 1. An estimator of θ is a statistic θ̂ used to estimate θ. Definition 2. An estimate of θ is the value of the estimator for a given sample. Examples of estimators: X is an estimator of µ, S 2 is an estimator of σ 2 , S is an estimator of σ. 4. Properties of estimators. (a) unbiased: Definition 3. An estimator θ̂ is unbiased if µθ̂ = θ. Examples: X is an unbiased estimator of µ and S 2 is an unbiased estimator of σ 2 but S is a biased estimator of σ. (b) small variance: would also like estimator to have small variance. Important example: X is the minimum variance unbiased estimator of µ! (c) consistent: (estimator gets better as n gets larger) Definition 4. An estimator θ̂n for θ that depends on the sample size n is consistent if for every > 0, lim P (|θ̂n − θ| > ) = 0. n→∞ 5. Important example - X in disguise in the binomial distribution. If Y is a binomial random variable with parameters n (known) and π (unknown), estimate π by y/n. Y /n has mean π (is unbiased) and variance π(1 − π)/n. Homework, Due Friday, March 11 1. Read Devore and Farnum, Section 7.1 2. Do problems 7.2,3,4,6 of Devore and Farnum. March 10 - An introduction to confidence intervals 1. Setting: (a) X is a random variable with unknown distribution. (b) X1 , . . . , Xn is a random sample from X (independent and identically distributed random variables.) 2. Key fact used: for large n, X has a distribution that has mean µ, variance σ 2 /n, and that is approximately normal. Therefore the following random variable is approximately normal with mean 0 and standard deviation 1. X −µ √ Z= σ/ n 3. Using Z and algebra we have σ σ P X − 1.96 √ < µ < X + 1.96 √ ≈ .95 n n (The symbol ≈ means approximately equal and is because of the central limit theorem. IF X is normal, then this probability statement exact.) 4. But σ is not known. If n is large, use S to approximate σ. Then P S S X − 1.96 √ < µ < X + 1.96 √ n n ≈ .95 There are now two approximations here, one using the CLT and the other using S to approximate σ. h i 5. The interval x − 1.96 √sn , x + 1.96 √sn is called a 95% confidence interval for µ. Important: our confidence is not in the interval but in the procedure for producing the interval. Approximately 95% of the 95% confidence intervals that we produce will successfully capture µ. 6. Other confidence intervals: other percentages; one sided. 7. Work to do: What’s with all these approximations? Homework, Due Tuesday, March 22 1. Read Devore and Farnum, Section 7.2 2. Do problems 7.8,9,10,12,14 of Devore and Farnum. March 21 - Confidence intervals using t-distribution 1. Setting: (a) X is a random variable with unknown distribution. (b) X1 , . . . , Xn is a random sample from X (independent and identically distributed random variables.) 2. Review. An approximate confidence interval for µ if n is large is given by S X ± z∗ √ n where z ∗ is chosen based on the level of confidence desired (for 95% use 1.96). The approximation is due to the CLT (if X is not normal) and approximating σ by s. 3. Assume that X is normal. Then the distribution of with parameter n − 1. X √ S/ n is known exactly. It is a t-distribution 4. The t-distribution has one parameter k called the degrees of freedom. The distribution is n unimodal, symmetric with mean 0 and variance n−2 for n > 2. 5. An exact confidence interval for µ is given by S X ± t∗ √ n where t∗ is chosen based on the level of confidence desired. e.g., for a 95% confidence interval, t∗ =qt(.975,n-1). 6. For X not normal but n reasonably large, the t-distribution can be used to construct confidence intervals (and probably should instead of the Z distribution used above). The tdistribution is robust with respect to violation of the normality hypothesis. 7. The R command t.test computes confidence intervals using the t-distribution. Homework, Due Thursday, March 24, note due date 1. Read Devore and Farnum, Section 7.4, pages 313–316. 2. The problems of March 10 are due Thursday instead of Tuesday. 3. Do problems 7.36,38,39,40a 4. For the iris data, write 90% confidence intervals for the mean of the sepal length of each of the three species. Does it seem likely that the true means for the three species are actually equal? March 22 - Confidence intervals using t-distribution, two samples 1. Setting: (a) X1 , X2 are normal random variables with unknown distribution and means µ1 and µ2 . (b) X11 , . . . , X1n is a random sample from X1 and X21 , . . . , X2m is a random sample from X2 and all variables are independent. (c) Want to make inferences about µ1 − µ2 . (d) Usual application: two treatments or treatment and control group. Example: random dot stereograms. 2. Aside. Theorem 1. If Y and Z are independent, then the random variables Y ± Z have means µY ± µZ and variance σY2 + σZ2 . Application: X 1 − X 2 has mean µ1 − µ2 and variance σ12 /n + σ22 /m. 3. Crucial t-distribution fact: the random variable (X 1 − X 2 ) − (µ1 − µ2 ) q S12 S22 n + m has a distribution that is approximately t with degrees of freedom df given by a nasty formula (see book page 322). The approximation is best when the variables are approximately normal, the variances of X1 and X2 are approximately equal, and the sample sizes are approximately equal. (The biggest deviations occur when the sample sizes are small and/or the variances are quite different.) q S2 S2 4. Conclusion: (X 1 − X 2 ) ± t∗ n1 + m2 gives a confidence interval for µ1 − µ2 . 5. The R command ttest(x1,x2,...) computes these confidence intervals. 6. New Setting: (a) X1 , X2 are normal random variables with unknown distribution and means µ1 and µ2 . (b) X11 , . . . , X1n is a random sample from X1 and X21 , . . . , X2n but X1i and X2i are dependent. (c) Want to make inferences about µ1 − µ2 . (d) Usual application: Two treatments with a blocking variable so that X1i and X2i are not independent but rather represent the two treatments applied for a fixed value of a blocking variable. 7. Consider the random variable Y = X1 − X2 . This random variable has mean µ1 − µ2 . Consider Yi = X1i − X2i as a random sample from a population with mean µ1 − µ2 and unknown variance. Use one sample t-distribution. 8. Example: picking stocks by throwing darts. Homework, Due Tuesday, March 29 1. Read Devore and Farnum, Section 7.5. 2. Do problems 7.49,7.50,7.52a. 3. Refer to the “darts” data introduced in class (and available from the data page of the class website. Write a confidence interval useful for determining whether the PROS rate of return was better than that of the Dow Jones Industrial Index (column DJIA of the data). March 16 — Confidence intervals for proportions 1. Suppose x is the number of successes in n trials of a Bernoulli process with probability of success π. Then x has the binomial distribution with parameters n and π. 2. π̂ = p = x/n is an unbiased estimator for π. 3. The distribution of p has mean π and variance π(1 − π)/n. 4. For large n, the central limit theorem them implies that the following random variable is approximately normal with mean 0 and variance 1: p−π q π(1−π) n so a 95% confidence interval can be found by solving p−π −1.96 < q < 1.96 π(1−π) n for an interval of form a(p) < π < b(p). 5. Three different confidence intervals based on this: (a) Sloppy - replace π in the denominator by p (some books do this) (b) R (prop.test): solve the equation directly for an inequality of form a(p) < π < b(p). (Use the quadratic equation.) (c) R (with continuity correction, prop.test default): solve the equation after realizing that x is really between x − 12 and x + 12 . 6. Exact confidence intervals are also available in R (binom.test). These are called exact but are not really exact. They do have the property however that a 95% confidence interval is at least a 95% confidence interval. 7. Confidence intervals for differences in proportions can also be computed by the same approximations (R prop.test). 8. Determining sample sizes to ensure given confidence levels. (See Gallup poll.) Homework 1. Read Devore and Farnum, Section 7.3, pages 303–306. 2. Do problems 7.22,24,26,28. March 28 — The bootstrap 1. In estimating a parameter θ we have the following two questions: Question 1 What estimator θ̂ should we use? Question 2 How accurate is it? The answer to the second question could include an estimate of the variance of θ̂ and, preferably, information about the distribution of θ̂. 2. We have seen that for the mean µ we have good answers to the two questions. But for other parameters (e.g., population median µ̃) or for small sample sizes we don’t always have answers. 3. If we had many samples (so a sample of the estimator) we could estimate the variance of the estimator by the sample variance. But we only have one sample. Big idea. Use the sample that we have have to generate lots of samples. 4. A bootstrap method for finding 95% confidence intervals. Suppose that x1 , . . . , xn is a fixed random sample with the underlying distribution unknown. Let θ̂(x1 , . . . , xn ) be an estimator of θ. (a) For a large B choose B samples of size n from x1 , . . . , xn with replacement. Denote the ith such sample by xi1 , . . . , xin . (b) Compute the B values of the estimator θ̂i = θ̂(xi1 , . . . , xin ). (c) Compute the .025- and .975-quantiles l, u of the numbers θ̂1 , . . . θ̂B . (d) The confidence interval for θ is (l, u). 5. Evidence suggests that the bootstrap works in many situations. 6. The theory is that the sample x1 , . . . , xn gives us an approximation of the density function of the unknown rnadom variable. The empirical density function determined by the sample is the density f (x) such that f (xi ) = 1/n for all i. Bootstrap samples are simply samples from a distribution with that has the empirical density function for its density. Homework 1. Read Devore and Farnum, Section 7.6, pages 334–337. 2. Do problems 7.56,79. March 29 — Linear Models 1. Setting: bivariate data: (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ). 2. Problem: fit a function y = f (x) to the data 3. Special case: fit a linear function y = a + bx. 4. Predicted value for a fixed a, b: ŷi = a + bxi . 5. Residuals: ei = yi − ŷi . 6. Linear regression: Choose a, b to minimize n X e2i . i=1 7. Notation: Sxx = n X (xi − x̄)2 i=1 Syy = n X (yi − ȳ)2 i=1 Sxy = n X (xi − x̄)(yi − ȳ) i=1 SSTOT = Syy n X SSResid = e2i i=1 SSRegress = n X (ŷi − ȳ)2 i=1 8. The following minimize SSResid : b = Sxy /Sxx a = ȳ − bx̄ Homework Read Devore and Farnum, Section 3.3, pages 114–117. A dataset containing some statistics on baseball teams competing in the 2003 American League baseball season can be found on the data page of the course website. Suppose that you want to predict the number of runs scored (R) by a team just from knowing how many home runs (HR) the team has. 1. Write a linear function of R as a function of HR. 2. Report SSResid . 3. Plot the residuals as a function of HR. 4. Compute the predicted values for each of the teams. Make some comments on the fit. (For example, are there any values not particiularly well-fit? Do you have any explanations for that?) April Fool’s Day – Linear Regression 1. (a) Given: bivariate data: (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ). (b) Problem: fit a line y = a + bx to the data. 2. Solution: Choose a and b to minimize the sum of the squared residuals. That is minimize SSResid = n X (yi − (a + bxi ))2 = i=1 (Minimize by computing ∂ ∂a and b= ∂ ∂b , n X e2i i=1 set equal to 0, and solve for a and b.) Result Sxy = Sxx Pn (x − x̄)(yi − ȳ) i=1 Pn i , 2 i=1 (xi − x̄) a = ȳ − bx̄ 3. Why minimize SSResid as opposed to some other function of the ei ? Hold that thought. 4. Notation: Sxx = n X (xi − x̄)2 s2x = Sxx /(n − 1) (yi − ȳ)2 s2y = Syy /(n − 1) i=1 Syy = n X i=1 Sxy = n X (xi − x̄)(yi − ȳ) i=1 SSTOT = Syy SSRegress = n X (ŷi − ȳ)2 i=1 5. The following relationship is not obvious (think algebra) SSTOT = SSRegress + SSResid Interpreting this equation: we want to explain the variation measured by SSTOT . Our explanation is measured by SSRegress . The unexplained variation is SSResid . 6. The coefficient of determination, denoted by R2 measures the explained variation: R2 = SSRegress SSTOT Note that 0 ≤ R2 ≤ 1. Higher values represent a better model. R2 is usually reported as a percentage and often appears in expressions such as “x explains 56% of the variation in y.” 7. The correlation coefficient r is defined by r= √ Note that r = √ Sxy p Sxx Syy R2 . Thus −1 ≤ r ≤ 1. 8. The regression equation can be rewritten as y − ȳ x − x̄ =r sy sx Homework Read Devore and Farnum, Section 3.3. 1. Suppose that we wish to fit a linear model without a constant: i.e., y = bx. Write down an expression for the value of b that minimizes the sums of squares of residuals in this case. (Hint: there is only one variable here, b, so this is a straightforward Mathematics 161 max-min problem.) R will compute b in this case as well with the command lm(y∼x-1). In this expression, 1 stands for the constant term and -1 therefore means leave it out. Alternatively we can write lm(y∼x+0). 2. Refer to the same baseball dataset as the last homework, the 2003 American League Baseball season. Can we predict the number of wins (W ) that a team will have from the number of runs (R) that the team scores? (a) Write W as a linear function of R. (b) A better model takes into account the runs that a team’s opponent has scored as well. Write W − L as a function of R − OR (here L is losses and OR is opponents runs scored). Compare this model with the previous one by comparing the values of R2 (not runs squared but coefficient of determination!). (c) Why might it make sense from the meaning of the variables W − L and R − OR to use a linear model without a constant term as in problem 1? Write W − L as a linear function of R − OR without a constant term. (d) Compare the results of parts (b) and (c) as to the goodness of the model. It is best, when comparing a model without a constant term to one with a constant term, to compare the residuals rather than to compare the values of R2 . April 4 – The linear model 1. Review: (a) Given: bivariate data: (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ). (b) Problem: fit a line y = a + bx to the data. (c) Solution: Choose a and b to minimize the sum of the squared residuals. That is minimize n n X X 2 SSResid = (yi − (a + bxi )) = e2i i=1 i=1 (d) A measure of fit: R2 : R2 = SSRegress SSTot SSTOT = Syy = n X (yi − ȳ)2 SSRegress = i=1 n X (ŷi − ȳ)2 i=1 2. A statistical model. y = α + βx + where (a) is a random variable with mean 0 and variance σ 2 (b) α, β, σ 2 are (unknown) parameters (c) Additionally, when we want to write confidence intervals, is normal 3. The data (x1 , y1 ), . . . , (xn , yn ) are assumed to arise by choosing a random sample 1 , . . . , n . The random variables 1 , . . . , n are assumed to be independent and all have the same distribution as . (Note that this really makes yi a random variable but xi is not.) We can write µY |x = α + βx σY2 |x = σ 2 4. This model is just a generalization of one that we have seen before: we could view a random sample y1 , . . . , yn as arising from the model y =µ+ Homework Read Devore and Farnum, Section 11.1, pages 488–492. No problems! April 5 – The linear model 1. The standard statistical linear model. yi = α + βxi + ei where (a) ei is a random variable with mean 0 and variance σ 2 (should use uppercase E, some authors use ) (b) α, β, σ 2 are (unknown) parameters (c) the random variables e1 , . . . , en are independent (d) To write confidence intervals we will later assume that e is normal 2. Note that these assumptions imply that yi is a random variable (so we should write Yi ) but xi is not. We have µYi = α + βxi σY2i = σ 2 3. This model is just a generalization of one that we have seen before: we could view a random sample y1 , . . . , yn as arising from the model Yi = µ + ei 4. Estimating parameters. The least squares estimates of α and β are good estimates. There is also a natural estimator of σ 2 (but note the n − 2). β̂ = Sxy Sxx α̂ = ȳ − β̂ x̄ σ̂ 2 = s2e = SSResid n−2 5. These estimators are each unbiased. 6. Variances: σβ̂2 σ2 σ2 = = Pn 2 Sxx i=1 (xi − x̄) σα̂2 =σ 2 1 x̄2 + n Sxx 7. α̂ and β̂ are the “best” estimators in a large class of estimators. For the following Theorem, we do not need to assume that errors have a normal distribution (or even that they have the same distribution). We need only assume that they have mean 0, the same variance σ 2 and that they are independent. Theorem 1 (Gauss-Markov) The estimators α̂ and β̂ are linear functions of the yi . Among all unbiased estimators of α and β that are linear in the yi , α̂ and β̂ have minimum variance. (We say α̂ and β̂ are BLUE - best, linear, unbiased estimators.) See other side for homework. Homework Read Devore and Farnum, Section 11.1, Section 3.4 (pages 129,130) 1. If y is (approximately) linearly related to x then x is (approximately) linearly related to y. Use the state data to write Life.Exp as a function of Murder. Likewise write Murder and a linear function of Life.Exp. (a) Plot the data and both lines on one plot. (You can use R to do this but it might require a bit of thought. Else use R to plot the data and one line and add the other line in by hand. You know how to sketch lines given their equations.) (b) Why aren’t the lines the same? (Obviously if the data were on a straight line, the lines should be the same.) 2. Sections 11.1 and 3.4 show how to transform data given by (xi , yi ) to (x0i , yi0 ) if the hypothesized relationship of x and y is a particular nonlinear relationship. Show how to define new variables x0 , y 0 to transform the following nonlinear relationships to linear realtionships. (a) y = a/(b + cx) (b) y = ae−bx (c) y = abx (d) y = x/(a + bx) (e) y = 1/(1 + ebx ) 3. Sometimes the experimenter has control over the points xi in the random experiment. If the experimenter can choose the points xi , she might very well choose these points to minimize the variance of the estimators α̂ and β̂. Suppose that the experiment will collect 10 observations (xi , yi ) and can set the values of xi to be real number in Pany n the interval [−5, 5]. (Hint: you might want to recall that Sxx = i=1 x2i − nx̄2 .) (a) How should an experimenter choose the ten values xi to minimize the variance of β̂? (b) There are other considerations in choosing the values of xi . Why might the choice in part (a) not be a very good one? April 7 – Estimating parameters in the linear model 1. The standard statistical linear model. yi = α + βxi + ei where (a) ei is a random variable with mean 0 and variance σ 2 (should use uppercase E, some authors use ) (b) α, β, σ 2 are (unknown) parameters (c) the random variables e1 , . . . , en are independent (d) To write confidence intervals we will later assume that e is normal 2. Estimating parameters. (a) Estimators. Sxy SSResid α̂ = ȳ − β̂ x̄ σ̂ 2 = s2e = Sxx n−2 Important note: from now on we will use the book’s convention and write a for α̂ and b for β̂. Should really use uppercase since these are estimators (and then use lowercase for estimates). (b) These estimators are each unbiased. (c) Variances: σ2 1 σ2 x̄2 2 2 σb2 = = Pn σ = σ + a 2 Sxx n Sxx i=1 (xi − x̄) β̂ = 3. a and b are the “best” estimators in a large class of estimators. Theorem 1 (Gauss-Markov) The estimators a and b are linear functions of the yi . Among all unbiased estimators of α and β that are linear in the yi , a and b have minimum variance. (We say a and b are BLUE - best, linear, unbiased estimators.) 4. Writing confidence intervals for β. (a) (b) (c) (d) (e) Assume that i is normally distributed. Then b is normally distributed. Estimate σ 2 by s2e . Estimate σb2 by s2b = s2e /Sxx . sb is called the standard error of the estimate of β. Key fact: the following statistic has a t-distribution with n − 2 degrees of freedom b−β sb (f) Confidence interval. Let t be the appropriate critical value for the desired confidence level and n − 2 degrees of freedom. The following is a confidence interval for β (b − tsb , b + tsb ) Homework Read Devore and Farnum, Section 11.2, pages 501-502. No problems. April 8 – Confidence intervals for paramters 1. The standard statistical linear model. yi = α + βxi + ei where (a) ei is a random variable with mean 0 and variance σ 2 (b) α, β, σ 2 are (unknown) parameters (c) the random variables e1 , . . . , en are independent (d) ei has a normal distribution (assumption made to construct confidence intervals) 2. Estimators Parameter σ β Estimate se = q b= Standard Deviation Estimate √σ Sxx sb = √sSexx q 2 sa = se n1 + Sx̄xx q 2 sŷ = se n1 + (x−x̄) Sxx SSResid n−2 Sxy Sxx α a = ȳ − bx̄ α + βx ŷ = a + bx q 2 σ n1 + Sx̄xx q 2 σ n1 + (x−x̄) Sxx 3. Confidence Intervals. Let t∗ be the appropriate critical value for df = n − 2. The following are confidence intervals. Parameter Interval α a ± t∗ sa β b ± t∗ sb α + βx (a + bx) ± t∗ sŷ 4. Predicting a single y. For a fixed x∗ we predict y ∗ = α + βx∗ + e∗ by ŷ ∗ = a + bx∗ . Then ∗ 2 ). Thus a “prediction interval” for y ∗ is ŷ ∗ − y has mean 0 and variance σ 2 (1 + n1 + (x S−x̄) xx s ŷ ∗ ± t∗ se 1+ 1 (x∗ − x̄)2 + n Sxx Homework Read Devore and Farnum, Section 11.3. 1. Do problems 11.18,22. 2. Return to the state data. Again use murder rate to predict life expectancy. (a) Estimate β. Write a sentence that expresses the relationship between murder rate and life expectancy using this estimate. (b) Estimate α. What is a 90% confidence interval for α? Write a sentence interpreting α. (c) Suppose that the murder rate is 9.0. Write a 95% confidence interval for the mean life expectancy for such states. (d) Suppose that the murder rate is 4.0. Write a 90% prediction interval for a “new” state that has murder rate 4.0 April 11- Don’t fit a line if a line doesn’t fit 1. Computing multiple confidence intervals may increase the chance of error. 2. Our estimates a and b for α and β are not independent. We can generate a 95% confidence ellipsoid for α and β. (Using the ellipse command of the ellipse package in R). 3. Three potential sources of problems in using the linear model (a) the standard, linear statistical model is not right (linearity, homoscedasticity, independence) (b) the distributional assumption (normality) is wrong (c) the data are “bad” 4. Examples of line-fitting 5. One important diagnostic tool: plot the residuals (versus x or y) 6. Influential data. (Use of R commands dfbeta and dfbetas. dfbeta returns the amount the estimates of α and β would change if an individual observation is left out of the regression. dfbetas returns the same value but the units are in number of standard deviations). Homework Read Devore and Farnum, Section 3.3, pages 123–125. 1. Do problem 3.26 of Devore and Farnum. The data can be found on the CD or in the builtin R dataset data(anscombe). 2. Suppose that a research generates 95% confidence intervals for 10 different parameters as the result of the analysis of a set of data. Suppose that random variables generating these confidence intervals are not necessarily independent. What confidence level should the researcher use to ensure that the probability that all confidence intervals contain their respective paramters is at least 95%? 3. Use the builtin states dataset again. (Recall that you retrieve the data by data(state) and then s=data.frame(state.x77). The variable s will now be a data.frame with the desired data). (a) Write life expectancy as a linear function of illiteracy rate. (b) Plot the residuals. Are there any features of the residual plot that indicate a violation of the linear model assumptions? (c) Which observation has the most influence over the coefficients in the regression equation? 4. A built in dataset obtained by data(women) gives the average weight for women of each height. (a) Write weight as a linear function of height. (b) Plot the residuals. Are there any features of the residual plot that indicate a violation of the linear model assumptions? (c) Can you find a transformation of the data that fits a linear model better? April 12 - Checking normality, other lines 1. Quantile-quantile plots. A visual check for normality. (qqnorm or plot in linear regression) In regression, the standardized residuals have a standard normal distribution if the hypothesis of normality is true. 2. Resistant regression. Standard choice is to minimize sums of squares of residuals: n X i=1 e2i = n X (yi − ŷi )2 i=1 If f (e) = e2 , then the standard choice is to minimize possible. Pn i=1 f (ei ). Other choices for f are (a) f (e) = |e| is called “least absolute deviations” (LAD) regression. (b) The following is called Huber’s method. It is essentially a compromise between least squares and LAD. (rlm in the MASS package implements this in R). For an appropriate choice of c, use the following function: ( e2 /2 if |e| ≤ c f (e) = 2 c|e| − c /2 otherwise (c) Another choice is to use f (e) = e2 for the smallest residuals but f (e) = 0 for the others. This is analogous to the trimmed mean and is called “least trimmed squares.” (ltsreg in the MASS package in R implements this) Homework Read Devore and Farnum, Section 2.4 and Section 3.4, pages 133,134. 1. Continuing problem 3 of the last homework assignment, (a) Plot life expectancy against illiteracy rate. (b) Add to the plot the line produced from the regression. (c) Add to the plot the line produced by rlm. (d) Add to the plot the line produced by the lts regression. (e) Comment on any significant differences that you see in the plots. 2. Continuing problem 3 on the last assignment, does the qq-plot of the standardized residuals indicate any departure from normality? April 12 - Multiple linear regression 1. Setting: (a) k independent variables x1 , . . . , xk . One dependent variable y. (b) n data points: (x11 , . . . , xk1 , y1 ), . . . , (x1i , . . . , xki , yi ), . . . , (x1n , . . . , xkn , yn ). (c) Goal: fit a linear function y ≈ b0 + b1 x1 + · · · + bk xk . (d) Geometry: fit a k-dimensional plane to n + 1 points in k + 1 dimensional space. (e) Remark: we write b0 instead of a so we can talk about the coefficients as b’s. Note that b0 can be viewed as similar to the bi ’s if we think of a new variable x0 where x0i = 1 for all i. 2. Least squares solution: Choose b0 , . . . , bk to minimize the sums of squares of residuals: SSResid = n X (yi − ŷi )2 where ŷi = b0 + b1 x1i + · · · + bk xki i=1 3. “Interpretation” of coefficients: a change in xj of 1 unit means a change in y of βj units, all other variables xl being held constant. (But in applications, holding the other variables constant might not be realistic.) 4. Use of R command lm to find the numbers b0 , . . . , bk . (Example.) 5. R2 has the same meaning: percentage of variation “explained” by the regression: R2 = SSResid SSRegress =1− SSTotal SSTotal 6. Generality of linear model: (a) Can use functions of variables as new variables. Example: polynomial regression (new variables x2 , x3 , . . . ). (b) Variables can interact. Example: interaction terms (new variable x = xj xl ). Interpretation of interaction terms. Homework Read Devore and Farnum Section 3.5 and 3.4, pages 132–33. 1. Do problems 3.32,34,35,36. April 15 - The standard linear statistical model 1. Setting: (a) k independent variables x1 , . . . , xk . One dependent variable y. (b) n data points: (x11 , . . . , xk1 , y1 ), . . . , (x1i , . . . , xki , yi ), . . . , (x1n , . . . , xkn , yn ). (c) Goal: fit a linear function y ≈ b0 + b1 x1 + · · · + bk xk . (d) Least squares solution: Choose b0 , . . . , bk to minimize the sums of squares of residuals: SSResid = n X (yi − ŷi )2 where ŷi = b0 + b1 x1i + · · · + bk xki i=1 2. The standard linear statistical model (lm in R) (a) yi = β0 + β1 x1i + · · · + βk xki + ei (So β0 + β1 x∗1 + · · · + βk x∗k is the mean value of y for a fixed tuple (x∗1 , . . . , x∗k ) of independent variables.) (b) The errors, ei have mean 0, variance σ 2 , and are independent. (c) Additionally, when confidence intervals are desired, the errors have a normal distribution. 3. The coefficients bj of the least squares solution are unbiased estimators of the paramters βj and have minimum variance among all unbiased estimators. 4. An unbiased estimator of σ 2 is SSResid/(n − (k + 1)). Thus we define se , the standard error, by s SSResid se = (n − (k + 1)) 5. How many estimators should one use? Some issues. (a) Adding an estimator will always increase R2 since it will decrease SSResid. (b) If an estimator doesn’t increase R2 “very much” it might not be useful in the model. We will want to quantify this. 6. Side remark, inclusion of categorical variables xj in the model by xj = 0, 1. Be careful in interpreting the coefficient βj ! Homework Read Devore and Farnum Section 11.4 1. Do problems 11.26,27,28. Patriot’s Day – Multiple Linear Regression, Inferences 1. Setting: (a) k independent variables x1 , . . . , xk . One dependent variable y. (b) n data points: (x11 , . . . , xk1 , y1 ), . . . , (x1i , . . . , xki , yi ), . . . , (x1n , . . . , xkn , yn ). 2. The standard linear statistical model (lm in R) (a) yi = β0 + β1 x1i + · · · + βk xki + ei (So β0 + β1 x∗1 + · · · + βk x∗k is the mean value of y for a fixed tuple (x∗1 , . . . , x∗k ) of independent variables.) (b) The errors, ei have mean 0, variance σ 2 , and are independent. (c) The random variables ei have normal distributions. 3. Confidence intervals for the parameters βj . (a) Statistical software gives estimates bj for βj and estimates sbj of the standard deviation of bj . (b) Then b−β sb has a t-distribution with n − (k + 1) degrees of freedom. (c) A confidence interval for any β is b + t∗ sb where t∗ is the critical value for a t-distribution with n − (k + 1) degrees of freedom. (d) Similarly, we can get confidence intervals and prediction intervals for fixed (x∗1 , . . . , x∗k ). (e) Cautions: the declining confidence in multiple confidence intervals; the interpretation of a confidence interval in the presence of other coefficients 4. An alternate way to interpret the output - Testing the “hypothesis” that β = 0. (a) If β = 0, t = b/sb has a t-distribution with n − (k + 1) degrees of freedom. (b) The probability that a random variable with a t-distribution is at least as great as t is called the p-value of the statistic. (c) The p-value is a measure of how “surprising” that a t-value this extreme occured. If the p-value is very small, we doubt that β = 0. Otherwise, we think that β = 0 is consistent with the data. (d) Computing p-values is directly related to computing confidence intervals and a p-value doesn’t provide as much information as a confidence interval. 5. Adjusted R2 . R2 will increase as we add more variables to the model. To correct for this, there is a statistic called adjusted R2 : SSResid/(n − (k + 1)) adj-R2 = 1 − SSTotal/(n − 1) Homework Read Devore and Farnum Section 11.5, pages 525–527, 530–532. 1. In the data section of the course website is a datafile consisting of the scores of 32 students of Mathematics 222 on three tests and a final exam. In this question we investigate using the test scores to predict the final exam score. (After all, if the test scores do a good enough job, I wouldn’t have to grade the final exam!) (a) Write a linear function Exam = b0 + b1 Test1 + b2 Test2 + b3 Test3 that can be used to predict the final exam score from the three test scores. (b) Write a 95% confidence interval for the parameter β1 in the model corresponding to our estimate b1 . (c) Do the p-values for the coefficients lead you to suspect that one or more of the βi are not very useful in the model? Explain. (d) For each of the three independent variables, fit a linear function that does not include that variable. Compare the values of adj-R2 for each those models to each other and to the full model. Which model would you use to predict exam scores and why? (e) If a student scores 85 on each test, what is the predicted exam score? What is a 90% confidence interval for that prediction? April 19 — Choosing model variables 1. The standard linear statistical model (lm in R) (a) yi = β0 + β1 x1i + · · · + βk xki + ei (b) The errors, ei have mean 0, variance σ 2 , and are independent. (c) The random variables ei have normal distributions. 2. Problem: How many variables (of the k) should we keep? There is a tradeoff - more variables “explains” more variation but makes a more complicated model. 3. Some solutions suggested by yesterday: (a) Eliminate any variable with a “large” p-value (in other words, the evidence in favor of β 6= 0 not strong) (b) Find the collection of variables with largest adjusted-R2 . 4. One approach. (a) Stepwise regression, add or subtract one variable at a time based on a criterion. (b) Want to mimimize some (increasing) function of SSResid and k. (As k or SSResid increases, model becomes less desirable). (c) Candidates for function to minimize (there are theoretical reasons for these) AIC = n ln SSResid + 2(k + 1) n BIC = n ln SSResid + (k + 1) ln n n (d) step in R does both AIC (default) and BIC stepwise regression. (e) Stepwise regression may not find the best model since it does not look at every subset of variables. Homework Read Devore and Farnum Section 11.6, pages 538–545. 1. The data page at the course website has a dataset that includes statistics on five years of majorleague baseball. We desire to take certain statistics computed on a per-game basis and use these to predict the number of runs-per-game (RG). The version of the dataset that is useful in this regard is the “per game” version. The variables are RG (runs), X1BG (singles), X2BG (doubles), X3BG (triples), HRG (home runs), SOG (strikeouts), SBG (stolen bases), CSG (caught stealing). (a) Use all of the other variables in a linear function to predict RG. (b) Based on the t-values in the preceding analysis, which variables are obvious candidates for removal from the model? Refit the model without these variables. Compute adjusted R2 for the two models. (c) Employ stepwise regresson in R (step). Does step give the same result as the analysis in part (b)? (d) Using the model from (c), what variable should be the next to remove? Refit the model without that variable and compare adjusted R2 of this model to the models of (b) and (c). (e) If you know something about baseball, do the coefficients in the model of (c) “make sense?” April 22 — Hypothesis Testing 1. Hypothesis testing is an alternate way to present inferences from data about the parameters of the underlying statistical model. Hypothesis testing used to be used more heavily than it is now and in most circumstances, there are better (more informative) inference methods such as confidence intervals. 2. Hypotheses. A hypothesis proposes a possible state of affairs with respect to the distribution from which the data comes. Examples: (a) A hypothesis stating a fixed value of a parameter: µ = 0. (b) A hypothesis stating a range of values of a paramter: µ ≥ 3. (c) A hypothesis about the nature of the distribution itself; X has a normal distribution. 3. The hypotheses of an experiment. (a) Null Hypothesis. The null hypothesis, usually denoted H0 , is generally a hypothesis that the data analysis is intended to investigate. It is usually thought of as the “default” or “status quo” hypothesis that we will accept unless the data gives us substantial evidence against it. (b) Alternate Hypothesis. The alternate hypothesis, usually denoted H1 or Ha , is the hypothesis that we are wanting to put forward as true if we have sufficient evidence against the null hypothesis. (c) Possible Decisions. On the basis of the data we will either reject H0 or fail to reject H0 (in favor of H1 ). (d) Asymmetry. Note that H0 and H1 are not treated equally. The idea is that H0 is the default and only if we are reasonably sure that H0 is false do we reject it in favor of H1 . H0 is “innocent until proven guilty” and this metaphor from the criminal justice system is good to keep in mind. 4. How do we decide? (a) We construct a “test” statistic T : that is a number computed from the data that we will use to assess the likelihood that H0 is true. (b) We announce a critical region, a set of possible values of the test statistic T that would cause us to reject H0 . (c) We choose the critical region so that i. values of the test statistic in the critical region tend to support H1 in favor of H1 and ii. the size of the critical region is such that the probability of rejecting H0 if it is true is small. 5. Errors. There are two types of errors with extraordinarily obvious but uninteresting and uninformative names: (a) Type I error. We reject H0 even though it is true. The probability of a type I error is denoted by α. (b) Type II error. We do not reject H0 even though it is false. The probability of a type II error is denoted by β. Both errors are bad but the asymmetricity of the hypotheses suggests that Type I errors are worse than Type II errors. Thus, we usually want to construct our test statistic and critical region so that α is small. This requires knowing the distribution of the test statistic T if H0 is true. This in turn restricts the kinds of hypotheses that we can choose for H0 . 6. p-value.After the data is collected, we report the decision (reject or not) and, usually, the p-value which is the probability that, if H0 is true a value of the statistic at least as extreme in favor of H1 would occur. Homework, Due Monday, April 25, a short assignment but a short due-date 1. Read Devore and Farnum Section 8.1. 2. Do problems 8.2–8. April 22 — Hypothesis Testing – M&M’s 1. Advertised distribution of M&M colors; brown red orange 13% 13% 20% yellow blue green 14% 24% 16% 2. The multinomial distribution. k + 1 paramters: n, π1 , . . . , πk such that π1 + · · · πk = 1. Interpretation: X1 , . . . , Xk have a multinomial distribution if there are n independent trials of an experiment which can result in one of k distinct outcomes and Xi is the number of trials that result in the ith outcome. 3. Want to test the hypothesis H0 : π1 = .13 π2 = .14 π3 = .13 π4 = .24 π5 = .20 π6 = .16 or in general the hypothesis H0 : π1 = π1,0 ... πk = πk,0 4. Development of a test statistic. Given the result of the experiment x1 , . . . , xk want a statistic T that measures deviation of the result from what would be expected if H0 is true. 5. Candidate: 6. Decision: reject H0 if T is too large. 7. If the null hypothesis is true then the distribution of T is approximately a distribution that is called the distribution. It has one parameter k − 1, called the degrees of freedom. 8. Decision: reject H0 if Homework, Due Monday, May 2, 1. Read Devore and Farnum 8.3, pages 373-376. 2. Do problems 8.42–44. May 2 – Analysis of Variance 1. The setting for (one-way) analysis of variance (ANOVA). (a) a dependent (response) variable (usually a continuous random variable) (b) an independent (treatment, factor) variable that is categorical with several (at least two) categories called levels (c) Want to know whether the different levels explain (some of) the variation in the dependent variable 2. Related to what we have done before: (a) difference in means between two different populations, two sample t-test, the case of two levels (b) linear regression 3. Examples. 4. Useful graphical representations (side-by-side boxplots, stripcharts) 5. Explaining variation: pictures. 6. Formal setting: (a) k (normal) populations with means µ1 , . . . , µk respectively and common variance σ 2 . (b) data xij is the j th observation of the ith population (1 ≤ i ≤ k, 1 ≤ j ≤ ni ). The data are independent observations. (c) sample mean and standard deviation of ith level are x̄i and si . (d) average of all the observations is x̄¯. 7. The crucial role of randomization. Homework, Due Friday, May 6, 1. Read Devore and Farnum 9.1, pages 401–404. The next three weeks 1. There will be no further homework assignments. Some problems will be assigned on current material but these problems will form part of the take-home portion of the exam. 2. There will be an in-class portion to the final exam. It will be limited to questions taken from the first three tests (the in-class portions thereof) although I reserve the right to restate, modify, combine or otherwise improve those questions. Additional information will be forthcoming. 3. The in-class component will be distributed on Monday, May 9, and will be due on Thursday, May 19, at 5:00 PM. May 2 – Analysis of Variance 1. Formal setting: (a) k (normal) populations with means µ1 , . . . , µk respectively and common variance σ 2 . (b) data xij is the j th observation of the ith population (1 ≤ i ≤ k, 1 ≤ j ≤ ni ). The data are independent observations. (c) sample mean and standard deviation of ith level are x̄i and si . (d) average of all the observations is x̄¯. 2. Sums of squares: SSTotal = ni k X X (xij − x̄¯)2 = (n − 1)s2 (1) i=1 j=1 SSError = ni k X X (xij − x̄i )2 = i=1 j=1 SSTreatment = ni k X X k X (ni − 1)s2i (2) ni (x̄i − x̄¯)2 (3) i=1 (x̄i − x̄¯)2 = i=1 j=1 k X i=1 3. The fundamental equation (same as in linear regression but read “error” for “residual” and “Treatment” for “regression”): SSTotal = SSTreatment + SSError 4. Comparing SSTreatment to SSError. Define MSE = SSError n−k MSTr = SSTreatment k−1 F = MSTr MSE 5. If H0 : µ1 = · · · = µk is true, the F has an F -distribution with parameters (k − 1) and (n − k). Reject H0 if F is too large. 6. The role of randomization. Without assuming anything in particular about the underlying distributions, if the xij all have the same distribution (i.e., there is no difference in treatment effect by level), then the F statistic has a distribution that is approximately an F distribution. Homework, Due Friday, May 6, 1. Read Devore and Farnum 9.1, pages 403-406, and 9.2. May 5 – Confidence intervals 1. Formal setting: (a) k (normal) populations with means µ1 , . . . , µk respectively and common variance σ 2 . (b) data xij is the j th observation of the ith population (1 ≤ i ≤ k, 1 ≤ j ≤ ni ). The data are independent observations. (c) sample mean and standard deviation of ith level are x̄i and si . (d) average of all the observations is x̄¯. 2. The problem of constructing confidence intervals - multiple comparisons. 3. Basic facts: (a) x̄i − x̄j has a normal distribution with mean µi − µj and variance σ2 ni + σ2 . nj (b) s2e = MSE is an unbiased estimator of σ 2 . 4. Approach I: Bonferonni. If we want a family of k confidence intervals at 100(1 − α)%, construct 100(1 − α/k)% confidence intervals. Guaranteed but conservative. 5. Approach II: Tukey Honest Significant Differences intervals. (Who was Tukey? What is Honest?) Develop intervals based on the distribution of the range of a set of normally distributed numbers. Intervals for µi − µj have form s 1 MSE 1 ∗ + (x̄i − x̄j ) ± q 2 ni nj where q ∗ is a critical value that depends on the studentized range distribution. (two parameters, k and n − k). Homework, Due Tuesday, May 10 1. Read Devore and Farnum Section 9.3, pages 416–420. 2. This is Problem 1 on the take-home final exam. Twenty-four rats (as identical as rats are) were each fed one of four diets (imaginatively labeled A,B,C,D). The time (in seconds) for their blood to coagulate was recorded. The data are available from the course webpage. (a) Is it reasonable to believe that diet had no effect on blood coagulation? Explain. (b) Identify which, if any, pairs of diets seem to cause different blood coagulation times. (c) There are only six cases per diet. This is a very small number with which to check the assumptions underlying the techniques that you used to answer the first two parts. Nevertheless, do you see anything in the data that gives you any concerns about these assumptions? May 6 – Two-way ANOVA 1. Problem (a) a dependent (response) variable, x, which is continous (b) two different independent categorical variables (treatment and block, or two treatment) (c) want to know if either or both categorical variables has an effect on mean value of x 2. Data: xijk with i ≤ a, j ≤ b, k ≤ nij . If all nij are equal, we call the experiment “balanced” and use r (for replications) to denote nij . 3. Model: (not in book) a X xijk = µ + αi + βj + eijk αi = 0 i=1 b X βj = 0 j=1 where the errors, eijk are independent, normally distributed, with mean 0 and variance σ 2 . 4. Estimates (using different notation than that of the book, dots refer to variable to be averaged over) parameter µ estimate ¯ x̄ αi x̄i.. − x̄... βj x̄.j. − x̄... σ2 MSE 5. Sum of squares analysis: SSTotal = X SSA = X SSB = X SSE = X (xijk − x̄... )2 ijk (x̄i.. − x̄... )2 ijk (x̄.j. − x̄... )2 ijk (xijk − (x̄i.. − x̄... + x̄.j. − x̄... + x̄... ))2 ijk 6. Mean squares: MSE = SSE abr − a − b + 1 MSA = SSA a−1 7. F-tests: If H0 : α1 = · · · = αa = 0 is true then F = appropriate number of degrees of freedom. 8. TukeyHSD works. Homework, Due Tuesday, May 10 1. Read Devore and Farnum Section 9.4. MSA MSE MSB = SSB b−1 has an F -distribution with the May 9 – Two-way ANOVA with interaction 1. Problem (a) a dependent (response) variable, x, which is continous (b) two different independent categorical variables (treatment and block, or two treatment) (c) want to know if either or both categorical variables has an effect on mean value of x 2. Data: xijk with i ≤ a, j ≤ b, k ≤ r. Must have r > 1. 3. Model: xijk = µ + αi + βj + γij + eijk with the usual assumptions about eijk . 4. Sum of squares analysis: SSTotal = X SSA = X SSB = X SSAB = X SSE = X (xijk − x̄... )2 ijk (x̄i.. − x̄... )2 ijk (x̄.j. − x̄... )2 ijk (x̄ij. − (x̄i.. − x̄... + x̄.j. − x̄... + x̄... ))2 ijk (xijk − x̄ij. )2 ijk 5. Mean squares: MSE = SSE ab(r − 1) MSA = SSA a−1 6. TukeyHSD works. Homework 1. Read Devore and Farnum, Section 10.2 MSB = SSB b−1 MSAB = SSAB (a − 1)(b − 1) R Reference Card by Tom Short, EPRI PEAC, [email protected] 2004-11-07 Granted to the public domain. See www.Rpad.org for the source and latest version. Includes material from R for Beginners by Emmanuel Paradis (with permission). Getting help Most R functions have online documentation. help(topic) documentation on topic ?topic id. help.search("topic") search the help system apropos("topic") the names of all objects in the search list matching the regular expression ”topic” help.start() start the HTML version of help str(a) display the internal *str*ucture of an R object summary(a) gives a “summary” of a, usually a statistical summary but it is generic meaning it has different operations for different classes of a ls() show objects in the search path; specify pat="pat" to search on a pattern ls.str() str() for each variable in the search path dir() show files in the current directory methods(a) shows S3 methods of a methods(class=class(a)) lists all the methods to handle objects of class a character or factor columns are surrounded by quotes ("); sep is the field separator; eol is the end-of-line separator; na is the string for missing values; use col.names=NA to add a blank column header to get the column headers aligned correctly for spreadsheet input sink(file) output to file, until sink() Most of the I/O functions have a file argument. This can often be a character string naming a file or a connection. file="" means the standard input or output. Connections can include files, pipes, zipped files, and R variables. On windows, the file connection can also be used with description = "clipboard". To read a table copied from Excel, use x <- read.delim("clipboard") To write a table to the clipboard for Excel, use write.table(x,"clipboard",sep="\t",col.names=NA) For database interaction, see packages RODBC, DBI, RMySQL, RPgSQL, and ROracle. See packages XML, hdf5, netCDF for reading other file formats. Indexing lists x[n] list with elements n x[[n]] nth element of the list x[["name"]] element of the list named "name" x$name id. Indexing matrices x[i,j] element at row i, column j x[i,] row i x[,j] column j x[,c(1,3)] columns 1 and 3 x["name",] row named "name" Indexing data frames (matrix indexing plus the following) x[["name"]] column named "name" x$name id. Data creation c(...) generic function to combine arguments with the default forming a vector; with recursive=TRUE descends through lists combining all elements into one vector from:to generates a sequence; “:” has operator priority; 1:4 + 1 is “2,3,4,5” seq(from,to) generates a sequence by= specifies increment; length= specifies desired length seq(along=x) generates 1, 2, ..., length(along); useful for for loops rep(x,times) replicate x times; use each= to repeat “each” element of x each times; rep(c(1,2,3),2) is 1 2 3 1 2 3; rep(c(1,2,3),each=2) is 1 1 2 2 3 3 data.frame(...) create a data frame of the named or unnamed Input and output arguments; data.frame(v=1:4,ch=c("a","B","c","d"),n=10); load() load the datasets written with save shorter vectors are recycled to the length of the longest data(x) loads specified data sets list(...) create a list of the named or unnamed arguments; library(x) load add-on packages list(a=c(1,2),b="hi",c=3i); read.table(file) reads a file in table format and creates a data array(x,dim=) array with data x; specify dimensions like frame from it; the default separator sep="" is any whitespace; use dim=c(3,4,2); elements of x recycle if x is not long enough header=TRUE to read the first line as a header of column names; use matrix(x,nrow=,ncol=) matrix; elements of x recycle as.is=TRUE to prevent character vectors from being converted to facfactor(x,levels=) encodes a vector x as a factor tors; use comment.char="" to prevent "#" from being interpreted as gl(n,k,length=n*k,labels=1:n) generate levels (factors) by speca comment; use skip=n to skip n lines before reading data; see the ifying the pattern of their levels; k is the number of levels, and n is help for options on row naming, NA treatment, and others the number of replications read.csv("filename",header=TRUE) id. but with defaults set for expand.grid() a data frame from all combinations of the supplied vecreading comma-delimited files tors or factors read.delim("filename",header=TRUE) id. but with defaults set rbind(...) combine arguments by rows for matrices, data frames, and for reading tab-delimited files others ,as.is=FALSE) read.fwf(file,widths,header=FALSE,sep="" cbind(...) id. by columns read a table of f ixed width f ormatted data into a ’data.frame’; widths Slicing and extracting data is an integer vector, giving the widths of the fixed-width fields save(file,...) saves the specified objects (...) in the XDR platform- Indexing vectors independent binary format x[n] nth element save.image(file) saves all objects x[-n] all but the nth element cat(..., file="", sep=" ") prints the arguments after coercing to x[1:n] first n elements character; sep is the character separator between arguments x[-(1:n)] elements from n+1 to the end print(a, ...) prints its arguments; generic, meaning it can have differ- x[c(1,4,2)] specific elements ent methods for different objects x["name"] element named "name" format(x,...) format an R object for pretty printing x[x > 3] all elements greater than 3 write.table(x,file="",row.names=TRUE,col.names=TRUE, x[x > 3 & x < 5] all elements between 3 and 5 sep=" ") prints x after converting to a data frame; if quote is TRUE, x[x %in% c("a","and","the")] elements in the given set Variable conversion as.array(x), as.data.frame(x), as.numeric(x), as.logical(x), as.complex(x), as.character(x), ... convert type; for a complete list, use methods(as) Variable information is.na(x), is.null(x), is.array(x), is.data.frame(x), is.numeric(x), is.complex(x), is.character(x), ... test for type; for a complete list, use methods(is) length(x) number of elements in x dim(x) Retrieve or set the dimension of an object; dim(x) <- c(3,2) dimnames(x) Retrieve or set the dimension names of an object nrow(x) number of rows; NROW(x) is the same but treats a vector as a onerow matrix ncol(x) and NCOL(x) id. for columns class(x) get or set the class of x; class(x) <- "myclass" unclass(x) remove the class attribute of x attr(x,which) get or set the attribute which of x attributes(obj) get or set the list of attributes of obj Data selection and manipulation which.max(x) returns the index of the greatest element of x which.min(x) returns the index of the smallest element of x rev(x) reverses the elements of x sort(x) sorts the elements of x in increasing order; to sort in decreasing order: rev(sort(x)) cut(x,breaks) divides x into intervals (factors); breaks is the number of cut intervals or a vector of cut points match(x, y) returns a vector of the same length than x with the elements of x which are in y (NA otherwise) which(x == a) returns a vector of the indices of x if the comparison operation is true (TRUE), in this example the values of i for which x[i] == a (the argument of this function must be a variable of mode logical) choose(n, k) computes the combinations of k events among n repetitions = n!/[(n − k)!k!] na.omit(x) suppresses the observations with missing data (NA) (suppresses the corresponding line if x is a matrix or a data frame) na.fail(x) returns an error message if x contains at least one NA unique(x) if x is a vector or a data frame, returns a similar object but with the duplicate elements suppressed table(x) returns a table with the numbers of the differents values of x (typically for integers or factors) subset(x, ...) returns a selection of x with respect to criteria (..., typically comparisons: x$V1 < 10); if x is a data frame, the option select gives the variables to be kept or dropped using a minus sign sample(x, size) resample randomly and without replacement size elements in the vector x, the option replace = TRUE allows to resample with replacement prop.table(x,margin=) table entries as fraction of marginal table Math sin,cos,tan,asin,acos,atan,atan2,log,log10,exp max(x) maximum of the elements of x min(x) minimum of the elements of x range(x) id. then c(min(x), max(x)) sum(x) sum of the elements of x diff(x) lagged and iterated differences of vector x prod(x) product of the elements of x mean(x) mean of the elements of x median(x) median of the elements of x quantile(x,probs=) sample quantiles corresponding to the given probabilities (defaults to 0,.25,.5,.75,1) weighted.mean(x, w) mean of x with weights w rank(x) ranks of the elements of x var(x) or cov(x) variance of the elements of x (calculated on n − 1); if x is a matrix or a data frame, the variance-covariance matrix is calculated sd(x) standard deviation of x cor(x) correlation matrix of x if it is a matrix or a data frame (1 if x is a vector) var(x, y) or cov(x, y) covariance between x and y, or between the columns of x and those of y if they are matrices or data frames cor(x, y) linear correlation between x and y, or correlation matrix if they are matrices or data frames round(x, n) rounds the elements of x to n decimals log(x, base) computes the logarithm of x with base base scale(x) if x is a matrix, centers and reduces the data; to center only use the option center=FALSE, to reduce only scale=FALSE (by default center=TRUE, scale=TRUE) pmin(x,y,...) a vector which ith element is the minimum of x[i], y[i], . . . pmax(x,y,...) id. for the maximum cumsum(x) a vector which ith element is the sum from x[1] to x[i] cumprod(x) id. for the product cummin(x) id. for the minimum cummax(x) id. for the maximum union(x,y), intersect(x,y), setdiff(x,y), setequal(x,y), is.element(el,set) “set” functions Re(x) real part of a complex number Im(x) imaginary part Mod(x) modulus; abs(x) is the same Arg(x) angle in radians of the complex number Conj(x) complex conjugate convolve(x,y) compute the several kinds of convolutions of two sequences fft(x) Fast Fourier Transform of an array mvfft(x) FFT of each column of a matrix filter(x,filter) applies linear filtering to a univariate time series or to each series separately of a multivariate time series Many math functions have a logical parameter na.rm=FALSE to specify missing data (NA) removal. Matrices t(x) transpose diag(x) diagonal %*% matrix multiplication solve(a,b) solves a %*% x = b for x solve(a) matrix inverse of a rowsum(x) sum of rows for a matrix-like object; rowSums(x) is a faster version colsum(x), colSums(x) id. for columns rowMeans(x) fast version of row means colMeans(x) id. for columns Advanced data processing apply(X,INDEX,FUN=) a vector or array or list of values obtained by applying a function FUN to margins (INDEX) of X lapply(X,FUN) apply FUN to each element of the list X tapply(X,INDEX,FUN=) apply FUN to each cell of a ragged array given by X with indexes INDEX by(data,INDEX,FUN) apply FUN to data frame data subsetted by INDEX merge(a,b) merge two data frames by common columns or row names xtabs(a b,data=x) a contingency table from cross-classifying factors aggregate(x,by,FUN) splits the data frame x into subsets, computes summary statistics for each, and returns the result in a convenient form; by is a list of grouping elements, each as long as the variables in x stack(x, ...) transform data available as separate columns in a data frame or list into a single column unstack(x, ...) inverse of stack() reshape(x, ...) reshapes a data frame between ’wide’ format with repeated measurements in separate columns of the same record and ’long’ format with the repeated measurements in separate records; use (direction=”wide”) or (direction=”long”) Strings paste(...) concatenate vectors after converting to character; sep= is the string to separate terms (a single space is the default); collapse= is an optional string to separate “collapsed” results substr(x,start,stop) substrings in a character vector; can also assign, as substr(x, start, stop) <- value strsplit(x,split) split x according to the substring split grep(pattern,x) searches for matches to pattern within x; see ?regex gsub(pattern,replacement,x) replacement of matches determined by regular expression matching sub() is the same but only replaces the first occurrence. tolower(x) convert to lowercase toupper(x) convert to uppercase match(x,table) a vector of the positions of first matches for the elements of x among table x %in% table id. but returns a logical vector pmatch(x,table) partial matches for the elements of x among table nchar(x) number of characters Dates and Times The class Date has dates without times. POSIXct has dates and times, including time zones. Comparisons (e.g. >), seq(), and difftime() are useful. Date also allows + and −. ?DateTimeClasses gives more information. See also package chron. as.Date(s) and as.POSIXct(s) convert to the respective class; format(dt) converts to a string representation. The default string format is “2001-02-21”. These accept a second argument to specify a format for conversion. Some common formats are: %a, %A Abbreviated and full weekday name. %b, %B Abbreviated and full month name. %d Day of the month (01–31). %H Hours (00–23). %I Hours (01–12). %j Day of year (001–366). %m Month (01–12). %M Minute (00–59). %p AM/PM indicator. %S Second as decimal number (00–61). %U Week (00–53); the first Sunday as day 1 of week 1. %w Weekday (0–6, Sunday is 0). %W Week (00–53); the first Monday as day 1 of week 1. %y Year without century (00–99). Don’t use. %Y Year with century. %z (output only.) Offset from Greenwich; -0800 is 8 hours west of. %Z (output only.) Time zone as a character string (empty if not available). Where leading zeros are shown they will be used on output but are optional on input. See ?strftime. Plotting plot(x) plot of the values of x (on the y-axis) ordered on the x-axis plot(x, y) bivariate plot of x (on the x-axis) and y (on the y-axis) hist(x) histogram of the frequencies of x barplot(x) histogram of the values of x; use horiz=FALSE for horizontal bars dotchart(x) if x is a data frame, plots a Cleveland dot plot (stacked plots line-by-line and column-by-column) pie(x) circular pie-chart boxplot(x) “box-and-whiskers” plot sunflowerplot(x, y) id. than plot() but the points with similar coordinates are drawn as flowers which petal number represents the number of points stripplot(x) plot of the values of x on a line (an alternative to boxplot() for small sample sizes) coplot(x˜y | z) bivariate plot of x and y for each value or interval of values of z interaction.plot (f1, f2, y) if f1 and f2 are factors, plots the means of y (on the y-axis) with respect to the values of f1 (on the x-axis) and of f2 (different curves); the option fun allows to choose the summary statistic of y (by default fun=mean) matplot(x,y) bivariate plot of the first column of x vs. the first one of y, the second one of x vs. the second one of y, etc. fourfoldplot(x) visualizes, with quarters of circles, the association between two dichotomous variables for different populations (x must be an array with dim=c(2, 2, k), or a matrix with dim=c(2, 2) if k = 1) assocplot(x) Cohen–Friendly graph showing the deviations from independence of rows and columns in a two dimensional contingency table mosaicplot(x) ‘mosaic’ graph of the residuals from a log-linear regression of a contingency table pairs(x) if x is a matrix or a data frame, draws all possible bivariate plots between the columns of x plot.ts(x) if x is an object of class "ts", plot of x with respect to time, x may be multivariate but the series must have the same frequency and dates ts.plot(x) id. but if x is multivariate the series may have different dates and must have the same frequency qqnorm(x) quantiles of x with respect to the values expected under a normal law qqplot(x, y) quantiles of y with respect to the quantiles of x contour(x, y, z) contour plot (data are interpolated to draw the curves), x and y must be vectors and z must be a matrix so that dim(z)=c(length(x), length(y)) (x and y may be omitted) filled.contour(x, y, z) id. but the areas between the contours are coloured, and a legend of the colours is drawn as well image(x, y, z) id. but with colours (actual data are plotted) persp(x, y, z) id. but in perspective (actual data are plotted) stars(x) if x is a matrix or a data frame, draws a graph with segments or a star where each row of x is represented by a star and the columns are the lengths of the segments symbols(x, y, ...) draws, at the coordinates given by x and y, symbols (circles, squares, rectangles, stars, thermometres or “boxplots”) which sizes, colours . . . are specified by supplementary arguments termplot(mod.obj) plot of the (partial) effects of a regression model (mod.obj) The following parameters are common to many plotting functions: add=FALSE if TRUE superposes the plot on the previous one (if it exists) axes=TRUE if FALSE does not draw the axes and the box type="p" specifies the type of plot, "p": points, "l": lines, "b": points connected by lines, "o": id. but the lines are over the points, "h": vertical lines, "s": steps, the data are represented by the top of the vertical lines, "S": id. but the data are represented by the bottom of the vertical lines xlim=, ylim= specifies the lower and upper limits of the axes, for example with xlim=c(1, 10) or xlim=range(x) xlab=, ylab= annotates the axes, must be variables of mode character main= main title, must be a variable of mode character sub= sub-title (written in a smaller font) Low-level plotting commands points(x, y) adds points (the option type= can be used) lines(x, y) id. but with lines text(x, y, labels, ...) adds text given by labels at coordinates (x,y); a typical use is: plot(x, y, type="n"); text(x, y, names) mtext(text, side=3, line=0, ...) adds text given by text in the margin specified by side (see axis() below); line specifies the line from the plotting area segments(x0, y0, x1, y1) draws lines from points (x0,y0) to points (x1,y1) arrows(x0, y0, x1, y1, angle= 30, code=2) id. with arrows at points (x0,y0) if code=2, at points (x1,y1) if code=1, or both if code=3; angle controls the angle from the shaft of the arrow to the edge of the arrow head abline(a,b) draws a line of slope b and intercept a abline(h=y) draws a horizontal line at ordinate y abline(v=x) draws a vertical line at abcissa x abline(lm.obj) draws the regression line given by lm.obj rect(x1, y1, x2, y2) draws a rectangle which left, right, bottom, and top limits are x1, x2, y1, and y2, respectively polygon(x, y) draws a polygon linking the points with coordinates given by x and y legend(x, y, legend) adds the legend at the point (x,y) with the symbols given by legend title() adds a title and optionally a sub-title axis(side, vect) adds an axis at the bottom (side=1), on the left (2), at the top (3), or on the right (4); vect (optional) gives the abcissa (or ordinates) where tick-marks are drawn rug(x) draws the data x on the x-axis as small vertical lines locator(n, type="n", ...) returns the coordinates (x, y) after the user has clicked n times on the plot with the mouse; also draws symbols (type="p") or lines (type="l") with respect to optional graphic parameters (...); by default nothing is drawn (type="n") Graphical parameters These can be set globally with par(...); many can be passed as parameters to plotting commands. adj controls text justification (0 left-justified, 0.5 centred, 1 right-justified) bg specifies the colour of the background (ex. : bg="red", bg="blue", . . . the list of the 657 available colours is displayed with colors()) bty controls the type of box drawn around the plot, allowed values are: "o", "l", "7", "c", "u" ou "]" (the box looks like the corresponding character); if bty="n" the box is not drawn cex a value controlling the size of texts and symbols with respect to the default; the following parameters have the same control for numbers on the axes, cex.axis, the axis labels, cex.lab, the title, cex.main, and the sub-title, cex.sub col controls the color of symbols and lines; use color names: "red", "blue" see colors() or as "#RRGGBB"; see rgb(), hsv(), gray(), and rainbow(); as for cex there are: col.axis, col.lab, col.main, col.sub font an integer which controls the style of text (1: normal, 2: italics, 3: bold, 4: bold italics); as for cex there are: font.axis, font.lab, font.main, font.sub las an integer which controls the orientation of the axis labels (0: parallel to the axes, 1: horizontal, 2: perpendicular to the axes, 3: vertical) lty controls the type of lines, can be an integer or string (1: "solid", 2: "dashed", 3: "dotted", 4: "dotdash", 5: "longdash", 6: "twodash", or a string of up to eight characters (between "0" and "9") which specifies alternatively the length, in points or pixels, of the drawn elements and the blanks, for example lty="44" will have the same effect than lty=2 lwd a numeric which controls the width of lines, default 1 mar a vector of 4 numeric values which control the space between the axes and the border of the graph of the form c(bottom, left, top, right), the default values are c(5.1, 4.1, 4.1, 2.1) mfcol a vector of the form c(nr,nc) which partitions the graphic window as a matrix of nr lines and nc columns, the plots are then drawn in columns mfrow id. but the plots are drawn by row pch controls the type of symbol, either an integer between 1 and 25, or any single character within "" 1 ● 2 16 ● 17 3 18 4 5 19 ● 20 ● 6 7 21 ● 22 8 23 9 24 10 ● 11 25 * * 12 . 13 ● 14 15 X X a a ? ? ps an integer which controls the size in points of texts and symbols pty a character which specifies the type of the plotting region, "s": square, "m": maximal tck a value which specifies the length of tick-marks on the axes as a fraction of the smallest of the width or height of the plot; if tck=1 a grid is drawn tcl a value which specifies the length of tick-marks on the axes as a fraction of the height of a line of text (by default tcl=-0.5) xaxt if xaxt="n" the x-axis is set but not drawn (useful in conjonction with axis(side=1, ...)) yaxt if yaxt="n" the y-axis is set but not drawn (useful in conjonction with axis(side=2, ...)) Lattice (Trellis) graphics xyplot(y˜x) bivariate plots (with many functionalities) barchart(y˜x) histogram of the values of y with respect to those of x dotplot(y˜x) Cleveland dot plot (stacked plots line-by-line and columnby-column) densityplot(˜x) density functions plot histogram(˜x) histogram of the frequencies of x bwplot(y˜x) “box-and-whiskers” plot qqmath(˜x) quantiles of x with respect to the values expected under a theoretical distribution stripplot(y˜x) single dimension plot, x must be numeric, y may be a factor qq(y˜x) quantiles to compare two distributions, x must be numeric, y may be numeric, character, or factor but must have two ‘levels’ splom(˜x) matrix of bivariate plots parallel(˜x) parallel coordinates plot levelplot(z˜x*y|g1*g2) coloured plot of the values of z at the coordinates given by x and y (x, y and z are all of the same length) wireframe(z˜x*y|g1*g2) 3d surface plot cloud(z˜x*y|g1*g2) 3d scatter plot In the normal Lattice formula, y x|g1*g2 has combinations of optional conditioning variables g1 and g2 plotted on separate panels. Lattice functions take many of the same arguments as base graphics plus also data= the data frame for the formula variables and subset= for subsetting. Use panel= to define a custom panel function (see apropos("panel") and ?llines). Lattice functions return an object of class trellis and have to be print-ed to produce the graph. Use print(xyplot(...)) inside functions where automatic printing doesn’t work. Use lattice.theme and lset to change Lattice defaults. Optimization and model fitting optim(par, fn, method = c("Nelder-Mead", "BFGS", "CG", "L-BFGS-B", "SANN") general-purpose optimization; par is initial values, fn is function to optimize (normally minimize) nlm(f,p) minimize function f using a Newton-type algorithm with starting values p lm(formula) fit linear models; formula is typically of the form response termA + termB + ...; use I(x*y) + I(xˆ2) for terms made of nonlinear components glm(formula,family=) fit generalized linear models, specified by giving a symbolic description of the linear predictor and a description of the error distribution; family is a description of the error distribution and link function to be used in the model; see ?family nls(formula) nonlinear least-squares estimates of the nonlinear model parameters approx(x,y=) linearly interpolate given data points; x can be an xy plotting structure spline(x,y=) cubic spline interpolation loess(formula) fit a polynomial surface using local fitting Many of the formula-based modeling functions have several common arguments: data= the data frame for the formula variables, subset= a subset of variables used in the fit, na.action= action for missing values: "na.fail", "na.omit", or a function. The following generics often apply to model fitting functions: predict(fit,...) predictions from fit based on input data df.residual(fit) returns the number of residual degrees of freedom coef(fit) returns the estimated coefficients (sometimes with their standard-errors) residuals(fit) returns the residuals deviance(fit) returns the deviance fitted(fit) returns the fitted values logLik(fit) computes the logarithm of the likelihood and the number of parameters AIC(fit) computes the Akaike information criterion or AIC Statistics aov(formula) analysis of variance model anova(fit,...) analysis of variance (or deviance) tables for one or more fitted model objects density(x) kernel density estimates of x binom.test(), pairwise.t.test(), power.t.test(), prop.test(), t.test(), ... use help.search("test") Distributions rnorm(n, mean=0, sd=1) Gaussian (normal) rexp(n, rate=1) exponential rgamma(n, shape, scale=1) gamma rpois(n, lambda) Poisson rweibull(n, shape, scale=1) Weibull rcauchy(n, location=0, scale=1) Cauchy rbeta(n, shape1, shape2) beta rt(n, df) ‘Student’ (t) rf(n, df1, df2) Fisher–Snedecor (F) (χ2 ) rchisq(n, df) Pearson rbinom(n, size, prob) binomial rgeom(n, prob) geometric rhyper(nn, m, n, k) hypergeometric rlogis(n, location=0, scale=1) logistic rlnorm(n, meanlog=0, sdlog=1) lognormal rnbinom(n, size, prob) negative binomial runif(n, min=0, max=1) uniform rwilcox(nn, m, n), rsignrank(nn, n) Wilcoxon’s statistics All these functions can be used by replacing the letter r with d, p or q to get, respectively, the probability density (dfunc(x, ...)), the cumulative probability density (pfunc(x, ...)), and the value of quantile (qfunc(p, ...), with 0 < p < 1). Programming function( arglist ) expr function definition return(value) if(cond) expr if(cond) cons.expr else alt.expr for(var in seq) expr while(cond) expr repeat expr break next Use braces {} around statements ifelse(test, yes, no) a value with the same shape as test filled with elements from either yes or no do.call(funname, args) executes a function call from the name of the function and a list of arguments to be passed to it Continuous distributions in R Distribution exponential normal uniform Weibull beta parameters λ µ, σ a, b α, β α, β R suffix exp norm unif weibull beta 1. Computing density functions. Use dexp, dnorm, etc. dexp(2,3) gives the value of the density of the exponential at x = 2 with λ = 3. dexp(x,3) computes the values of density for all values in vector x. 2. Computing proportions. Use pexp, pnorm, etc. pexp(2,3) gives the integral 02 f (x)dx where f is the density of the exponential with λ = 3. (This is the proportion of values of the variable for x ≤ 2.) R 3. Computing percentiles. Use qexp, qnorm, etc. qexp(.7,3) gives the number q such that exponential with parameter 3. Rq 0 f (x)dx = .7 where f (x) is the density of the 4. Generating random numbers. Use rexp, rnorm, etc. rexp(10,3) gives 10 random numbers from an exponential distribution with λ = 3. That is the random numbers are expected to occur according to the relative frequencies as given by the exponential distribution. You can also use a vector for the first argument in each of these commands. Solution to Problem 2.30, page 78 The grader found considerable confusion on this problem and finally gave up in reading it. Here is a solution, with some generality added. You are given a distribution (in this case an exponential distribution with parameter λ) and are asked to find the proportion of the population that exceeds the mean by more than 1 standard deviation. If f (x) is the density for the distribution, that means that you are to compute Z ∞ f (x) dx µ+σ To compute this integral, we need to know µ and σ. For the distribution in question, µ = λ. We need only compute σ. Now σ 2 for any distribution is given by ∞ Z σ2 = (x − µ)2 f (x) dx −∞ which by the hint in the problem is equal to ∞ Z σ2 = x2 f (x) dx − µ2 −∞ In the special case of this problem then we need to compute σ2 = Z ∞ x2 λe−λx dx − 1/λ2 0 since µ = 1/λ. The integral, after integrating by parts twice evaluates to 2/λ2 so σ 2 = 1/λ2 . Thus σ = 1/λ. Therefore the answer to the problem is Z ∞ λe−λx dx 2/λ This is a dead easy integral to evaluate. Its value is e−2 . Study Sheet for Test 1 — Tuesday, March 1 1. The test will cover all material from the course through that of Tuesday, February 22 (including the homework assigned on that day). 2. The textbook sections covered include 1.1–1.6, 2.1–2.3, 4.3, 5.1–5.3. On occasion we covered topics not in the textbook. And we left out a few things that were in the textbook. The test also covers the supplementary notes Sections 1 and 2. 3. The daily outlines are the best guide to what we covered. You should be familiar with all the terminology listed on those sheets. 4. You will not be expected to know the exact formulas for the density functions for important distributions. But you will be expected to know about distributions in general (what is a density function, mean, variance, etc.) And you will be expected to know how a particular distribution (like the normal or exponential) is shaped and what a particular distribution might be a model for. You should be able to compute relative proportions, means, and variances given a (fairly simple) density function. 5. I’ll supply a copy of the table on the front cover of the book. Know how to use it to answer questions about arbitrary normal distributions. 6. I won’t ask you to compute statistics or draw things like histograms or boxplots on the fly. But you should know how they are computed and drawn. In particular, I might show you R output and ask for interpretation. 7. To repeat the policy on missed tests, you may miss this test for any reason you deem appropriate. But there are no makeup tests. If you miss the test or score higher on the final exam than on this test, your final exam score will replace the first test score. 8. There are two sections of this course so the same exam will be given at 9 AM and 1:30 PM. You may take it at either time. Discussing the test with another student after you see it but before 2:30 PM is a serious breach of trust and is punishable by death. 9. You may use calculators to calculate things. But you may not use the calculator to retrieve stored information such as definitions or the like.