Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Applied Data Analysis Spring 2017 The North Remembers The smile makes me nervous Karen Albert [email protected] Thursdays, 4-5 PM (Hark 302) Lecture outline 1. Random variables 2. Probability distributions 3. Uniform distribution 4. Standard normal distribution 5. Normal distribution Whence data? How do we translate outcomes into numbers? That is, how do we relate events in the set of outcomes to the set of numbers? political outcomes → numbers Aside: functions A function is a rule that relates inputs to outputs. Aside: functions A function is a rule that relates inputs to outputs. Example: f (x) = x 2 • f is the name of the function • x stands in for the input • x 2 is the output f (x) = x 2 Input -4 -2 0 2 4 Output 16 4 0 4 16 Aside: functions Functions relate inputs to outputs subject to two conditions: 1. it must work for every possible input value 2. it must relate each input to only one output Aside: functions Functions relate inputs to outputs subject to two conditions: 1. it must work for every possible input value 2. it must relate each input to only one output A rule that relates both -4 and 4 to 16 is a function. A rule that relates 4 to both 8 and 16 is not a function. Random variables A random variable is function that relates political outcomes to numbers. Random variables A random variable is function that relates political outcomes to numbers. Random variables have nothing to do with: • randomness • probabilities Probability distributions A probability distribution lists the possible values of a random variable and their probabilities. Probability distributions A probability distribution lists the possible values of a random variable and their probabilities. Two kinds: • Discrete (probability mass functions) • Continuous (probability density function) Probability mass functions A probability mass function assigns a probability to each possible value of a discrete variable. Probability mass functions A probability mass function assigns a probability to each possible value of a discrete variable. Properties • 0 ≤ Pr(y ) ≤ 1 (each prob. is between 0 and 1) • P all y Pr(y ) = 1 (the probs. must sum to 1) Example Let y be the answer to “What do you think is the ideal number of children for a family to have?” y 0 1 2 3 4 5 Total Pr(y ) 0.01 0.03 0.60 0.23 0.12 0.01 1.00 Probability density functions A probability density function assigns probabilities to intervals of a continuous variable. Probability density functions A probability density function assigns probabilities to intervals of a continuous variable. Properties • f (x) ≥ 0 (each prob. is between 0 and 1) • R y f (x)dx = 1 (the probs. must sum to 1) Example The Uniform distribution assigns all intervals of the same length equal probability. f (x) = 1 , a≤x ≤b b−a Drawing a Uniform 0.4 0.2 0.0 dunif(x, 1, 3) curve(dunif(x,1,3),0,4) 0 1 2 x 3 4 Finding areas under the Uniform Assume that X ∼ U(1, 3). What is Pr(X < 1.4)? 0.3 0.2 0.1 0.0 dunif(x, 1, 3) 0.4 0.5 Uniform Density 0 1 2 x 3 4 Solution What is Pr(X < 1.4)? Solution What is Pr(X < 1.4)? The area of the shaded box is base× height. Solution What is Pr(X < 1.4)? The area of the shaded box is base× height. Pr(X ≤ 1.4) = (1.4 − 1.0) × = 0.4 × 0.5 = 0.2 1 (3 − 1) Solution, another way (1.4-1)*(1/(3-1)) punif(1.4,1,3,lower.tail=TRUE) Solution, another way (1.4-1)*(1/(3-1)) ## [1] 0.2 punif(1.4,1,3,lower.tail=TRUE) ## [1] 0.2 Simulating a Uniform library(MASS) x <- runif(1000,1,3) truehist(x,prob=TRUE) Simulating a Uniform library(MASS) ## Warning: package ’MASS’ was built under R version 3.1.3 0.0 0.2 0.4 0.6 x <- runif(1000,1,3) truehist(x,prob=TRUE) 1.0 1.5 2.0 2.5 3.0 Simulating a Uniform d <- sum(x<=1.4) d/1000 Simulating a Uniform d <- sum(x<1.4) d/1000 ## [1] 0.18 Practice Assume X ∼ U(6, 10). What is Pr(X > 7)? Practice: draw it! 0.20 0.10 0.00 dunif(x, 6, 10) x.cord <- c(7,10,10,7) y.cord <- c(0,0,0.25,.25) curve(dunif(x,6,10),5,11) polygon(x.cord,y.cord,col='skyblue') 5 6 7 8 x 9 10 11 Practice: Answer Assume X ∼ U(6, 10). What is Pr(X > 7)? punif(7,6,10,lower.tail=FALSE) Practice: Answer Assume X ∼ U(6, 10). What is Pr(X > 7)? punif(7,6,10,lower.tail=FALSE) ## [1] 0.75 The standard normal 0.2 0.1 0.0 dnorm(x, 0, 1) 0.3 0.4 Normal Density -4 -2 0 x 2 4 Drawing the Standard Normal 0.3 0.2 0.1 0.0 dnorm(x, 0, 1) 0.4 curve(dnorm(x,0,1),-5,5) −4 −2 0 x 2 4 Some history • Introduced by Abraham de Moivre in his 1783 book “The Doctrine of Chances.” • The result was extended by Laplace. • Then Gauss used the curve in his discussion of regression. So why is it called the Normal or Gaussian The term “bell curve” came from Jouffret in 1872, and the name “normal distribution” was coined independently by Pierce, Galton, and Lexis in 1875. They named it the normal because lots of things (but not everything!) are normally distributed. Thus, Stigler’s Law of Eponomy No scientific discovery is named after its original discoverer. Pedantic digression—Eponym One who gives, or is supposed to give, his name to a people, place, or institution; e.g. among the Greeks, the heroes who were looked upon as ancestors or founders of tribes or cities. “Pelops is the eponym or name-giver of the Peloponnesus.” Grote (1869) Are variables really normally distributed? • When there is reason to suspect the presence of a large number of small effects acting additively, it is reasonable to assume normality. • Example: test scores—the IQ score of an individual comprises many small effects including genes and environmental factors. • Not everything is normally distributed; lifetimes are not normally distributed. Think about the lifetimes of lightbulbs.... Facts about the standard normal distribution • symmetric about zero • the area under the curve is 100% • the above is not surprising because the vertical axis is the density scale • the curve is always above the horizontal axis More facts about the standard normal • the curve stretches from positive infinity to negative infinity • almost all of the data is between -4 and 4 • 68% of the data lie within 1 standard deviation of the mean • 95% of the data lie within 2 standard deviations of the mean • 99.7% of the data lie within 3 standard deviations of the mean There are lots of normal distributions Normal distributions are characterized by their mean (where they are centered) and their standard deviation (how spread out they are). Only the standard normal has a mean of 0 and a standard deviation of 1. (All, however, are symmetric and bell-shaped. ) Notation: N(0, 12 ). The characteristics of a probability distribution are known as parameters. The normal distribution has two parameters: the mean and the standard deviation. Finding areas under the normal Like the uniform distribution, the normal distribution is described by an equation. It is somewhat more complicated, however, Z x Pr(X < x) = −∞ 1 (x − µ)2 √ exp − 2σ 2 σ 2π Fortunately, there are tables of normal probabilities and computers. Practice with the 68%, 95%, 99.7% rule Assume X ∼ N(6, 22 ). Pr(X < 4)? 0.00 0.10 0.20 dnorm(x, 6, 2) cord.x <- c(-12,seq(-12,4,0.01),4) cord.y <- c(0,dnorm(seq(-12,4,0.01),6,2),0) curve(dnorm(x,6,2),0,12) polygon(cord.x,cord.y,col='skyblue') 0 2 4 6 x 8 10 12 Practice with the 68%, 95%, 99.7% rule Assume X ∼ N(6, 22 ). Pr(X > 4)? 0.00 0.10 0.20 dnorm(x, 6, 2) cord.x <- c(4,seq(4,12,0.01),12) cord.y <- c(0,dnorm(seq(4,12,0.01),6,2),0) curve(dnorm(x,6,2),0,12) polygon(cord.x,cord.y,col='skyblue') 0 2 4 6 x 8 10 12 Practice with the 68%, 95%, 99.7% rule Assume X ∼ N(6, 22 ). Pr(2 < X < 10)? 0.00 0.10 0.20 dnorm(x, 6, 2) cord.x <- c(2,seq(2,10,0.01),10) cord.y <- c(0,dnorm(seq(2,10,0.01),6,2),0) curve(dnorm(x,6,2),0,12) polygon(cord.x,cord.y,col='skyblue') 0 2 4 6 x 8 10 12 Practice with the 68%, 95%, 99.7% rule Assume X ∼ N(6, 22 ). • Pr(X < 4)? 16% • Pr(X > 4)? 84% • Pr(2 < X < 10)? 95% The Old Way: The Normal table The Old Way: The Normal table Pr(X > 1)? The Old Way: The Normal table Pr(X > 1)? Pr (X > 0)? The Old Way: The Normal table Pr(X > 1)? Pr (X > 0)? 30th percentile? The New Way: R Pr(X > 1)? pnorm(1,0,1,lower.tail=FALSE) Pr(X > 0)? pnorm(0,0,1,lower.tail=FALSE) 30th percentile? qnorm(0.30,0,1) The New Way: R Pr(X > 1)? pnorm(1,0,1,lower.tail=FALSE) ## [1] 0.1586553 Pr(X > 0)? pnorm(0,0,1,lower.tail=FALSE) ## [1] 0.5 30th percentile? qnorm(0.30,0,1) ## [1] -0.5244005 The horizontal axis The horizontal axis of the standard normal is standard deviations (away from the mean). When we use a Normal distribution that is not standard, we count in standard deviations. That number is called a Z-score. Z-scores Z-scores count up how many standard deviations away from the mean a particular point is. z= x −µ σ The Z-score is the point of interest minus the mean divided by the standard deviation. The Z-score in action 0.02 0.00 0.01 dnorm(x, 20, 10) 0.03 0.04 X ∼ N(20, 102 ) -20 0 20 40 60 2 4 x 0.2 0.1 0.0 dnorm(x, 0, 1) 0.3 0.4 X ∼ N(0, 1) -4 -2 0 x Questions I can ask about the normal.... 1. What percentage of the data is less (greater) than a certain number? 2. What percentage of the data is between two points? 3. Given a certain percentage, what number corresponds to it? Question 1 IQ scores are scaled so that the mean is 100 and the standard deviation is 15. The scores are approximately normally distributed. To be eligible for membership in MENSA, an individual must have an IQ above 130. What proportion of the population would qualify for MENSA membership? Question 1 0.020 0.000 dnorm(x, 100, 15) cord.x <- c(130,seq(130,160,0.01),160) cord.y <- c(0,dnorm(seq(130,160,0.01),100,15),0) curve(dnorm(x,100,15),40,160) polygon(cord.x,cord.y,col='skyblue') 40 60 80 120 x 160 Question 1 solution First, put it on the standard normal scale. z= 130 − 100 =2 15 Question 1 solution First, put it on the standard normal scale. z= 130 − 100 =2 15 Second, look up “2” on the normal table. Pr (X ≥ 2) = 0.0228 or 2.28%. Question 1 solution pnorm(2,0,1,lower.tail=FALSE) ## [1] 0.02275013 pnorm(130,100,15,lower.tail=FALSE) ## [1] 0.02275013 Question 2 In poor countries, the growth of children can be an important indicator of general levels of nutrition and health. Data in the paper “The Osteological Paradox: Problems of Inferring Prehistoric Health from Skeletal Samples” suggests that a reasonable model for the population distribution of the height of 5-year old children is a normal distribution with mean 100 cm and a standard deviation of 6 cm. What proportion of the population has heights between 94 cm and 112 cm? Question 2 0.04 0.00 dnorm(x, 100, 6) cord.x <- c(94,seq(94,112,0.01),112) cord.y <- c(0,dnorm(seq(94,112,0.01),100,6),0) curve(dnorm(x,100,6),76,124) polygon(cord.x,cord.y,col='skyblue') 80 90 100 x 110 120 Question 2 solution pnorm(112,100,6,lower.tail=TRUE)-pnorm(94,100,6,lower.tail=TRUE) ## [1] 0.8185946 Check your answer • We expect 68% within 1 sd of the mean. Check your answer • We expect 68% within 1 sd of the mean. • We expect 95% within 2 sds of the mean. Check your answer • We expect 68% within 1 sd of the mean. • We expect 95% within 2 sds of the mean. • The percentage we are looking for must be between these two. Check your answer • We expect 68% within 1 sd of the mean. • We expect 95% within 2 sds of the mean. • The percentage we are looking for must be between these two. • Is it? Yes. 81.86% is between 68% and 95%. Question 3 The distribution of the length of time required for students to complete telephone registration is well approximated by a normal distribution with a mean of 12 minutes and a standard deviation of 2 minutes. The University would like to choose an automatic disconnect time such that only 1% of the students will be disconnected will they are still attempting to register. What time should be chosen? Question 3 solution What does the top 1% mean? Question 3 solution What does the top 1% mean? Disconnecting those that take the longest. Question 3 solution What does the top 1% mean? Disconnecting those that take the longest. 0.00 0.10 0.20 dnorm(x, 12, 2) cord.x <- c(17,seq(17,20,0.01),20) cord.y <- c(0,dnorm(seq(17,20,0.01),12,2),0) curve(dnorm(x,12,2),4,20) polygon(cord.x,cord.y,col='skyblue') 5 10 15 x 20 Question 3 solution qnorm(0.01,12,2,lower.tail=FALSE) ## [1] 16.6527 Check your answer • We expect 2.5% of the data to be above 16 minutes. Check your answer • We expect 2.5% of the data to be above 16 minutes. • We expect 0.15% of the data to be above 18 minutes. Check your answer • We expect 2.5% of the data to be above 16 minutes. • We expect 0.15% of the data to be above 18 minutes. • Our answer must be between these two. Check your answer • We expect 2.5% of the data to be above 16 minutes. • We expect 0.15% of the data to be above 18 minutes. • Our answer must be between these two. • Is it? Yes. 16.66 is between 16 and 18. What did we learn? • Random variables • Probability distributions • Uniform distribution • Standard normal • Eponyms • Normal distribution • Empirical rule • Finding areas under the Normal