Download Lecture 6

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Applied Data Analysis
Spring 2017
The North Remembers
The smile makes me nervous
Karen Albert
[email protected]
Thursdays, 4-5 PM (Hark 302)
Lecture outline
1. Random variables
2. Probability distributions
3. Uniform distribution
4. Standard normal distribution
5. Normal distribution
Whence data?
How do we translate outcomes into numbers?
That is, how do we relate events in the set of outcomes to the
set of numbers?
political outcomes → numbers
Aside: functions
A function is a rule that relates inputs to outputs.
Aside: functions
A function is a rule that relates inputs to outputs.
Example:
f (x) = x 2
• f is the name of the function
• x stands in for the input
• x 2 is the output
f (x) = x 2
Input
-4
-2
0
2
4
Output
16
4
0
4
16
Aside: functions
Functions relate inputs to outputs subject to two conditions:
1. it must work for every possible input value
2. it must relate each input to only one output
Aside: functions
Functions relate inputs to outputs subject to two conditions:
1. it must work for every possible input value
2. it must relate each input to only one output
A rule that relates both -4 and 4 to 16 is a function.
A rule that relates 4 to both 8 and 16 is not a function.
Random variables
A random variable is function that relates political outcomes to
numbers.
Random variables
A random variable is function that relates political outcomes to
numbers.
Random variables have nothing to do with:
• randomness
• probabilities
Probability distributions
A probability distribution lists the possible values of a random
variable and their probabilities.
Probability distributions
A probability distribution lists the possible values of a random
variable and their probabilities.
Two kinds:
• Discrete (probability mass functions)
• Continuous (probability density function)
Probability mass functions
A probability mass function assigns a probability to each
possible value of a discrete variable.
Probability mass functions
A probability mass function assigns a probability to each
possible value of a discrete variable.
Properties
• 0 ≤ Pr(y ) ≤ 1 (each prob. is between 0 and 1)
•
P
all y Pr(y ) = 1 (the probs. must sum to 1)
Example
Let y be the answer to “What do you think is the ideal number
of children for a family to have?”
y
0
1
2
3
4
5
Total
Pr(y )
0.01
0.03
0.60
0.23
0.12
0.01
1.00
Probability density functions
A probability density function assigns probabilities to intervals
of a continuous variable.
Probability density functions
A probability density function assigns probabilities to intervals
of a continuous variable.
Properties
• f (x) ≥ 0 (each prob. is between 0 and 1)
•
R
y
f (x)dx = 1 (the probs. must sum to 1)
Example
The Uniform distribution assigns all intervals of the same length
equal probability.
f (x) =
1
, a≤x ≤b
b−a
Drawing a Uniform
0.4
0.2
0.0
dunif(x, 1, 3)
curve(dunif(x,1,3),0,4)
0
1
2
x
3
4
Finding areas under the Uniform
Assume that X ∼ U(1, 3). What is Pr(X < 1.4)?
0.3
0.2
0.1
0.0
dunif(x, 1, 3)
0.4
0.5
Uniform Density
0
1
2
x
3
4
Solution
What is Pr(X < 1.4)?
Solution
What is Pr(X < 1.4)?
The area of the shaded box is base× height.
Solution
What is Pr(X < 1.4)?
The area of the shaded box is base× height.
Pr(X ≤ 1.4) = (1.4 − 1.0) ×
= 0.4 × 0.5
= 0.2
1
(3 − 1)
Solution, another way
(1.4-1)*(1/(3-1))
punif(1.4,1,3,lower.tail=TRUE)
Solution, another way
(1.4-1)*(1/(3-1))
## [1] 0.2
punif(1.4,1,3,lower.tail=TRUE)
## [1] 0.2
Simulating a Uniform
library(MASS)
x <- runif(1000,1,3)
truehist(x,prob=TRUE)
Simulating a Uniform
library(MASS)
## Warning:
package ’MASS’ was built under R version 3.1.3
0.0
0.2
0.4
0.6
x <- runif(1000,1,3)
truehist(x,prob=TRUE)
1.0
1.5
2.0
2.5
3.0
Simulating a Uniform
d <- sum(x<=1.4)
d/1000
Simulating a Uniform
d <- sum(x<1.4)
d/1000
## [1] 0.18
Practice
Assume X ∼ U(6, 10).
What is Pr(X > 7)?
Practice: draw it!
0.20
0.10
0.00
dunif(x, 6, 10)
x.cord <- c(7,10,10,7)
y.cord <- c(0,0,0.25,.25)
curve(dunif(x,6,10),5,11)
polygon(x.cord,y.cord,col='skyblue')
5
6
7
8
x
9
10
11
Practice: Answer
Assume X ∼ U(6, 10).
What is Pr(X > 7)?
punif(7,6,10,lower.tail=FALSE)
Practice: Answer
Assume X ∼ U(6, 10).
What is Pr(X > 7)?
punif(7,6,10,lower.tail=FALSE)
## [1] 0.75
The standard normal
0.2
0.1
0.0
dnorm(x, 0, 1)
0.3
0.4
Normal Density
-4
-2
0
x
2
4
Drawing the Standard Normal
0.3
0.2
0.1
0.0
dnorm(x, 0, 1)
0.4
curve(dnorm(x,0,1),-5,5)
−4
−2
0
x
2
4
Some history
• Introduced by Abraham de Moivre in his 1783 book “The
Doctrine of Chances.”
• The result was extended by Laplace.
• Then Gauss used the curve in his discussion of regression.
So why is it called the Normal or Gaussian
The term “bell curve” came from Jouffret in 1872, and the name
“normal distribution” was coined independently by Pierce,
Galton, and Lexis in 1875.
They named it the normal because lots of things (but not
everything!) are normally distributed.
Thus,
Stigler’s Law of Eponomy
No scientific discovery is
named after its original
discoverer.
Pedantic digression—Eponym
One who gives, or is supposed to give, his name to a people,
place, or institution; e.g. among the Greeks, the heroes who
were looked upon as ancestors or founders of tribes or cities.
“Pelops is the eponym or name-giver of the Peloponnesus.”
Grote (1869)
Are variables really normally distributed?
• When there is reason to suspect the presence of a large
number of small effects acting additively, it is reasonable to
assume normality.
• Example: test scores—the IQ score of an individual
comprises many small effects including genes and
environmental factors.
• Not everything is normally distributed; lifetimes are not
normally distributed. Think about the lifetimes of
lightbulbs....
Facts about the standard normal distribution
• symmetric about zero
• the area under the curve is 100%
• the above is not surprising because the vertical axis is the
density scale
• the curve is always above the horizontal axis
More facts about the standard normal
• the curve stretches from positive infinity to negative infinity
• almost all of the data is between -4 and 4
• 68% of the data lie within 1 standard deviation of the mean
• 95% of the data lie within 2 standard deviations of the
mean
• 99.7% of the data lie within 3 standard deviations of the
mean
There are lots of normal distributions
Normal distributions are characterized by their mean (where
they are centered) and their standard deviation (how spread out
they are). Only the standard normal has a mean of 0 and a
standard deviation of 1. (All, however, are symmetric and
bell-shaped. )
Notation: N(0, 12 ).
The characteristics of a probability distribution are known as
parameters. The normal distribution has two parameters: the
mean and the standard deviation.
Finding areas under the normal
Like the uniform distribution, the normal distribution is described
by an equation. It is somewhat more complicated, however,
Z
x
Pr(X < x) =
−∞
1
(x − µ)2
√ exp −
2σ 2
σ 2π
Fortunately, there are tables of normal probabilities and
computers.
Practice with the 68%, 95%, 99.7% rule
Assume X ∼ N(6, 22 ). Pr(X < 4)?
0.00 0.10 0.20
dnorm(x, 6, 2)
cord.x <- c(-12,seq(-12,4,0.01),4)
cord.y <- c(0,dnorm(seq(-12,4,0.01),6,2),0)
curve(dnorm(x,6,2),0,12)
polygon(cord.x,cord.y,col='skyblue')
0
2
4
6
x
8
10
12
Practice with the 68%, 95%, 99.7% rule
Assume X ∼ N(6, 22 ). Pr(X > 4)?
0.00 0.10 0.20
dnorm(x, 6, 2)
cord.x <- c(4,seq(4,12,0.01),12)
cord.y <- c(0,dnorm(seq(4,12,0.01),6,2),0)
curve(dnorm(x,6,2),0,12)
polygon(cord.x,cord.y,col='skyblue')
0
2
4
6
x
8
10
12
Practice with the 68%, 95%, 99.7% rule
Assume X ∼ N(6, 22 ). Pr(2 < X < 10)?
0.00 0.10 0.20
dnorm(x, 6, 2)
cord.x <- c(2,seq(2,10,0.01),10)
cord.y <- c(0,dnorm(seq(2,10,0.01),6,2),0)
curve(dnorm(x,6,2),0,12)
polygon(cord.x,cord.y,col='skyblue')
0
2
4
6
x
8
10
12
Practice with the 68%, 95%, 99.7% rule
Assume X ∼ N(6, 22 ).
• Pr(X < 4)? 16%
• Pr(X > 4)? 84%
• Pr(2 < X < 10)? 95%
The Old Way: The Normal table
The Old Way: The Normal table
Pr(X > 1)?
The Old Way: The Normal table
Pr(X > 1)?
Pr (X > 0)?
The Old Way: The Normal table
Pr(X > 1)?
Pr (X > 0)?
30th percentile?
The New Way: R
Pr(X > 1)?
pnorm(1,0,1,lower.tail=FALSE)
Pr(X > 0)?
pnorm(0,0,1,lower.tail=FALSE)
30th percentile?
qnorm(0.30,0,1)
The New Way: R
Pr(X > 1)?
pnorm(1,0,1,lower.tail=FALSE)
## [1] 0.1586553
Pr(X > 0)?
pnorm(0,0,1,lower.tail=FALSE)
## [1] 0.5
30th percentile?
qnorm(0.30,0,1)
## [1] -0.5244005
The horizontal axis
The horizontal axis of the standard normal is standard
deviations (away from the mean).
When we use a Normal distribution that is not standard, we
count in standard deviations.
That number is called a Z-score.
Z-scores
Z-scores count up how many standard deviations away from
the mean a particular point is.
z=
x −µ
σ
The Z-score is the point of interest minus the mean divided by
the standard deviation.
The Z-score in action
0.02
0.00
0.01
dnorm(x, 20, 10)
0.03
0.04
X ∼ N(20, 102 )
-20
0
20
40
60
2
4
x
0.2
0.1
0.0
dnorm(x, 0, 1)
0.3
0.4
X ∼ N(0, 1)
-4
-2
0
x
Questions I can ask about the normal....
1. What percentage of the data is less (greater) than a certain
number?
2. What percentage of the data is between two points?
3. Given a certain percentage, what number corresponds to
it?
Question 1
IQ scores are scaled so that the mean is 100 and the standard
deviation is 15. The scores are approximately normally
distributed.
To be eligible for membership in MENSA, an individual must
have an IQ above 130. What proportion of the population would
qualify for MENSA membership?
Question 1
0.020
0.000
dnorm(x, 100, 15)
cord.x <- c(130,seq(130,160,0.01),160)
cord.y <- c(0,dnorm(seq(130,160,0.01),100,15),0)
curve(dnorm(x,100,15),40,160)
polygon(cord.x,cord.y,col='skyblue')
40
60
80
120
x
160
Question 1 solution
First, put it on the standard normal scale.
z=
130 − 100
=2
15
Question 1 solution
First, put it on the standard normal scale.
z=
130 − 100
=2
15
Second, look up “2” on the normal table.
Pr (X ≥ 2) = 0.0228 or 2.28%.
Question 1 solution
pnorm(2,0,1,lower.tail=FALSE)
## [1] 0.02275013
pnorm(130,100,15,lower.tail=FALSE)
## [1] 0.02275013
Question 2
In poor countries, the growth of children can be an important
indicator of general levels of nutrition and health. Data in the
paper “The Osteological Paradox: Problems of Inferring
Prehistoric Health from Skeletal Samples” suggests that a
reasonable model for the population distribution of the height of
5-year old children is a normal distribution with mean 100 cm
and a standard deviation of 6 cm.
What proportion of the population has heights between 94 cm
and 112 cm?
Question 2
0.04
0.00
dnorm(x, 100, 6)
cord.x <- c(94,seq(94,112,0.01),112)
cord.y <- c(0,dnorm(seq(94,112,0.01),100,6),0)
curve(dnorm(x,100,6),76,124)
polygon(cord.x,cord.y,col='skyblue')
80
90
100
x
110
120
Question 2 solution
pnorm(112,100,6,lower.tail=TRUE)-pnorm(94,100,6,lower.tail=TRUE)
## [1] 0.8185946
Check your answer
• We expect 68% within 1 sd of the mean.
Check your answer
• We expect 68% within 1 sd of the mean.
• We expect 95% within 2 sds of the mean.
Check your answer
• We expect 68% within 1 sd of the mean.
• We expect 95% within 2 sds of the mean.
• The percentage we are looking for must be between these
two.
Check your answer
• We expect 68% within 1 sd of the mean.
• We expect 95% within 2 sds of the mean.
• The percentage we are looking for must be between these
two.
• Is it? Yes. 81.86% is between 68% and 95%.
Question 3
The distribution of the length of time required for students to
complete telephone registration is well approximated by a
normal distribution with a mean of 12 minutes and a standard
deviation of 2 minutes.
The University would like to choose an automatic disconnect
time such that only 1% of the students will be disconnected will
they are still attempting to register. What time should be
chosen?
Question 3 solution
What does the top 1% mean?
Question 3 solution
What does the top 1% mean?
Disconnecting those that take the longest.
Question 3 solution
What does the top 1% mean?
Disconnecting those that take the longest.
0.00 0.10 0.20
dnorm(x, 12, 2)
cord.x <- c(17,seq(17,20,0.01),20)
cord.y <- c(0,dnorm(seq(17,20,0.01),12,2),0)
curve(dnorm(x,12,2),4,20)
polygon(cord.x,cord.y,col='skyblue')
5
10
15
x
20
Question 3 solution
qnorm(0.01,12,2,lower.tail=FALSE)
## [1] 16.6527
Check your answer
• We expect 2.5% of the data to be above 16 minutes.
Check your answer
• We expect 2.5% of the data to be above 16 minutes.
• We expect 0.15% of the data to be above 18 minutes.
Check your answer
• We expect 2.5% of the data to be above 16 minutes.
• We expect 0.15% of the data to be above 18 minutes.
• Our answer must be between these two.
Check your answer
• We expect 2.5% of the data to be above 16 minutes.
• We expect 0.15% of the data to be above 18 minutes.
• Our answer must be between these two.
• Is it? Yes. 16.66 is between 16 and 18.
What did we learn?
• Random variables
• Probability distributions
• Uniform distribution
• Standard normal
• Eponyms
• Normal distribution
• Empirical rule
• Finding areas under the Normal