Download Probability and the Normal Curve

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Probability and the Normal Curve
MSc Module 6: Introduction to Quantitative Research Methods
Kenneth Benoit
February 17, 2010
Basic rules of probability
I
Probability is the cornerstone of decision-making in
quantitative research — in particular, how to judge evidence
given a specific hypothesis
I
The specific question with research is the following:
How likely was it that I obtained this sample of data, given
my hypothesis?
I
Probability is on the unit interval [0, 1], even though in
common parlance we may refer to probability as percentages
Probability axioms:
I
1. For any set of events E : 0 ≤ Pr (E ) ≤ 1.
2. Probability that something occurs is 1.0: Pr (Ω) = 1
3. Any countable sequence of pairwise
disjoint events E1 , E2 , . . .
P
satisfies Pr (E1 ∪ E2 ∪ · · · ) = i Pr (Ei )
I
For instance: for a coin toss, Pr(heads)=0.5, Pr(tails)=0.5
Computing probabilities
I
I
Probability refers to the relative likelihood of occurrence of
any given outcome or event
Alternatively, the probability associated with an event is the
number of times that event can occur, relative to the total
number of times any event can occur
Pr(given outcome) =
number of times outcome can occur
total number of times any outcome can occur
I
Example: If a classroom contains 20 Democrats and 10
Republicans, then the probability that a randomly selected
student will be Democrat is 20/(20 + 10) = .667
I
Other example: The probability of picking the ace of spades
from a single draw from a deck of cards is 1/52; the
probability of drawing any ace is 4/52
Probability distributions
I
A probability distribution is directly analagous to a frequency
distribution, except that it is based on theory (probability
theory) rather than what is observed in sample data
I
In a probability distribution, we calculate determine the
possible values of outcomes, and compute probabilities
associated with each outcome
I
Example:
Probability versus frequency distributions
I
The difference is that a frequency distribution depends on a
sample. Example: flip two coins 10 times (Table 5.3 LF&F):
> table(rbinom(10,2,.5))/10
0
1
2
0.3 0.6 0.1
I
If we increase the coin flips to 1000, we get:
> table(rbinom(1000,2,.5))/1000
0
1
2
0.266 0.496 0.238
I
A probability distribution is like a frequency distribution where
N=∞
I
Note: Your exact numbers will differ!
0.0
0.1
0.2
0.3
0.4
0.5
Empirical distribution of heads in 50 tosses of 2 coins
0
1
2
barplot(tab
0.0
0.1
0.2
0.3
0.4
Empirical distribution of heads in 1,000 tosses of 2 coins
0
1
2
barplot(tab
Mean and variance of a probability distribution
P
X
N
I
Remember the formula for a mean: X̄ =
I
We can compute the mean for any given frequency
distribution already
Returning to the two coin flip example:
I
> # Means of frequency distributions of 2 coin tosses
> mean(rbinom(10,2,.5))
[1] 0.9
> mean(rbinom(100,2,.5))
[1] 1.04
> mean(rbinom(1000,2,.5))
[1] 0.961
> mean(rbinom(10000,2,.5))
[1] 0.9987
I
If we performed this test an infinite number of times, we
would expect the average to be 1.0. This is why we call the
mean of probability distributions an expected value
Greeks and Romans
I
For sample statistics, we use Roman characters
I
For population parameters, we use Greek characters
Quantity
Mean
Standard deviation
Variance
Sample Notation
Population Notation
X̄
s
s2
µ
σ
σ2
The normal curve
0.4
I
0.3
I
symmetrical
continuous
unimodal
follows a specific mathematical form involving two parameters
0.2
I
dnorm(x)
I
0.1
I
The single most important probability distribution in all of
statistics
Features:
0.0
I
-4
-2
0
x
2
4
The area under the normal curve
I
Remember that the normal distribution describes a
(continuous) probability distribution — empirical distributions
(may) only approximate it
I
To use the normal distribution in solving problems, we
calculate probabilities in a probability distribution that comes
from integrating the normal curve
I
Normal distribution has the density:
(x−µ)2
1
f (x) = √ e − 2σ2
σ 2π
where
I
I
σ > 0 is the standard deviation
µ is the exepected value
The area under the normal curve (cont.)
I
Typically we consider the area relative to standard deviations
from the mean
I
A constant proportion of the total area under the normal
curve will lie between the mean and any given distance
measured in units of σ
I
For instance, the area under the normal curve and the point
1σ above the mean always contains 34.13% of cases
I
The same σ distance below the mean contains the identical
proportion of cases
Area under the normal curve in σ distances
Determining exact areas under the normal curve
I
To determine the probability of an event that a random
variable X with a normal distribution is less than or equal to
x, we evaluate the cumulative distribution function of the
normal probability distribution at x
I
The cdf of the normal distribution is:
Z x
(x−µ)2
1
Φµ,σ2 (x) = √
e − 2σ2 dx
σ 2π −∞
I
This equation (and the integral in the cdf) makes it possible
for us to determine the total area under the curve for any
given distance from the mean µ
Determining exact areas under the normal curve
I
For instance, what if we wanted to know the percentage of
cases contained for 1.4 σ units?
I
We can do this in R, or use Table A from Appendix C of LF&F
I
Using Table A from Appendix B...
In R:
I
> 1-pnorm(1.4)
[1] 0.08075666
> round((1-pnorm(1.4))*100, 2)
[1] 8.08
“Standardized” scores
I
I
I
We can transform any distribution into a set of standard
deviations from the mean — this is called a z score or
standardized score
The z score measures the direction and degree that any given
raw score deviates from the mean of a distribution on a scale
of σ units
Formula for computing z score:
zi =
I
I
Xi − µ
σ
In order to convert any raw scores into scores that can be
assessed using the normal curve, we convert them into σ units
(or “z scores) by standardizing them
The normal curve can be used in conjunction with z scores
and Table A (or pnorm) to determine the probability of
obtaining any raw score in a distribution
Example
I
Assume we have a normal distribution of hourly wages
I
I
I
mean wage is $10, and the standard deviation is $1.5
we wish to find out probability of an hourly wage of $12
What is probability of obtaining a score between mean of $10
and value of $12?
Example continued
Steps:
1. Standardize the raw score of $12:
z = (12 − 10)/1.5 = 1.33
2. Use Table A to find the total frequency under the normal
curve between z = 1.33 and the mean. This is p=.4082
Alternatively, we could have used R:
> pnorm(1.33)-pnorm(0)
[1] 0.4082409
Standard normal curve: special version of the normal curve with
µ = 0, σ 2 = 1
Example variation
I
I
I
What if we had wanted to find the probability that a wage
might be greater than the observed value (in this case, $12)?
In this case, we would integrate from 1.33 to +∞
In R, this is easy:
> 1-pnorm(1.33)
[1] 0.09175914
Example variation 2
I
I
I
We could also have obtained the probability that a wage
would be either less than $8 or more than $12.
The transformation would then yield -1.33 and 1.33
In R:
> pnorm(1.33) - pnorm(-1.33)
[1] 0.8164817
0.0
0.1
0.2
0.3
0.4
This means that 1 − .8165 = .1835 of the area lies below $8
and above $12
dnorm(x)
I
-1.33
0.00
x
1.33
Standardizing scores in R
> x <- c(1,4, 5, 7, 14, 0, 21)
> (stdx <- (x - mean(x)) / sd(x))
[1] -0.8518410 -0.4543152 -0.3218066 -0.0567894
> scale(x)
[,1]
[1,] -0.8518410
[2,] -0.4543152
[3,] -0.3218066
[4,] -0.0567894
[5,] 0.8707708
[6,] -0.9843496
[7,] 1.7983310
attr(,"scaled:center")
[1] 7.428571
attr(,"scaled:scale")
[1] 7.54668
0.8707708 -0.9843496
1.798331
Methods for determining normality in samples
I
Visual inspection of the kernel density
I
Some tests also exist, e.g. Anderson-Darling test,
Kolmogorov-Smirnov test, Pearson χ2 test
I
Q-Q plot to visualize normality
I
I
I
“quantile-quantile” plot: plots ranked samples against a similar
number of ranked samples from a normal distribution
normality shows up as a straight-line correspondence
In R: qqnorm()
Normal Q-Q plot
30000
20000
0
10000
Sample Quantiles
40000
50000
Normal Q-Q Plot
-3
-2
-1
0
Theoretical Quantiles
>
>
>
>
load("dail2002.Rdata")
attach(dail2002)
qqnorm(spend_total)
qqline(spend_total)
1
2
3
Central Limit Theorem
I
States that as the sample size N increases, the distribution of
the sample means will be normally distributed with a mean µ
and variance σ 2 no matter what the shape of the original
distribution
I
Second fundamental theorem of probability (law of large
numbers is the first)
I
This allows us to make probability statements about sample
means, against a theoretical (probability) distribution of
means that we might have drawn, since that probability
distribution is normally distributed
I
We could illustrate this with the spend total example from
the dail2002.Rdata dataset