Download Section 5 Part 1: Distributions for Numeric Data Normal Distribution

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Statistics wikipedia , lookup

History of statistics wikipedia , lookup

Probability wikipedia , lookup

Probability interpretations wikipedia , lookup

Transcript
Section 5
Part 1: Distributions for Numeric
Data
Normal Distribution
Probability Review
Section 4 covered these Probability topics
– Addition rule for mutually exclusive events
– Joint, Marginal and Conditional probability for two
non-mutually exclusive events
– Identifying Independent events
– Screening test measures as examples of
conditional probabilities
– Application of Bayes’ Theorem to calculate PPV
and NPV
PubH 6414 Section 5 Part 1
2
Section 5 Part 1 Outline
•
•
•
•
Probability Distributions
Normal Distribution
Standard Normal (Z) Distribution
R functions for normal distribution
probabilities
– pnorm
– qnorm
PubH 6414 Section 5 Part 1
3
Probability Distributions
• Any characteristic that can be measured or
categorized is called a variable.
• If the variable can assume a number of different
values such that any particular outcome is
determined by chance it is called a random variable.
• Every random variable has a corresponding
probability distribution.
• Probability distribution: The value the data takes on
and how often it takes on those values.
PubH 6414 Section 5 Part 1
4
Random Nominal Variable
• Blood type is a random nominal variable
• The blood type of a randomly selected
individual is unknown but the distribution of
blood types in the population can be
described
Distribution of Blood types in the US
Probability
50%
40%
30%
20%
10%
0%
O
A
B
AB
Blood Type
PubH 6414 Section 5 Part 1
5
Probability Distributions for Numerical Data
• Continuous Data can take on any value within the
range of possible values so describing the
distribution of continuous data in a table is not very
practical.
• One solution is a histogram.
PubH 6414 Section 5 Part 1
6
Numerical Data: Duration of Eruptions for Old Faithful
PubH 6414 Section 5 Part 1
7
Histogram
PubH 6414 Section 5 Part 1
8
Probability Density Curve
• Definition: a probability density curve is a curve
describing a continuous probability distribution
• Unlike a histogram of continuous data, the
probability density curve is a smooth line.
• Thought experiment: Imagine a histogram with the
width of the intervals getting smaller and smaller and
the sample size getting larger and larger.
PubH 6414 Section 5 Part 1
9
The Histogram and the Probability Density Curve
PubH 6414 Section 5 Part 1
10
0
.1
Percentage
.2
.3
.4
The Histogram and the Probability Density Curve
80
100
120
140
Systolic BP (mmHg)
160
180
The Probability Density Curve for BP values in the entire population of men –
there are no bars because the population is infinite.
PubH 6414 Section 5 Part 1
11
Area under a Probability Density Curve
• The probability density is a smooth idealized curve
that shows the shape of the distribution of a random
variable in the population
• The total area under a probability density curve = 1.0
• The probability density curve in the systolic blood
pressure example has the bell-shape of a normal
distribution. Not all probability density curves are bell
shaped.
PubH 6414 Section 5 Part 1
12
Shapes of Probability Density Curves
– Symmetric
– Right Skewed
– Left Skewed
– Unimodal
– Bimodal
– Multimodal
PubH 6414 Section 5 Part 1
13
Normal Distribution
• The Normal Distribution is also called
the Gaussian Distribution after
Karl Friedrich Gauss, a German
mathematician (1777 – 1855)
• Characteristics of any Normal Distribution
– Bell-shaped curve
– Unimodal – peak is at the mean
– Symmetric about the mean
– Mean = Median = Mode
– ‘Tails’ of the curve extend to infinity in both
directions
PubH 6414 Section 5 Part 1
14
Normal Distribution
PubH 6414 Section 5 Part 1
15
Symbol Notation
A convention in statistics notation is to use Roman letters
for sample statistics and Greek letters for population
parameters. Since the density curve describes the population,
Greek letters are used for the mean (mu) and SD (sigma)
Symbol
Mean
Standard
Deviation
Sample
X
s
PubH 6414 Section 5 Part 1
Density
Curve
µ
σ
16
Describing Normal Distributions
• Every Normal distribution is uniquely described by
it’s mean (µ) and standard deviation (σ)
• The Notation for a normal distribution is
N(µ, σ)
• N(125, 4) refers to a normal distribution with
mean = 125 and standard deviation = 4. What is
the variance (σ2) for this distribution? ________
PubH 6414 Section 5 Part 1
17
Normal density with
mean=5 and σ =1
Two normal densities with different
mean values and same σ
Two normal densities with different σ and the
same mean
PubH 6414 Section 5 Part 1
18
The 68-95-99.7 Approximation for all
Normal Distributions
Regardless of the mean and standard deviation of the normal
distribution:
• 68% of the observations fall within one standard deviation
of the mean
• 95% of the observations fall within approximately* two
standard deviations of the mean
• 99.7% of the observations fall within three standard
deviations of the mean
* 95% of the observations fall within 1.96 SD of the mean
PubH 6414 Section 5 Part 1
19
Distributions of Blood Pressure
.4
µ= 125 mmHG
σ = 14 mmHG
.3
68%
.2
95%
99.7%
.1
0
83
97
111
125
139
153
167
The 68-95-99.7 rule applied to the distribution
of systolic blood pressure in men.
PubH 6414 Section 5 Part 1
20
Calculating Probabilities from a Normal
Distribution Curve
• The total area under the curve = 1.0 which is the
total probability
• Areas for intervals under the curve can be
interpreted as probability
• What is the probability that a man has blood
pressure between 111 and 139 mmHg?
PubH 6414 Section 5 Part 1
21
Areas under the Curve
• What if you wanted to find the probability of a man
having SBP < 105 mmHg?
The 68-95-99.7
rule can’t be used
to find this area
under the curve!
83
97 111
125
139
153 167
SBP in mmHg
PubH 6414 Section 5 Part 1
22
Calculating the Areas under the Curve
• Use integral Calculus with formula on slide 15.
• Table A-2 in the text is a table of areas under the
standard normal curve – the normal distribution with
mean = 0 and standard deviation = 1
• The pnorm function in R can be used to find the area
under a normal distribution density curve.
PubH 6414 Section 5 Part 1
23
Using pnorm in R
To find P(X< x*):
> pnorm(x*, µ, σ)
To find P(X>x*):
> 1- pnorm(x*, µ, σ)
PubH 6414 Section 5 Part 1
24
Using pnorm in R
• What is the probability that a randomly selected man has SBP
< 105 mmHg?
Interpretation?
> pnorm(105,125,14)
[1] 0.07656373
PubH 6414 Section 5 Part 1
25
Areas under the Curve
• What if you wanted to find the probability of a man
having SBP > 150?
83
97 111
125
139
153 167
SBP in mmHg
PubH 6414 Section 5 Part 1
26
Using pnorm in R
• What is the probability that a man has SBP > 150 mmHg?
Interpretation?
> 1-pnorm(150,125,14)
[1] 0.03707277
PubH 6414 Section 5 Part 1
>
27
Areas under the Curve
• What if you wanted to find the probability of a man
having SBP between 115 and 135?
83
97 111
125
139
153 167
SBP in mmHg
PubH 6414 Section 5 Part 1
28
Using pnorm in R
• What is the probability that a man has SBP between 115 and
135 mmHg?
Interpretation???
> pnorm(135,125,14)-pnorm(115,125,14)
[1] 0.5249495
PubH 6414 Section 5 Part 1
29
Standard Normal Distribution
• The Standard Normal Distribution is the normal
distribution with
− x2
1
– Mean = 0
f ( x) =
e 2
2π
– Standard deviation = 1
Z ~ N(0,1)
PubH 6414 Section 5 Part 1
30
Formula for Z-score
Z=
X −µ
σ
The value of Z tells us how many standard deviations above
or below the mean the observed values falls.
PubH 6414 Section 5 Part 1
31
Standard Normal Scores
•
Z = 1: The observation lies one SD above the mean
•
Z = 2: The observation is two SD above the mean
•
Z = -1: The observation lies 1 SD below the mean
•
Z = -2: The observation lies 2 SD below the mean
What does Z = 0 represent?
PubH 6414 Section 5 Part 1
32
Standard Normal Scores
Recall: Male Systolic Blood Pressure;
X ~ N(125, 14)
For SBP = 105 mmHg – what is the Z-score?
Area > 1.79
Interpretation?
PubH 6414 Section 5 Part 1
33
Standard Normal Scores
What is the probability of a Z score less than -1.43?
Interpretation?
> pnorm(-1.43,0,1)
[1] 0.07635851
Notice this is the same as: What is the probability that
selected
Area a
> randomly
1.79
man has SBP < 105 mmHg?
> pnorm(105,125,14)
[1] 0.07656373
PubH 6414 Section 5 Part 1
34
Standard Normal Scores
Recall: Male Systolic Blood Pressure;
X ~ N(125, 14)
For SBP = 150 mmHg – what is the Z-score?
Area > 1.79
Interpretation?
PubH 6414 Section 5 Part 1
35
Standard Normal Scores
What is the probability of a Z score greater than 1.79?
Interpretation?
> 1-pnorm(1.79,0,1)
[1] 0.03672696
Notice this is the same as: What is the probability that a man has SBP > 150
mmHg?
> 1-pnorm(150,125,14)
[1] 0.03707277
PubH 6414 Section 5 Part 1
36
Z scores – why do it?
– z-scores are used to find probabilities from standard
normal tables such as Table A-2 in the text appendix
– The z-score represents the number of standard deviations
an observation is from the mean which can be useful in
understanding and visualizing the data.
– Probabilities from the standard normal distribution are
used in confidence intervals and hypothesis tests
PubH 6414 Section 5 Part 1
37
The Inverse problem
• What if you instead of finding the area for a z-score you want
to know the z-score for a specified area?
• For example, the top 10% of the midterms scores are how
many standard deviations above the mean?
PubH 6414 Section 5 Part 1
38
Inverse problem: Midterm Score
• Find a z* value such that the probability of obtaining a
larger z score = 0.10.
What is this z score?
PubH 6414 Section 5 Part 1
39
Using qnorm in R
The qnorm finds lower percentiles.
qnorm(lower percentile, µ, σ)
PubH 6414 Section 5 Part 1
40
Inverse problem: Midterm Score
• Find a z* value such that the probability of obtaining a
larger z score = 0.10.
> qnorm(.90,0,1)
[1] 1.281552
What is this z score?
PubH 6414 Section 5 Part 1
41
Inverse problem: Midterm Score
The top 10% of the midterms scores are how many standard
deviations above the mean?
PubH 6414 Section 5 Part 1
42
Inverse problem: Serum Sodium
The lowest 25% of the serum sodium levels in healthy adults are
how many standard deviations below the mean?
PubH 6414 Section 5 Part 1
43
Inverse problem: Serum Sodium
• Find a z-value such that the probability of obtaining a
smaller z score = 0.25
>qnorm(0.25,0,1)
[1] -0.6744898
What is this z-score?
PubH 6414 Section 5 Part 1
44
Inverse Problem: Serum Sodium
What Serum sodium (mEq/L) level corresponds
to 0.67 standard deviations below the mean?
Serum sodium ~ N(141,3)
PubH 6414 Section 5 Part 1
45
Inverse Problem: Serum Sodium
What Serum sodium (mEq/L) level corresponds
to 0.67 standard deviations below the mean?
Serum sodium ~ N(141,3)
> qnorm(0.25,141,3)
[1] 138.9765
PubH 6414 Section 5 Part 1
46