Download PROBABILITY AND PROBABILITY DISTRIBUTIONS

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Theoretical Probability
Distributions
PhD Özgür Tosun
THEORETICAL PROBABILITY DISTRIBUTIONS
probability is a measure of chance
probability distributions help us to study the
probabilities associated with outcomes of
the variable under study
probability theory is the foundation for
statistical
inference.
A
probability
distribution is a device for indicating the
values that a random variable may have
THEORETICAL PROBABILITY DISTRIBUTIONS
Several theoretical probability distributions
are important in biostatistics:
I) Binomial
II) Poisson
III)Normal
Discrete probability distributions:
Variable takes only integer values.
Continuous probability distribution:
Variable has values measured on a
continuous scale.
THE BINOMIAL DISTRIBUTION:
•Variable
outcomes
has
only
binary/dichotomous
(male – female; diseased – not diseased;
positive – negative) denoted A and B.
•The probability of A is denoted by p.
P(A) = p
and
P(B)= 1-p
THE BINOMIAL DISTRIBUTION:
•When an experiment is repeated n times, p
remains constant (outcome is independent
from one trial to another)
Such a variable is said to
BINOMIAL DISTRIBUTION.
follow
Daniel Bernoulli
a
The question is:
What is the probability that outcome A
occurs x times?
Or
What proportion of n outcomes will be A?
The probability of x outcomes in a group of
size n, if each outcome has probability p and
is independent from all outcomes is given by
Binomial Probability Function:
n  x
P(A  x)    p (1  p)n-x
 x
Example
For families with 5 children each, what is the
probability that
i) There will be one male child?
 5
P(A  1)    0.501 (1  0.50) 51  0.16
1 
Among families with 5 children each, 0.16
have one male child.
ii) There will be at least one male children?
P(A  1)  P(A  1)  P(A  2)  P(A  3)  P(A  4)  P(A  5)
5 
    0.5A (1  0.5) 5A
A1  A 
 1 - P(A  0)
5
5 
 1 -   0.5 0 (1 - 0.5)5
 0
 1 - 0.03
 0.97
Using the probabilities associated with
possible outcomes, we can draw a probability
distribution for the event under study:
,4
,3
PROBABILITY
,2
,1
0,0
,00
1,00
2,00
3,00
NO. OF MALE CHILDREN
4,00
5,00
Example:
Among men with localized prostate tumor and a
PSA<10, the 5-year survival is known to be 0.8. We
can use Binimial Distribution to calculate the
probability that any particular number (A), out of
n, will survive 5 years. For example for a new
series of 6 such men:
Non will survive 5 years : P(A=0)=0,000064
Only 1 will survive 5 years : P(A=1)=0,0015
2 will survive 5 years : P(A=2)=0,015
3 will survive 5 years : P(A=3)=0,082
4 will sıurvive 5 years : P(A=4)=0,246
5 will survive 5 years : P(A=5)=0,393
All will survive 5 years : P(A=6)=0,262
,5
,4
,3
PROBABILITY
,2
,1
0,0
0
1
2
3
4
NO. SURVIVING 5 YEARS
Binomial Distribution for n=6 and p=0.8
5
6
THE POISSON DISTRIBUTION:
Like the Binomial, Poisson distribution is a discrete
distribution applicable when the outcome is the
“number of times an event occurs”.
Instead of the probability of an outcome, if average
number of occurrence of the event is given,
associated probabilities can be calculated by using the
Poisson Distribution Function which is defined as:
  lambda
P( X  A) 
Simeon D. Poisson (1781-1840)
A 
e
A!
Example.
If the average number of hospitalizations
for a group of patients is calculated as
3.22, the probability that a patient in the
group has zero hospitalizations is
3.220 e 3.22
P(A  0) 
 0.04
0!
The probability that a patient has exactly
one hospitalization is
3.22 e
P(A  1) 
1!
1
3.22
 0.129
The probability that a patient will be
hospitalized more than 3 times, since the
upper limit is unknown, is calculated as
P(A>3)=1-P(A3)
EXAMPLE
Normal Distribution
Abraham de Moivre (1667-1754)
Karl F. Gauss (1777-1855)
NORMAL DISTRIBUTION
Normal (Gaussian) distribution is the most famous probability
distribution of continuous variables.
The two parameters of the normal distribution are the mean (μ)
and the standard deviation (σ).
The graph has a familiar bell-shaped curve.
The function of normal distribution curve is as follows:
1
f ( x) 
e
 2
1  xi   
 

2  
2
   xi  
The normal distribution is completely defined by the mean
and standard deviation of a set of quantitative data:
 The mean determines the location of the curve on the
x axis of a graph
 The standard deviation determines the height of the
curve on the y axis
There are an infinite number of normal distributions- one for
every possible combination of a mean and standard deviation
Examples of Normal Distributions
Pr(X) on the y-axis refers to either frequency or
probability.
Examples of Normal Distributions
25
Frequency
Frequency
20
15
10
5
0
//
55
Mean
60
65
70
75
80
85
90
Heart Rate (BPM)
Many (but not all) continuous variables are approximately normally
distributed. Generally, as sample size increases, the shape of a frequency
distribution becomes more normally distributed.
Frequency
of
occurrence
Mode, Median, Mean
When data are normally distributed, the mode, median, and
mean are identical and are located at the center of the
distribution.
Quantitative variables may also have a skewed
distribution:
When distributions are skewed, they have more extreme
values in one direction than the other, resulting in a long
tail on one side of the distribution.
The direction of the tail determines whether a
distribution is positively or negatively skewed.
A positively skewed distribution has a long tail on the
right, or positive side of the curve.
A negatively skewed distribution has the tail on the left, or
negative side of the curve.
For a normally distributed variable:
~68.3% of the observations lie between the mean and  1 standard
deviation
~95.4% lie between the mean and  2 standard deviations
~99.7% lie between the mean and  3 standard deviations
Mode,
Median,
Mean
68.3 %
95.4 %
99.7 %
3
2
1

1
2
3
68.26%
95.44%
99.74%
  3   2   

     2   3
P(     x     )  0.6826
P(   2  x    2 )  0.9544
P(   3  x    3 )  0.9974
Sex HR Sex HR Sex HR Sex HR Sex HR Sex HR Sex HR
F
55
M
66
F
70
M
73
F
77
M
79
F
82
M
57
F
M
59
F
67
F
70
M
73
F
77
M
79
M
82
67
M
70
M
73
F
77
M
79
F
83
F
61
F
68
M
70
M
73
M
77
F
80
M
83
M
61
F
68
F
71
F
74
M
77
F
80
M
83
M
62
F
68
F
71
F
74
F
78
M
80
F
84
M
62
M
68
M
71
F
74
F
78
F
81
F
84
F
63
F
69
M
71
M
74
F
78
F
81
M
85
F
64
M
69
F
72
F
75
F
78
F
81
F
86
M
64
M
69
M
72
F
75
M
78
M
81
F
86
M
64
M
69
F
73
M
75
M
78
F
82
M
89
M
66
F
70
M
73
M
76
M
79
F
82
M
89
25
Mean HR = 74.0 bpm
SD = 7.5 bpm
Mean  1SD = 74.0  7.5
= 66.5-81.5 bpm
Mean  2SD = 74.0  15.0
= 59.0-89.0 bpm
20
Frequency
For the heart rate data for 84 adults:
15
10
5
0
//
55
60
65
70
75
80
Heart Rate (BPM)
85
90
Mean  3SD = 74.0  22.5
= 51.5-96.5 bpm
HR Data:
• 57/84 (67.9%) subjects are between mean ± 1SD
• 82/84 (97.6%) are between mean ± 2SD
• 84/84 (100%) are between mean ± 3SD
100
+3 SD
95
90
+2 SD
Heart rate (bpm)
85
+ 1SD
80
75
Mean
70
-1 SD
65
-2 SD
60
55
-3 SD
50
45
0
4
8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84
Subject number
The “normal” range in medical measurements is the
central 95% of the values for a reference population,
and is usually determined from large samples
representative of the population.
The central 95% is approximately the mean  2 sd*
Some examples of established reference ranges are:
Serum
“Normal” range
fasting glucose
sodium
triglycerides
70-110 mg/dL
135-146 mEq/L
35-160 mg/dL
Note: The value is actually 1.96 sd but for convenience this is usually
rounded to 2 sd.
The Standard Normal Distribution
A normal distribution with a mean of 0, and standard
deviation of 1
The distribution is also called the z distribution
Any normal distribution can be converted to the standard
normal distribution using the z transformation.
Each value in a distribution is converted to the number of
standard deviations the value is from the mean.
The transformed value is called a z score.
Formula for the z transformation
x 
z

Once the data are transformed to z-scores, the standard
normal distribution can be used to determine areas under
the curve for any normal distribution.
Example of a z-transformation
If the population mean heart rate is 74 bpm, and the standard
deviation is 7.5, the z score for an individual with HR = 80 bpm
is:
x 
80  
6
z


 0.8



The individual’s HR of 80 bpm is
0.8 standard deviations above the mean.
The z-value can be looked up in a table for
the standard normal distribution to
determine the lower and upper areas
defined by a z-score of 0.8 (the areas are
the lower 78.8% and upper 21.2%)
Using Table
.0082 is
the area
under
N(0,1)
left of z
= -2.40
.0080 is the
area under
N(0,1) left of
z = -2.41
(…)
0.0069 is the
area under
N(0,1) left of
z = -2.46
The standard Normal distribution
Because all Normal distributions share the same properties, we can
standardize our data to transform any Normal curve N (, ) into the
standard Normal curve N (0,1).
N(64.5, 2.5)
N(0,1)
=>
x
z
Standardized height (no units)
For each x we calculate a new value, z (called a z-score).
The total area under the normal distribution curve is 1:
90% of the area is between ± 1.645 sd
95% of the area is between ± 1.960 sd
99% of the area is between ± 2.575 sd
Area = 90%
Area = 95%
Area = 99%
-2.575
-1.645
-1.960
0
+1.645
+2.575
+1.960
The Normal Distribution &
Confidence Intervals
90% of the area is between ± 1.645 sd
95% of the area is between ± 1.960 sd
99% of the area is between ± 2.575 sd
These are the most commonly used areas for defining
Confidence Intervals
which are used in inferential statistics to estimate population
values from sample data
If a certain interval is a 95% confidence interval, then we can say
that if we repeated the procedure of drawing random samples and
computing confidence intervals over and over again, 95% of those
confidence intervals include the true value from the population.
Birth weight (xi)
3200
3450
2980
4100
2900
3500
:
3400
=3300 ; =600
Zi=
xi  3300
600
-0.167
0.25
-0.533
1.333
-0.667
0.333
:
0.167
=0 ; =1.0
If it is known that the birth weights of
infants are normally distributed with a mean
of 3300gr and a standard deviation of 600gr,
what is the probability that a randomly
selected infant will weigh less than 3000gr?
zi =
xi - m
s
3000 - 3300
=
= -0.5
600
P(Xi £ 3000) = P(Zi £ -0.5) = 0.31
More than 3000gr? Ans: 0.19+0.50=0.69
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.0
0.0000
0.0040
0.0080
0.0120
0.0160
0.0199
0.0239
0.0279
0.0319
0.0359
0.1
0.0398
0.0438
0.0478
0.0517
0.0557
0.0596
0.0636
0.0675
0.0714
0.0753
0.2
0.0793
0.0832
0.0871
0.0910
0.0948
0.0987
0.1026
0.1064
0.1103
0.1141
0.3
0.1179
0.1217
0.1255
0.1293
0.1331
0.1368
0.1406
0.1443
0.1480
0.1517
0.4
0.1554
0.1591
0.1628
0.1664
0.1700
0.1736
0.1772
0.1808
0.1844
0.1879
0.5
0.1915
0.1950
0.1985
0.2019
0.2054
0.2088
0.2123
0.2157
0.2190
0.2224
0.6
0.2257
0.2291
0.2324
0.2357
0.2389
0.2422
0.2454
0.2486
0.2517
0.2549
0.7
0.2580
0.2611
0.2642
0.2673
0.2704
0.2734
0.2764
0.2794
0.2823
0.2852
0.8
0.2881
0.2910
0.2939
0.2967
0.2995
0.3023
0.3051
0.3078
0.3106
0.3133
0.9
0.3159
0.3186
0.3212
0.3238
0.3264
0.3289
0.3315
0.3340
0.3365
0.3389
1.0
0.3413
0.3438
0.3461
0.3485
0.3508
0.3531
0.3554
0.3577
0.3599
0.3621
1.1
0.3643
0.3665
0.3686
0.3708
0.3729
0.3749
0.3770
0.3790
0.3810
0.3830
1.2
0.3849
0.3869
0.3888
0.3907
0.3925
0.3944
0.3962
0.3980
0.3997
0.4015
1.3
0.4032
0.4049
0.4066
0.4082
0.4099
0.4115
0.4131
0.4147
0.4162
0.4177
1.4
0.4192
0.4207
0.4222
0.4236
0.4251
0.4265
0.4279
0.4292
0.4306
0.4319
1.5
0.4332
0.4345
0.4357
0.4370
0.4382
0.4394
0.4406
0.4418
0.4429
0.4441
1.6
0.4452
0.4463
0.4474
0.4484
0.4495
0.4505
0.4515
0.4525
0.4535
0.4545
1.7
0.4554
0.4564
0.4573
0.4582
0.4591
0.4599
0.4608
0.4616
0.4625
0.4633
1.8
0.4641
0.4649
0.4656
0.4664
0.4671
0.4678
0.4686
0.4693
0.4699
0.4706
1.9
0.4713
0.4719
0.4726
0.4732
0.4738
0.4744
0.4750
0.4756
0.4761
0.4767
2.0
0.4772
0.4778
0.4783
0.4788
0.4793
0.4798
0.4803
0.4808
0.4812
0.4817
2.1
0.4821
0.4826
0.4830
0.4834
0.4838
0.4842
0.4846
0.4850
0.4854
0.4857
2.2
0.4861
0.4864
0.4868
0.4871
0.4875
0.4878
0.4881
0.4884
0.4887
0.4890
2.3
0.4893
0.4896
0.4898
0.4901
0.4904
0.4906
0.4909
0.4911
0.4913
0.4916
2.4
0.4918
0.4920
0.4922
0.4925
0.4927
0.4929
0.4931
0.4932
0.4934
0.4936
2.5
0.4938
0.4940
0.4941
0.4943
0.4945
0.4946
0.4948
0.4949
0.4951
0.4952
2.6
0.4953
0.4955
0.4956
0.4957
0.4959
0.4960
0.4961
0.4962
0.4963
0.4964
2.7
0.4965
0.4966
0.4967
0.4968
0.4969
0.4970
0.4971
0.4972
0.4973
0.4974
2.8
0.4974
0.4975
0.4976
0.4977
0.4977
0.4978
0.4979
0.4979
0.4980
0.4981
2.9
0.4981
0.4982
0.4982
0.4983
0.4984
0.4984
0.4985
0.4985
0.4986
0.4986
3.0
0.4987
0.4987
0.4987
0.4988
0.4988
0.4989
0.4989
0.4989
0.4990
0.4990
Area between 0
and z
If the mean and the standard deviation of
the BMI of adult women are 24 and 6 units
respectively, what proportion of women will
have BMI>30 (what proportion of women will
be clssified as obese)?
zi =
xi - m
s
30 - 24
=
= 1.0
6
P(Xi ³ 30) = P(Zi ³ 1.0) = 0.16
16% of the adult women will be classified as
obese.
18  22
P( x  18)  P( z 
)  P( z  1)
4
 0,5  0,3413
 0,1587
Example
• In a normal curve with mean = 30, s = 5, what
is the proportion of scores below 27?
27  30
Z 27 
 0.6
5
Smaller portion of a Z of 0.6 is
0.2743
Mean to Z equals 0.2257 and
-4
-3
-2
-1
0
27
1
2
3
4
0.5 - 0.2257 = 0.2743
Portion  27%
Example
• In a normal curve with mean = 30, s = 5, what
is the proportion of scores fall between 26 and
26  30
35?
Z 26 
.3413
.2881
5
 0.8
35  30
Z 35 
1
5
Mean to a Z of 0.8 is 0.2881
Mean to a Z of 1 is 0.3413
0.2881 + 0.3413 = 0.6294
-4
-3
-2
-1
26
0
1
2
3
4
Portion = 62.94% or  63%
Example
• The Stanford-Binet IQ test has a mean of 100 and
a SD of 15, how many people (out of 1000 ) have
IQs between 120 and 140?
Z140
140  100

 2.66
15
Mean to a Z of 2.66 is 0.4961
.4082
Z120
120  100

 1.33
15
Mean to a Z of 1.33 is 0.4082
.4961
0.4961 - 0.4082 = 0.0879
Portion = 8.79% or  9%
-4
-3
-2
-1
0
1
120
2
3
140
4
0.0879 * 1000 = 87.9 or  88
people
When the numbers are on the same side of
the mean: subtract
-
=
Related documents