Download Read - Open Online Courses

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Probability wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Chapter 3: Probability & Distributions
By Farrokh Alemi, PhD
Version of Tuesday, February 07, 2017
Portions of this chapter were written by Munir Ahmed and edited by Nancy Freeborne DrPH
All statistical analyses follow an organized set of steps. A first step in analyses begins with
examination of univariate data which we describe in this chapter. Univariate data analysis
involves the evaluation of one variable at a time. Analysis of this type helps health care
managers, clinicians and researchers assess patterns. Examples of how managers or clinicians
can do univariate data analysis are below:
1. Counting the number of nursing home patients who fell in a given year;
2. Assessing the most common time of day for medication errors at a hospital;
3. Counting the number of children admitted with influenza diagnosis in the month of
January.
Variables
A variable measures extent of presence of a concept, degree an object is available, or an
extent of a person’s characteristic. Sometimes variables are referred to as attributes, clues, or
characteristics. Here are several example variables used in healthcare:



Cost:
•
Billed amount for medical service
•
Billed for surgical procedure
Patient satisfaction:
•
Score from 1 to 10 reflecting very dissatisfied to very satisfied
Disease status:
•
Presence of diabetes
•
Presence of hypertension
•
Count of symptoms of depression
Variables have at least two levels and some have many levels. The values of a variable
change from one level to another. A variable never has only one value for all occasions, but it
could have one value for a set analyses or evaluation. In that case, we refer to the variable as a
as a constant. For example, if we are looking at emergency department visits, the constant
variable could be one emergency room, but the non-constant variable would be the time of visit.
If a variable can assume only two levels, then the variable is binary. An example of a
binary variable is occurrence of adverse event, such as wrong-side surgery. Either the surgery
has been done or it has not been done. If the event has occurred it will have a value of 1,
otherwise it will be 0. Sometimes, binary variables are referred to as indicators. If there are more
than two, but a countable set of levels; then the variable is discrete or categorical. An example of
a categorical variable is race. Here are examples of categorical variables commonly used in
healthcare.
1
Race:
•
•
•
•
American Indian
Black or African American
White
Latino or Hispanic
Insurance status:
•
•
•
Medicaid
Medicare
Private insurance
Marital status:
•
•
•
•
Never married
Divorced
Married
Widowed
If variable levels show an order, then the variable is ordinal. A variable can be both
ordinal and categorical. For example, age in decades is an ordinal categorical variable. In this
example, we are categorizing patients as either in their 30s or in their 40s or in their 50s, but we
are not looking at their exact age. By contrast, if the variable can assume any real number in a
range then the variable is continuous. An example of a continuous variable is cost. A continuous
variable is also an interval variable, meaning that the values of the variable represent the
magnitude of the presence of the variable. Cost is an interval variable because an operation that
costs $12,000 is twice more expensive than an operation that costs $6,000. Count of anything is
an interval scale; so count of patients with adverse events is an interval scale; even though the
variable itself is binary. A ratio variable is an interval variable but one in which 0 is considered
valid, or specifically that none of the variable is present. For example, number of patients signing
up for Medicaid can be considered a ratio variable as a 0 would mean none signed up. Count of
number of days post admission, what is known as length of stay, is a continuous variable.
Ordinal scales cannot be averaged since they do not show the magnitude of the variable. There
are exceptions to this rule and sometimes ordinal scales are treated as if they are interval scales.
Satisfaction to care is typically rated on an ordinal scale but treated as if the scale was interval;
thus, we see reports of average satisfaction.
Some variables have nominal levels, and different than continuous variables, have levels
that
are not in any particular order and just represent different concepts. For example,
racial categories (White, Black, Asian, etc.) are nominal, meaning that racial categories are not in
any particular order.
2
In healthcare, a set of variables are typically used to measure outcomes of care. These
include cost of care, access to care, satisfaction with care, mortality and morbidity. Cost of care
is a continuous variable. Above or below average cost is a binary representation of cost. If above
average cost has value of 1 and below average cost has a value of 0, then we can count the
percent of patients who have above average cost. Access to care is sometimes measured in days
till next appointment; count of days is obviously an interval scale. Mortality is binary as one is
either dead or alive. Although, in electronic health records, sometimes date of death is entered
erroneously, leading to bizarre situations where the data shows dead patients as continuing to
show up for visits. Probability of death is typically referred to as patient’s prognosis or severity
of illness. Any probability is also an interval scale. Finally, morbidities are typically calculated
on an ordinal scales (e.g. Barthel index breaks extent of function into different disabilities and
each disability is rated as 0, 5, 10 or 15; with 15 indicating complete disability and 0 indicating
no disability). Strictly speaking, morbidity scales are ordinal and cannot be averaged, but again
the literature includes exceptions, where averages of ordinal scales are reported.
Univariate data analyses can aid managers and clinicians in assessing their clinics, hospitals,
nursing homes and the like so that they can obtain information that will allow them to: 1.) assess
quality, 2.) monitor costs; 3.) track illnesses. While univariate data analysis does not allow for
assessment of causation, it allows health care managers to see trends in data which can help them
with assessment of their clinics or hospitals.
Sample
A population of interest is a group of individuals who share a key feature, e.g. all diabetic
patients or all employees who are 40 years old. Often it is not possible to examine all members
of a population, and thus it is convention to evaluate a sample of the population. A sample is an
organized subset of the population. Statistics are calculated on the sample and if the sample is
representative of the entire population then the same statistics can be construed as relevant to the
population. A sample is judged to be representative of the entire population if it reflects various
subgroups in the population proportionally. For instance, if the population is made up of
200,000 persons and the proportion of persons having income greater than $100,000 per year is
20 %, one would need to assure that the sample had about the same number of persons with that
income-level. A large sample may not be representative, and care should be taken to organize the
sample in ways that does not introduce a bias in people who were included in the sample.
There are different ways to sample data and choice of sampling is influenced by several
factors. In analysis of data from electronic health records, typically all patients in the population
are included in the analysis. This is called a complete sample. If working with complete sample,
then any findings are generalized to the population because it is calculated at the population
level. Use of a complete sample is ideal for making conclusions because everyone in the
population is included, but when a large amount of data is analyzed, managers may need to
account for higher cost of data analysis. One can also randomly sample the data. When doing
experiments, random sampling allows one to assign patients randomly to experimental and
control groups. This random assignment increases the probability that experimental and control
groups do not differ in other characteristics beside the intervention. In analysis of data from
electronic health records, a random sample of patients may be taken in order to reduce the size of
the data to be analyzed. This type of random sampling from all available data is done to reduce
3
computational difficulties. An important method of sampling is stratification. In stratified
sampling, patients with certain characteristics are over-sampled so rare conditions can be overpresent in the sample. Adaptive sample is a method of sampling in which initial samples are used
to determine size and parameters of subsequent samples. A convenience sample is a subset of the
population chosen in order to facilitate the analysis and not to generalize to the entire population.
Sample elements in the case of healthcare analyses are typically patients. It is generally
assumed that patients are independent sample elements; that is, patient characteristics do not
depend on each other. The diseases and treatment decision of one patient does not depend on
conditions of another patient. This is usually true except when parents bring their children to
care. For instance, when a family member brings two sick siblings to the pediatrician, they are
seen back to back and diseases of one might inform on another. Another example of a nonindependent sample situation is when a patient has a contagious disease as his or illness will
affect the probability of illness of subsequent patients.
In this book, we will show a variable as X, if there are multiple variables we may show
them as X1, X2 … Xm. In a sample of n cases, a variable assumes n values, one for each case.
We show the value of the variable X in the sample as: 𝑋𝑖 , where the index “i" refers to the value
of X in the ith case in the sample. Throughout the book, the index “i" is reserved to indicate the
ith case.
Average
In a sample of data, variables may have many different values and one way to understand
the general tendency of a variable is to average the data. An average is a single value that is
representative of all values that a variable can take. In most contexts the term average refers to a
very specific measure of center called the arithmetic mean or simply the mean. The arithmetic
mean is defined as the sum of all observations divided by their number. Thus, in order to
calculate the arithmetic mean we add up all values of our variable of interest and then divide that
sum by the total number of values that were added. When calculated from the population the
arithmetic mean is called the population mean (µ) and when calculated from the sample it is
called the sample mean (Ẋ). Clearly this statistic cannot be calculated for qualitative (nominal or
ordinal) data. For any variable X that is measured on an interval or ratio scale, the sample
arithmetic mean,𝑋̅ can be calculated from n observations using the following formula:
𝑋̅ =
∑𝑛𝑖 𝑋𝑖
𝑛
The estimate of sample mean represented by the mathematical expression above is a
point estimate, since it provides us with a single estimate or guess of the population mean.
Naturally, for different samples the values of the sample mean is expected to vary. An estimate
of the variation in the value of any sample statistic is called the standard error. The standard
error of the arithmetic mean is discussed in a latter chapter, where we discuss the central limit
theorem.
The expression for arithmetic mean, discussed so far, gives the same weight to all
observations in the dataset. When this is not the case then a weighted arithmetic mean should be
calculated. For the n observations of variable X, the weighted arithmetic mean is the sum product
of values Xi and their corresponding weights wi, divided by the sum of wi.
4
n
X 
w X
i 1
n
i
i
w
i 1
i
Weighted estimates can be used to correct asymmetries in a sample that occur due to a portion of
data being missing. When the sample is not representative of the target population, weights can
be used to correct the bias. For example, let's assume that a hospital manager wants to
understand satisfaction with emergency room service. To get to readily available data, the
manager relies on patient reviews posted to the internet in a month. After the initial examination
of comments on satisfaction, the emergency room clinicians actively ask patients to leave more
patient reviews. In the first time period, there were only 20 comments left and in the subsequent
review 30 comments were left. The hospital manager would like to weight the data so that month
2 would have the same number of comments as Month 1. Table 1 displays the data. To calculate
the weighted values, we have to multiply comments left in Month 2 by the weight (20/30). This
ensures that the total in weighted column is 20, same as Month 1. After this weighting, 6.67
positive comments were left in Month 2, which exceeds the 5 positive comments left in Month 1.
Table 1: Comments Posted to the Internet
Positive
Negative
Total
Month
1
5
15
20
Month
2
10
20
30
Weight
0.67
0.67
Weighted
Month 2
6.67
13.33
20.00
Expected Values
A concept closely related to averages is the expected value of the variable. If the values
of a variable are mutually exclusive and exhaustive (and they always are or should be), then the
expected value of a variable is the sum of the product of the probability of observing a particular
level of the variable times its value.
Expected Value of X = E(X) = ∑ p(X = xi )xi
i
The formula for expected value is similar to weighted average, where weights are replaced with
probability of the event and sum of the probabilities is 1. A common expected value used in
analysis of healthcare data is the Case Mix index. The Case Mix index is typically calculated
over Diagnostic Related Groups (DRGs). Since the DRGs are mutually exclusive and exhaustive,
the expected value can be used to measure the Case Mix of a hospital. One can calculate the
Case Mix index by summing the product of hospital’s probability of seeing a particular DRG,
and national estimates of length of stay of the DRG.
5
Case Mix Index = ∑(probability of DRGi ∗ Length of Stay for DRGi )
i
Note that while the probabilities reflect the experience of the hospital, the length of stay reflects
the national experience. Since the length of stay is calculated from national estimates, then the
Case Mix Index shows the extent to which the hospital is seeing patients who require longer
length of stay. Thus, it provides an estimate of difficulty and severity of the patients seen in the
hospital.
We can demonstrate the concept of expected values through a hypothetical data from an
insurer. This insurer wants to anticipate the expected cost across different groups of potential
members. The expected cost is calculated as sum of the product of the probability of the member
joining the insurer and the cost he/she may have. The insurer has 8 different types of patients
who are likely to use the insurance, and each type has a different probability of joining and cost
(see Table 2). The average cost for all 8 categories is $6,125 but this cost is not weighted for the
likelihood that the patients will join the insurance. Once weighted, the average cost increases to
$7,669 because the more expensive potential members seem to be more likely to enroll in the
insurance.
Table 2: Cost and Likelihood of Utilization of an Insurance Plan
Description
Urban, 60+ years, Male
Urban, 60+ years, Female
Urban, 40-60 years, Male
Urban, 40-60 years, Female
Suburban, 60+ years, Male
Suburban, 60+ years, Female
Suburban, 40-60 years, Male
Suburban, 40-60 years, Female
Average
Probability
Weighted
of Joining
Cost
Cost
0.2
$10,000
$2,000
0.2
$10,000
$2,000
0.1
$8,000
$800
0.1
$7,000
$700
0.2
$6,000
$1,200
0.05
$5,000
$250
0.05
$2,000
$100
0.1
$1,000
$100
$6,125
$7,150
In our hypothetical example, the combinations depicted (rows in the Table) are assumed to be
mutually exclusive and exhaustive. Thus the probability of the events adds up to 1. The average
cost can be readily corrected by weighting each type of member. For example, suburban, 60+
years, males have 20% chance of enrolling in the insurance and a $6,000 cost. The total
contribution of this type of members to the weighted average cost is 0.20*$6,000=$1,200. In
contrast, a suburban female who is 40-60 years old has 5% chance of enrolling in the insurance
package and thus will cost $1,000 per year so their contribution to the weighted average is
0.05*$1,000=$50. In this manner, the cost is weighted by the likelihood that the member will
enroll.
6
Standard Deviation
This section describes the concept and measurement of standard deviation.1 A variable,
by definition, has values that vary across subjects in a sample. Standard deviation is an estimate
of the variation in the values of a variable around that variable's measure of center (or average).
Two variables can have the same average but observations can be spread tightly or widely
around the average. In Figure 1, all three plots have the same average but different dispersion.
The observations in plot A have the smallest dispersion around the mean.
Figure 1: Dispersion around Average
Dispersion matters a great deal, as we are not telling the complete story if we just report
the averages. Consider the average hospitalization costs calculated for two different populations:
a group of patients who have been hospitalized or members of an insurance plan who have not
been hospitalized. Compared to members of the health plan, patients in the hospital have a tight
dispersion around the average hospitalization cost. A portion of members of health plan are
never hospitalized; thus they will have 0 hospitalization costs. In contrast, all patients in the
hospital will have some hospital costs so there will be no zeros. The dispersion around the
average hospitalization cost will be larger for members than for patients.
The mean describes the center of data, but the variability in the data is also important.
Figure 1 shows three sets of data. All have the same mean. Plot A data shows with tight spread
of data around the mean. Plot C shows a large spread around the mean. Plot B is in between the
two.
To measure the spread of data we start with deviation. Deviation is the difference
between the ith observation and the mean: (𝑋𝑖 − 𝜇). In this formula 𝜇 is the mean of the
population, which is often unknown. The mean of sample, as described earlier, is shown as 𝑋̅
and can be calculated by summing the observations in the sample and dividing by the count of
the observations. You can think of deviation as how far off the course we are if we deviate from
our main road and end up at the observation point. We can deviate to the right of the mean or to
the left, so the difference of the mean and observation can be positive or negative. One idea for
measuring the spread around the mean is to calculate the sum of deviations for all points in the
population. This is problematic. If we just sum the deviations some positive deviations will
1
This description of standard deviation is based on the book OpenIntro Statistics.
7
cancel out other negative deviations and create an image that the spread is less than it is. To
avoid this, we first square the deviations and then sum it. The measure of spread calculated in
this way is called variance. It is defined as the average of the square of deviations for every
point in the population. It is calculated as:
𝑛
1
𝜎 2 = ∑(𝑋𝑖 − 𝜇)2
𝑛
𝑖=1
The standard deviation is the square root of the variance and can be calculated from the
observations in the sample. The standard deviation is useful when considering how close the data
are to the mean of the sample.
∑𝑛𝑖=1(𝑋𝑖 − 𝑋̅)2
√
𝑠=
𝑛−1
To estimate the standard deviation, first calculate the deviations for each point. In Table
3, we have three observations: 5, 2 and 5. The mean is 4. The deviations from the mean are 1,
minus 2 and 1. Note that the total sum of deviations is always 0. Once the deviations have been
calculated, then you square the deviations. The sum of the squared deviation is 6. Last you
divide the sum by one minus the number of observations and take the square root of it. The sum
is 6, there are 3 observations so the division yields 2 and the square root of 2 is 1.7. This value is
referred to as standard deviation.
Table 3: Steps in Calculating Standard Deviations
Squared
Observations
Mean Deviation Deviation
5
4
1
1
2
4
-2
4
5
4
1
1
Total =
0
6
Number of Observations =
3
6
Standard Deviations = √3−1 =
1.7
Standard deviation is a measure of spread of data around the mean. If spread is small
(Plot A in Figure 1), then standard deviation will be small. If spread is large (Plot C in Figure 1),
then standard deviation will be large. Usually 70% of the data fall within one standard deviation
of the mean; 95% of the data fall within 2 standard deviation of mean. This is not always the case
and the percentage of observations that fall within a certain number of standard deviation depend
on the shape of the distribution.
It should be noted that the standard deviation has the same unit of measurement as
variable X. It should also be clear that this statistic can only be calculated for interval and ratio
variables.
8
When values of a variable X are grouped into a frequency distribution, the expression for
sample standard deviation becomes,
 f X
n
S
i 1
i
i
X
2
n
f
i 1
i
When all the values in the sample are not weighted identically, the expression for sample
standard deviation becomes:
w X
n
S
i 1
i
i
X
2
n
w
i 1
i
Transformation of Variables
In statistics, one often has to transform a variable so that it follows assumptions made in
the analytical phase. For example, when dealing with cost, often the data are not distributed in a
fashion that can be directly used in the analysis: some members of an insurance plan have 0 cost.
To have a smooth distribution of cost, analysts often transform cost data. In any transformation,
the average and standard deviation of the variable changes.
Linear transformation of variable X and Y affects the mean and variance of their linear
combination. The expected value of sum (or difference) of linearly transformed variables X and
Y is given by:
 a bX  c  dY   a bX  c  dY
  a  b X    c  d Y 
The variance of sum (or difference) of linearly transformed variables X and Y that are not
independent is given by:
 2a bX  c  dY    2   2  2  a bX c  dY 
abX
c dY
 b2 2X  d 2 Y2  2bd XY
When X and Y are independent, the expression for variance of sum of their linear
transformations is given as follows:
 2a bX  c  dY    2   2
a bX
c  dY
 b 2 2X  d 2 Y2
9
Example 1: The probability distributions of random variables X and Y is provided in
Table 4. Calculate the expected value and variance of the difference between their linear
transformations 2 + 3X, and 1 + Y.
Table 4: Probability of Two Random Variables
X
Y
p(X, Y)
0
-1
0.2
1
0
0.3
2
1
0.5
Total
1
First, we calculate expected value, variance, and covariance for X and Y.
 X  .2  0   .3 1  .5  2   1.3
Y  .2  1  .3  0   .5 1  0.3
 X2  .2  0  1.3  .3 1  1.3  .5  2  1.3  0.61
2
2
2
 Y2  .2  1  0.3  .3  0  0.3  .5 1  0.3  0.61
2
2
2
 XY  .2  0  1.3 1  0.3  .3 1  1.3 0  0.3  .5  2  1.31  0.3   0.61
Next, we calculate the mean and variance of the difference between linear
transformations of X and Y.
 23 X 1Y    2  3 X   1  Y 
 2  3 1.3   1  0.3 

2
 2  3 X  1Y 
 4.6
 b 2 2X  d 2 Y2  2bd XY
 32  0.61  12  0.61  2  31 0.61
 2.44
10
Transformation of Variables using Excel:
The computation of expected value and variance of the difference between linear combinations
of X and Y is shown in Figure 2 below using Excel. Figure 3 shows all embedded formulas.
Figure 2:
Figure 3:
Probability
Probability is a quantitative measure of uncertainty. When we are sure that an event will
occur, we say that it has a probability of 1. When we are sure that an event will not occur, we
assign it a probability of zero. When we are completely in dark, we give it a probability of 0.5,
or note that it has a 50 % chance of occurrence. Values between 0 and 1 measure how uncertain
we are about the occurrence of an event. More specifically, probability of an event is the
likelihood or chance that such event will occur. There are three basic types of probability:



Theoretical probability
Empirical probability
Subjective or personal probability
Theoretical probabilities are rooted in theory and can be calculated without actually
observing data. This is possible if the process responsible for generating an event is known.
Theoretical probabilities are typically associated with games of chance such as a coin toss, roll of
a dice, or drawing a card from a standard deck of 52 playing cards, and involve mutually
exclusive and equally likely events. Events are mutually exclusive if only one of all possible
events can occur at any given point in time. Events are equally likely if the probability of
occurrence of each event is the same. If there are a total of n mutually exclusive and equally
11
likely possible events and m of these are event A, then the probability of event A, denoted by
p(A) is given by:
𝑝(𝐴) =
𝑚
𝑛
Example 2: Three hospitals are competing for an HMO contract. What is the chance that
one of them will be awarded the contract? In this scenario the total number of possible events is
3, any of the 10 participants can be awarded and 1 must be awarded. Not knowing anything else
about each hospitals bid, we can assume that they have equal chances to get the award therefore
the probability is:
𝑝(𝐴𝑤𝑎𝑟𝑑) =
𝑚 1
= = 0.33 = 33%
𝑛 3
Subjective probabilities are based on personal experience and consequently tend to vary
across individuals. One can assess the subjective probability that a group of experts assign to an
event and use this consensus to plan future expansion of a service. Subject probabilities can be
used when theoretical probabilities are not applicable and there is insufficient observed data to
form empirical probabilities.
Empirical probabilities are rooted in observed data. Typically frequency distributions and
contingency tables are used to organize observed data which in turn are used to obtain
probability estimates. Thus, empirical probabilities are essentially relative frequencies associated
with the occurrence of one or more events. For an event A of interest, probability of A, p(A) is
given by:
𝑓
𝑛
where f is the frequency with which outcome A occurs in the observed data, and n is the sum of
frequencies corresponding to all possible outcomes.
𝑝(𝐴) =
Example 3 An administrator observed the data in Table 5 related to newly admitted
patients over a period of 12 months and wants to know the probability that the next patient will
have a cost overrun.
Table 5: Cost Overruns
Cost overrun
Frequency
Over
399
Under
665
Total
1064
In this example, past data suggest that 399 patients out of a total of 1,064 had cost overrun. The
corresponding relative frequency is 399/1064 = or 0.375. Thus, there is a 37.5% chance that the
next patient will have a cost overrun.
12
All probabilities, whether subjective, theoretical or empirical follow three simple rules
and furthermore any set of scores that follows these rules is a probability function:
1. The probability of an event is a positive number between 0 and 1.
2. One event certainly will happen, so the sum of the probabilities of all events is 1.
3. The probability of any two mutually exclusive events occurring equals the sum of the
probability of each occurring.
A probability function can be defined by these rules but such rules provide no intuition about
what is a probability function. One way to think of probability is the frequency of observing an
event among all possible events. This is called the frequentist definition of probability. There are
also other ways of defining probabilities, e.g. subjective definition where probability is the
strength of one’s belief that an event will occur. Savage (1954) and DeFinetti (1964) argued that
the rules of probabilities can work with uncertainties expressed as strength of opinion. A
frequentist definition is not possible for events that have not occurred or do not have repeated
observations. In this book we rely on the frequentist definition of probability: the prevalence of
the event among the possible events.
The probability of a small business failing is then the number of business failures divided
by the number of small businesses. The probability of an iatrogenic (a mistake made in a
healthcare setting) infection in the last month in our hospital is the number of patients who last
month had an iatrogenic infection in our hospital divided by the number of patients in our
hospital during last month. If one can count the events of interest then probability of the event
can be calculated. Even for adverse rare events, the probability of the event can be calculated.
The daily probability of wrong side surgery in our facility is then the number of days in which a
wrong side surgery was reported divided by the number of days in which we had surgeries.
Graphically, we can show a probability by defining the square to be proportional to the number
of possible events and the circle as the all ways in which event A occurs; then the ratio of the
circle to the square is the probability of A (see Figure 4).
Figure 4: A Graphical Representation of Probability
Probability Calculus
Notice that for any event of interest A, the probability of occurrence of A is a complement
of the probability that event A will not occur and is calculated as:
13
p(A) = 1 − p(A′)
p(A′) = 1 − p(A)
Where p(A′) is the probability that event A will not occur. The probability of two mutually
exclusive and exhaustive events can be calculated from the same set of relationships where A
and A’ are replaced with the two mutually exclusive and exhaustive events.
There are two basic rules related to probability: the multiplication rule and the addition
rule. The can be used to calculate joint probability of two or more events. The exact probability
formula depends on whether or not the events are independent. Events are independent when
probability of occurrence of one event does not affect the probability of occurrence of the other
event. On the other hand, events are not independent, or their probabilities are conditional, when
probability of occurrence of one event does affect the probability of occurrence of the other
event. For any two events A and B that are independent, the multiplication rule is:
p(A ∩ B) = p(A)p(B)
where  is the intersection operator, and P  A  B  is the probability that both A and B occur.
For two events A and B that are not independent the multiplication rule takes the following form:
p(A ∩ B) = p(A)p(B|A)
where P  B A is the conditional probability of occurrence of event B given that event A has
occurred. If B occurred first, then the multiplication rules takes the following rule
p(A ∩ B) = p(B)p(A|B)
The addition rule of probability can be used to calculate probability that one or another
event occur. The probability of one of two events A and B occurring, is calculated by first
summing all the possible ways in which event A will occur plus all the ways in which event B
will occur, minus all the possible ways in which both event A and B will occur (this term is
subcontracted because it is double counted in the previous sums). This sum is divided by the all
possible ways that any event will occur. This is represented in mathematical terms as:
p(A or B) = p(A) + p(B) − p(A&B)
Graphically, the concept can be shown as the yellow and red areas in Figure 5 divided by the
blue area.
14
Figure 5 Graphical Representation of Probability of A or B
Similarly, the probability of A and B co-occur, corresponds to the overlap between A and B and
can be shown graphically as the red area divided by all possible events (the rectangular blue
area) in Figure 6.
Figure 6: Graphical Representation of Probability A and B
Repeated application of the definition of probability gives us a simple calculus for combining
uncertainty of two or more events. We can now ask questions such as: "What is the probability
that frail elderly (age>75) or infants will join our HMO?" According to our formulas this can be
calculated as:
p(Frail Elderly or Infants) = p(Frail Elderly) + p(Infants) − p(Frail Elderly & Infants)
Since the chance of being frail elderly and infant is zero (i.e. the two events are mutually
exclusive), we can re-write the formula as:
p(Frail Elderly or Infants) = p(Frail Elderly) + p(Infants)
Distribution
A distribution gives the probability of observing various levels of a variable. In future
chapters, we introduce many different distributions including Normal, Binomial, Poison, and
Uniform distributions. These are distributions where the shape of the distribution is assumed to
follow a formula and a statistical parameter is used to fit the data to the formula. These
15
distributions are called parametric distributions. In addition to parametric distribution one could
also examine the observed distribution of data, typically through a histogram.
A histogram for discrete variables can be created counting the number of times each level
of the variable occurs. The frequency associated with a specific value of a discrete variable is the
total number of times that value is observed in the data. Frequencies can be converted into
relative frequencies by dividing each individual frequency with the total number of observations.
These relative frequencies are often multiplied by 100 in order to transform them into
percentages. Visual tools for summarizing discrete variables include the bar chart and the pie
chart, and when the number of observations is large, the histogram. For illustration we have
provided the frequency distribution of age of employees working at a large hospital located in a
metropolitan area in the following table. Notice that we have also added relative frequency and
cumulative frequency columns in this table. The pie and bar charts corresponding to the
frequency distribution of age are shown in the two figures following the table.
In addition to the frequency distribution, common measures of center such as mean,
median, and mode, and measures of dispersion such as range, standard deviation can be
calculated in order to summarize a numeric variable that has discrete values.
Example 4: Table 6 provides frequency of observing different values for age. Relative frequency
is calculated by dividing frequency by its total and means the same as probability. Cumulative
frequency shows the number of falling in each level of age and lower age groups.
Table 6: Distribution of Age
Relative Cumulative
Age
Frequency Frequency Frequency
Less than 21
41
0.01
41
21 – 25
183
0.05
224
26 – 30
333
0.10
557
31 – 35
470
0.14
1027
36 – 40
521
0.15
1548
41 – 45
653
0.19
2201
46 – 50
482
0.14
2683
51 – 55
360
0.11
3043
56 – 60
151
0.04
3194
61 – 65
122
0.04
3316
More than 65
52
0.02
3368
Total
3368
1
16
Histogram
One popular method to display discrete data is to construct a histogram. Histograms can
also display continuous variables after they are broken into discrete ranges. A histogram can be
conceptually thought of as a bar chart. In it we count the number of times the variable X falls
within discrete ranges. Since adjacent class intervals do not have any gaps between them, the
corresponding bars are also constructed without any gaps in between. The optimal class interval
size is often based on a mathematical formula such as that one proposed by Sturges (1926):
C
X Max  X Min
1  3.322log  n 
where C is the optimal class interval size which is a function of the maximum value of X,
X Max ,
17
the minimum value of X,
X Min , and the total number of observations, n.
Example 5. Table 7 lists the household size of 25 patients currently residing in a hospital ward.
Construct a histogram of these values.
1
2
3
2
4
Table 7: Household Size
5
4
3
4
3
5
3
1
2
4
3
3
8
2
7
2
1
2
4
3
In order to construct the histogram the values of X should first be organized into a frequency
distribution. The optimal class interval size can be obtained by applying Sturges' formula:
C
8 1
 1.2
1  3.322log  25
Note that in order to ensure continuity across adjacent class intervals, the optimal class interval
size should be calculated to one more digit after the decimal point than the original precision of
X. In our example, since X values are positive integers the class interval size has been rounded to
one digit after the decimal point.
The next step is to construct a frequency distribution of X based on the optimal class
interval size, C. The lower limit of the lowest class interval can be set equal to the minimum
value of X. This frequency distribution is shown in Table 8.
Table 8: Frequency Distribution for Household Size
Class interval
Frequency (f)
1.0 – 2.2
9
2.2 – 3.4
7
3.4 – 4.6
5
4.6 – 5.8
2
5.8 – 7.0
0
7.0 – 8.2
2
Total
25
The final step is to construct a bar chart of the frequency distribution of class intervals of X
without any empty space between adjacent bars. The resulting graph is the histogram of X. Such
graphs can be easily constructed in common spreadsheet programs such as Excel.
How to Make a Histogram in Excel
Inside Excel there are two ways to make a histogram. One way is easy but many of intermediate
steps are automated and one does not get an intuition about how histograms are constructed;
another way is faster and automated. We first discuss how to do it the hard way and then show
18
easy automated steps for constructing a histogram. In order to create a histogram in Excel, the
following steps need to be performed:
1. Enter values of variable X into a blank worksheet in Excel. These values do not need to
be in a single column.
2. Compute optimal class interval size, C.
3. Construct a frequency distribution of X based on C.
4. Calculate the frequency corresponding to each individual class interval.
5. Graph the frequency distribution with class intervals on the X-axis and corresponding
frequencies (f) on the Y-axis.
Figure 7 shows the result of these steps in Excel.
Figure 7: Results of Steps in Preparing a Histogram
19
It should be noted that the frequency distribution shown in the previous figure was constructed
with the aid of several embedded formulas in Excel. These embedded formulas automate the
frequency distribution construction and minimize the chance of any mistakes. Figure 8 shows all
embedded formulas used in the histogram construction along with a brief description of each
formula.
Figure 8: Excel Formulas for Calculating Frequency Distribution
There is an easier way to construct a histogram, where all of the steps are automated.
One can use the Histogram tool of the Analysis ToolPak in Microsoft Office Excel. Before you
can use the Histogram tool, make sure that the Analysis ToolPak Add-in is installed. You can do
so in three steps:
1. On the File menu, click Options.
2. In the Manage list, select Excel Add-ins, and then click Go.
3. In the Add-Ins dialog box, make sure that the Analysis ToolPak check box under Add-Ins
available is selected, and then click OK. If you do not see Analysis ToolPak in the AddIns dialog box, re-run Microsoft Excel Setup. Add this component to the list of installed
items.
Microsoft provides detailed instruction on how to install the ToolPak on the web. Once the
ToolPak is installed, then follow these steps to create a histogram:
1. On the Microsoft Excel’s Data tab, click Data Analysis in the Analysis group.
2. In the Data Analysis dialog box, click Histogram, and then click OK.
3. Enter the input and output range in the Histogram dial box. If you want to set the bin
20
range, do so; otherwise Excel will select the appropriate bin range. Complete the output
options by selecting an area where the output will be displayed. Select a chart output.
In these procedures, you do not need to estimate the size of the bins or enter formulas to count
the number of times a value falls inside a bin. The Histogram ToolPak does all these steps
automatically.
Transformation of Distributions
The log transformation is a widely used method to address skewed distributions. Skewed
data are seen in non-symmetric distributions, where more of the data occur below or above the
average. Cost data are a good example, as individuals with no cost skew the distribution; in
these situations cost data are transformed using log function.
In Figure 9, on the left side, the distribution of the original data is shown. Notice that
more data show below the average than above it. On the right side we see the log transformation
of the same data. Now, the data are nearly linear. Data transformations must be applied
cautiously, making sure that the findings on the transformed data are relevant for the original
data.
Figure 9: Effect of Log Transformation on Skewness
(Taken with permission from Log-transformation and its implications for data analysis
Shanghai Arch Psychiatry. 2014 Apr; 26(2): 105–109.)
Observing Variables over Time
Managers are often interested to know if their units are improving or getting worse. A
common method of examining a variable is to look at its values over time. These types of data
are analyzed using Statistical Control Chart. The purpose of a control chart is to discipline
intuitions. Most people read too much into their success and attribute their failures to external
21
events. For example, Figure 10 shows a control chart. There are considerable variations among
patient experiences, some positive and some negative. Many providers and managers read too
much into an occasional complaint and fail to reserve judgment until they see patterns among
patient experiences. A control chart can help us understand if there is a pattern among patient
experiences. In a control chart, one monitors progress over time. A plot is created, where the Xaxis is time since start and the Y-axis is the outcome one is monitoring. To decide if outcomes
are different from historical patterns, the upper (UCL) and lower control limits (LCL) are
calculated. These limits are organized in such a way as to make sure that if your historical
pattern has continued then 99% of time data will fall within these limits. The upper and lower
control limits are calculated using mathematical formulas that are specific to the type of outcome
one is monitoring. In different chapters we describe the various assumptions made and describe
which type of control chart is best for different data.
Figure 10 Structure of a Typical Control Chart
A control chart is useful in many different ways. Points outside the limits are unusual
and mark departure from historical patterns. If tracking patient experiences, a point outside the
limit indicates an experience different from historical patterns. Two points in Figure 10 fall
below the LCL and therefore mark a departure from historical patterns. All other points do not
indicate any real change, even though there is lots of variation. These small fluctuations are
random and not different from the historical patterns. If data falls within the control limits,
despite day to day variations, there has not been any change. If the process is producing results
you need, then you want your data to fall within the limits.
Minimum Observations for Control Charts
The more data you have, the more precision you have in constructing the upper and lower
control limits. Not all data are used for calculation of control limits. Often, the limits are based
on pre-intervention period. Then, subsequent post-intervention observations are compared to the
pre-intervention limits. When a manager makes a change in the process, then patient experiences
have been effected by the change. In these circumstances, the limits are set based on the preintervention data. Post-intervention findings are compared to limits calculated from preintervention period. If any points fall outside the limits, then one can conclude that the
intervention has changed the pattern of patient experiences. See Figure 11 for an example of
22
limits set based on pre-intervention periods. In this figure, the solid red line indicates the
calculated upper and lower control limits, the dashed lines show the extension of these limits to
post intervention period.
Figure 11: An Example of Limits set based on Pre-intervention Periods
Compare the chart in Figure 11 with the chart in Figure 10. Both are based on the same
data, but in Figure 11 the limits are based on the first 7 days before the intervention. Figure 11
shows that post intervention data are lower than LCL and therefore a significant change has
occurred. When Figure 11 is compared to Figure 10, we see that more points are out of the limits
in Figure 11. By setting the limits to pre-intervention patterns, we were able to detect more
accurately the improvements since the intervention. The length of data used in construction of
control limit depends on the timing of the intervention and changes in the underlying process.
Use at least 7 data points before the start of the intervention to set the control limit. Of course,
one should use more data points to get a more stable picture of the process. As one uses more
data points one is going back further in time. The more distant the data; the less relevant it is to
the current situation. There is a practical limit of how far back one should go. Taking data from
years ago may make the analysis less accurate, if since those years the underlying process has
changed and older data are no longer relevant.
23