Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Chapter 3: Probability & Distributions By Farrokh Alemi, PhD Version of Tuesday, February 07, 2017 Portions of this chapter were written by Munir Ahmed and edited by Nancy Freeborne DrPH All statistical analyses follow an organized set of steps. A first step in analyses begins with examination of univariate data which we describe in this chapter. Univariate data analysis involves the evaluation of one variable at a time. Analysis of this type helps health care managers, clinicians and researchers assess patterns. Examples of how managers or clinicians can do univariate data analysis are below: 1. Counting the number of nursing home patients who fell in a given year; 2. Assessing the most common time of day for medication errors at a hospital; 3. Counting the number of children admitted with influenza diagnosis in the month of January. Variables A variable measures extent of presence of a concept, degree an object is available, or an extent of a person’s characteristic. Sometimes variables are referred to as attributes, clues, or characteristics. Here are several example variables used in healthcare: Cost: • Billed amount for medical service • Billed for surgical procedure Patient satisfaction: • Score from 1 to 10 reflecting very dissatisfied to very satisfied Disease status: • Presence of diabetes • Presence of hypertension • Count of symptoms of depression Variables have at least two levels and some have many levels. The values of a variable change from one level to another. A variable never has only one value for all occasions, but it could have one value for a set analyses or evaluation. In that case, we refer to the variable as a as a constant. For example, if we are looking at emergency department visits, the constant variable could be one emergency room, but the non-constant variable would be the time of visit. If a variable can assume only two levels, then the variable is binary. An example of a binary variable is occurrence of adverse event, such as wrong-side surgery. Either the surgery has been done or it has not been done. If the event has occurred it will have a value of 1, otherwise it will be 0. Sometimes, binary variables are referred to as indicators. If there are more than two, but a countable set of levels; then the variable is discrete or categorical. An example of a categorical variable is race. Here are examples of categorical variables commonly used in healthcare. 1 Race: • • • • American Indian Black or African American White Latino or Hispanic Insurance status: • • • Medicaid Medicare Private insurance Marital status: • • • • Never married Divorced Married Widowed If variable levels show an order, then the variable is ordinal. A variable can be both ordinal and categorical. For example, age in decades is an ordinal categorical variable. In this example, we are categorizing patients as either in their 30s or in their 40s or in their 50s, but we are not looking at their exact age. By contrast, if the variable can assume any real number in a range then the variable is continuous. An example of a continuous variable is cost. A continuous variable is also an interval variable, meaning that the values of the variable represent the magnitude of the presence of the variable. Cost is an interval variable because an operation that costs $12,000 is twice more expensive than an operation that costs $6,000. Count of anything is an interval scale; so count of patients with adverse events is an interval scale; even though the variable itself is binary. A ratio variable is an interval variable but one in which 0 is considered valid, or specifically that none of the variable is present. For example, number of patients signing up for Medicaid can be considered a ratio variable as a 0 would mean none signed up. Count of number of days post admission, what is known as length of stay, is a continuous variable. Ordinal scales cannot be averaged since they do not show the magnitude of the variable. There are exceptions to this rule and sometimes ordinal scales are treated as if they are interval scales. Satisfaction to care is typically rated on an ordinal scale but treated as if the scale was interval; thus, we see reports of average satisfaction. Some variables have nominal levels, and different than continuous variables, have levels that are not in any particular order and just represent different concepts. For example, racial categories (White, Black, Asian, etc.) are nominal, meaning that racial categories are not in any particular order. 2 In healthcare, a set of variables are typically used to measure outcomes of care. These include cost of care, access to care, satisfaction with care, mortality and morbidity. Cost of care is a continuous variable. Above or below average cost is a binary representation of cost. If above average cost has value of 1 and below average cost has a value of 0, then we can count the percent of patients who have above average cost. Access to care is sometimes measured in days till next appointment; count of days is obviously an interval scale. Mortality is binary as one is either dead or alive. Although, in electronic health records, sometimes date of death is entered erroneously, leading to bizarre situations where the data shows dead patients as continuing to show up for visits. Probability of death is typically referred to as patient’s prognosis or severity of illness. Any probability is also an interval scale. Finally, morbidities are typically calculated on an ordinal scales (e.g. Barthel index breaks extent of function into different disabilities and each disability is rated as 0, 5, 10 or 15; with 15 indicating complete disability and 0 indicating no disability). Strictly speaking, morbidity scales are ordinal and cannot be averaged, but again the literature includes exceptions, where averages of ordinal scales are reported. Univariate data analyses can aid managers and clinicians in assessing their clinics, hospitals, nursing homes and the like so that they can obtain information that will allow them to: 1.) assess quality, 2.) monitor costs; 3.) track illnesses. While univariate data analysis does not allow for assessment of causation, it allows health care managers to see trends in data which can help them with assessment of their clinics or hospitals. Sample A population of interest is a group of individuals who share a key feature, e.g. all diabetic patients or all employees who are 40 years old. Often it is not possible to examine all members of a population, and thus it is convention to evaluate a sample of the population. A sample is an organized subset of the population. Statistics are calculated on the sample and if the sample is representative of the entire population then the same statistics can be construed as relevant to the population. A sample is judged to be representative of the entire population if it reflects various subgroups in the population proportionally. For instance, if the population is made up of 200,000 persons and the proportion of persons having income greater than $100,000 per year is 20 %, one would need to assure that the sample had about the same number of persons with that income-level. A large sample may not be representative, and care should be taken to organize the sample in ways that does not introduce a bias in people who were included in the sample. There are different ways to sample data and choice of sampling is influenced by several factors. In analysis of data from electronic health records, typically all patients in the population are included in the analysis. This is called a complete sample. If working with complete sample, then any findings are generalized to the population because it is calculated at the population level. Use of a complete sample is ideal for making conclusions because everyone in the population is included, but when a large amount of data is analyzed, managers may need to account for higher cost of data analysis. One can also randomly sample the data. When doing experiments, random sampling allows one to assign patients randomly to experimental and control groups. This random assignment increases the probability that experimental and control groups do not differ in other characteristics beside the intervention. In analysis of data from electronic health records, a random sample of patients may be taken in order to reduce the size of the data to be analyzed. This type of random sampling from all available data is done to reduce 3 computational difficulties. An important method of sampling is stratification. In stratified sampling, patients with certain characteristics are over-sampled so rare conditions can be overpresent in the sample. Adaptive sample is a method of sampling in which initial samples are used to determine size and parameters of subsequent samples. A convenience sample is a subset of the population chosen in order to facilitate the analysis and not to generalize to the entire population. Sample elements in the case of healthcare analyses are typically patients. It is generally assumed that patients are independent sample elements; that is, patient characteristics do not depend on each other. The diseases and treatment decision of one patient does not depend on conditions of another patient. This is usually true except when parents bring their children to care. For instance, when a family member brings two sick siblings to the pediatrician, they are seen back to back and diseases of one might inform on another. Another example of a nonindependent sample situation is when a patient has a contagious disease as his or illness will affect the probability of illness of subsequent patients. In this book, we will show a variable as X, if there are multiple variables we may show them as X1, X2 … Xm. In a sample of n cases, a variable assumes n values, one for each case. We show the value of the variable X in the sample as: 𝑋𝑖 , where the index “i" refers to the value of X in the ith case in the sample. Throughout the book, the index “i" is reserved to indicate the ith case. Average In a sample of data, variables may have many different values and one way to understand the general tendency of a variable is to average the data. An average is a single value that is representative of all values that a variable can take. In most contexts the term average refers to a very specific measure of center called the arithmetic mean or simply the mean. The arithmetic mean is defined as the sum of all observations divided by their number. Thus, in order to calculate the arithmetic mean we add up all values of our variable of interest and then divide that sum by the total number of values that were added. When calculated from the population the arithmetic mean is called the population mean (µ) and when calculated from the sample it is called the sample mean (Ẋ). Clearly this statistic cannot be calculated for qualitative (nominal or ordinal) data. For any variable X that is measured on an interval or ratio scale, the sample arithmetic mean,𝑋̅ can be calculated from n observations using the following formula: 𝑋̅ = ∑𝑛𝑖 𝑋𝑖 𝑛 The estimate of sample mean represented by the mathematical expression above is a point estimate, since it provides us with a single estimate or guess of the population mean. Naturally, for different samples the values of the sample mean is expected to vary. An estimate of the variation in the value of any sample statistic is called the standard error. The standard error of the arithmetic mean is discussed in a latter chapter, where we discuss the central limit theorem. The expression for arithmetic mean, discussed so far, gives the same weight to all observations in the dataset. When this is not the case then a weighted arithmetic mean should be calculated. For the n observations of variable X, the weighted arithmetic mean is the sum product of values Xi and their corresponding weights wi, divided by the sum of wi. 4 n X w X i 1 n i i w i 1 i Weighted estimates can be used to correct asymmetries in a sample that occur due to a portion of data being missing. When the sample is not representative of the target population, weights can be used to correct the bias. For example, let's assume that a hospital manager wants to understand satisfaction with emergency room service. To get to readily available data, the manager relies on patient reviews posted to the internet in a month. After the initial examination of comments on satisfaction, the emergency room clinicians actively ask patients to leave more patient reviews. In the first time period, there were only 20 comments left and in the subsequent review 30 comments were left. The hospital manager would like to weight the data so that month 2 would have the same number of comments as Month 1. Table 1 displays the data. To calculate the weighted values, we have to multiply comments left in Month 2 by the weight (20/30). This ensures that the total in weighted column is 20, same as Month 1. After this weighting, 6.67 positive comments were left in Month 2, which exceeds the 5 positive comments left in Month 1. Table 1: Comments Posted to the Internet Positive Negative Total Month 1 5 15 20 Month 2 10 20 30 Weight 0.67 0.67 Weighted Month 2 6.67 13.33 20.00 Expected Values A concept closely related to averages is the expected value of the variable. If the values of a variable are mutually exclusive and exhaustive (and they always are or should be), then the expected value of a variable is the sum of the product of the probability of observing a particular level of the variable times its value. Expected Value of X = E(X) = ∑ p(X = xi )xi i The formula for expected value is similar to weighted average, where weights are replaced with probability of the event and sum of the probabilities is 1. A common expected value used in analysis of healthcare data is the Case Mix index. The Case Mix index is typically calculated over Diagnostic Related Groups (DRGs). Since the DRGs are mutually exclusive and exhaustive, the expected value can be used to measure the Case Mix of a hospital. One can calculate the Case Mix index by summing the product of hospital’s probability of seeing a particular DRG, and national estimates of length of stay of the DRG. 5 Case Mix Index = ∑(probability of DRGi ∗ Length of Stay for DRGi ) i Note that while the probabilities reflect the experience of the hospital, the length of stay reflects the national experience. Since the length of stay is calculated from national estimates, then the Case Mix Index shows the extent to which the hospital is seeing patients who require longer length of stay. Thus, it provides an estimate of difficulty and severity of the patients seen in the hospital. We can demonstrate the concept of expected values through a hypothetical data from an insurer. This insurer wants to anticipate the expected cost across different groups of potential members. The expected cost is calculated as sum of the product of the probability of the member joining the insurer and the cost he/she may have. The insurer has 8 different types of patients who are likely to use the insurance, and each type has a different probability of joining and cost (see Table 2). The average cost for all 8 categories is $6,125 but this cost is not weighted for the likelihood that the patients will join the insurance. Once weighted, the average cost increases to $7,669 because the more expensive potential members seem to be more likely to enroll in the insurance. Table 2: Cost and Likelihood of Utilization of an Insurance Plan Description Urban, 60+ years, Male Urban, 60+ years, Female Urban, 40-60 years, Male Urban, 40-60 years, Female Suburban, 60+ years, Male Suburban, 60+ years, Female Suburban, 40-60 years, Male Suburban, 40-60 years, Female Average Probability Weighted of Joining Cost Cost 0.2 $10,000 $2,000 0.2 $10,000 $2,000 0.1 $8,000 $800 0.1 $7,000 $700 0.2 $6,000 $1,200 0.05 $5,000 $250 0.05 $2,000 $100 0.1 $1,000 $100 $6,125 $7,150 In our hypothetical example, the combinations depicted (rows in the Table) are assumed to be mutually exclusive and exhaustive. Thus the probability of the events adds up to 1. The average cost can be readily corrected by weighting each type of member. For example, suburban, 60+ years, males have 20% chance of enrolling in the insurance and a $6,000 cost. The total contribution of this type of members to the weighted average cost is 0.20*$6,000=$1,200. In contrast, a suburban female who is 40-60 years old has 5% chance of enrolling in the insurance package and thus will cost $1,000 per year so their contribution to the weighted average is 0.05*$1,000=$50. In this manner, the cost is weighted by the likelihood that the member will enroll. 6 Standard Deviation This section describes the concept and measurement of standard deviation.1 A variable, by definition, has values that vary across subjects in a sample. Standard deviation is an estimate of the variation in the values of a variable around that variable's measure of center (or average). Two variables can have the same average but observations can be spread tightly or widely around the average. In Figure 1, all three plots have the same average but different dispersion. The observations in plot A have the smallest dispersion around the mean. Figure 1: Dispersion around Average Dispersion matters a great deal, as we are not telling the complete story if we just report the averages. Consider the average hospitalization costs calculated for two different populations: a group of patients who have been hospitalized or members of an insurance plan who have not been hospitalized. Compared to members of the health plan, patients in the hospital have a tight dispersion around the average hospitalization cost. A portion of members of health plan are never hospitalized; thus they will have 0 hospitalization costs. In contrast, all patients in the hospital will have some hospital costs so there will be no zeros. The dispersion around the average hospitalization cost will be larger for members than for patients. The mean describes the center of data, but the variability in the data is also important. Figure 1 shows three sets of data. All have the same mean. Plot A data shows with tight spread of data around the mean. Plot C shows a large spread around the mean. Plot B is in between the two. To measure the spread of data we start with deviation. Deviation is the difference between the ith observation and the mean: (𝑋𝑖 − 𝜇). In this formula 𝜇 is the mean of the population, which is often unknown. The mean of sample, as described earlier, is shown as 𝑋̅ and can be calculated by summing the observations in the sample and dividing by the count of the observations. You can think of deviation as how far off the course we are if we deviate from our main road and end up at the observation point. We can deviate to the right of the mean or to the left, so the difference of the mean and observation can be positive or negative. One idea for measuring the spread around the mean is to calculate the sum of deviations for all points in the population. This is problematic. If we just sum the deviations some positive deviations will 1 This description of standard deviation is based on the book OpenIntro Statistics. 7 cancel out other negative deviations and create an image that the spread is less than it is. To avoid this, we first square the deviations and then sum it. The measure of spread calculated in this way is called variance. It is defined as the average of the square of deviations for every point in the population. It is calculated as: 𝑛 1 𝜎 2 = ∑(𝑋𝑖 − 𝜇)2 𝑛 𝑖=1 The standard deviation is the square root of the variance and can be calculated from the observations in the sample. The standard deviation is useful when considering how close the data are to the mean of the sample. ∑𝑛𝑖=1(𝑋𝑖 − 𝑋̅)2 √ 𝑠= 𝑛−1 To estimate the standard deviation, first calculate the deviations for each point. In Table 3, we have three observations: 5, 2 and 5. The mean is 4. The deviations from the mean are 1, minus 2 and 1. Note that the total sum of deviations is always 0. Once the deviations have been calculated, then you square the deviations. The sum of the squared deviation is 6. Last you divide the sum by one minus the number of observations and take the square root of it. The sum is 6, there are 3 observations so the division yields 2 and the square root of 2 is 1.7. This value is referred to as standard deviation. Table 3: Steps in Calculating Standard Deviations Squared Observations Mean Deviation Deviation 5 4 1 1 2 4 -2 4 5 4 1 1 Total = 0 6 Number of Observations = 3 6 Standard Deviations = √3−1 = 1.7 Standard deviation is a measure of spread of data around the mean. If spread is small (Plot A in Figure 1), then standard deviation will be small. If spread is large (Plot C in Figure 1), then standard deviation will be large. Usually 70% of the data fall within one standard deviation of the mean; 95% of the data fall within 2 standard deviation of mean. This is not always the case and the percentage of observations that fall within a certain number of standard deviation depend on the shape of the distribution. It should be noted that the standard deviation has the same unit of measurement as variable X. It should also be clear that this statistic can only be calculated for interval and ratio variables. 8 When values of a variable X are grouped into a frequency distribution, the expression for sample standard deviation becomes, f X n S i 1 i i X 2 n f i 1 i When all the values in the sample are not weighted identically, the expression for sample standard deviation becomes: w X n S i 1 i i X 2 n w i 1 i Transformation of Variables In statistics, one often has to transform a variable so that it follows assumptions made in the analytical phase. For example, when dealing with cost, often the data are not distributed in a fashion that can be directly used in the analysis: some members of an insurance plan have 0 cost. To have a smooth distribution of cost, analysts often transform cost data. In any transformation, the average and standard deviation of the variable changes. Linear transformation of variable X and Y affects the mean and variance of their linear combination. The expected value of sum (or difference) of linearly transformed variables X and Y is given by: a bX c dY a bX c dY a b X c d Y The variance of sum (or difference) of linearly transformed variables X and Y that are not independent is given by: 2a bX c dY 2 2 2 a bX c dY abX c dY b2 2X d 2 Y2 2bd XY When X and Y are independent, the expression for variance of sum of their linear transformations is given as follows: 2a bX c dY 2 2 a bX c dY b 2 2X d 2 Y2 9 Example 1: The probability distributions of random variables X and Y is provided in Table 4. Calculate the expected value and variance of the difference between their linear transformations 2 + 3X, and 1 + Y. Table 4: Probability of Two Random Variables X Y p(X, Y) 0 -1 0.2 1 0 0.3 2 1 0.5 Total 1 First, we calculate expected value, variance, and covariance for X and Y. X .2 0 .3 1 .5 2 1.3 Y .2 1 .3 0 .5 1 0.3 X2 .2 0 1.3 .3 1 1.3 .5 2 1.3 0.61 2 2 2 Y2 .2 1 0.3 .3 0 0.3 .5 1 0.3 0.61 2 2 2 XY .2 0 1.3 1 0.3 .3 1 1.3 0 0.3 .5 2 1.31 0.3 0.61 Next, we calculate the mean and variance of the difference between linear transformations of X and Y. 23 X 1Y 2 3 X 1 Y 2 3 1.3 1 0.3 2 2 3 X 1Y 4.6 b 2 2X d 2 Y2 2bd XY 32 0.61 12 0.61 2 31 0.61 2.44 10 Transformation of Variables using Excel: The computation of expected value and variance of the difference between linear combinations of X and Y is shown in Figure 2 below using Excel. Figure 3 shows all embedded formulas. Figure 2: Figure 3: Probability Probability is a quantitative measure of uncertainty. When we are sure that an event will occur, we say that it has a probability of 1. When we are sure that an event will not occur, we assign it a probability of zero. When we are completely in dark, we give it a probability of 0.5, or note that it has a 50 % chance of occurrence. Values between 0 and 1 measure how uncertain we are about the occurrence of an event. More specifically, probability of an event is the likelihood or chance that such event will occur. There are three basic types of probability: Theoretical probability Empirical probability Subjective or personal probability Theoretical probabilities are rooted in theory and can be calculated without actually observing data. This is possible if the process responsible for generating an event is known. Theoretical probabilities are typically associated with games of chance such as a coin toss, roll of a dice, or drawing a card from a standard deck of 52 playing cards, and involve mutually exclusive and equally likely events. Events are mutually exclusive if only one of all possible events can occur at any given point in time. Events are equally likely if the probability of occurrence of each event is the same. If there are a total of n mutually exclusive and equally 11 likely possible events and m of these are event A, then the probability of event A, denoted by p(A) is given by: 𝑝(𝐴) = 𝑚 𝑛 Example 2: Three hospitals are competing for an HMO contract. What is the chance that one of them will be awarded the contract? In this scenario the total number of possible events is 3, any of the 10 participants can be awarded and 1 must be awarded. Not knowing anything else about each hospitals bid, we can assume that they have equal chances to get the award therefore the probability is: 𝑝(𝐴𝑤𝑎𝑟𝑑) = 𝑚 1 = = 0.33 = 33% 𝑛 3 Subjective probabilities are based on personal experience and consequently tend to vary across individuals. One can assess the subjective probability that a group of experts assign to an event and use this consensus to plan future expansion of a service. Subject probabilities can be used when theoretical probabilities are not applicable and there is insufficient observed data to form empirical probabilities. Empirical probabilities are rooted in observed data. Typically frequency distributions and contingency tables are used to organize observed data which in turn are used to obtain probability estimates. Thus, empirical probabilities are essentially relative frequencies associated with the occurrence of one or more events. For an event A of interest, probability of A, p(A) is given by: 𝑓 𝑛 where f is the frequency with which outcome A occurs in the observed data, and n is the sum of frequencies corresponding to all possible outcomes. 𝑝(𝐴) = Example 3 An administrator observed the data in Table 5 related to newly admitted patients over a period of 12 months and wants to know the probability that the next patient will have a cost overrun. Table 5: Cost Overruns Cost overrun Frequency Over 399 Under 665 Total 1064 In this example, past data suggest that 399 patients out of a total of 1,064 had cost overrun. The corresponding relative frequency is 399/1064 = or 0.375. Thus, there is a 37.5% chance that the next patient will have a cost overrun. 12 All probabilities, whether subjective, theoretical or empirical follow three simple rules and furthermore any set of scores that follows these rules is a probability function: 1. The probability of an event is a positive number between 0 and 1. 2. One event certainly will happen, so the sum of the probabilities of all events is 1. 3. The probability of any two mutually exclusive events occurring equals the sum of the probability of each occurring. A probability function can be defined by these rules but such rules provide no intuition about what is a probability function. One way to think of probability is the frequency of observing an event among all possible events. This is called the frequentist definition of probability. There are also other ways of defining probabilities, e.g. subjective definition where probability is the strength of one’s belief that an event will occur. Savage (1954) and DeFinetti (1964) argued that the rules of probabilities can work with uncertainties expressed as strength of opinion. A frequentist definition is not possible for events that have not occurred or do not have repeated observations. In this book we rely on the frequentist definition of probability: the prevalence of the event among the possible events. The probability of a small business failing is then the number of business failures divided by the number of small businesses. The probability of an iatrogenic (a mistake made in a healthcare setting) infection in the last month in our hospital is the number of patients who last month had an iatrogenic infection in our hospital divided by the number of patients in our hospital during last month. If one can count the events of interest then probability of the event can be calculated. Even for adverse rare events, the probability of the event can be calculated. The daily probability of wrong side surgery in our facility is then the number of days in which a wrong side surgery was reported divided by the number of days in which we had surgeries. Graphically, we can show a probability by defining the square to be proportional to the number of possible events and the circle as the all ways in which event A occurs; then the ratio of the circle to the square is the probability of A (see Figure 4). Figure 4: A Graphical Representation of Probability Probability Calculus Notice that for any event of interest A, the probability of occurrence of A is a complement of the probability that event A will not occur and is calculated as: 13 p(A) = 1 − p(A′) p(A′) = 1 − p(A) Where p(A′) is the probability that event A will not occur. The probability of two mutually exclusive and exhaustive events can be calculated from the same set of relationships where A and A’ are replaced with the two mutually exclusive and exhaustive events. There are two basic rules related to probability: the multiplication rule and the addition rule. The can be used to calculate joint probability of two or more events. The exact probability formula depends on whether or not the events are independent. Events are independent when probability of occurrence of one event does not affect the probability of occurrence of the other event. On the other hand, events are not independent, or their probabilities are conditional, when probability of occurrence of one event does affect the probability of occurrence of the other event. For any two events A and B that are independent, the multiplication rule is: p(A ∩ B) = p(A)p(B) where is the intersection operator, and P A B is the probability that both A and B occur. For two events A and B that are not independent the multiplication rule takes the following form: p(A ∩ B) = p(A)p(B|A) where P B A is the conditional probability of occurrence of event B given that event A has occurred. If B occurred first, then the multiplication rules takes the following rule p(A ∩ B) = p(B)p(A|B) The addition rule of probability can be used to calculate probability that one or another event occur. The probability of one of two events A and B occurring, is calculated by first summing all the possible ways in which event A will occur plus all the ways in which event B will occur, minus all the possible ways in which both event A and B will occur (this term is subcontracted because it is double counted in the previous sums). This sum is divided by the all possible ways that any event will occur. This is represented in mathematical terms as: p(A or B) = p(A) + p(B) − p(A&B) Graphically, the concept can be shown as the yellow and red areas in Figure 5 divided by the blue area. 14 Figure 5 Graphical Representation of Probability of A or B Similarly, the probability of A and B co-occur, corresponds to the overlap between A and B and can be shown graphically as the red area divided by all possible events (the rectangular blue area) in Figure 6. Figure 6: Graphical Representation of Probability A and B Repeated application of the definition of probability gives us a simple calculus for combining uncertainty of two or more events. We can now ask questions such as: "What is the probability that frail elderly (age>75) or infants will join our HMO?" According to our formulas this can be calculated as: p(Frail Elderly or Infants) = p(Frail Elderly) + p(Infants) − p(Frail Elderly & Infants) Since the chance of being frail elderly and infant is zero (i.e. the two events are mutually exclusive), we can re-write the formula as: p(Frail Elderly or Infants) = p(Frail Elderly) + p(Infants) Distribution A distribution gives the probability of observing various levels of a variable. In future chapters, we introduce many different distributions including Normal, Binomial, Poison, and Uniform distributions. These are distributions where the shape of the distribution is assumed to follow a formula and a statistical parameter is used to fit the data to the formula. These 15 distributions are called parametric distributions. In addition to parametric distribution one could also examine the observed distribution of data, typically through a histogram. A histogram for discrete variables can be created counting the number of times each level of the variable occurs. The frequency associated with a specific value of a discrete variable is the total number of times that value is observed in the data. Frequencies can be converted into relative frequencies by dividing each individual frequency with the total number of observations. These relative frequencies are often multiplied by 100 in order to transform them into percentages. Visual tools for summarizing discrete variables include the bar chart and the pie chart, and when the number of observations is large, the histogram. For illustration we have provided the frequency distribution of age of employees working at a large hospital located in a metropolitan area in the following table. Notice that we have also added relative frequency and cumulative frequency columns in this table. The pie and bar charts corresponding to the frequency distribution of age are shown in the two figures following the table. In addition to the frequency distribution, common measures of center such as mean, median, and mode, and measures of dispersion such as range, standard deviation can be calculated in order to summarize a numeric variable that has discrete values. Example 4: Table 6 provides frequency of observing different values for age. Relative frequency is calculated by dividing frequency by its total and means the same as probability. Cumulative frequency shows the number of falling in each level of age and lower age groups. Table 6: Distribution of Age Relative Cumulative Age Frequency Frequency Frequency Less than 21 41 0.01 41 21 – 25 183 0.05 224 26 – 30 333 0.10 557 31 – 35 470 0.14 1027 36 – 40 521 0.15 1548 41 – 45 653 0.19 2201 46 – 50 482 0.14 2683 51 – 55 360 0.11 3043 56 – 60 151 0.04 3194 61 – 65 122 0.04 3316 More than 65 52 0.02 3368 Total 3368 1 16 Histogram One popular method to display discrete data is to construct a histogram. Histograms can also display continuous variables after they are broken into discrete ranges. A histogram can be conceptually thought of as a bar chart. In it we count the number of times the variable X falls within discrete ranges. Since adjacent class intervals do not have any gaps between them, the corresponding bars are also constructed without any gaps in between. The optimal class interval size is often based on a mathematical formula such as that one proposed by Sturges (1926): C X Max X Min 1 3.322log n where C is the optimal class interval size which is a function of the maximum value of X, X Max , 17 the minimum value of X, X Min , and the total number of observations, n. Example 5. Table 7 lists the household size of 25 patients currently residing in a hospital ward. Construct a histogram of these values. 1 2 3 2 4 Table 7: Household Size 5 4 3 4 3 5 3 1 2 4 3 3 8 2 7 2 1 2 4 3 In order to construct the histogram the values of X should first be organized into a frequency distribution. The optimal class interval size can be obtained by applying Sturges' formula: C 8 1 1.2 1 3.322log 25 Note that in order to ensure continuity across adjacent class intervals, the optimal class interval size should be calculated to one more digit after the decimal point than the original precision of X. In our example, since X values are positive integers the class interval size has been rounded to one digit after the decimal point. The next step is to construct a frequency distribution of X based on the optimal class interval size, C. The lower limit of the lowest class interval can be set equal to the minimum value of X. This frequency distribution is shown in Table 8. Table 8: Frequency Distribution for Household Size Class interval Frequency (f) 1.0 – 2.2 9 2.2 – 3.4 7 3.4 – 4.6 5 4.6 – 5.8 2 5.8 – 7.0 0 7.0 – 8.2 2 Total 25 The final step is to construct a bar chart of the frequency distribution of class intervals of X without any empty space between adjacent bars. The resulting graph is the histogram of X. Such graphs can be easily constructed in common spreadsheet programs such as Excel. How to Make a Histogram in Excel Inside Excel there are two ways to make a histogram. One way is easy but many of intermediate steps are automated and one does not get an intuition about how histograms are constructed; another way is faster and automated. We first discuss how to do it the hard way and then show 18 easy automated steps for constructing a histogram. In order to create a histogram in Excel, the following steps need to be performed: 1. Enter values of variable X into a blank worksheet in Excel. These values do not need to be in a single column. 2. Compute optimal class interval size, C. 3. Construct a frequency distribution of X based on C. 4. Calculate the frequency corresponding to each individual class interval. 5. Graph the frequency distribution with class intervals on the X-axis and corresponding frequencies (f) on the Y-axis. Figure 7 shows the result of these steps in Excel. Figure 7: Results of Steps in Preparing a Histogram 19 It should be noted that the frequency distribution shown in the previous figure was constructed with the aid of several embedded formulas in Excel. These embedded formulas automate the frequency distribution construction and minimize the chance of any mistakes. Figure 8 shows all embedded formulas used in the histogram construction along with a brief description of each formula. Figure 8: Excel Formulas for Calculating Frequency Distribution There is an easier way to construct a histogram, where all of the steps are automated. One can use the Histogram tool of the Analysis ToolPak in Microsoft Office Excel. Before you can use the Histogram tool, make sure that the Analysis ToolPak Add-in is installed. You can do so in three steps: 1. On the File menu, click Options. 2. In the Manage list, select Excel Add-ins, and then click Go. 3. In the Add-Ins dialog box, make sure that the Analysis ToolPak check box under Add-Ins available is selected, and then click OK. If you do not see Analysis ToolPak in the AddIns dialog box, re-run Microsoft Excel Setup. Add this component to the list of installed items. Microsoft provides detailed instruction on how to install the ToolPak on the web. Once the ToolPak is installed, then follow these steps to create a histogram: 1. On the Microsoft Excel’s Data tab, click Data Analysis in the Analysis group. 2. In the Data Analysis dialog box, click Histogram, and then click OK. 3. Enter the input and output range in the Histogram dial box. If you want to set the bin 20 range, do so; otherwise Excel will select the appropriate bin range. Complete the output options by selecting an area where the output will be displayed. Select a chart output. In these procedures, you do not need to estimate the size of the bins or enter formulas to count the number of times a value falls inside a bin. The Histogram ToolPak does all these steps automatically. Transformation of Distributions The log transformation is a widely used method to address skewed distributions. Skewed data are seen in non-symmetric distributions, where more of the data occur below or above the average. Cost data are a good example, as individuals with no cost skew the distribution; in these situations cost data are transformed using log function. In Figure 9, on the left side, the distribution of the original data is shown. Notice that more data show below the average than above it. On the right side we see the log transformation of the same data. Now, the data are nearly linear. Data transformations must be applied cautiously, making sure that the findings on the transformed data are relevant for the original data. Figure 9: Effect of Log Transformation on Skewness (Taken with permission from Log-transformation and its implications for data analysis Shanghai Arch Psychiatry. 2014 Apr; 26(2): 105–109.) Observing Variables over Time Managers are often interested to know if their units are improving or getting worse. A common method of examining a variable is to look at its values over time. These types of data are analyzed using Statistical Control Chart. The purpose of a control chart is to discipline intuitions. Most people read too much into their success and attribute their failures to external 21 events. For example, Figure 10 shows a control chart. There are considerable variations among patient experiences, some positive and some negative. Many providers and managers read too much into an occasional complaint and fail to reserve judgment until they see patterns among patient experiences. A control chart can help us understand if there is a pattern among patient experiences. In a control chart, one monitors progress over time. A plot is created, where the Xaxis is time since start and the Y-axis is the outcome one is monitoring. To decide if outcomes are different from historical patterns, the upper (UCL) and lower control limits (LCL) are calculated. These limits are organized in such a way as to make sure that if your historical pattern has continued then 99% of time data will fall within these limits. The upper and lower control limits are calculated using mathematical formulas that are specific to the type of outcome one is monitoring. In different chapters we describe the various assumptions made and describe which type of control chart is best for different data. Figure 10 Structure of a Typical Control Chart A control chart is useful in many different ways. Points outside the limits are unusual and mark departure from historical patterns. If tracking patient experiences, a point outside the limit indicates an experience different from historical patterns. Two points in Figure 10 fall below the LCL and therefore mark a departure from historical patterns. All other points do not indicate any real change, even though there is lots of variation. These small fluctuations are random and not different from the historical patterns. If data falls within the control limits, despite day to day variations, there has not been any change. If the process is producing results you need, then you want your data to fall within the limits. Minimum Observations for Control Charts The more data you have, the more precision you have in constructing the upper and lower control limits. Not all data are used for calculation of control limits. Often, the limits are based on pre-intervention period. Then, subsequent post-intervention observations are compared to the pre-intervention limits. When a manager makes a change in the process, then patient experiences have been effected by the change. In these circumstances, the limits are set based on the preintervention data. Post-intervention findings are compared to limits calculated from preintervention period. If any points fall outside the limits, then one can conclude that the intervention has changed the pattern of patient experiences. See Figure 11 for an example of 22 limits set based on pre-intervention periods. In this figure, the solid red line indicates the calculated upper and lower control limits, the dashed lines show the extension of these limits to post intervention period. Figure 11: An Example of Limits set based on Pre-intervention Periods Compare the chart in Figure 11 with the chart in Figure 10. Both are based on the same data, but in Figure 11 the limits are based on the first 7 days before the intervention. Figure 11 shows that post intervention data are lower than LCL and therefore a significant change has occurred. When Figure 11 is compared to Figure 10, we see that more points are out of the limits in Figure 11. By setting the limits to pre-intervention patterns, we were able to detect more accurately the improvements since the intervention. The length of data used in construction of control limit depends on the timing of the intervention and changes in the underlying process. Use at least 7 data points before the start of the intervention to set the control limit. Of course, one should use more data points to get a more stable picture of the process. As one uses more data points one is going back further in time. The more distant the data; the less relevant it is to the current situation. There is a practical limit of how far back one should go. Taking data from years ago may make the analysis less accurate, if since those years the underlying process has changed and older data are no longer relevant. 23