Download Module 1: Fundamentals of Data Analysis

Using Statistical Data to Make Decisions Module 1: Fundamentals of Data Analysis Dr. Mugdim Pašiƒ Dr. Tom Ilvento University of Delaware Sarajevo Graduate School of Business tatistics are an important tool for many fields business, the physical sciences, economics and the social sciences, engineering and the biological sciences. They enable us to examine and test important research questions concerning individual variables and the relationships among a set of variables. The results, if used properly, can help make difficult decisions. S Many students approach statistics with some fear and trepidation. Common concerns involve anxiety over math skills and a feeling of distrust for the relevance of statistics. In terms of the former concerns, modern desktop and laptop computers have made most of the calculations of statistics easy and painless. These tools enable us to focus on the more important aspects of good data analysis practice and interpreting results. As for the latter concern of relevance, I believe we see examples of the importance of statistics each and every day. For example: How do they know how much snow has fallen given snow is so difficult to measure? They actually have a standard protocol and take an average of several measurements. Can a business make decisions about the future by analyzing data from the past? Yes! Prediction on the future, though always with some uncertainty, is a core use of statistics by businesses. Making informed decisions with data is a value-added opportunity for businesses. Can we ever get a good measurement of crowd size at war or policy protests? I don’t have an answer for this one, but estimates vary wildly! Can a sales team make decisions on new products from a sample of consumers? Yes, we can make estimates. Marketing research is a big user of statistics to help make decisions on price, product attributes, and the formation of new products. How do drug trials lead to the acceptance of a new drug? Statistics are a big part of this process! Ultimate acceptance of a new drug for sale requires an elaborate Key Objectives • Understand the difference between the descriptive and inferential aspects of statistics • Understand the concept of a random sample, measurement, and levels of measurement • Understand the use of basic summary statistics of measures of central tendency and measures of dispersion In this Module We Will Be: • Describing data using summary measures of Central Tendency and Dispersion • Looking at graphical displays of data: box plots, stem and leaf, and time series smoothing • Transforming data with logs, inversions, trimming and dealing with outliers For more information, contact: Tom Ilvento 213 Townsend Hall, Newark, DE 19717 302-831-6773 [email protected] Using Statistical Data to Make Decisions: Fundamentals of Data Analysis Page 2 experimental trial which is analyzed by statistical models. Millions of dollars ride on the outcome of the statistical analysis. The focus of this course is on understanding the basics of statistics. I would like you to gain an appreciation for how descriptive and inferential statistics are used in your business or field; how to analyze a set of data; how to present the data and make meaningful and coherent conclusions to others, and how to critique the use of statistics by others. WHAT ARE STATISTICS? There are many concepts of statistics and what it means. Statistics are thought to be the data itself, as in “the government released the latest statistics on unemployment,” a field of study in mathematics, and a set of tools used by many disciplines to analyze data. In its broadest sense, statistics is the science of data. It refers to • Collecting data • Classifying, summarizing, and organizing data • Analysis of data • Interpretation of data Descriptive versus Inferential Statistics. We make a distinction between two main approaches in the use of statistics for data analysis, descriptive versus inferential statistics. Both approaches are related to each other, but the distinction between them is important to note. Descriptive statistics uses measures and graphs to summarize the data with an emphasis on parsimony. Our strategy is to find summary measures which describe the data adequately and succinctly, be they a percentage, average, or a standard deviation. Descriptive statistics also involve describing the relationships between variables or sets of variables through the use of very sophisticated techniques, such as correlation, regression, factor analysis, logistic regression and probit analysis. Inferential statistics involves many of the same techniques used in descriptive statistics, but takes it a step further. Now we use these techniques to make estimates, decisions, predictions, or generalizations about a population from a smaller subset of data called a sample. The sample can be a subset of a population at a point in time or a sample of the population in time or space. Descriptive statistics uses measures and graphs to summarize the data with an emphasis on parsimony. Inferential statistics uses some of the same techniques to make estimates, decisions, predictions, or generalizations about a population from a smaller subset or sample. Using Statistical Data to Make Decisions: Fundamentals of Data Analysis Inferential statistics are a powerful tool for research. It enables us to make statements about a large group from a much smaller sample. Thus, we can survey a sample of 1,000 people and make good generalizations about 280 million people in the U.S. Sampling. A census is when we collect data on all elements in a population. Sometimes it is difficult or impossible to get information on the entire population. An alternative is to take a sample of the population. A sample is a subset of the units or elements of a population. Sampling saves time, money, and other resources (computation time). In some cases, it may actually be impossible to collect information on every element of the population and sampling becomes a reasonable alternative. A valuable property of a sample is that it is representative of the population. By this we mean that the sample characteristics resemble those possessed by the population. Inferential statistics require a sample to be representative of the population, and that can be done when the sample is drawn through a random process. A random sample is when each element or unit has the same chance of being selected. Classic statistical inference requires that the sample be selected through a random process. Measurement. Measurement is the process of assigning a number to variables of the individual elements of the population (or sample). Measurement is a bigger issue than many think. Some measurement seems relatively straight-forward - distance, weight, dollars spent. However, measurement always comes with some error and perhaps even bias. With measurement we must also deal with issues of validity (are we measuring what we think we are measuring) and reliability (is the measuring device consistent). A user of data is responsible for asking questions and in some cases doing preliminary analysis to determine if the measurement is valid and consistent. The process of measurement is often complex – don’t take it for granted. Levels of Measurement. There are various ways to characterize measurement of variables. An easy dichotomy in measurement is qualitative versus quantitative data. Qualitative data do not follow a natural numerical scale and thus are classified into categories such as male or female; customers versus noncustomers; and race (white, African American, Asian, and so forth). Quantitative data use measures that are recorded on a naturally occurring scale, such as age, income, or time. Page 3 Only by taking a random sample can we have confidence when we make a inference from a sample to the population. Measurement is the process of assigning a number to variables of the individual elements of the population (or sample). The process of measurement is often complex – don’t take it for granted. Using Statistical Data to Make Decisions: Fundamentals of Data Analysis A more elaborate description involves three levels of measurement - nominal, ordinal, and continuous. Nominal (or categorical) measures have no implied order or superiority and can be thought of as qualitative. A middle ground is ordinal data, where there is an implied order or rank, but the distance between units is not well specified. Rankings, opinion questions that use ordered categories such as strongly agree to strongly disagree, and variables that use an ordered scale from one to ten are examples of ordinal data. Continuous data are the same as quantitative data. Page 4 Qualitative data do not follow a natural numerical scale, while quantitative data use measures that are recorded on a naturally occurring scale. Levels of measurement are not trivial to the use of statistics. Many statistical techniques are predicated on certain levels of measurement of the variables involved. Some techniques or formulas assume a certain level is used and misusing a statistical technique can lead to results that are biased or misleading. GRAPHING DATA Excel and other data management software allow us numerous ways to graph and display data. Along with summary statistics, graphs help “tell the story” of the data and lead to insight or explanation. Like all statistics, the validity of a graph depends upon the user. It is easy to distort a graph by manipulating the scale, collapsing data, or choosing misleading ways to represent the data. For example, even a small change in a measurement over time can look large if you adjust the scale on the axis. A good strategy for all graphing of data is to provide sufficient information and context to let the reader judge for him or herself. A good protocol to follow is to: • Give a caption or title describing the graph • Identify the source of the data • Label the axes, bars, or pie slices • Give an indication of the measurement level (e.g., in $ or $1,000s) • Identify the scale of the axes, including the starting point • Provide a context for the graph in the narrative of the report Graphs help “tell the story” of the data and lead to insight or explanation. But, the validity of a graph or chart depends upon the care in setting it up correctly. Using Statistical Data to Make Decisions: Fundamentals of Data Analysis Page 5 Graphs of Qualitative Data. Pie charts and bar charts are the most frequently used graphs of qualitative data. The graph will depict the frequency or relative frequency of the categories in the variable. For example, a pie chart can depict the percentages of customers in different credit card bins at a period in time. Pie charts have limitations on how many categories can be represented (more than five begins to be a problem). Both pie charts and bar charts can present a categorical variable broken down by a second variable so that you can compare the distribution of the categories across the groups. Graphs of Quantitative Data. There are several useful graphing techniques to show the distribution of a quantitative variable. These include a histogram, a box plot, and a stem and leaf plot. Many of these graphs require some decisions from the user that may affect the shape of the distribution. A final graph that we will look at is the scatter plot, which shows the relationship between two quantitative variables. Histograms. A histogram is a depiction of a quantitative variable broken down into categories reflecting the range of the variable. The histogram bars represent the relative number (or percentage) of observations in each category. By taking a continuous variable and breaking it into ordinal categories we lose some information, but the loss may be tempered by the potential gain in insight provided by the graph. However, decisions made by the use, such as the width and number of the categories, can influence the shape of the histogram. Care must be taken not to distort the data with too few or too many categories. The easiest approach in deciding the width of the category intervals (also referred to as “bins” in Excel) is to determine the range of the data (maximum value minus the minimum value) and decide by the number of categories desired (minimum of five and a maximum of 15). The number of categories is constrained by the number of observations in your variable - the more observations the more categories possible (the default in Excel is to take the square root of the number of observations). This approach would provide equal width intervals, but the frequencies within each interval would not necessarily be equal. A limitation of this approach is the intervals may not reflect key thresholds or values important to decision-making. You may have to tweak the bin ranges to better reflect your needs. The following is an example of a histogram from Excel. The data are miles per gallon (mpg) of 100 sub-compact cars. The default under Excel is 10 intervals (square root of 100). Histograms provide a good visual of the distribution of a variable and can help identify outliers, multiple modes in the data, and the skew of the data. The process of deciding the number and width of intervals in Histograms can result in misleading depiction of the data. Using Statistical Data to Make Decisions: Fundamentals of Data Analysis The interval width for the histogram is determined from the range of the data. The maximum is 44.9 and the minimum is 30. Thus, the interval width calculated by Excel is: Interval Width Page 6 Analysis feature under the Tools menu. (44.9 - 30)/10 = 1.49 Excel created the following bin table and the resulting histogram. The bin values represent the minimum value in the interval and the frequencies are the number of observations up to the next Bin Frequency bin value. For example, there 30 1 31.49 0 are 9 observations between 32.98 5 34.47 and 35.96 mpg. I made 34.47 9 one modification to this table. 35.96 14 Because of rounding, there 37.45 33 was one extra category which 38.94 18 Excel labeled “More.” I simply 40.43 12 add that last value into the 41.92 6 final interval, and the graph 43.41 2 below reflected this change. Notice that the graph labels for the X-axis in Excel are identical to the bin values in the table. These labels could be cleaned up in Excel if desired. You are also free to modify the graph in other ways, including setting the width of the gap between bars. T h e re a 40 re 30 m 20 a 10 n 0 y ot h er MPG w a ys to set the number of categories and the interval width. For example, intervals could be set at particular thresholds, so that the frequencies for each interval are equal, or based on positional measures such as percentiles. There isn’t a single right way, but the choice of intervals will influences the shape of the graph and care should be taken. 30 31 .4 9 32 .9 8 34 .4 7 35 .9 6 37 .4 5 38 .9 4 40 .4 3 41 .9 2 43 .4 1 Frequency Figure 1. Excel Histogram Exampleof MPG Using System Defualts for Intervals Excel can help you design Histograms as part of the Data Pros of Histograms • Good visual depiction of the distribution of a variable, showing shape, modes, skew and outliers • Can be graphed for a small sample size or a large sample size • Most software programs (including Excel) provide an easy means to construct and display a histogram Cons of Histograms • Requires a user decision of the number of intervals and the width of the intervals • Choices made by a user can distort the graph Using Statistical Data to Make Decisions: Fundamentals of Data Analysis Page 7 Box Plots. Box Plots (also known as Box and Whisker Plots) are a graphical way to depict the center and spread of the data based on the median, quartiles, and the interquartile range. These are called positional measures of the center and the spread. A box plot provides a convenient way to look at the spread of a variable and is especially useful in comparing the spread of a continuous variable for two or more groups. The calculation and formation of a box plot is best left to computer programs. Unfortunately, Excel does not provide a box plot graph, but many add-in programs for Excel do provide this feature. Box Plots, also called Box and Whisker Plots, are based a “Five Number Summary” of statistics based on position, including the median and quartiles. The box plot is based on a five number summary - the minimum, first quartile (Q1), median (Q2), third quartile (Q3), and the maximum. From Q3 and Q1 we can calculate a sixth number, the Five-number Summary Interquartile range (IQR). Minimum 30.0 These numbers provide the First Quartile 35.6 way to formulate the box and Median 37.0 the whiskers of the plot. The Third Quartile 38.4 box to the right shows the five Maximum 44.9 numbers for the mpg data set of 100 sub-compact cars. • Good visual depiction of the distribution of a variable, showing shape, modes, skew and outliers The dimensions of the box represents the IQR, and goes from the first quartile (Q1) to the third quartile (Q3). The box often has a center line which represents the median value, and occasionally the mean is depicted to show the difference between the mean and the median. The whiskers, lines on either side of the box, reflect a distance of 1.5 IQR to the left of Q1 and to the right of the Q3. For a variable that follows a symmetrical, bell-shaped curve, most of the values will fall within 1.5 IQR of the first and third quartiles. Values outside of the whiskers are considered outliers. Some programs will depict mild outliers (between 1.5 and 3 IQR from Q1 and Q3) and extreme outliers (more than 3 IQR from Q 1 and Q3). • There is a uniform approach to constructing box plots - no user decisions Pros of Box Plots • Can be graphed for a small or large sample size, although it may be difficult to show outliers when the data set is large • Excellent approach for comparing the distribution of a variable two or more sub-groups Cons of Box Plots • Excel cannot construct a Box Plot without an add-in program Figure 2. Box Plot of MPG of 100 Sub-Compact Cars 25 30 35 40 45 50 Using Statistical Data to Make Decisions: Fundamentals of Data Analysis Page 8 Figure 3. B ox Plots of Amount S pe nt on C ate log Sale s by H ome O wne rship $ 7 ,0 0 0 Amo u n t Sp e n t To ta l Re n te r Ho me O w n e r $ 6 ,0 0 0 $ 5 ,0 0 0 $ 4 ,0 0 0 $ 3 ,0 0 0 $ 2 ,0 0 0 $1543 $ 1 ,0 0 0 $1 217 $ 962 $869 $62 3 $0 Box plots are particularly good at comparing the center and spread of a variable for two or more groups. The graph at the top of the page shows a box plot of catalog sales for audio and video electronic entertainment for customers who own their home and those that rent. The Box Plot is constructed by XLSTAT, an Excel add-in. XLSTAT Box Plots shows the mild outliers as open points and extreme outliers as solid points. XLSTAT allows several user options, such as how the plots are oriented and the inclusion of the median and mean values (mean values are on the top in this graph). This graph above shows the distribution for all customers, those who rent, and those who own their homes. The plots show the data are skewed towards several extreme outliers of customers that have made large purchases of equipment` The group that owns their home has higher expenditures, more spread in the data, and more outliers. Stem and Leaf Plots. Another approach to graphing continuous data is the Stem and Leaf plot. This approach tend to works best with small to mid-sized data sets (up to 150 observations). The Stem and Leaf plot uses a clever approach by using the data itself to make the graph. The graph below is a depiction of the MPG data of 100 subcompact cars. For this graph the stems are the whole numbers and the leafs are a single decimal place. $1359 Using Statistical Data to Make Decisions: Fundamentals of Data Analysis Figure 4. Stem-and-Leaf Display for MPG Stem unit: Whole number 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 0 8 5799 126899 024588 01235667899 01233445566777888999 000011122334456677899 0122345678 00345789 0123557 002 1 Page 9 Pros of Stem and Leaf Plots • Good visual depiction of the distribution of a variable, showing shape, modes, skew and outliers • The plot actually uses the data itself to make the graph • There are less user or program decisions when compared to a histogram 9 Cons of Stem and Leaf Plots The Stem and Leaf plot provides a good graphical picture of a variable’s distribution, showing the shape, range, skew, and outliers. In order to construct a Stem and Leaf plot the user (or a software program) must make some decisions as well as manipulate the data. Some data do not lend themselves to a stem and leaf plot, particularly when the choices for leaves are limited. The three key steps in constructing a Stem and Leaf plot are: 1. Sort the data 2. Choose the stems 3. Add the leaves The stems are the initial digit in the values, such as 1 in the number 10; 10 in the number 10.6; or 2 in the number 215. It is helpful to look at the sorted data and the range of the variable to decide the appropriate stems. The stems can be one, two, or more digits. For example, the stem for 215 could be 2 or 21. Once the stems are set, the leaves are simply the remaining digits in the numbers. In most cases it will be one digit, but it is possible to use more than one digit for leaves. If you are constructing a stem and leaf plot by hand, make sure the distance between digits are uniform and large enough to show the separate observations. • Excel cannot construct a Stem and Leaf Plot without an add-in program • Limited to small and medium sized data sets difficult to produce when the sample size is over 150 • The user (or program) must make some decisions that can influence the shape of the graph Using Statistical Data to Make Decisions: Fundamentals of Data Analysis Scatter Plots. When graphing two continuous variables we often use a scatter plot to show how the variables vary together. Scatter Plots also provide a useful way to show how data vary over time. Most spreadsheet programs provide an easy mechanism to make a scatter plot (also called XY scatterplot). In scatter plots we tend to think of the one of the variables as a dependent variable and label it Y. The dependent variable is the variable you wish to “explain” or understand by knowing something about the independent variable (denoted as X). The dependent variable (Y) tends to be on the vertical axis and the independent variable (X) is on the horizontal axis of the plot. Scatter plots provide a visual representation of the relationship between two variables. As such, it provides a useful first step in more sophisticated analysis strategies, such as regression. Excel provides mechanisms to include a trend line or best fitting line based on regression or an alternative curve fitting procedure. The following graph shows the relationship between 2001 average state SAT scores and the percent of high school students that take the SAT test. The graph clearly shows a linear relationships between the two variables, the greater the percentage of high school seniors who take the test, the lower the average state SAT score. Using options in Excel, we added a regression line, the regression equation, and R2 (a measure of the fit of the data). Average SAT (Math + Verbal) Average State SAT scores by Percent Taking the Test, 2001 1250 1200 1150 1100 1050 1000 950 900 y = -2.133x + 1145.3 R2 = 0.7657 0 20 40 60 80 100 Percent Taking Figure 5. A Scatter Plot using Excel of the relationship between the 2001 average state SAT scores and the percent of high school seniors taking the test Page 10 Scatter Plots are an effective way to graph the relationship between two continuous variables. It will show strength and direction of the relationship. However, too many observations (e.g., more than 1,000) are difficult to plot and still see a relationship. Excel can provide you with many ways to dress up the scatter plot by adding detail to the chart, a regression or trend line, and Using Statistical Data to Make Decisions: Fundamentals of Data Analysis Page 11 Scatter plots are also very useful in graphing data over time. In these plots the x-axis is typically the time element. The data can be left as a scatter of data points or the points can be connected by a line. The following graph shows the annual percentage change in the Consumer Price Index over the last century. The graph also includes a five-year running average to show how this approach “smooths” the data. Consumer Price Index Percent Change Percent Change 20 10 0 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 2010 -10 -20 Year 5-Yr Avg Raw Data Figure 6. A graph of Annual Percentage Change in the Consumer Price Index Including 5-Year Smoothing CENTRAL TENDENCY OF DATA A useful concept when summarizing data is to find some way to measure the center of the data. The central tendency of a variable is the tendency of the data to cluster or center about certain numerical values. Central tendency is in contrast to another concept which will be discussed shortly, variability or the spread of the data. For central tendency we will focus on the mean, the mode, and the median. The Mean. The arithmetic mean or mean is the sum of the measurements divided by the number of measurements contained in the data set. For a sample we use x with a bar over it, 0. For a population, we use the Greek term, : (mu). The formula for the mean is given as: x= n ∑ ( x / n) i =1 i The first formula is the more familiar formula and reflects that the mean is the average observation. The second The central tendency of a variable is the tendency of the data to cluster or center about certain numerical values. Using Statistical Data to Make Decisions: Fundamentals of Data Analysis formula yields the same result and emphasizes the mean is a weighted summation with the weights being the probability of each observation in the data set (i.e., 1/n) and as such is an expectation of a probability distribution. As a measure of central tendency the mean has several advantages (and disadvantages) over other measures. The first is that the mean uses information of all the values in a variable - all the values of the variable are added together and divided by the sample size. We can make inferences from a sample to a population for the mean some descriptive statistics of central tendency do not have inferential properties. The mean forms the basis for a number of other statistics known as Product Moment Statistics, which includes the variance, correlation, and regression coefficients. But, the mean is sensitive to outliers and extremes in the data. The mean is “pulled” toward extreme values in the data and it is not as “resistant” as other measures of central tendency. The mean has two important mathematical properties that are important in statistics. The first is that mathematically, the sum of the deviations about the mean equals zero. The second property is that the sum of squared deviations about the mean is a minimum. The latter is called the Least Squares property . It means that the sum of deviations around the mean is smaller than around any other value. The least squares property is exploited when looking at the spread of the data and in regression. The Median. The median is the middle value when the measurements are arranged in ascending order. It is a positional measure because it is based on the middle case in a variable. In order to find the median value, we first must sort the data in ascending or descending order, find the position of the middle value, and then read that value. The median is an intuitive measure of central tendency the value at the middle of the ordered data. However, the median is computationally difficult to compute because it requires you to sort the data. Fortunately, spreadsheets and statistical software packages calculate the median for us rather easily. The median has very limited inferential properties, so it is not used when making inferences from a sample or in hypothesis testing. Nonetheless, the median is often used in skewed data because it is not as sensitive to outliers. The median is often the preferred measure of the center in data with extreme values, such as income. Page 12 Properties of the Mean • The mean uses information of all the values in a variable • We can make inferences from a sample to a population for the mean • The mean forms the basis for a number of other statistics known as Product Moment Statistics • The mean is sensitive to outliers and extremes in the data. The median is a positional measure of the center of a variable and is preferred when the data contain extreme values. For example, it is common to report the median income rather than mean income. Using Statistical Data to Make Decisions: Fundamentals of Data Analysis The median is also referred to as the 50th percentile. Other ordered measures include percentiles, deciles, and quintiles, and quartiles. Quartiles, used in box plots, represent the values at the 25th, 50th, and 75th percentiles. Some software programs use Q1 for the 25th percentile, Q3 for the 75th percentile, and Q2 (50th percentile) for the median. Mode. The mode is the most frequent occurring value in a variable. As a measure of the center, the mode is less useful than the mean or median. However, it can provide some insights to the most common value in a variable and the shape of a distribution. In some cases there are multiple “modes” referred to as Bi-Modal or Tri-Modal. Multiple modes or groupings around a value may reflect different groups within a variable. Figure 7 shows a bimodal distribution with a histogram of student weight. In continuous level data, there may not be any single value that is the most frequent. The mode may make more sense in reference to qualitative data. With a qualitative variable, we refer to the Modal Class or Category which represents the category with the most responses. 90 80 70 Frequency 60 50 40 30 20 10 0 Figure 7. Histogram of weight of 312 students showing a bi-modal distribution that reflects differences between males and females Page 13 Using Statistical Data to Make Decisions: Fundamentals of Data Analysis Comparing the Mean, Median, and Mode. If we have a variable with a distribution that reflects a symmetrical, bell shaped curve, the mean, median, and mode would be very similar to one another. The normal distribution is a very special bell shaped curve where the mean, median, and mode are equal to each other by definition. The symmetrical, bell shaped curve is important in statistics because it more easily allows us to make probability statements about the distribution of a variable. The skew of the data reflect a tail in the distribution pulled by extreme values, either high or low. The following three simple rules are useful in making a quick assessment of the distribution of a variable. 1. If the mean is larger than the median, the data are skewed to the right with some extreme high values. In this case the mean is being pulled up by extreme values in the data. 2. If the mean is smaller than the median, data are skewed to the left with some extreme low values. In this case the mean is being pulled down by the extreme low values in the data. 3. If the mean and the median are very close to each other it is likely (though not guaranteed) that the distribution is symmetric and mound shaped. Simply comparing the mean to the median can give us a sense of the presence of extreme values or outliers, and in which direction we can expect the skew. Page 14 In a symmetrical, bell shaped distribution, the mean, median, and mode would be very similar to one another Simply comparing the mean to the median gives a sense of the presence of extreme values or outliers, and in which direction we can expect the skew. Using Statistical Data to Make Decisions: Fundamentals of Data Analysis Page 15 MEASURES OF VARIABILITY OF DATA Central tendency only tells part of the story when describing a variable. Another aspect of data is the spread or variability of data. There are several intuitive measures of spread of data, including the range, the Inter-Quartile (IQR) range, the variance, standard deviation, and the coefficient of variation. Variability of the data reflects the spread of the data around some center value, usually the mean. Range. The range is the difference between the highest and lowest value in the data. The range provides an sense of the extremes in the data. It is an order statistic and depends upon the two most extreme values in the data. As such, the range may be seriously influenced by outliers. Inter-Quartile Range. An alternative to the range is the Inter-quartile range, which is the difference between the 3rd quartile (Q3 or 75th percentile) and the 1st quartile (O1 or 25th percentile). The inter-quartile range provides a sense of the range in the middle of the data and is not as sensitive to extreme values in the data. Variance. The variance is the average squared deviation around the mean. By deviation we refer to the difference of a particular value of a variable from the mean of the variable. The concept of deviations around the mean can be intuitively appealing as a measure of spread of the data. If the mean is a good measure of central tendency, then it is reasonable to ask how different (or how far away) is a particular value of a variable (X) from the mean of X. Taking this a step further, we might ask what is the average distance of all values in the variable from the mean. However, because of the property that the sum of deviations around the mean always equals zero, we need to square the deviations around the mean and take an average squared deviation. Let’s look at the formula for the variance. We will use the Greek symbol F2 (sigma) to represent the variance of a population. The sample term for the variance will be s2. n σ2 = ∑ (x i =1 i − x )2 n To put it into words, the numerator reflects the sum of the square of the calculation of each value in the variable minus the mean. The numerator is called the Total Sum of Squares (a term we will see later in regression). Since The Total Sum of Squares is the sum of the squared deviations of each value in a variable around the mean. It is the numerator of the formula for the variance. Using Statistical Data to Make Decisions: Fundamentals of Data Analysis we take the square of the deviations around the mean, the numerator will always be a positive term. Once we divide by n (or n-1 as we will show later), the number of observations, the variance reflects the average squared deviation around the mean. Another way to describe the variance is that it is the Mean Squared Deviation. Like the mean, the variance is sensitive to outliers in the data. In fact, because the terms are squared, the variance can be extremely sensitive to outliers. When you square large numbers you get much larger numbers. Care should be taken when using and interpreting the variance with data that is highly skewed with high or low outliers. Standard Deviation. Average squared deviations around the mean are awkward to discuss and interpret. However, if we take the square root of the variance we have a value that is no longer in squared terms. This new term is the standard deviation, or the average deviation around the mean. We use the Greek term F (sigma) to represent the population standard deviation and the term s to represent the sample standard deviation. Page 16 Properties of the Variance • It is also known as the Mean Squared Deviation • The numerator is known at the Total Sum of Squares • When dealing with a sample we divide by (n-1) to adjust for degrees of freedom • The variance is sensitive to extreme values - outliers have a large effect on the variance The Variance and Standard Deviation with Sample Data. When we are dealing with a sample of the population, and our ultimate goal is some sort of inference, the formula for the variance and standard deviation must change. As noted earlier, when we are dealing with a population we use the Greek term F2 (sigma squared) and when we are dealing with a sample we use s2. However, when dealing with a sample the formula must change to reflect an adjustment due to degrees of freedom. The adjustment involves using n-1 in the denominator of the formula. We will use the following formula for the variance (and the square root of this formula for the standard deviation) almost exclusively for the rest of this course. n s2 = ∑ (x i =1 i − x )2 (n − 1) Degrees of freedom is an important concept in inferential statistics and it will be seen again in regression analysis. While it is a difficult concept to comprehend at this level, think of it as an adjustment when dealing with a sample. Using n in the formula for s2 tends to underestimate F2 for the population. Note that the adjustment makes more of a difference when the sample size is small (less than 30) than when the sample is large (greater than 1,000). Using (n-1) in the formula is the default in Excel and most calculators. When dealing with a sample of data, we adjust the formula for the variance for the degrees of freedom by diving by n-1. Using Statistical Data to Make Decisions: Fundamentals of Data Analysis Page 17 Coefficient of Variation. Another way to express the standard deviation is in relation to the mean. The Coefficient of Variation (CV) is the ratio of the standard deviation to the absolute value of the mean, usually expressed as a percentage. By taking a ratio, we express the standard deviation relative to the mean and it provides a way to say how much variability there is in a variable relative to the size of the mean. The higher the percentage, the more variability. The CV is particularly useful when comparing the variability of different variables. For example, suppose we had a data set on customers and we want to compare the variability of education level and their income. It would not be useful to compare the standard deviations because the metric on income is so much larger. However, we could compare the CVs for each variability and talk about which variable has more variability. The CV formula is given below. CV = s/|0| * 100 Interpreting the Standard Deviation. If our variable is symmetrical and mound shaped in its distribution, we can use the Empirical Rule to make some statements to interpret the standard deviation. By symmetrical we mean that the distribution is the same (or reasonably close) to the left and right of the mean. By mound shaped we mean that the largest proportion of the observations are centered around the middle of the distribution, and the mean, median, and mode of the variable are close in value. The following histogram (Figure 8) of miles per gallon (MPG) of 100 compact cars can be thought of as a symmetrical, mound shaped distribution with a mean, median, and mode of 37 and a standard deviation of 2.4. 45 40 35 Frequency 30 25 20 15 10 5 0 Figure 8. Histogram of MPG of compact cars that represents a symmetrical, mound shaped distribution. If our variable is symmetrical and mound shaped we can use the Empirical Rule to interpret the standard deviation, and give us an indication if a value is an outlier. Using Statistical Data to Make Decisions: Fundamentals of Data Analysis Page 18 If our variable is symmetrical and mound shaped, the Empirical Rule tells us that approximately 68% of the observations should be plus or minus one standard deviation (34% above or below); 95% should be within plus or minus 2 standard deviations, and nearly all the observations (99.7%) should be plus or minus 3 standard deviations around the mean. We can express this as: . 68% of the observations are ± 1*s . 95% of the observations are ± 2*s . 99.7% of the observations are ± 3*s This rule allows us to say how likely or unlikely it would be to find a variable that is a certain number of standard deviations away from the mean. For the MPG example, we can say that we would expect 68% of the cars to be within: A value more than three standard deviations above or below the mean is unusual, especially if the distribution is symmetrical and moundshaped. 37 ± 2.4 = 34.6 mpg to 39.4 mpg We could also say that we would expect 34% of the cars (one half of the 68%) to have a mpg between 37 and 39.4. The Empirical Rule also gives us a rule of thumb to determine if a value is an outlier. If a value is more than three standard deviations away from the mean, it is extremely rare. In a probabilistic framework, we would say that it is possible, but not very probable. Thus, if we had a compact car that gets less than 29.8 mpg or more than 44.2 mpg we might ask questions. Perhaps it is a performance car that is part of a different population of compact cars, or if it is on the high end a specialty hybrid that is unique. Or, someone could have made a mistake in measuring mpg or in entering the data in a computer. The fact that a value is extreme does not make it wrong or bad, but it should cause us to ask questions and examine it further. The fact that a value is extreme does not make it wrong or bad, but it should cause us to ask questions and examine it further. TRANSFORMATIONS OF DATA There are times when we will want to transform our data into another form that is more useable. Reasons to transform data are to reduce the impact of extreme values in the data, to make a nonlinear relationship linear, to present data in a more easily interpretable manner, or to make adjustments based on a third factor. The latter reason involves a weighting schemes such as per capita, seasonal adjustments, or adjustments for inflation. Each of these methods have strengths and weakness and no Transforming data before analysis can be a useful way to better see what is going on in the data. Using Statistical Data to Make Decisions: Fundamentals of Data Analysis method will perfectly solve all the data problems they were designed to address. However, transforming data at times provides a useful way to present a clearer picture. I will talk of three methods to transform the data: creating zscores; log transformation, and weighting the data. Z-scores. The z-score approach is a method of transforming data to reflect relative standing of the value in relation to the mean. A z-score is calculated by subtracting the mean from a value and then dividing by the standard deviation. zi = (xi − x ) s The result represents the distance between a given measurement X and its mean, expressed in standard deviations. A positive z-score means that measurement is larger than the mean while a negative z-score means that it is smaller than the mean. By dividing through by the standard deviation we are able to say how far away a value is from its mean in a relative way. If we were to convert an entire variable to z-scores - take each value, subtract the mean, and divide by the standard deviation - we would create a new variable that has a mean equal to zero and a standard deviation equal to one. The new variable would be in standardized units and thus would allow us to compare different values to each other in terms of how many standard deviations away from the mean they are. A z-score transformation does not change the order of the data or the shape of the distribution of the data. This is because we are subtracting and dividing through by constant values (i.e., the mean and standard deviation). Use a z-score transformation can help in interpretation of a variable, comparison of variables measured on different scales, and in cases of variables whose measurement is somewhat contrived and arbitrary, such as an index. Log Transformation. A popular transformation of data is a log transformation. In most cases we use the natural logarithm or base e. The value of e is 2.7183 and the natural log of a value is the power that I need to raise the value of e to equal that number. For example: Page 19 When transforming data, be careful in the interpretation of the newly transformed data, especially if the transformation changes the order of the data. Z-scores are an effective way to re-express a value to reflect its position to the mean relative to the standard deviation of the data. Using Statistical Data to Make Decisions: Fundamentals of Data Analysis Page 20 The log of 10 = 2.3026, which means 2.71832.3026 = 10 Look at the following natural logs of numbers: log 10 log 100 log 1,000 log 10,000 log 100,000 log 1,000,000 = 2.3026 = 4.6052 = 6.9078 = 9.2103 = 11.5129 = 13.8155 Logarithmic transformations are useful when the data has a large variability and as a result is skewed toward high or low values. With the use of the natural logarithm we can reduce the variability of a number while still maintaining the original order of the data. By this I mean that the largest numbers are still larger, but we greatly reduced the variability between larger numbers and smaller numbers. Thus, log transformations can be used to reduce the variability in variables that have a large range. Lets look at an example of the price of 25 apartment buildings in a city. The goal of the analysis is to better understand the price of the building as influenced by such factors as square footage, number of apartments, and age and condition of the building. The original data for price shows a large variability from a minimum of $79,300 to a maximum of $950,000. The summary statistics are given in the table below, along with the same statistics for the natural log of price. Summary Statistics for Apartment Price and Log of Apartment Price Price Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count 290573.52 42305.83 268000.00 #N/A 211529.15 44744581138.09 2.80 1.61 870700.00 79300.00 950000.00 7264338.00 25.00 Ln Price 12.359 0.134 12.499 #N/A 0.671 0.450 -0.674 0.258 2.483 11.281 13.764 308.973 25.00 Log transformations are used in growth models, finance models of instantaneous rates of change, and economic models of constant elasticities. Using Statistical Data to Make Decisions: Fundamentals of Data Analysis Page 21 If we look at a histogram of the original data we can see that price data are skewed and that the mean is being pulled by several extreme values in the data set. 14 12 Frequency 10 8 6 4 2 0 Figure 9. Histogram of Apartment Building Price showing a skewed distribution. By taking the natural logarithm of the price we change the distribution of the variable, reduce the variability, and often make the variable more normally distributed. If we look at the transformed price in Figure 10 we can see that the influence of the extreme values is reduced. 9 8 7 Frequency 6 5 4 3 2 1 0 Figure 10. Histogram of the Natural Log of Apartment Price The main problem with a log transformation is that we change the measurement units and make it harder to interpret the statistics - they are now expressed in log units. Using Statistical Data to Make Decisions: Fundamentals of Data Analysis Page 22 Log transformations are very useful tools for data analysis. They are used in population growth models, instantaneous rates of change in finance, and in constant elasticity models in economics. Log transformations can be used in regression analysis to transform a nonlinear relationship into a linear relationship in the parameters. The main caution with a log transformation is that the data are changed and it is not easy to interpret or compare results to the original data. Weighting Schemes. The last transformation discussed here is various weighting schemes. These are strategies to weight the data, usually by multiplying or dividing by another variable, to adjust the original data. Examples of this include putting a value on a per capita basis (dividing by the population), adjusting for inflation by using a Consumer Price Index (CPI), or adjusting time series data by dividing through by seasonal averages. If the weights are a constant (every data value is adjusted by the same weight) the impact is small. For example, expressing dollar figures as per $1,000 or per $1,000,000. In fact, this transformation will have no meaningful impact in advanced techniques such as correlation and regression. By this we mean that the analysis will not change in substance or conclusion simply because we express a variable in $1,000 dollars rather than in real terms. This is a good thing and is a strength of regression analysis. However, if there is a unique weight for each value, the impact of this transformation can be substantial. Because of this, using weights as an adjustment should be justified on a past practice or theoretical basis. This transformation can change the distribution and order of the data, so the change is more radical than other transformations. Let’s look at an example of flood insurance payments over time in the U.S, from 1978 to 2002. There has been an upward trend in payments over time, from nearly $148 million in 1978 to $339 million in 2002. There is a great deal of fluctuation from year to year based on rainfall and natural occurrences. However, part of the trend over time is also due to inflation and the change in the value of money. In fact, one might ask if there really is an upward trend in payments once we adjust for inflation. The graph below (Figure 11) shows the trend since 1978. You can clearly see an upward trend in payments, but there are fluctuations from year to year. Although not shown, a model was fit to the data that shows a linear trend that explains about 33% of the variability in payments follows a linear trend line. The line in the graph is generated from a regression model. If the weights are a constant the impact is small on most analyses techniques, such as regression. However, if there is a unique weight for each value, as is the case of adjusting for annual inflation rates, the impact of this transformation can be substantial. Using Statistical Data to Make Decisions: Fundamentals of Data Analysis $1,800,000 $1,600,000 $1,400,000 $1,200,000 $1,000,000 $800,000 $600,000 $400,000 $200,000 19 78 19 80 19 82 19 84 19 86 19 88 19 90 19 92 19 94 19 96 19 98 20 00 20 02 $0 Figure 11. Linear trend of U.S. Flood Insurance Payments, 1978 to 2002 The next graph (Figure 12) shows the same trend, only this time the payments are adjusted for inflation using the Consumer Price Index (CPI) from the Bureau of Economic Analysis. The shape of the graph is very similar to the previous graph, but there are some important differences. Figure 12 also shows a upward trend, but the trend line is not as steep and the amount the model explains drops to only 7%. This result shows that the upward trend in payments is not near as steep or noticeable once we adjust for inflation. In other words, part of the trend was due the CPI and not simply an upward payout trend. Adjusting for inflation helped clarify the trend. $ 1,800,000 $ 1,600,000 $ 1,400,000 $ 1,200,000 $ 1,000,000 $ 800,000 $ 600,000 $ 400,000 $ 200,000 02 98 00 20 20 19 94 92 90 88 86 84 82 80 96 19 19 19 19 19 19 19 19 19 19 78 $0 Figure 12. Linear Trend of U.S. Flood Insurance Adjusted for Inflation, 1978 to 2002 CONCLUSIONS This module was designed to help you gain a basic understanding of descriptive statistics, graphing, and transformations as a way to better understand your data. These techniques serve as basic building blocks for analysis and form the foundation of more sophisticated techniques such as correlation and regression. In fact, much of what regression is about is explaining variability in a dependent variable, and how independent variables influence or “explain” the variability. Throughout this course we continually use some of the techniques in Module 1 as a starting point for analysis. Page 23

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Module 1: Fundamentals of Data Analysis