Download Module 1: Fundamentals of Data Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Categorical variable wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Using Statistical Data to Make Decisions
Module 1: Fundamentals of Data Analysis
Dr. Mugdim Pašiƒ
Dr. Tom Ilvento
University of Delaware
Sarajevo Graduate School of Business
tatistics are an important tool for many fields business, the physical sciences, economics and the
social sciences, engineering and the biological
sciences. They enable us to examine and test important
research questions concerning individual variables and the
relationships among a set of variables. The results, if used
properly, can help make difficult decisions.
S
Many students approach statistics with some fear and
trepidation. Common concerns involve anxiety over math
skills and a feeling of distrust for the relevance of statistics.
In terms of the former concerns, modern desktop and laptop
computers have made most of the calculations of statistics
easy and painless. These tools enable us to focus on the
more important aspects of good data analysis practice and
interpreting results. As for the latter concern of relevance, I
believe we see examples of the importance of statistics each
and every day. For example:
How do they know how much snow has fallen given
snow is so difficult to measure? They actually have a
standard protocol and take an average of several
measurements.
Can a business make decisions about the future by
analyzing data from the past? Yes! Prediction on the
future, though always with some uncertainty, is a core use of
statistics by businesses. Making informed decisions with
data is a value-added opportunity for businesses.
Can we ever get a good measurement of crowd size at
war or policy protests? I don’t have an answer for this one,
but estimates vary wildly!
Can a sales team make decisions on new products from
a sample of consumers? Yes, we can make estimates.
Marketing research is a big user of statistics to help make
decisions on price, product attributes, and the formation of
new products.
How do drug trials lead to the acceptance of a new drug?
Statistics are a big part of this process! Ultimate acceptance
of a new drug for sale requires an elaborate
Key Objectives
• Understand the difference
between the descriptive and
inferential aspects of statistics
• Understand the concept of a
random
sample,
measurement, and levels of
measurement
• Understand the use of basic
summary statistics of
measures of central tendency
and measures of dispersion
In this Module We Will Be:
• Describing data using
summary measures of
Central Tendency and
Dispersion
• Looking at graphical displays
of data: box plots, stem and
leaf, and time series
smoothing
• Transforming data with logs,
inversions, trimming and
dealing with outliers
For more information, contact:
Tom Ilvento
213 Townsend Hall, Newark, DE 19717
302-831-6773
[email protected]
Using Statistical Data to Make Decisions: Fundamentals of Data Analysis
Page 2
experimental trial which is analyzed by statistical models.
Millions of dollars ride on the outcome of the statistical
analysis.
The focus of this course is on understanding the basics of
statistics. I would like you to gain an appreciation for how
descriptive and inferential statistics are used in your business
or field; how to analyze a set of data; how to present the data
and make meaningful and coherent conclusions to others,
and how to critique the use of statistics by others.
WHAT ARE STATISTICS?
There are many concepts of statistics and what it means.
Statistics are thought to be the data itself, as in “the
government released the latest statistics on unemployment,”
a field of study in mathematics, and a set of tools used by
many disciplines to analyze data.
In its broadest sense, statistics is the science of data. It
refers to
•
Collecting data
•
Classifying, summarizing, and organizing data
•
Analysis of data
•
Interpretation of data
Descriptive versus Inferential Statistics. We make a
distinction between two main approaches in the use of
statistics for data analysis, descriptive versus inferential
statistics. Both approaches are related to each other, but the
distinction between them is important to note.
Descriptive statistics uses measures and graphs to
summarize the data with an emphasis on parsimony. Our
strategy is to find summary measures which describe the
data adequately and succinctly, be they a percentage,
average, or a standard deviation. Descriptive statistics also
involve describing the relationships between variables or
sets of variables through the use of very sophisticated
techniques, such as correlation, regression, factor analysis,
logistic regression and probit analysis.
Inferential statistics involves many of the same techniques
used in descriptive statistics, but takes it a step further. Now
we use these techniques to make estimates, decisions,
predictions, or generalizations about a population from a
smaller subset of data called a sample. The sample can be
a subset of a population at a point in time or a sample of the
population in time or space.
Descriptive statistics uses
measures and graphs to
summarize the data with an
emphasis on parsimony.
Inferential statistics uses some
of the same techniques to make
estimates,
decisions,
predictions, or generalizations
about a population from a
smaller subset or sample.
Using Statistical Data to Make Decisions: Fundamentals of Data Analysis
Inferential statistics are a powerful tool for research. It
enables us to make statements about a large group from a
much smaller sample. Thus, we can survey a sample of
1,000 people and make good generalizations about 280
million people in the U.S.
Sampling. A census is when we collect data on all elements
in a population. Sometimes it is difficult or impossible to get
information on the entire population. An alternative is to take
a sample of the population. A sample is a subset of the units
or elements of a population. Sampling saves time, money,
and other resources (computation time). In some cases, it
may actually be impossible to collect information on every
element of the population and sampling becomes a
reasonable alternative.
A valuable property of a sample is that it is representative of
the population.
By this we mean that the sample
characteristics resemble those possessed by the population.
Inferential statistics require a sample to be representative of
the population, and that can be done when the sample is
drawn through a random process. A random sample is when
each element or unit has the same chance of being selected.
Classic statistical inference requires that the sample be
selected through a random process.
Measurement. Measurement is the process of assigning a
number to variables of the individual elements of the
population (or sample). Measurement is a bigger issue than
many think.
Some measurement seems relatively
straight-forward - distance, weight, dollars spent. However,
measurement always comes with some error and perhaps
even bias.
With measurement we must also deal with issues of validity
(are we measuring what we think we are measuring) and
reliability (is the measuring device consistent). A user of data
is responsible for asking questions and in some cases doing
preliminary analysis to determine if the measurement is valid
and consistent. The process of measurement is often
complex – don’t take it for granted.
Levels of Measurement. There are various ways to
characterize measurement of variables. An easy dichotomy
in measurement is qualitative versus quantitative data.
Qualitative data do not follow a natural numerical scale and
thus are classified into categories such as male or female;
customers versus noncustomers; and race (white, African
American, Asian, and so forth). Quantitative data use
measures that are recorded on a naturally occurring scale,
such as age, income, or time.
Page 3
Only by taking a random
sample can we have confidence
when we make a inference from
a sample to the population.
Measurement is the process of
assigning a number to variables
of the individual elements of the
population (or sample).
The process of measurement is
often complex – don’t take it for
granted.
Using Statistical Data to Make Decisions: Fundamentals of Data Analysis
A more elaborate description involves three levels of
measurement - nominal, ordinal, and continuous. Nominal
(or categorical) measures have no implied order or
superiority and can be thought of as qualitative. A middle
ground is ordinal data, where there is an implied order or
rank, but the distance between units is not well specified.
Rankings, opinion questions that use ordered categories
such as strongly agree to strongly disagree, and variables
that use an ordered scale from one to ten are examples of
ordinal data. Continuous data are the same as quantitative
data.
Page 4
Qualitative data do not follow a
natural numerical scale, while
quantitative data use measures
that are recorded on a naturally
occurring scale.
Levels of measurement are not trivial to the use of statistics.
Many statistical techniques are predicated on certain levels
of measurement of the variables involved. Some techniques
or formulas assume a certain level is used and misusing a
statistical technique can lead to results that are biased or
misleading.
GRAPHING DATA
Excel and other data management software allow us
numerous ways to graph and display data. Along with
summary statistics, graphs help “tell the story” of the data
and lead to insight or explanation. Like all statistics, the
validity of a graph depends upon the user. It is easy to
distort a graph by manipulating the scale, collapsing data, or
choosing misleading ways to represent the data. For
example, even a small change in a measurement over time
can look large if you adjust the scale on the axis.
A good strategy for all graphing of data is to provide sufficient
information and context to let the reader judge for him or
herself. A good protocol to follow is to:
•
Give a caption or title describing the graph
•
Identify the source of the data
•
Label the axes, bars, or pie slices
•
Give an indication of the measurement level (e.g., in $ or
$1,000s)
•
Identify the scale of the axes, including the starting point
•
Provide a context for the graph in the narrative of the
report
Graphs help “tell the story” of
the data and lead to insight or
explanation.
But, the validity of a graph or
chart depends upon the care in
setting it up correctly.
Using Statistical Data to Make Decisions: Fundamentals of Data Analysis
Page 5
Graphs of Qualitative Data. Pie charts and bar charts are
the most frequently used graphs of qualitative data. The
graph will depict the frequency or relative frequency of the
categories in the variable. For example, a pie chart can
depict the percentages of customers in different credit card
bins at a period in time. Pie charts have limitations on how
many categories can be represented (more than five begins
to be a problem). Both pie charts and bar charts can present
a categorical variable broken down by a second variable so
that you can compare the distribution of the categories
across the groups.
Graphs of Quantitative Data. There are several useful
graphing techniques to show the distribution of a quantitative
variable. These include a histogram, a box plot, and a stem
and leaf plot. Many of these graphs require some decisions
from the user that may affect the shape of the distribution. A
final graph that we will look at is the scatter plot, which shows
the relationship between two quantitative variables.
Histograms. A histogram is a depiction of a quantitative
variable broken down into categories reflecting the range of
the variable. The histogram bars represent the relative
number (or percentage) of observations in each category. By
taking a continuous variable and breaking it into ordinal
categories we lose some information, but the loss may be
tempered by the potential gain in insight provided by the
graph. However, decisions made by the use, such as the
width and number of the categories, can influence the shape
of the histogram. Care must be taken not to distort the data
with too few or too many categories.
The easiest approach in deciding the width of the category
intervals (also referred to as “bins” in Excel) is to determine
the range of the data (maximum value minus the minimum
value) and decide by the number of categories desired
(minimum of five and a maximum of 15). The number of
categories is constrained by the number of observations in
your variable - the more observations the more categories
possible (the default in Excel is to take the square root of the
number of observations). This approach would provide equal
width intervals, but the frequencies within each interval would
not necessarily be equal. A limitation of this approach is the
intervals may not reflect key thresholds or values important
to decision-making. You may have to tweak the bin ranges
to better reflect your needs.
The following is an example of a histogram from Excel. The
data are miles per gallon (mpg) of 100 sub-compact cars.
The default under Excel is 10 intervals (square root of 100).
Histograms provide a good
visual of the distribution of a
variable and can help identify
outliers, multiple modes in the
data, and the skew of the data.
The process of deciding the
number and width of intervals in
Histograms can result in
misleading depiction of the
data.
Using Statistical Data to Make Decisions: Fundamentals of Data Analysis
The interval width for the histogram is determined from the
range of the data. The maximum is 44.9 and the minimum is
30. Thus, the interval width calculated by Excel is:
Interval Width
Page 6
Analysis feature under the
Tools menu.
(44.9 - 30)/10 = 1.49
Excel created the following bin table and the resulting
histogram. The bin values represent the minimum value in
the interval and the
frequencies are the number of
observations up to the next
Bin
Frequency
bin value. For example, there
30
1
31.49
0
are 9 observations between
32.98
5
34.47 and 35.96 mpg. I made
34.47
9
one modification to this table.
35.96
14
Because of rounding, there
37.45
33
was one extra category which
38.94
18
Excel labeled “More.” I simply
40.43
12
add that last value into the
41.92
6
final interval, and the graph
43.41
2
below reflected this change.
Notice that the graph labels
for the X-axis in Excel are
identical to the bin values in
the table. These labels could be cleaned up in Excel if
desired. You are also free to modify the graph in other ways,
including setting the width of the gap between bars.
T
h
e
re
a
40
re
30
m
20
a
10
n
0
y
ot
h
er
MPG
w
a
ys to set the number of categories and the interval width. For
example, intervals could be set at particular thresholds, so
that the frequencies for each interval are equal, or based on
positional measures such as percentiles. There isn’t a single
right way, but the choice of intervals will influences the shape
of the graph and care should be taken.
30
31
.4
9
32
.9
8
34
.4
7
35
.9
6
37
.4
5
38
.9
4
40
.4
3
41
.9
2
43
.4
1
Frequency
Figure 1. Excel Histogram Exampleof
MPG Using System Defualts for
Intervals
Excel can help you design Histograms as part of the Data
Pros of Histograms
•
Good visual depiction of the
distribution of a variable,
showing shape, modes, skew
and outliers
•
Can be graphed for a small
sample size or a large sample
size
•
Most software programs
(including Excel) provide an
easy means to construct and
display a histogram
Cons of Histograms
•
Requires a user decision of
the number of intervals and
the width of the intervals
•
Choices made by a user can
distort the graph
Using Statistical Data to Make Decisions: Fundamentals of Data Analysis
Page 7
Box Plots. Box Plots (also known as Box and Whisker
Plots) are a graphical way to depict the center and spread
of the data based on the median, quartiles, and the
interquartile range. These are called positional measures
of the center and the spread. A box plot provides a
convenient way to look at the spread of a variable and is
especially useful in comparing the spread of a continuous
variable for two or more groups. The calculation and
formation of a box plot is best left to computer programs.
Unfortunately, Excel does not provide a box plot graph, but
many add-in programs for Excel do provide this feature.
Box Plots, also called Box and
Whisker Plots, are based a
“Five Number Summary” of
statistics based on position,
including the median and
quartiles.
The box plot is based on a five number summary - the
minimum, first quartile (Q1), median (Q2), third quartile
(Q3), and the maximum.
From Q3 and Q1 we can
calculate a sixth number, the
Five-number Summary
Interquartile range (IQR).
Minimum
30.0
These numbers provide the
First Quartile 35.6
way to formulate the box and
Median
37.0
the whiskers of the plot. The
Third Quartile 38.4
box to the right shows the five
Maximum
44.9
numbers for the mpg data set
of 100 sub-compact cars.
• Good visual depiction of
the distribution of a
variable, showing shape,
modes, skew and outliers
The dimensions of the box represents the IQR, and goes
from the first quartile (Q1) to the third quartile (Q3). The
box often has a center line which represents the median
value, and occasionally the mean is depicted to show the
difference between the mean and the median. The
whiskers, lines on either side of the box, reflect a distance
of 1.5 IQR to the left of Q1 and to the right of the Q3. For
a variable that follows a symmetrical, bell-shaped curve,
most of the values will fall within 1.5 IQR of the first and
third quartiles. Values outside of the whiskers are
considered outliers. Some programs will depict mild
outliers (between 1.5 and 3 IQR from Q1 and Q3) and
extreme outliers (more than 3 IQR from Q 1 and Q3).
• There is a uniform
approach to constructing
box plots - no user
decisions
Pros of Box Plots
• Can be graphed for a
small or large sample size,
although it may be difficult
to show outliers when the
data set is large
• Excellent approach for
comparing the distribution
of a variable two or more
sub-groups
Cons of Box Plots
• Excel cannot construct a
Box Plot without an add-in
program
Figure 2. Box Plot of MPG of
100 Sub-Compact Cars
25
30
35
40
45
50
Using Statistical Data to Make Decisions: Fundamentals of Data Analysis
Page 8
Figure 3. B ox Plots of Amount S pe nt on C ate log Sale s by
H ome O wne rship
$ 7 ,0 0 0
Amo u n t Sp e n t To ta l
Re n te r
Ho me O w n e r
$ 6 ,0 0 0
$ 5 ,0 0 0
$ 4 ,0 0 0
$ 3 ,0 0 0
$ 2 ,0 0 0
$1543
$ 1 ,0 0 0
$1 217
$ 962
$869
$62 3
$0
Box plots are particularly good at comparing the center
and spread of a variable for two or more groups. The
graph at the top of the page shows a box plot of catalog
sales for audio and video electronic entertainment for
customers who own their home and those that rent. The
Box Plot is constructed by XLSTAT, an Excel add-in.
XLSTAT Box Plots shows the mild outliers as open points
and extreme outliers as solid points. XLSTAT allows
several user options, such as how the plots are oriented
and the inclusion of the median and mean values (mean
values are on the top in this graph).
This graph above shows the distribution for all customers,
those who rent, and those who own their homes. The
plots show the data are skewed towards several extreme
outliers of customers that have made large purchases of
equipment` The group that owns their home has higher
expenditures, more spread in the data, and more outliers.
Stem and Leaf Plots. Another approach to graphing
continuous data is the Stem and Leaf plot. This approach
tend to works best with small to mid-sized data sets (up to
150 observations). The Stem and Leaf plot uses a clever
approach by using the data itself to make the graph. The
graph below is a depiction of the MPG data of 100 subcompact cars. For this graph the stems are the whole
numbers and the leafs are a single decimal place.
$1359
Using Statistical Data to Make Decisions: Fundamentals of Data Analysis
Figure 4. Stem-and-Leaf Display for MPG
Stem unit: Whole number
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
0
8
5799
126899
024588
01235667899
01233445566777888999
000011122334456677899
0122345678
00345789
0123557
002
1
Page 9
Pros of Stem and Leaf Plots
• Good visual depiction of
the distribution of a
variable, showing shape,
modes, skew and outliers
• The plot actually uses the
data itself to make the
graph
• There are less user or
program decisions when
compared to a histogram
9
Cons of Stem and Leaf Plots
The Stem and Leaf plot provides a good graphical picture
of a variable’s distribution, showing the shape, range,
skew, and outliers. In order to construct a Stem and Leaf
plot the user (or a software program) must make some
decisions as well as manipulate the data. Some data do
not lend themselves to a stem and leaf plot, particularly
when the choices for leaves are limited.
The three key steps in constructing a Stem and Leaf plot
are:
1. Sort the data
2. Choose the stems
3. Add the leaves
The stems are the initial digit in the values, such as 1 in
the number 10; 10 in the number 10.6; or 2 in the number
215. It is helpful to look at the sorted data and the range of
the variable to decide the appropriate stems. The stems
can be one, two, or more digits. For example, the stem for
215 could be 2 or 21.
Once the stems are set, the leaves are simply the
remaining digits in the numbers. In most cases it will be
one digit, but it is possible to use more than one digit for
leaves. If you are constructing a stem and leaf plot by
hand, make sure the distance between digits are uniform
and large enough to show the separate observations.
• Excel cannot construct a
Stem and Leaf Plot without
an add-in program
• Limited to small and
medium sized data sets difficult to produce when
the sample size is over 150
• The user (or program) must
make some decisions that
can influence the shape of
the graph
Using Statistical Data to Make Decisions: Fundamentals of Data Analysis
Scatter Plots. When graphing two continuous variables
we often use a scatter plot to show how the variables vary
together. Scatter Plots also provide a useful way to show
how data vary over time. Most spreadsheet programs
provide an easy mechanism to make a scatter plot (also
called XY scatterplot).
In scatter plots we tend to think of the one of the variables
as a dependent variable and label it Y. The dependent
variable is the variable you wish to “explain” or understand
by knowing something about the independent variable
(denoted as X). The dependent variable (Y) tends to be on
the vertical axis and the independent variable (X) is on the
horizontal axis of the plot.
Scatter plots provide a visual representation of the
relationship between two variables. As such, it provides a
useful first step in more sophisticated analysis strategies,
such as regression. Excel provides mechanisms to
include a trend line or best fitting line based on regression
or an alternative curve fitting procedure. The following
graph shows the relationship between 2001 average state
SAT scores and the percent of high school students that
take the SAT test. The graph clearly shows a linear
relationships between the two variables, the greater the
percentage of high school seniors who take the test, the
lower the average state SAT score. Using options in
Excel, we added a regression line, the regression
equation, and R2 (a measure of the fit of the data).
Average SAT (Math +
Verbal)
Average State SAT scores by
Percent Taking the Test, 2001
1250
1200
1150
1100
1050
1000
950
900
y = -2.133x + 1145.3
R2 = 0.7657
0
20
40
60
80
100
Percent Taking
Figure 5. A Scatter Plot using Excel of the relationship
between the 2001 average state SAT scores and the
percent of high school seniors taking the test
Page 10
Scatter Plots are an effective
way to graph the relationship
between two continuous
variables. It will show strength
and direction of the
relationship.
However, too many
observations (e.g., more than
1,000) are difficult to plot and
still see a relationship.
Excel can provide you with
many ways to dress up the
scatter plot by adding detail to
the chart, a regression or trend
line, and
Using Statistical Data to Make Decisions: Fundamentals of Data Analysis
Page 11
Scatter plots are also very useful in graphing data over
time. In these plots the x-axis is typically the time element.
The data can be left as a scatter of data points or the
points can be connected by a line. The following graph
shows the annual percentage change in the Consumer
Price Index over the last century. The graph also includes
a five-year running average to show how this approach
“smooths” the data.
Consumer Price Index Percent Change
Percent Change
20
10
0
1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 2010
-10
-20
Year
5-Yr Avg
Raw Data
Figure 6. A graph of Annual Percentage Change in the
Consumer Price Index Including 5-Year Smoothing
CENTRAL TENDENCY OF DATA
A useful concept when summarizing data is to find some
way to measure the center of the data. The central
tendency of a variable is the tendency of the data to cluster
or center about certain numerical values. Central
tendency is in contrast to another concept which will be
discussed shortly, variability or the spread of the data. For
central tendency we will focus on the mean, the mode, and
the median.
The Mean. The arithmetic mean or mean is the sum of the
measurements divided by the number of measurements
contained in the data set. For a sample we use x with a
bar over it, 0. For a population, we use the Greek term, :
(mu).
The formula for the mean is given as:
x=
n
∑ ( x / n)
i =1
i
The first formula is the more familiar formula and reflects
that the mean is the average observation. The second
The central tendency of a
variable is the tendency of the
data to cluster or center about
certain numerical values.
Using Statistical Data to Make Decisions: Fundamentals of Data Analysis
formula yields the same result and emphasizes the mean
is a weighted summation with the weights being the
probability of each observation in the data set (i.e., 1/n)
and as such is an expectation of a probability distribution.
As a measure of central tendency the mean has several
advantages (and disadvantages) over other measures.
The first is that the mean uses information of all the values
in a variable - all the values of the variable are added
together and divided by the sample size. We can make
inferences from a sample to a population for the mean some descriptive statistics of central tendency do not have
inferential properties. The mean forms the basis for a
number of other statistics known as Product Moment
Statistics, which includes the variance, correlation, and
regression coefficients. But, the mean is sensitive to
outliers and extremes in the data. The mean is “pulled”
toward extreme values in the data and it is not as
“resistant” as other measures of central tendency.
The mean has two important mathematical properties that
are important in statistics. The first is that mathematically,
the sum of the deviations about the mean equals zero.
The second property is that the sum of squared deviations
about the mean is a minimum. The latter is called the
Least Squares property . It means that the sum of
deviations around the mean is smaller than around any
other value. The least squares property is exploited when
looking at the spread of the data and in regression.
The Median. The median is the middle value when the
measurements are arranged in ascending order. It is a
positional measure because it is based on the middle case
in a variable. In order to find the median value, we first
must sort the data in ascending or descending order, find
the position of the middle value, and then read that value.
The median is an intuitive measure of central tendency the value at the middle of the ordered data. However, the
median is computationally difficult to compute because it
requires you to sort the data. Fortunately, spreadsheets
and statistical software packages calculate the median for
us rather easily.
The median has very limited inferential properties, so it is
not used when making inferences from a sample or in
hypothesis testing. Nonetheless, the median is often used
in skewed data because it is not as sensitive to outliers.
The median is often the preferred measure of the center in
data with extreme values, such as income.
Page 12
Properties of the Mean
• The mean uses information
of all the values in a
variable
• We can make inferences
from a sample to a
population for the mean
• The mean forms the basis
for a number of other
statistics known as Product
Moment Statistics
• The mean is sensitive to
outliers and extremes in the
data.
The median is a positional
measure of the center of a
variable and is preferred when
the data contain extreme
values. For example, it is
common to report the median
income rather than mean
income.
Using Statistical Data to Make Decisions: Fundamentals of Data Analysis
The median is also referred to as the 50th percentile.
Other ordered measures include percentiles, deciles, and
quintiles, and quartiles. Quartiles, used in box plots,
represent the values at the 25th, 50th, and 75th percentiles.
Some software programs use Q1 for the 25th percentile,
Q3 for the 75th percentile, and Q2 (50th percentile) for the
median.
Mode. The mode is the most frequent occurring value in a
variable. As a measure of the center, the mode is less
useful than the mean or median. However, it can provide
some insights to the most common value in a variable and
the shape of a distribution. In some cases there are
multiple “modes” referred to as Bi-Modal or Tri-Modal.
Multiple modes or groupings around a value may reflect
different groups within a variable. Figure 7 shows a bimodal distribution with a histogram of student weight.
In continuous level data, there may not be any single value
that is the most frequent. The mode may make more
sense in reference to qualitative data. With a qualitative
variable, we refer to the Modal Class or Category which
represents the category with the most responses.
90
80
70
Frequency
60
50
40
30
20
10
0
Figure 7. Histogram of weight of 312 students showing a
bi-modal distribution that reflects differences between
males and females
Page 13
Using Statistical Data to Make Decisions: Fundamentals of Data Analysis
Comparing the Mean, Median, and Mode. If we have a
variable with a distribution that reflects a symmetrical, bell
shaped curve, the mean, median, and mode would be very
similar to one another. The normal distribution is a very
special bell shaped curve where the mean, median, and
mode are equal to each other by definition. The
symmetrical, bell shaped curve is important in statistics
because it more easily allows us to make probability
statements about the distribution of a variable.
The skew of the data reflect a tail in the distribution pulled
by extreme values, either high or low. The following three
simple rules are useful in making a quick assessment of
the distribution of a variable.
1. If the mean is larger than the median, the data are
skewed to the right with some extreme high values. In
this case the mean is being pulled up by extreme
values in the data.
2. If the mean is smaller than the median, data are
skewed to the left with some extreme low values. In
this case the mean is being pulled down by the
extreme low values in the data.
3. If the mean and the median are very close to each
other it is likely (though not guaranteed) that the
distribution is symmetric and mound shaped.
Simply comparing the mean to the median can give us a
sense of the presence of extreme values or outliers, and in
which direction we can expect the skew.
Page 14
In a symmetrical, bell shaped
distribution, the mean,
median, and mode would be
very similar to one another
Simply comparing the mean to
the median gives a sense of
the presence of extreme values
or outliers, and in which
direction we can expect the
skew.
Using Statistical Data to Make Decisions: Fundamentals of Data Analysis
Page 15
MEASURES OF VARIABILITY OF DATA
Central tendency only tells part of the story when
describing a variable. Another aspect of data is the spread
or variability of data. There are several intuitive measures
of spread of data, including the range, the Inter-Quartile
(IQR) range, the variance, standard deviation, and the
coefficient of variation.
Variability of the data reflects
the spread of the data around
some center value, usually the
mean.
Range. The range is the difference between the highest
and lowest value in the data. The range provides an
sense of the extremes in the data. It is an order statistic
and depends upon the two most extreme values in the
data. As such, the range may be seriously influenced by
outliers.
Inter-Quartile Range. An alternative to the range is the
Inter-quartile range, which is the difference between the 3rd
quartile (Q3 or 75th percentile) and the 1st quartile (O1 or
25th percentile). The inter-quartile range provides a sense
of the range in the middle of the data and is not as
sensitive to extreme values in the data.
Variance. The variance is the average squared deviation
around the mean. By deviation we refer to the difference
of a particular value of a variable from the mean of the
variable. The concept of deviations around the mean can
be intuitively appealing as a measure of spread of the
data. If the mean is a good measure of central tendency,
then it is reasonable to ask how different (or how far away)
is a particular value of a variable (X) from the mean of X.
Taking this a step further, we might ask what is the
average distance of all values in the variable from the
mean. However, because of the property that the sum of
deviations around the mean always equals zero, we need
to square the deviations around the mean and take an
average squared deviation.
Let’s look at the formula for the variance. We will use the
Greek symbol F2 (sigma) to represent the variance of a
population. The sample term for the variance will be s2.
n
σ2 =
∑ (x
i =1
i
− x )2
n
To put it into words, the numerator reflects the sum of the
square of the calculation of each value in the variable
minus the mean. The numerator is called the Total Sum
of Squares (a term we will see later in regression). Since
The Total Sum of Squares is
the sum of the squared
deviations of each value in a
variable around the mean. It
is the numerator of the
formula for the variance.
Using Statistical Data to Make Decisions: Fundamentals of Data Analysis
we take the square of the deviations around the mean, the
numerator will always be a positive term. Once we divide
by n (or n-1 as we will show later), the number of
observations, the variance reflects the average squared
deviation around the mean. Another way to describe the
variance is that it is the Mean Squared Deviation.
Like the mean, the variance is sensitive to outliers in the
data. In fact, because the terms are squared, the variance
can be extremely sensitive to outliers. When you square
large numbers you get much larger numbers. Care should
be taken when using and interpreting the variance with
data that is highly skewed with high or low outliers.
Standard Deviation. Average squared deviations around
the mean are awkward to discuss and interpret. However,
if we take the square root of the variance we have a value
that is no longer in squared terms. This new term is the
standard deviation, or the average deviation around the
mean. We use the Greek term F (sigma) to represent the
population standard deviation and the term s to represent
the sample standard deviation.
Page 16
Properties of the Variance
• It is also known as the
Mean Squared Deviation
• The numerator is known at
the Total Sum of Squares
• When dealing with a
sample we divide by (n-1)
to adjust for degrees of
freedom
• The variance is sensitive to
extreme values - outliers
have a large effect on the
variance
The Variance and Standard Deviation with Sample
Data. When we are dealing with a sample of the
population, and our ultimate goal is some sort of inference,
the formula for the variance and standard deviation must
change. As noted earlier, when we are dealing with a
population we use the Greek term F2 (sigma squared) and
when we are dealing with a sample we use s2. However,
when dealing with a sample the formula must change to
reflect an adjustment due to degrees of freedom. The
adjustment involves using n-1 in the denominator of the
formula. We will use the following formula for the variance
(and the square root of this formula for the standard
deviation) almost exclusively for the rest of this course.
n
s2 =
∑ (x
i =1
i
− x )2
(n − 1)
Degrees of freedom is an important concept in inferential
statistics and it will be seen again in regression analysis.
While it is a difficult concept to comprehend at this level,
think of it as an adjustment when dealing with a sample.
Using n in the formula for s2 tends to underestimate F2 for
the population. Note that the adjustment makes more of a
difference when the sample size is small (less than 30)
than when the sample is large (greater than 1,000). Using
(n-1) in the formula is the default in Excel and most
calculators.
When dealing with a sample
of data, we adjust the formula
for the variance for the
degrees of freedom by diving
by n-1.
Using Statistical Data to Make Decisions: Fundamentals of Data Analysis
Page 17
Coefficient of Variation. Another way to express the
standard deviation is in relation to the mean. The
Coefficient of Variation (CV) is the ratio of the standard
deviation to the absolute value of the mean, usually
expressed as a percentage. By taking a ratio, we express
the standard deviation relative to the mean and it provides
a way to say how much variability there is in a variable
relative to the size of the mean. The higher the
percentage, the more variability.
The CV is particularly useful when comparing the
variability of different variables. For example, suppose we
had a data set on customers and we want to compare the
variability of education level and their income. It would not
be useful to compare the standard deviations because the
metric on income is so much larger. However, we could
compare the CVs for each variability and talk about which
variable has more variability. The CV formula is given
below.
CV = s/|0| * 100
Interpreting the Standard Deviation. If our variable is
symmetrical and mound shaped in its distribution, we can
use the Empirical Rule to make some statements to
interpret the standard deviation. By symmetrical we mean
that the distribution is the same (or reasonably close) to
the left and right of the mean. By mound shaped we mean
that the largest proportion of the observations are centered
around the middle of the distribution, and the mean,
median, and mode of the variable are close in value. The
following histogram (Figure 8) of miles per gallon (MPG) of
100 compact cars can be thought of as a symmetrical,
mound shaped distribution with a mean, median, and
mode of 37 and a standard deviation of 2.4.
45
40
35
Frequency
30
25
20
15
10
5
0
Figure 8. Histogram of MPG of compact cars that
represents a symmetrical, mound shaped distribution.
If our variable is symmetrical
and mound shaped we can
use the Empirical Rule to
interpret the standard
deviation, and give us an
indication if a value is an
outlier.
Using Statistical Data to Make Decisions: Fundamentals of Data Analysis
Page 18
If our variable is symmetrical and mound shaped, the
Empirical Rule tells us that approximately 68% of the
observations should be plus or minus one standard
deviation (34% above or below); 95% should be within plus
or minus 2 standard deviations, and nearly all the
observations (99.7%) should be plus or minus 3 standard
deviations around the mean. We can express this as:
. 68% of the observations are ± 1*s
. 95% of the observations are ± 2*s
. 99.7% of the observations are ± 3*s
This rule allows us to say how likely or unlikely it would be
to find a variable that is a certain number of standard
deviations away from the mean. For the MPG example,
we can say that we would expect 68% of the cars to be
within:
A value more than three
standard deviations above or
below the mean is unusual,
especially if the distribution is
symmetrical and moundshaped.
37 ± 2.4 = 34.6 mpg to 39.4 mpg
We could also say that we would expect 34% of the cars
(one half of the 68%) to have a mpg between 37 and 39.4.
The Empirical Rule also gives us a rule of thumb to
determine if a value is an outlier. If a value is more than
three standard deviations away from the mean, it is
extremely rare. In a probabilistic framework, we would say
that it is possible, but not very probable. Thus, if we had a
compact car that gets less than 29.8 mpg or more than
44.2 mpg we might ask questions. Perhaps it is a
performance car that is part of a different population of
compact cars, or if it is on the high end a specialty hybrid
that is unique. Or, someone could have made a mistake in
measuring mpg or in entering the data in a computer. The
fact that a value is extreme does not make it wrong or bad,
but it should cause us to ask questions and examine it
further.
The fact that a value is
extreme does not make it
wrong or bad, but it should
cause us to ask questions and
examine it further.
TRANSFORMATIONS OF DATA
There are times when we will want to transform our data
into another form that is more useable. Reasons to
transform data are to reduce the impact of extreme values
in the data, to make a nonlinear relationship linear, to
present data in a more easily interpretable manner, or to
make adjustments based on a third factor. The latter
reason involves a weighting schemes such as per capita,
seasonal adjustments, or adjustments for inflation. Each
of these methods have strengths and weakness and no
Transforming data before
analysis can be a useful way to
better see what is going on in
the data.
Using Statistical Data to Make Decisions: Fundamentals of Data Analysis
method will perfectly solve all the data problems they were
designed to address. However, transforming data at times
provides a useful way to present a clearer picture. I will
talk of three methods to transform the data: creating zscores; log transformation, and weighting the data.
Z-scores. The z-score approach is a method of
transforming data to reflect relative standing of the value in
relation to the mean. A z-score is calculated by
subtracting the mean from a value and then dividing by the
standard deviation.
zi =
(xi − x )
s
The result represents the distance between a given
measurement X and its mean, expressed in standard
deviations. A positive z-score means that measurement is
larger than the mean while a negative z-score means that
it is smaller than the mean. By dividing through by the
standard deviation we are able to say how far away a
value is from its mean in a relative way.
If we were to convert an entire variable to z-scores - take
each value, subtract the mean, and divide by the standard
deviation - we would create a new variable that has a
mean equal to zero and a standard deviation equal to one.
The new variable would be in standardized units and thus
would allow us to compare different values to each other in
terms of how many standard deviations away from the
mean they are.
A z-score transformation does not change the order of the
data or the shape of the distribution of the data. This is
because we are subtracting and dividing through by
constant values (i.e., the mean and standard deviation).
Use a z-score transformation can help in interpretation of a
variable, comparison of variables measured on different
scales, and in cases of variables whose measurement is
somewhat contrived and arbitrary, such as an index.
Log Transformation. A popular transformation of data is
a log transformation. In most cases we use the natural
logarithm or base e. The value of e is 2.7183 and the
natural log of a value is the power that I need to raise the
value of e to equal that number. For example:
Page 19
When transforming data, be
careful in the interpretation of
the newly transformed data,
especially if the transformation
changes the order of the data.
Z-scores are an effective way
to re-express a value to reflect
its position to the mean
relative to the standard
deviation of the data.
Using Statistical Data to Make Decisions: Fundamentals of Data Analysis
Page 20
The log of 10 = 2.3026, which means 2.71832.3026 = 10
Look at the following natural logs of numbers:
log 10
log 100
log 1,000
log 10,000
log 100,000
log 1,000,000
= 2.3026
= 4.6052
= 6.9078
= 9.2103
= 11.5129
= 13.8155
Logarithmic transformations
are useful when the data has a
large variability and as a result
is skewed toward high or low
values.
With the use of the natural logarithm we can reduce the
variability of a number while still maintaining the original
order of the data. By this I mean that the largest numbers
are still larger, but we greatly reduced the variability
between larger numbers and smaller numbers. Thus, log
transformations can be used to reduce the variability in
variables that have a large range.
Lets look at an example of the price of 25 apartment
buildings in a city. The goal of the analysis is to better
understand the price of the building as influenced by such
factors as square footage, number of apartments, and age
and condition of the building. The original data for price
shows a large variability from a minimum of $79,300 to a
maximum of $950,000. The summary statistics are given
in the table below, along with the same statistics for the
natural log of price.
Summary Statistics for Apartment Price and Log of
Apartment Price
Price
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
290573.52
42305.83
268000.00
#N/A
211529.15
44744581138.09
2.80
1.61
870700.00
79300.00
950000.00
7264338.00
25.00
Ln Price
12.359
0.134
12.499
#N/A
0.671
0.450
-0.674
0.258
2.483
11.281
13.764
308.973
25.00
Log transformations are used
in growth models, finance
models of instantaneous rates
of change, and economic
models of constant elasticities.
Using Statistical Data to Make Decisions: Fundamentals of Data Analysis
Page 21
If we look at a histogram of the original data we can see
that price data are skewed and that the mean is being
pulled by several extreme values in the data set.
14
12
Frequency
10
8
6
4
2
0
Figure 9. Histogram of Apartment Building Price showing
a skewed distribution.
By taking the natural logarithm of the price we change the
distribution of the variable, reduce the variability, and often
make the variable more normally distributed. If we look at
the transformed price in Figure 10 we can see that the
influence of the extreme values is reduced.
9
8
7
Frequency
6
5
4
3
2
1
0
Figure 10. Histogram of the Natural Log of Apartment
Price
The main problem with a log
transformation is that we
change the measurement units
and make it harder to interpret
the statistics - they are now
expressed in log units.
Using Statistical Data to Make Decisions: Fundamentals of Data Analysis
Page 22
Log transformations are very useful tools for data analysis.
They are used in population growth models, instantaneous
rates of change in finance, and in constant elasticity
models in economics. Log transformations can be used in
regression analysis to transform a nonlinear relationship
into a linear relationship in the parameters. The main
caution with a log transformation is that the data are
changed and it is not easy to interpret or compare results
to the original data.
Weighting Schemes. The last transformation discussed
here is various weighting schemes. These are strategies
to weight the data, usually by multiplying or dividing by
another variable, to adjust the original data. Examples of
this include putting a value on a per capita basis (dividing
by the population), adjusting for inflation by using a
Consumer Price Index (CPI), or adjusting time series data
by dividing through by seasonal averages.
If the weights are a constant (every data value is adjusted
by the same weight) the impact is small. For example,
expressing dollar figures as per $1,000 or per $1,000,000.
In fact, this transformation will have no meaningful impact
in advanced techniques such as correlation and
regression. By this we mean that the analysis will not
change in substance or conclusion simply because we
express a variable in $1,000 dollars rather than in real
terms. This is a good thing and is a strength of regression
analysis.
However, if there is a unique weight for each value, the
impact of this transformation can be substantial. Because
of this, using weights as an adjustment should be justified
on a past practice or theoretical basis. This transformation
can change the distribution and order of the data, so the
change is more radical than other transformations.
Let’s look at an example of flood insurance payments over
time in the U.S, from 1978 to 2002. There has been an
upward trend in payments over time, from nearly $148
million in 1978 to $339 million in 2002. There is a great
deal of fluctuation from year to year based on rainfall and
natural occurrences. However, part of the trend over time
is also due to inflation and the change in the value of
money. In fact, one might ask if there really is an upward
trend in payments once we adjust for inflation.
The graph below (Figure 11) shows the trend since 1978.
You can clearly see an upward trend in payments, but
there are fluctuations from year to year. Although not
shown, a model was fit to the data that shows a linear
trend that explains about 33% of the variability in payments
follows a linear trend line. The line in the graph is
generated from a regression model.
If the weights are a constant
the impact is small on most
analyses techniques, such as
regression.
However, if there is a unique
weight for each value, as is the
case of adjusting for annual
inflation rates, the impact of
this transformation can be
substantial.
Using Statistical Data to Make Decisions: Fundamentals of Data Analysis
$1,800,000
$1,600,000
$1,400,000
$1,200,000
$1,000,000
$800,000
$600,000
$400,000
$200,000
19
78
19
80
19
82
19
84
19
86
19
88
19
90
19
92
19
94
19
96
19
98
20
00
20
02
$0
Figure 11. Linear trend of U.S. Flood Insurance
Payments, 1978 to 2002
The next graph (Figure 12) shows the same trend, only
this time the payments are adjusted for inflation using the
Consumer Price Index (CPI) from the Bureau of Economic
Analysis. The shape of the graph is very similar to the
previous graph, but there are some important differences.
Figure 12 also shows a upward trend, but the trend line is
not as steep and the amount the model explains drops to
only 7%. This result shows that the upward trend in
payments is not near as steep or noticeable once we
adjust for inflation. In other words, part of the trend was
due the CPI and not simply an upward payout trend.
Adjusting for inflation helped clarify the trend.
$ 1,800,000
$ 1,600,000
$ 1,400,000
$ 1,200,000
$ 1,000,000
$ 800,000
$ 600,000
$ 400,000
$ 200,000
02
98
00
20
20
19
94
92
90
88
86
84
82
80
96
19
19
19
19
19
19
19
19
19
19
78
$0
Figure 12. Linear Trend of U.S. Flood Insurance
Adjusted for Inflation, 1978 to 2002
CONCLUSIONS
This module was designed to help you gain a basic
understanding of descriptive statistics, graphing, and
transformations as a way to better understand your data.
These techniques serve as basic building blocks for
analysis and form the foundation of more sophisticated
techniques such as correlation and regression. In fact,
much of what regression is about is explaining variability in
a dependent variable, and how independent variables
influence or “explain” the variability. Throughout this
course we continually use some of the techniques in
Module 1 as a starting point for analysis.
Page 23