Download BASICS OF APPLIED STATISTICS

Document related concepts

Psychometrics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Regression toward the mean wikipedia , lookup

Time series wikipedia , lookup

Categorical variable wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
SMK University of Applied Social Sciences
Laura Saltyte
BASICS OF APPLIED STATISTICS
Course Handbook
SMK University of Applied Social Sciences
Laura Saltyte
BASICS OF APPLIED STATISTICS
Course Handbook
Klaipeda
2015
Laura Saltyte
BASICS OF APPLIED STATISTICS
Methodological Handbook
Approved by the decision of the Academic Board of SMK University of Applied Social Sciences,
15th April 2014, No. 4.
Layout by Jurate Banyte-Gudeliene
The publication is financed within project “Joint Degree Study programme ‘Technology and
Innovation Management’ preparation and implementation“ No. VP1-2.2-SMM-07-K-02-087
funded in accordance with the means VP1-2.2-SMM-07-K “Improvement of study quality,
development of Internationalization” of priority 2 “Lifelong Learning” of the Action
Programme of Human Relations Development 2007 – 2013.
© Laura Saltyte, 2015
© SMK University of Applied Social Sciences, 2015
ISBN 978-9955-648-27 -7
Contents
Introduction ..................................................................................................................................................................5
Population and Sample .........................................................................................................................................6
Data types and measurements ..........................................................................................................................8
Getting started with SPSS ....................................................................................................................................8
Self-test questions ............................................................................................................................................... 13
Descriptive statistics ............................................................................................................................................... 14
Frequencie distributions. Graphs .................................................................................................................. 14
Measures of central tendency ......................................................................................................................... 17
Measures of variation......................................................................................................................................... 22
Standard Scores and Normal distribution ................................................................................................. 25
Descriptive statistics in SPSS .......................................................................................................................... 30
Self-test questions ............................................................................................................................................... 32
Excersices ............................................................................................................................................................... 33
Hypotheses testing .................................................................................................................................................. 34
The concepts of hypothesis testing............................................................................................................... 35
t Ratio or Student’s t ........................................................................................................................................... 38
Hypothesis for one sample .......................................................................................................................... 39
Testing the difference of two means (independent and paired samples) ............................... 39
Two independent samples ........................................................................................................................... 41
Two dependent samples ............................................................................................................................... 43
Hypothesis testing in SPSS ............................................................................................................................... 44
Self-test questions ............................................................................................................................................... 51
Exercises ................................................................................................................................................................. 52
Correlation analysis ................................................................................................................................................ 53
Self-test questions ............................................................................................................................................... 56
Exercises ................................................................................................................................................................. 57
Regression analysis ................................................................................................................................................. 57
Linear regression ................................................................................................................................................. 59
Multiple regression ............................................................................................................................................. 59
Adequacy of regression model ....................................................................................................................... 60
Correlation and regression analysis in SPSS ............................................................................................. 62
Self-test questions ............................................................................................................................................... 70
Excersises ............................................................................................................................................................... 70
Time series analysis ................................................................................................................................................ 71
Components of time series............................................................................................................................... 72
Trend in SPSS ........................................................................................................................................................ 80
Self-test questions ............................................................................................................................................... 84
Excersices ............................................................................................................................................................... 85
Appendix 1. t values ................................................................................................................................................ 89
Appendix 2. Chi square values ............................................................................................................................ 90
Appendix 3. F values ............................................................................................................................................... 91
References................................................................................................................................................................... 92
Introduction
Statistics – collections of data associated with human enterprises. Statistics – a
method that can be used to analyse data. That is to organize and make sense out of a large
amount of material. Statistics is the studies of how to collect, organize, analyse, and
interpret the numerical information from data. Statistics is necessary in:
• sports,
• stock market,
• traffic,
• and hundreds of other human activities,
• etc.
Like most people, you probably feel that it is important to "take control of your life."
But what does this mean? Partly, it means being able to properly evaluate the data and
claims that bombard you every day. If you cannot distinguish good from faulty reasoning,
then you are vulnerable to manipulation and to decisions that are not in your best
interest. Statistics provides tools that you need in order to react intelligently to
information you hear or read. In this sense, statistics is one of the most important things
that you can study [10].
Statistics are often presented in an effort to add credibility to an argument or
advice. You can see this by paying attention to television advertisements. Many of the
numbers thrown about in this way do not represent careful statistical analysis. They can
be misleading and push you into decisions that you might find cause to regret. For these
reasons, learning about statistics is a long step towards taking control of your life. (It is
not, of course, the only step needed for this purpose.) The present handbook is designed
to help you learn statistical essentials.
Aim of this methodological handbook is to introduce students about data
measurement scales, types of data, data collection and coding techniques. The
characteristics of data location (mean mode, median) and data dissemination (dispersion,
standard deviation, range) are analysed. The course unit discusses the statistical
relationship indicators, linear and nonlinear regression. The students learn about
parameter hypothesis testing. Graphical analysis of data is performed during the course
unit studies. Data analysis software tools and their use in solving typical data analysis
tasks are introduced to students. After completion of the course unit, the students are able
to select the models that are the most appropriate for the available data and to interpret
the obtained results.
The handbook consists of 6 chapters: Introduction; Descriptive statistics;
Hypotheses testing; Correlation analysis; Regression analysis and Time series analysis.
Also guidelines on how to use one of the most popular packages for statistical calculation
(SPSS) are given at the end of each topic.
Keywords: population, sample, Discrete and discontinuous data, Nominal scale,
Ordinal scale, Interval scale, Ratio scale.
Population and Sample
Before we study specific statistical descriptions, let me define the terms population
and sample [10].
A population is a group of phenomena that have something in common. The term
often refers to a group of people, as in the following examples:
•
•
•
all registered voters in Klaipeda;
all members of the International Machinists Union;
all Lithuanians who played basketball at least once in the past year.
But populations can refer to things as well as people:
•
•
•
all widgets produced last Tuesday by the Acme Widget Company;
all daily maximum temperatures in July for major Lithuania cities;
all basal ganglia cells from a particular rhesus monkey.
Often, researchers want to know things about populations but do not have data for
every person or thing in the population. If a company's customer service division wanted
to learn whether its customers were satisfied, it would not be practical (or perhaps even
possible) to contact every individual who purchased a product. Instead, the company
might select a sample of the population. A sample is a smaller group of members of a
population selected to represent the population. In order to use statistics to learn things
about the population, the sample must be random. A random sample is one in which every
member of a population has an equal chance of being selected. The most commonly used
sample is a simple random sample. It requires that every possible sample of the selected
size has an equal chance of being used [10] (see figure 1).
A parameter is a characteristic of a population. A statistic is a characteristic of a
sample. Inferential statistics enables you to make an educated guess about a population
parameter based on a statistic computed from a sample randomly drawn from that
population. Usually population size is denoted by the letter 𝑁𝑁𝑁𝑁 and the sample size by the
letter 𝑛𝑛𝑛𝑛.
Figure 1. Population and sample.
Source: www.boundless.com
Statistical procedures can be divided into two major categories: descriptive
statistics and inferential statistics.
Descriptive statistics includes statistical procedures that we use to describe the
population we are studying. The data could be collected from either a sample or a
population, but the results help us organize and describe data. Descriptive statistics can
only be used to describe the group that is being studying. That is, the results cannot be
generalized to any larger group.
Descriptive statistics are useful and serviceable if you do not need to extend your
results to any larger group. However, much of social sciences tend to include studies that
give us “universal” truths about segments of the population, such as all parents, all
women, all victims, etc.
Frequency distributions, measures of central tendency (mean, median, and mode),
and graphs like pie charts and bar charts that describe the data are all examples of
descriptive statistics.
Inferential statistics is concerned with making predictions or inferences about a
population from observations and analyses of a sample. That is, we can take the results of
an analysis using a sample and can generalize it to the larger population that the sample
represents. In order to do this, however, it is imperative that the sample is representative
of the group to which it is being generalized.
To address this issue of generalization, we have tests of significance. A Chi-square
or t-test, for example, can tell us the probability that the results of our analysis on the
sample are representative of the population that the sample represents. In other words,
these tests of significance tell us the probability that the results of the analysis could have
occurred by chance when there is no relationship at all between the variables we studied
in the population we studied.
Examples of inferential statistics include regression analysis, ANOVA, correlation
analysis, survival analysis, etc [11].
Data types and measurements
We can classify data into two types: continuous and discrete. Meters, centimetres,
millimetres; kilos, grams, milligrams are examples of continuous data. With these we can
make measurements of varying degrees of precision.
Discrete or discontinuous data are based on measurements that can be expressed
only in whole units (counting of people, number of words spelled correctly, number of
cars passing a point, etc.).
Normally, when one hears the term measurement, they may think in terms of
measuring the length of something (e.g., the length of a piece of wood) or measuring a
quantity of something (i.e. a cup of flour).This represents a limited use of the term
measurement. In statistics, the term measurement is used more broadly and is more
appropriately termed scales of measurement. Scales of measurement refer to ways in
which variables/numbers are defined and categorized. Each scale of measurement has
certain properties which in turn determine the appropriateness for use of certain
statistical analyses. The four scales of measurement are nominal, ordinal, interval, and
ratio [5].
Nominal scale – measure of identity, i.e. it classifies individuals into categories
(religious preference: (1) Protestant, (2) Catholic, (3) Jewish, (4) Hinduism, (5) other, (6)
none). Just simple statistical methods are used with nominal data.
Ordinal scale – measures are arranged from the highest to lowest or vice versa. In
this scale we can compare which is larger or smaller, harder or softer etc., but measures
don’t tell how much. Statistically not much can be done, but more than with nominal data.
Interval scale provides numbers that reflect differences among items. With interval
scales the measurement units are equal (Fahrenheit and Celsius thermometers; time as
reckoned our calendar; scores of intelligence test). Many statistical methods can be used
with interval scale.
Ratio scale – the basic difference between this and interval scale is that ratio
scales have an absolute zero (length, width, weight, capacity etc.). All statistical methods
can be used.
Sometimes Interval and Ratio scale is called quantitative scale [5].
Getting started with SPSS
SPSS stands for "Statistical Package for the Social Sciences". It is a very powerful
program that can do all of the statistics that you are ever likely to want to use. It is actually
fairly easy to use, but because it can do statistics for "grownups" as well as novices, it may
seem quite daunting at first. It presents you with a bewildering array of options that you
will probably never need to use. When it comes to giving you statistical results, it will give
you what you want - as well as a lot of extra stuff that you may not need! The secret to
using SPSS is to take it one small step at a time. These series of hand-outs are aimed at
showing you how to use SPSS to do the statistics referred to in the lectures. There are now
many helpful books which explain using SPSS well: the only catch with these is that SPSS
exists in various versions, and so, depending on which book you get, it may not
correspond exactly to what happens with the version we will be using (version 20.0) [7].
1.
Starting up SPSS:
Double-click on the SPSS icon. After a few seconds, a window like the following
should appear on your screen. This is the "Data View" window. It has the default name
"Untitled1".
At the top of this window, there is a row of commands, (File, Edit, Help, etc.)
Clicking on any of these will produce a drop-down menu, much as in other programs that
you might be familiar with (such as Word or Excel). At this stage, many of the options on
the menus will appear quite meaningless to you. "File", "Edit", "Analyze" and "Window"
are the options that we will use the most. As you might expect, "File" enables you to open
and save files, and "Edit" enables you to cut, copy and paste things. We will use "Analyze"
to perform various statistical tests. "Help" provides you with information about SPSS [7].
If you click on the tab at the bottom left of the window, it switches to a new
window, the "Variable View". You can toggle between these two windows at any time. You
use the Data View window when you want to input data, and you use the Variable View
window to change various properties of the data.
There is a third window, the "Output Window": this will contain the results of any
statistical analyses that you perform. (The Output Window will only be available once you
have some output to see, so it won't actually be accessible just yet). Each window has the
controls for SPSS at the top of the screen (the words "File", "Edit" and so on, and the row
of icons beneath them). Most of the SPSS controls will remain visible all the time, but you
can switch between the three windows whenever you like. You switch between the "Data
View" and "Variable View" windows by using the tabs at the bottom left of either of these
two windows. You switch between these two windows and the "Output Window" by
clicking on "window" at the top left of the screen and selecting the one that you want from
the menu that appears.
2. Entering data:
In SPSS, each row of the grid is a "case" and each column is a "variable". To make
this clear, imagine we have the heights and weights of six people. Each person is a
separate case. We have three variables: height, weight and sex. We could therefore enter
the data in such a way that each row represents a different individual: one column has the
height data, another column has the weight data, and a third column tells SPSS whether
the person was male or female. To enter values, make sure you have the "Data view"
window selected. Move the cursor to the square into which you wish to make an entry,
and click on it. Enter the value, followed by a press on the "enter" key. To move around the
grid, you can use the arrow keys or the mouse [7].
First of all you need to prepare SPSS for entering the data. So you need to switch
from Data View to Variable View.
Each row in this window contains information about one of the variables (one of
the columns) in the "Data View" window. Change the Name of variable "var0001" to
"name"; change the name of "var0002" to "height"; change "var0003" to "weight"; and
change "var0004" to "sex". In the version of SPSS that we are using, the variable name can
be any combination of letters and numbers, but it must not contain a space or any other
symbols. You can add a more informative title to a variable, one that can include
punctuation marks and spaces, by entering it in the box that is entitled Label. It is very
important to do this, as SPSS will show these labels in your output, and they will make the
output much easier to understand. I have many data files with variable names like
"qw1325bc", which made sense at the time I wrote the file but are now quite meaningless
to me because I didn't label the variables!
Type and Values:
SPSS can treat an entry to a cell as a sequence of characters (a "string") or as a
number. "Tom" is a string, while his height and weight are numbers. For "sex", I have used
numbers as strings: "1" represents "male" and "2" represents "female", but these are
arbitrary labels. I could have used any two numbers (say "5" for "male" and "0" for
"female") and SPSS would have been equally happy. It is easy to forget which number
represents which condition. However, if you click on the "Values" column, a small grey box
appears. Click on this, and a dialog box pops up. You can associate a label with each
number - thus, in this case, you can tell SPSS that "1" means "male" and "2" means
"female". (Don't forget to click on "add" each time you enter a label).
Width:
Width merely specifies the width of each column in the "Data View" window. By
default, a column that contains numbers is eight characters wide. However, an annoying
complication of this version of SPSS is that, if the first entry in a column is a string of a
certain length, SPSS assumes that all of the subsequent cells will contain strings of the
same length. This is why the names start with "Matilda": had I started the column with
"Tom", SPSS would assume that all of the strings in this column are going to be 3
characters long. Consequently "Dick" would have been truncated to "Dic", "Harry" to "Har"
and so on. You can stop SPSS doing this by changing the column width to suit the length of
your strings - or, as I did here, by making sure that the longest string goes in the first cell
of a column! [7]
Decimals:
If the variable is a number, this column specifies how many decimal places each
case will be displayed to. So, if you select 0, Tom's height will appear as "2000"; if you
select 2, it will appear as "2000.00"; if you select 3, his height will be "2000.000"; and so
on. If the variable is a string, this is irrelevant, and so SPSS will show a 0 in the relevant
cell of the "Variable View" window [7].
Missing:
Sometimes a data-set is incomplete - perhaps someone forgot to tell you their
height, for example. If a numerical entry is blank, SPSS assumes it misses the data.
However, sometimes you might want to enter a code for missing values - perhaps "999" to
show that the value is missing because the person forgot to enter it, and "99" to show that
it is missing because the person refused to give it. "Missing" enables you to do this. At the
moment, it is the simplest to show missing data by leaving blank the relevant cell in the
"Data View" window [7].
Measure:
This will be explained more fully in the statistics lectures. Essentially this column
shows what kind of data SPSS thinks is in the column: nominal (a name, i.e. a string),
ordinal (rating data) or scale (interval or ratio data). At the moment, it will suffice to keep
clear the distinction between using numbers as numbers, and using them as labels
(strings).
Now we are ready to enter the data. You need to go back to Data View Window. Let’s say
we have information about six people.
A "variable": each column contains data of a
particular kind (in this instance,"gender")
A "case": each row contains a single
person's set of data
In column “sex“ you can enter numbers (1 or 2 adequately), and they will be
changed to “male“ or “female“.
3. Saving a file:
Once you have entered your data, always save the data before doing anything else.
Saving the file will save you a lot of heartache in the future. It is really demoralising to
spend hours typing numbers in, only to lose the lot by some accident or mistake. Click on
"File" of the SPSS controls (it is top left). A menu will appear. Now click on "Save As", and
enter a filename in the top left-hand box - where it says " *.sav". Any combination of letters
and numbers will do, but let's call the file "chicken". SPSS will automatically add the suffix
".sav". The ".sav" bit is important, as it tells SPSS that this file is a data-file of a kind that it
likes. (SPSS can read other types of data-file as well, but it is simplest to stick with .sav
types for the moment). So, all you have to do, is type "name" (or, in future, any filename
you choose) in the box to the right of "filename".
Press "enter", and SPSS will save the data into a file. You can now carry on, secure in
the knowledge that whatever happens, your data will be safe on the computer; or you can
quit SPSS and come back another time. To quit SPSS, click on "exit", which is at the bottom
of the "File" menu.
When you save your data, an "Output Window" will open automatically, containing
information that you've successfully saved the file. If this is all it contains, just close it
without saving it. However if you have run some statistical analyses and hence have some
output in the "Output Window", you can save this in a separate file. To do so, make the
"Output Window" the active window (as described earlier), and then click on "File" and
then "Save As" in the same way as for saving the data. This time, SPSS will prompt you to
supply a filename ending in ".spo", to show that it is an output file rather than a data file.
Thus, you could call it "chicken.spo", and SPSS will then save the contents of the "Output"
window as a file. This file can be read into a word-processor such as Microsoft Word, and
then treated like any other text document [7].
Self-test questions
1. What is the main difference between Nominal and Ordinal scale?
2. What is the main difference between descriptive and inferential statistics?
3. What for column “Values” in SPSS can be used?
4. What is the population if goal of the research is to explore student’s view of Lithuania
high schools?
5. Variable “Age” will be interval or scale variable?
Descriptive statistics
It is well known, that picture tells more than a thousand words. The same applies to
any serious data analysis. The first step of data analysis is to summarize the data by
drawing plots and charts as well as by computing some descriptive statistics. These tools
essentially aim to provide a better understanding of how frequent the distinct data values
are, and how much variability there is around a typical value in the data.
After finishing this chapter, students will be able to collect, systematize and analyse
characteristics defining the social-economical phenomena. Aim of this chapter is to learn
how to systemize data, what kind of descriptive statistics characteristics can be used and
how to use them also Normal curve and its applications will be explained.
Keywords: frequency table, grouped data, central tendency, variation, standard
scores.
Frequency distributions. Graphs
Often to make our data more interpretable and convenient, we set up a frequency
distribution and draw graphs of various kinds to represent the data [6].
Frequency table reports the number of times that a given observation occurs or, if
based on relative terms, the frequency of that value divided by the number of
observations in the sample. Usually frequency table is applied to categorical (discrete
data), when we have not more than 10-15 different categories.
Example
A company in the transformation industry classifies the individuals at managerial
positions according to their university degree.
1 – Accountant;
2 – Administrator;
3 – Economist;
4 – Engineer;
5 – Lawyer;
6 – Physicist;
Given data: 1, 2, 3, 6, 2, 3, 4, 5, 4, 2, 3, 4, 3, 4, 4, 5, 4, 4
Frequency table:
Degree
Accounting
Business
frequencies
1/18
1/6
counts
1
percentage
3
5,56%
Economics Engineering
16,67%
4
2/9
22,22%
7
7/18
38,89%
Law
Physics
1/9
1/18
2
11,11%
1
5,56%
The corresponding plot for this type of categorical data is bar or pie chart (see figures 1
and 2).
40,00
30,00
20,00
10,00
0,00
5,56
16,67 22,22
38,89
11,11
5,56
Figure 1. Bar plot of frequency table
11,11
38,89
5,56 5,56
16,67
22,22
Accounting
Business
Economics
Engineering
Law
Physics
Figure 2. Pie plot of frequency table
If sample size is big, and measurements are made in interval or ratio scale
Frequency table of grouped data can be used. Before making frequency table of grouped
data some rules should be followed:
• we seldom use fewer than 6 or more than 15 classes (intervals). The exact
number we use in a given situation will depend on the nature, magnitude, and
range of the data;
• we always make sure that each item (measurement or observation) goes into one
and only one class (interval);
• whenever possible, we make the classes (intervals) the same length; that is, we
make them cover equal ranges of values;
• if a set of data contains a few values, which are much greater than or much smaller
than the rest, open classes are quite useful in reducing the number of classes
required to accommodate the data.
Scheme can be used to construct intervals of equal length:
• find 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑛𝑛𝑛𝑛 and 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑥𝑥𝑥𝑥 ;
• determine number of intervals (𝑘𝑘𝑘𝑘 = 6 … 15);
• calculate length of interval ℎ = 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑥𝑥𝑥𝑥 − 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑛𝑛𝑛𝑛 ;
• calculate break points for intervals 𝑐𝑐𝑐𝑐𝑚𝑚𝑚𝑚 = 𝑐𝑐𝑐𝑐𝑚𝑚𝑚𝑚−1 + ℎ (𝑐𝑐𝑐𝑐0 = 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑛𝑛𝑛𝑛 ).
Example
Consider a sample of College graduates, whose first salaries (in 1000 Lt per annum)
after graduating are as follows:
140 150 75 96 96 86 99 100 86 87 89 95 122 125 95 95 96 97 97 150 97 98 99 95
132 99 99 100 100 105 110 110 110 115 97 98 120 95 135 160
1. 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑛𝑛𝑛𝑛 = 75 ; 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑥𝑥𝑥𝑥 = 160;
2. let‘s make 5 intervals (𝑘𝑘𝑘𝑘 = 5);
3. interval lenght: ℎ =
160−75
4. interval break points:
𝑐𝑐𝑐𝑐0 = 75;
5
= 17;
𝑐𝑐𝑐𝑐1 = 75 + 17 = 92;
𝑐𝑐𝑐𝑐2 = 92 + 17 = 109;
𝑐𝑐𝑐𝑐4 = 126 + 17 = 143; 𝑐𝑐𝑐𝑐5 = 143 + 17 = 160.
Frequency table for grouped data:
𝑐𝑐𝑐𝑐3 = 109 + 17 = 126;
Intervals Counts Frequencies
[75;92)
[92;109)
[109;126)
[126;143)
[143;160]
5
22
7
3
3
1/8
11/20
7/40
3/40
Percentage
Middle points
7,50
151,5
3/40
12,50
55,00
17,50
7,50
Middle points are calculated as follows:
𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚0 =
83,5
100,5
117,5
134,5
𝑐𝑐𝑐𝑐𝑚𝑚𝑚𝑚−1 + 𝑐𝑐𝑐𝑐𝑚𝑚𝑚𝑚
2
The corresponding plot for this type of grouped data is histogram. A histogram is
constructed by representing the measurements of observations that are grouped on a
horizontal scale, the class frequencies (or corresponding percentage) on vertical scale,
and drawing rectangles whose bases equal the class interval and whose heights are
determined by the corresponding class frequencies (percentage).
55,00
60,00
50,00
40,00
30,00
20,00
10,00
0,00
12,50
[75;92)
Figure3. Histogram
17,50
7,50
7,50
[92;109) [109;126) [126;143) [143;160]
An alternative, although less widely used form of graphical presentation is the
frequency polygon. Here the class frequencies are plotted at the class marks and the
successive points are connected by means of straight lines.
60,00
50,00
40,00
30,00
20,00
10,00
0,00
55,00
12,50
83,5
Figure 4. Polygon
17,50
100,5
117,5
7,50
7,50
134,5
151,5
Measures of central tendency
A measure of central tendency is a single value that attempts to describe a set of
data by identifying the central position within that set of data. As such, measures of
central tendency are sometimes called measures of central location. They are also classed
as summary statistics. The mean (often called the average) is most likely the measure of
central tendency that you are most familiar with, but there are others, such as the median
and the mode.
The mean, median and mode are all valid measures of central tendency, but under
different conditions, some measures of central tendency become more appropriate to use
than others. In the following sections, we will look at the mean, mode and median, and
learn how to calculate them and under what conditions they are most appropriate to be
used [12].
The Mean
The mean (or average) is the most popular and well known measure of central
tendency. It can be used with both discrete and continuous data, although its use is most
often with continuous data. The mean is equal to the sum of all the values in the data set
divided by the number of values in the data set. Therefore, if we have n values in a data set
and they have values 𝑥𝑥𝑥𝑥1 , 𝑥𝑥𝑥𝑥2 , … , 𝑥𝑥𝑥𝑥𝑛𝑛𝑛𝑛 , the sample mean, usually denoted by 𝑥𝑥𝑥𝑥̅ (pronounced 𝑥𝑥𝑥𝑥
bar), is:
𝑥𝑥𝑥𝑥̅ =
𝑥𝑥𝑥𝑥1 + 𝑥𝑥𝑥𝑥2 + ⋯ + 𝑥𝑥𝑥𝑥𝑛𝑛𝑛𝑛
𝑛𝑛𝑛𝑛
This formula is usually written in a slightly different manner using the Greek capitol
letter Σ, pronounced "sigma", which means "sum of...":
𝑥𝑥𝑥𝑥̅ =
∑ 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚
𝑛𝑛𝑛𝑛
If we have a frequency table, a bit difference formula can be used:
𝑥𝑥𝑥𝑥̅ =
Here 𝑓𝑓𝑓𝑓𝑚𝑚𝑚𝑚 are frequencies.
∑ 𝑓𝑓𝑓𝑓𝑚𝑚𝑚𝑚 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚
𝑛𝑛𝑛𝑛
Formula for grouped data would be:
∑ 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚0
𝑥𝑥𝑥𝑥̅ =
𝑛𝑛𝑛𝑛
Here 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 are frequencies of grouped data and 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚0 – middle points.
Example
Suppose that we have the grades of 50 students in a course of elementary statistics
Grades (𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚 )
5
6
7
8
9
10
Counts (𝑓𝑓𝑓𝑓𝑚𝑚𝑚𝑚 )
12
18
13
4
2
1
𝑛𝑛𝑛𝑛 = 50
𝑥𝑥𝑥𝑥̅ =
Example
60
108
91
32
18
10
� 𝑓𝑓𝑓𝑓𝑚𝑚𝑚𝑚 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚 = 319
319
= 6,38
50
Suppose that we have grouped data
𝑓𝑓𝑓𝑓𝑚𝑚𝑚𝑚 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚
Age groups
Counts (𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 )
Midpoints (𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚0 )
[30;35)
0
32,5
[25;30)
[35;40)
1
3
[40;45)
6
[45;50)
6
[55;60)
4
[65;70)
1
[75;80)
1
[80;85]
1
0
37,5
112,5
47,5
285
255
315
57,5
402,5
67,5
270
62,5
4
[70;75)
27,5
52,5
7
[60;65)
27,5
42,5
6
[50;55)
𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚0
250
72,5
72,5
82,5
82,5
77,5
𝑛𝑛𝑛𝑛 = 40
77,5
� 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚0 = 2150
𝑥𝑥𝑥𝑥̅ =
2150
= 53,75
40
The mean is essentially a model of your data set. It is the value that is most
common. You will notice, however, that the mean is not often one of the actual values that
you have observed in your data set [12].
When not to use the mean
The mean has one main disadvantage: it is particularly susceptible to the influence
of outliers. These are values that are unusual compared to the rest of the data set by being
especially small or large in numerical value. For example, consider the wages of staff at a
factory below:
Staff
Salary
1
1500
2
1800
Mean would be
3
1600
4
1400
𝑥𝑥𝑥𝑥̅ =
5
1500
6
1500
30700
= 3070 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿
10
7
1200
8
1700
9
9000
10
9500
And this mean doesn‘t describe the real situation in the company.
The Median
The median is the middle score for a set of data that has been arranged in order of
magnitude. The median is less affected by outliers and skewed data. In order to calculate
the median, suppose we have the data below:
65, 55, 89, 56, 35, 14, 56, 55, 87, 45, 92
We first need to rearrange that data into order of magnitude (smallest first):
14, 35, 45, 55, 55, 56, 56, 65, 87, 89, 92
Our median mark is the middle mark - in this case, 56 (highlighted in bold). It is the
middle mark because there are 5 scores before it and 5 scores after it. This works fine
when you have an odd number of scores, but what happens when you have an even
number of scores? What if you had only 10 scores? Well, you simply have to take the
middle two scores and average the result. So, if we look at the example below:
65, 55, 89, 56, 35, 14, 56, 55, 87, 45
We again rearrange that data into order of magnitude (smallest first):
14, 35, 45, 55, 55, 56, 56, 65, 87, 89, 92
Only now we have to take the 5th and 6th score in our data set and average them to
get a median of 55.5 [12].
The Mode
The mode is the most frequent score in our data set. On a histogram it represents
the highest bar in a bar chart or histogram. You can, therefore, sometimes consider the
mode as being the most popular option.
Normally, the mode is used for categorical data where we wish to know which is
the most common category, as illustrated below:
The Mode
30
20
10
0
Car
Figure 5.
Bus
Train
Bycile
One of the problems with the mode is that it is not unique, so it leaves us with
problems when we have two or more values that share the highest frequency, such as
below:
10
8
6
4
2
0
1
5
8
10
14
19
22
25
28
30
Figure 6.
Another problem with the mode is that it will not provide us with a very good
measure of central tendency when the most common mark is far away from the rest of the
data in the data set, as depicted in the diagram below:
10
8
6
4
2
0
1 3 5 7 8 9 10 13 14 15 19 20 22 24 25 27 28 29 30
Figure 7.
Summary of when to use the mean, median and mode
Please use the following summary table to know what the best measure of central
tendency is with respect to the different types of variable [12].
Table 1. When to use the mean, median and mode
Type of Variable
Nominal
Ordinal
Interval/Ratio
Best measure of central tendency
Mode
Median or Mode
Mean
Measures of variation
Measures of variability indicate the degree to which the scores in a distribution are
spread out. Larger numbers indicate greater variability of scores. Sometimes the word
dispersion is substituted for variability, and you will find that term used in some statistics
texts. We will divide our discussion of measures of variability into three categories: the
range, the variance, and the standard deviation.
The Range
The Range is the distance from the lowest score to the highest score. We noted that the
range is very unstable, because it depends on only two scores. If one of those scores
moves further from the distribution, the range will increase even though the typical
variability among the scores has changed very little.
The Variance
The average of the squared deviations between the individual scores and the mean.
The larger the variance the more variability there is among the scores. When comparing
two samples with the same unit of measurement, the variances are comparable even
though the sample sizes may be different. The notation that is used for the variance is a
lowercase 𝑠𝑠𝑠𝑠 2 . The formula for the variance is shown below.
𝑠𝑠𝑠𝑠 2 =
Variance for frequency table
𝑠𝑠𝑠𝑠 2 =
Variance for grouped data
𝑠𝑠𝑠𝑠 2 =
1
�� 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚2 � − (𝑥𝑥𝑥𝑥̅ )2
𝑛𝑛𝑛𝑛 − 1
1
�� 𝑓𝑓𝑓𝑓𝑚𝑚𝑚𝑚 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚2 � − (𝑥𝑥𝑥𝑥̅ )2
𝑛𝑛𝑛𝑛 − 1
1
2
�� 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 �𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚0 � � − (𝑥𝑥𝑥𝑥̅ )2
𝑛𝑛𝑛𝑛 − 1
Definitions for 𝑓𝑓𝑓𝑓𝑚𝑚𝑚𝑚 , 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 , and 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚0 can be found above.
Did you recognize that the variance formula does not divide by 𝑛𝑛𝑛𝑛, but instead divides by
(𝑛𝑛𝑛𝑛 − 1)? The denominator (𝑛𝑛𝑛𝑛 − 1) in this equation is called the degrees of freedom. It is a
concept that you will hear about again and again in statistics. The reason that the variance
formula divides the sum of squared differences from the mean by (𝑛𝑛𝑛𝑛 − 1) is that dividing
by 𝑛𝑛𝑛𝑛 would produce a biased estimate of the population variance, and that bias is removed
by dividing by (𝑛𝑛𝑛𝑛 − 1).
The Standard deviation
The variance has some excellent statistical properties, but it is hard for most
students to conceptualize. To start with, the unit of measurement for the mean is the same
as the unit of measurement for the score. For example, if we compute the mean age of the
sample and find that it is 28.7 years, that mean is on the same scale as the individual ages
of our participants. But the variance is in squared units. For example, we might find that
the variance is 100 years2. Can you even imagine what the unit of years’ squared
represents? Most people can't. But there is a measure of variability that is in the same
units as the mean. It is called the standard deviation, and it is the square root of the
variances (see the formula below). So if the variance was 100 years2, the standard
deviation would be 10 years. Since we used the symbol 𝑠𝑠𝑠𝑠 2 to indicate variance, you might
not be surprised that we use the lowercase letter s to indicate the standard deviation. You
will see in our discussion of relative scores how valuable the standard deviation can be.
𝑠𝑠𝑠𝑠 = �𝑠𝑠𝑠𝑠 2
At this point, many students assume that the variance is just a step in computing the
standard deviation, because the standard deviation seems like it is much more useful and
understandable. In fact, you will use the standard deviation for description purposes only
and will use the variance for all your other statistical tasks.
Example
Let’s say we have data set 10, 12, 15, 18, 20
The Range would be 20 − 10 = 10
The Variance
𝑠𝑠𝑠𝑠 2 =
1
(102 + 122 + 152 + 182 + 202 ) − (15)2 = 73,25
4
The Standard deviation
Example
𝑠𝑠𝑠𝑠 = �73,25 ≈ 8,56
Grades (𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚 )
Counts (𝑓𝑓𝑓𝑓𝑚𝑚𝑚𝑚 )
𝑓𝑓𝑓𝑓𝑚𝑚𝑚𝑚 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚2
6
18
648
5
12
7
13
8
637
4
9
256
2
10
The variance: 𝑠𝑠𝑠𝑠 2 =
300
1
𝑛𝑛𝑛𝑛 = 50
1
49
162
� 𝑓𝑓𝑓𝑓𝑚𝑚𝑚𝑚 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚2 = 2103
∙ 2103 − 6,382 ≈ 2,21
here 6,38 is the average, calculated before.
100
The standard deviation: 𝑠𝑠𝑠𝑠 = √2,21 ≈ 1,49.
Example
For given grouped data calculate measures of variability
Age groups
Counts (𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 )
Midpoints (𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚0 )
𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 �𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚0 �
[30;35)
0
32,5
0
[25;30)
1
[35;40)
3
[40;45)
6
[45;50)
6
[50;55)
6
[55;60)
7
[60;65)
4
[65;70)
4
[70;75)
1
[75;80)
[80;85]
The variance
here 53,75 is the average.
The standard deviation
1
1
𝑛𝑛𝑛𝑛 = 40
𝑠𝑠𝑠𝑠 2 =
27,5
2
756,25
37,5
4218,75
47,5
13537,5
42,5
52,5
10837,5
16537,5
57,5
23143,75
67,5
18225
62,5
15625
72,5
5256,25
82,5
6806,25
77,5
1
∙ 120950 − 53,752 ≈ 212,22
39
𝑠𝑠𝑠𝑠 = √212,22 ≈ 14,57.
6006,25
2
� 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 �𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚0 � = 120950
Standard Scores and Normal distribution
Each quantitative variable can be transformed into standard scores. Formula for
standard score
𝑧𝑧𝑧𝑧 =
𝑥𝑥𝑥𝑥 − 𝑥𝑥𝑥𝑥̅
𝑠𝑠𝑠𝑠
where 𝑥𝑥𝑥𝑥 – any raw score or unit of measurement; 𝑥𝑥𝑥𝑥̅ - mean and 𝑠𝑠𝑠𝑠 standard deviation of the
distributions of scores.
First we calculate deviation from the mean, and then divide it by standard
deviation. When we change raw scores to standard scores, we are expressing them in
standard deviation units. These standard scores tell us how many standard deviation units
any given raw score deviates from the mean. Practically 3 standard deviations on either
side of the mean include all of the cases. Deviation of z scores can be described by saying
that they have a zero mean and a standard deviation equal to 1. Any time we see a
standard score, we should be able to place exactly where an individual falls in a
distribution. For example a student with a z score 2.5 is 2.5 deviations above the mean on
that test distribution and has a very good score.
Example
The scores of students on three elementary school tests are presented.
Student
A
B
C
Etc…
Mean
St. Deviation
Geography
60
72
46
…
60
10
Then Standard Scores will be
Spelling
140
100
110
…
100
20
Arithmetic
40
36
24
…
22
6
Student
A
B
C
Etc…
Geograph
y
50
62
36
…
Spelling
70
50
55
…
Arithmetic
80
73
53
…
Average
67
62
48
…
There are many probability distributions in statistics, developed to analyse
different types of problem. Several of them are covered here and the most important of
them is the Normal distribution, which we now turn to. It was discovered by the German
mathematician Gauss in the nineteenth century (hence it is also known as the Gaussian
distribution. Many random variables turn out to be normally distributed. Men’s (or
women’s) heights are normally distributed. IQ (the measure of intelligence) is also
normally distributed. Another example is of a machine producing (say) bolts with a
nominal length of 5 cm which will actually produce bolts of slightly varying length (these
differences would probably be extremely small) due to factors such as wear in the
machinery, slight variations in the pressure of the lubricant, etc. These would result in
bolts whose length varies, in accordance with the Normal distribution. This sort of process
is extremely common, with the result that the Normal distribution often occurs in
everyday situations. The Normal distribution tends to arise when a random variable is the
result of many independent, random influences added together, none of which dominates
the others. A man’s height is the result of many genetic influences, plus environmental
factors such as diet, etc. As a result, height is normally distributed. If one takes the height
of men and women together, the result is not a Normal distribution, however. This is
because there is one influence which dominates the others: gender. Men are, on average,
taller than women. Many variables familiar in economics are not Normal however –
incomes, for example (although the logarithm of income is approximately Normal). We
shall learn techniques to deal with such circumstances in due course.
0,045
0,04
0,035
0,03
0,025
f(x)
0,02
0,015
0,01
0,005
0
0
100
x
200
300
Figure 8. Normal distribution
Having introduced the idea of the Normal distribution, what does it look like? It is
presented below in graphical and then mathematical forms. Unlike the Binomial, the
Normal distribution applies to continuous random variables such as height and a typical
Normal distribution is illustrated in Figure 8. Since the Normal distribution is a
continuous one it can be evaluated for all values of x, not just for integers. The figure
illustrates the main features of the distribution:
• It is unimodal, having a single, central peak. If this were men’s heights it would
illustrate the fact that most men are clustered around the average height, with a
few very tall and a few very short people.
• It is symmetric, the left and right halves being mirror images of each other.
• It is bell-shaped.
• It extends continuously over all the values of x from minus infinity to plus infinity,
although the value of f (x) becomes extremely small as these values are
approached (the pages of this book being of only finite width, this last
characteristic is not faithfully reproduced!). This also demonstrates that most
empirical distributions (such as men’s heights) can only be an approximation to
the theoretical ideal, although the approximation is close and good enough for
practical purposes.
In mathematical terms the formula for the Normal distribution is (x is the random
variable)
(𝑥𝑥𝑥𝑥 − 𝜇𝜇𝜇𝜇)2
�
𝑓𝑓𝑓𝑓 (𝑥𝑥𝑥𝑥) =
𝑒𝑒𝑒𝑒𝑥𝑥𝑥𝑥𝑒𝑒𝑒𝑒 �−
2𝜎𝜎𝜎𝜎 2
𝜎𝜎𝜎𝜎 √2𝜋𝜋𝜋𝜋
1
The mathematical formulation is not as formidable as it appears. 𝜇𝜇𝜇𝜇 and 𝜎𝜎𝜎𝜎 are the
parameters of the distribution; 𝜋𝜋𝜋𝜋 is 3.1416 and e is 2.7183. If the formula is evaluated
using different values of x the values of f(x) obtained will map out a Normal distribution.
Fortunately, as we shall see, we do not need to use the mathematical formula in most
practical problems. The Normal is a family of distributions differing from one another only
in the values of the parameters 𝜇𝜇𝜇𝜇 and 𝜎𝜎𝜎𝜎. Several Normal distributions are drawn in Figures
9 - 11 for different values of the parameters. Whatever value of 𝜇𝜇𝜇𝜇 is chosen turns out to be
the centre of the distribution. As the distribution is symmetric, 𝜇𝜇𝜇𝜇 is its mean. The effect of
varying 𝜎𝜎𝜎𝜎 is to narrow (small 𝜎𝜎𝜎𝜎) or widen (large 𝜎𝜎𝜎𝜎) the distribution. 𝜎𝜎𝜎𝜎 turns out to be the
standard deviation of the distribution. The Normal is two-parameter family of
distributions and once the mean 𝜇𝜇𝜇𝜇 and the standard deviation 𝜎𝜎𝜎𝜎 (or equivalently the
variance 𝜎𝜎𝜎𝜎 2 ) are known the whole of the distribution can be drawn.
0,045
0,04
0,035
0,03
0,025
f(x) 0,02
0,015
0,01
0,005
0
0
50
100
x
150
200
250
Figure 9. Graph of normal distribution with mean 175 and standard deviation 9.6
0,05
f(x)
0,04
0,03
0,02
0,01
0
0
50
100
x
150
200
250
Figure 10. Graph of normal distribution with mean 175 and standard deviation 15.3
0,25
0,2
f(x)
0,15
0,1
0,05
0
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79
x
Figure 11. Graph of normal distribution with mean 40 and standard deviation 25
The shorthand notation for a Normal distribution is
x ~ N(𝜇𝜇𝜇𝜇, 𝜎𝜎𝜎𝜎 2 )
meaning “the variable x is Normally distributed with mean 𝜇𝜇𝜇𝜇 and variance 𝜎𝜎𝜎𝜎 2 “.
Use of the Normal distribution can be illustrated using a simple example. The height
of adult males is Normally distributed with mean height 𝜇𝜇𝜇𝜇 = 174 cm and standard
deviation 𝜎𝜎𝜎𝜎 = 9.6 cm. Let x represent the height of adult males; then x ~ N(174, 92.16)
and this is illustrated in Figure 12. Note that shorthand equation contains the variance
rather than the standard deviation. What is the probability that a randomly selected man
is taller than 180 cm? If all men are equally likely to be selected, this is equivalent to
asking what proportion of men are over 180 cm in height. This is given by the area under
the Normal distribution, to the right of x = 180, i.e. the shaded area in Figure 12. The
further from the mean of 174, the smaller the area in the tail of the distribution. One way
to find this area would be to make use of equation (3.4), but this requires the use of
sophisticated mathematics.
Figure 12. Men‘s height distribution
Since this is a frequently encountered problem, the answers have been set out in
the tables of the standard Normal distribution. We can simply look up the solution.
However, since there is an infinite number of Normal distributions (one for every
combination of 𝜇𝜇𝜇𝜇 and 𝜎𝜎𝜎𝜎 2 ), it would be an impossible task to tabulate the average of around
them all. The standard Normal distribution, which has a mean of zero and variance of one,
is therefore used to represent all Normal distributions. Before the table can be consulted,
therefore, the data have to be transformed so that they accord with the standard Normal
distribution.
The required transformation is the z score, which was introduced above. This
measures the distance between the value of interest (180) and the mean, measured in
terms of standard deviations. Therefore we calculate
𝑓𝑓𝑓𝑓(𝑥𝑥𝑥𝑥) =
1
√2𝜋𝜋𝜋𝜋
𝑧𝑧𝑧𝑧 2
𝑒𝑒𝑒𝑒𝑥𝑥𝑥𝑥𝑒𝑒𝑒𝑒 �− �,
2
and z is a Normally distributed random variable with mean 0 and variance 1, i.e. z ~ N(0,
1). This transformation shifts the original distribution 𝜇𝜇𝜇𝜇 units to the left and then adjusts
the dispersion by dividing through by 𝜎𝜎𝜎𝜎, resulting in a mean of 0 and variance 1. z is
Normally distributed because x is Normally distributed. The transformation in equation
above retains the Normal distribution shape, despite the changes to mean and variance. If
x followed some other distribution then z would not be Normal either.
Descriptive statistics in SPSS
Click Analyze->Descriptive statistics->Descriptives
The screen should now look like this:
The box on the left shows variables for which you could produce descriptive
statistics such as means, etc. (Note that SPSS doesn't show you the first column, as that
contains data in the form of words, and you can't calculate means on this kind of data!
However, we have fooled it with the variable "sex". SPSS would allow us to work out
descriptive statistics for "sex", even though they would be meaningless - remember we
used "1" and "2" merely as labels for "male" and "female" respectively, so they are not
really "numbers" in an arithmetical sense at all). You can move any or all of the variable
names on the left, into the box on the right. Highlight a variable name by clicking on it, and
then click on the little arrow between the boxes. If variables are moved to the box on the
right, SPSS will calculate basic descriptive statistics on them. As a default option, SPSS will
work out the mean, standard deviation, and minimum and maximum values for each
variable placed in the right-most box. If you want other descriptive statistics, try clicking
on the "Options" button. For this example, we will content ourselves with the default
statistics, so move "height" and "weight" into the right-hand box, and then press "OK".
The screen will abruptly switch from the data window, to the "Output Window",
and the statistics will be displayed.
The first column in the table, "N", tells you how many valid observations there
were, basically a reassurance that SPSS has used as many participants' data in its
calculations as we thought it was going to, and that it hasn't dropped participants from the
analysis because they had missing data. (With a small data-set like ours, this isn't too
useful, but if you had zillions of entries, it is always possible you made a mistake in
entering the data and failed to notice).
There follow means, standard deviations and minimum and maximum values for
our two variables of height and weight.
Self-test questions
1. Given data set 1200, 1400, 6000, 1900, 2100. Which central tendency characteristics
would you calculate?
2. How do you understand the 3 sigma rule? Give some example.
3. Give some examples when only Mode can be calculated.
4. Give some examples when it is better to calculate Mean and when Median.
5. When histogram can be used and when barplot?
Exercises
1. Given data about car sales in 20 days. Make frequency table and graph
3, 5, 7 ,8 ,2 ,2 ,2, 4 ,5 ,3 ,3 ,3 ,5 ,8 ,7 ,5 ,4 ,3 ,2 ,5 ,6 ,7 ,8 ,5 ,4 ,5
2. Given data about number of customers in the shop. Make frequency table of grouped
data
316
357
385
345
345
301
398
376
318
351
234
356
436
395
368
347
230
341
345
387
361
341
345
343
324
385
464
243
451
379
435
371
279
348
375
359
326
332
351
381
3. The following are the scores of 60 students on a 100-item spelling test:
84 74 83 46 80 57 59 94 76 72
52 77 48 48 61 65 86 65 73 54
74 64 60 63 68 41 66 55 46 75
76 64 68 67 68 27 67 53 68 78
59 72 71 67 68 62 58 69 54 62
64 72 61 67 39 57 57 75 69 61
Make frequency table for grouped data
4. Calculate averages and standard deviation of the following test scores
13 11 10 9 8 6 4 12 11 10 9 7 6 4 12 11 10 8 7 5 4 11 10 9 8 6 4 4 0 9 8 7 5 4
5. A group of seniors majoring in psychology made the following scores on the verbal test
of the Graduate Record Examination. Calculate averages and standard deviation of the
above scores
750 640 600 570 540 490 450 400 700 630 590 570 540 490 440 380 680 630 590 560
530 480 440 360 660 610 580 560 500 470 430 350 650 600 570 540 490 470 420 320
Hypotheses testing
We use inferential statistics because it allows us to measure behaviour in samples
to learn more about the behaviour in populations that are often too large or inaccessible.
We use samples because we know how they are related to populations. For example,
suppose the average score on a standardized exam in a given population is 1,000. In our
example, if we select a random sample from this population with a mean of 1,000, then on
average, the value of a sample mean will equal 1,000. In behavioural research, we select
samples to learn more about populations of interest to us. In terms of the mean, we
measure a sample mean to learn more about the mean in a population. Therefore, we will
use the sample mean to describe the population mean. We begin by stating the value of a
population mean, and then we select a sample and measure the mean in that sample. On
average, the value of the sample mean will equal the population mean. The larger the
difference or discrepancy between the sample mean and population mean, the less likely it
is that we could have selected that sample mean, if the value of the population mean is
correct.
The method in which we select samples to learn more about characteristics in a
given population is called hypothesis testing. Hypothesis testing is really a systematic way
to test claims or ideas about a group or population. To illustrate, suppose we read an
article stating that children in the Lithuania watch an average of 3 hours of TV per week.
To test whether this claim is true, we record the time (in hours) that a group of 20
Lithuanian children (the sample), among all children in the Lithuania (the population),
watch TV. The mean we measure for these 20 children is a sample mean. We can then
compare the sample mean we select to the population mean stated in the article.
Hypothesis testing or significance testing is a method for testing a claim or hypothesis
about a parameter in a population, using data measured in a sample. In this method, we
test some hypothesis by determining the likelihood that a sample statistic could have been
selected, if the hypothesis regarding the population parameter were true [8].
Different symbols are used to denote parameters of sample and population.
Table 2. Notations for Population and Sample parameters
Mean
Standard deviation
Variance
Sample
𝑥𝑥𝑥𝑥̅
𝑠𝑠𝑠𝑠
𝑠𝑠𝑠𝑠 2
Population
𝜇𝜇𝜇𝜇
𝜎𝜎𝜎𝜎
𝜎𝜎𝜎𝜎 2
After finishing this chapter students will be able to choose an appropriate
hypothesis and to formulate conclusions based on the results of statistical analysis also to
make decisions based on the results of statistical analysis.
Keywords: null hypothesis, alternate hypothesis, paired samples, test statistic,
alpha level.
The concepts of hypothesis testing
The goal of hypothesis testing is to determine the likelihood that a population
parameter, such as the mean, is likely to be true. In this section, we describe the four steps
of hypothesis testing:
• state the hypotheses,
• set the criteria for a decision,
• compute the test statistic,
• make a decision.
State the hypotheses. We begin by stating the value of a population mean in a null
hypothesis, which we presume is true. For the children watching TV example, we state the
null hypothesis that children in the Lithuania watch an average of 3 hours of TV per week.
This is a starting point so that we can decide whether this is likely to be true, similar to the
presumption of innocence in a courtroom. When a defendant is on trial, the jury starts by
assuming that the defendant is innocent. The basis of the decision is to determine whether
this assumption is true. Likewise, in hypothesis testing, we start by assuming that the
hypothesis or claim we are testing is true. This is stated in the null hypothesis. The basis of
the decision is to determine whether this assumption is likely to be true [8].
The null hypothesis (𝐻𝐻𝐻𝐻0 ), stated as the null, is a statement about a population
parameter, such as the population mean, that is assumed to be true. The null hypothesis is
a starting point. We will test whether the value stated in the null hypothesis is likely to be
true.
Keep in mind that the only reason we are testing the null hypothesis is because we
think it is wrong. We state what we think is wrong about the null hypothesis in an
alternative hypothesis (𝐻𝐻𝐻𝐻1 ). For the children watching TV example, we may have reason
to believe that children watch more than (>) or less than (<) 3 hours of TV per week.
When we are uncertain of the direction, we can state that the value in the null hypothesis
is not equal to (≠) 3 hours. In a courtroom, since the defendant is assumed to be innocent
(this is the null hypothesis so to speak), the burden is on a prosecutor to conduct a trial to
show evidence that the defendant is not innocent. In a similar way, we assume the null
hypothesis is true, placing the burden on the researcher to conduct a study to show
evidence that the null hypothesis is unlikely to be true. Regardless, we always make a
decision about the null hypothesis (that it is likely or unlikely to be true). The alternative
hypothesis is needed for Step 2.
An alternative hypothesis (𝐻𝐻𝐻𝐻1 ) is a statement that directly contradicts a null
hypothesis by stating that that the actual value of a population parameter is less than,
greater than, or not equal to the value stated in the null hypothesis. The alternative
hypothesis states what we think is wrong about the null hypothesis, which is needed for
Step 2
A decision made in hypothesis testing centres on the null hypothesis.
Set the criteria for a decision. To set the criteria for a decision, we state the level of
significance for a test. This is similar to the criterion that jurors use in a criminal trial.
Jurors decide whether the evidence presented shows guilt beyond a reasonable doubt
(this is the criterion). Likewise, in hypothesis testing, we collect data to show that the null
hypothesis is not true, based on the likelihood of selecting a sample mean from a
population (the likelihood is the criterion). The likelihood or level of significance is
typically set at 5% in behavioural research studies. When the probability of obtaining a
sample mean is less than 5% if the null hypothesis were true, then we conclude that the
sample we selected is too unlikely and so we reject the null hypothesis [8].
Level of significance, or significance level, refers to a criterion of judgment upon
which a decision is made regarding the value stated in a null hypothesis. The criterion is
based on the probability of obtaining a statistic measured in a sample if the value stated in
the null hypothesis were true. In behavioural science, the criterion or level of significance
is typically set at 5%. When the probability of obtaining a sample mean is less than 5% if
the null hypothesis were true, then we reject the value stated in the null hypothesis.
The alternative hypothesis establishes where to place the level of significance.
Remember that we know that the sample mean will equal the population mean on average
if the null hypothesis is true. All other possible values of the sample mean are normally
distributed (central limit theorem). The empirical rule tells us that at least 95% of all
sample means fall within about 2 standard deviations (SD) of the population mean,
meaning that there is less than a 5% probability of obtaining a sample mean that is
beyond 2 SD from the population mean. For the children watching TV example, we can
look for the probability of obtaining a sample mean beyond 2 SD in the upper tail (greater
than 3), the lower tail (less than 3), or both tails (not equal to 3). Figure 13 shows that the
alternative hypothesis is used to determine which tail or tails to place the level of
significance for a hypothesis test [8].
Figure 13. Alternative hypothesis and alpha level
Compute the test statistic. Suppose we measure a sample mean equal to 4 hours per
week that children watch TV. To make a decision, we need to evaluate how likely this
sample outcome is, if the population mean stated by the null hypothesis (3 hours per
week) is true. We use a test statistic to determine this likelihood. Specifically, a test
statistic tells us how far, or how many standard deviations, a sample mean is from the
population mean. The larger the value of the test statistic, the further the distance, or
number of standard deviations, a sample mean is from the population mean stated in the
null hypothesis. The value of the test statistic is used to make a decision in Step 4.
The test statistic is a mathematical formula that allows researchers to determine
the likelihood of obtaining sample outcomes if the null hypothesis were true. The value of
the test statistic is used to make a decision regarding the null hypothesis [8].
Make a decision. We use the value of the test statistic to make a decision about the
null hypothesis. The decision is based on the probability of obtaining a sample mean,
given that the value stated in the null hypothesis is true. If the probability of obtaining a
sample mean is less than 5% when the null hypothesis is true, then the decision is to reject
the null hypothesis. If the probability of obtaining a sample mean is greater than 5% when
the null hypothesis is true, then the decision is to retain the null hypothesis. In sum, there
are two decisions a researcher can make:
• reject the null hypothesis. The sample mean is associated with a low probability of
occurrence when the null hypothesis is true;
• retain the null hypothesis. The sample mean is associated with a high probability
of occurrence when the null hypothesis is true.
The probability of obtaining a sample mean, given that the value stated in the null
hypothesis is true, is stated by the p value. The p value is a probability: It varies between 0
and 1 and can never be negative. In Step 2, we stated the criterion or probability of
obtaining a sample mean at which point we will decide to reject the value stated in the
null hypothesis, which is typically set at 5% in behavioural research.
To make a decision, we compare the p value to the criterion we set in Step 2. A p
value is the probability of obtaining a sample outcome, given that the value stated in the
null hypothesis is true. The p value for obtaining a sample outcome is compared to the
level of significance.
Significance, or statistical significance, describes a decision made concerning a value
stated in the null hypothesis. When the null hypothesis is rejected, we reach significance.
When the null hypothesis is retained, we fail to reach significance. When the p value is less
than 5% (p < .05), we reject the null hypothesis. We will refer to p < .05 as the criterion
for deciding to reject the null hypothesis, although note that when p = .05, the decision is
also to reject the null hypothesis. When the p value is greater than 5% (p > .05), we retain
the null hypothesis. The decision to reject or retain the null hypothesis is called
significance. When the p value is less than .05, we reach significance; the decision is to
reject the null hypothesis. When the p value is greater than .05, we fail to reach
significance; the decision is to retain the null hypothesis [8].
One more important definition in hypothesis testing is degrees of freedom (df).
Degrees of freedom mean freedom to vary. Suppose we have six scores, and the mean of
these six scores is to be 10. Sixth score makes adjustments that the mean will be 10.
Examples
10, 12, 18, 16, 4. Sixth score should be equal to 0
2, 8, 4, 6, 10. Sixth score should be equal to 30.
In each case we have 5 degrees of freedom.
The next chapters represent the different types of hypotheses for one or two
samples.
t Ratio or Student’s t
Usually for hypothesis testing Student’s or 𝐿𝐿𝐿𝐿 distribution is used. The 𝐿𝐿𝐿𝐿 distributions
were discovered by William S. Gosset in 1908. Gosset was a statistician employed by the
Guinness brewing company which had stipulated that he not publish under his own name.
He therefore wrote under the pen name ``Student.'' These distributions arise in the
following situation.
Suppose we have a simple random sample of size 𝑛𝑛𝑛𝑛 drawn from a Normal
population with mean 𝜇𝜇𝜇𝜇 and standard deviation. Let 𝑥𝑥𝑥𝑥̅ denote the sample mean and 𝑠𝑠𝑠𝑠, the
sample standard deviation. Then the quantity
𝐿𝐿𝐿𝐿 =
𝑥𝑥𝑥𝑥̅ −𝜇𝜇𝜇𝜇
𝑠𝑠𝑠𝑠
√𝑛𝑛𝑛𝑛
(1)
has a t distribution with n-1 degrees of freedom.
Note that there is a different 𝐿𝐿𝐿𝐿 distribution for each sample size, in other words, it is
a class of distributions. When we speak of a specific 𝐿𝐿𝐿𝐿 distribution, we have to specify the
degrees of freedom.
The t density curves are symmetric and bell-shaped like the normal distribution
and have their peak at 0. However, the spread is more than that of the standard normal
distribution. This is due to the fact that in formula 1, the denominator is s rather than 𝜎𝜎𝜎𝜎.
Since 𝑠𝑠𝑠𝑠 is a random quantity varying with various samples, the variability in 𝐿𝐿𝐿𝐿 is more,
resulting in a larger spread [3].
Hypothesis for one sample
The one-sample 𝐿𝐿𝐿𝐿 -test is used when we want to know whether our sample comes
from a particular population but we do not have full population information available to
us. For instance, we may want to know if a particular sample of college students is similar
to or different from college students in general. The one-sample 𝐿𝐿𝐿𝐿 -test is used only for
tests of the sample mean. Thus, our hypothesis tests whether the average of our sample
(𝑥𝑥𝑥𝑥̅ ) suggests that our students come from a population with a known mean (𝜇𝜇𝜇𝜇) or whether
it comes from a different population [3].
The statistical hypotheses for one-sample 𝐿𝐿𝐿𝐿 -tests take one of the following forms. In
the equations below 𝜇𝜇𝜇𝜇 refers to the population from which the study sample was drawn; 𝑚𝑚𝑚𝑚
is replaced by the actual value of the population mean.
𝐻𝐻𝐻𝐻 : 𝜇𝜇𝜇𝜇 = 𝑚𝑚𝑚𝑚
� 0
– two-sided alternative
𝐻𝐻𝐻𝐻1 : 𝜇𝜇𝜇𝜇 ≠ 𝑚𝑚𝑚𝑚
𝐻𝐻𝐻𝐻0 : 𝜇𝜇𝜇𝜇 = 𝑚𝑚𝑚𝑚
�
– one-sided alternatives
𝐻𝐻𝐻𝐻1 : 𝜇𝜇𝜇𝜇 > 𝑚𝑚𝑚𝑚(𝜇𝜇𝜇𝜇 < 𝑚𝑚𝑚𝑚 )
Criteria for a decision is 𝐿𝐿𝐿𝐿 value which depends on df and significance level 𝛼𝛼𝛼𝛼
(usually 5%). Critical values (𝐿𝐿𝐿𝐿𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 = 𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 (𝑛𝑛𝑛𝑛 − 1)) can be found in Appendix 2.
Statistical test
𝐿𝐿𝐿𝐿 =
𝑥𝑥𝑥𝑥̅ − 𝑚𝑚𝑚𝑚
√𝑛𝑛𝑛𝑛 − 1
𝑠𝑠𝑠𝑠
Decision making. Decision depends on statistical test and alternative hypothesis.
Table 3. Decision making rules
𝐻𝐻𝐻𝐻1
𝜇𝜇𝜇𝜇 ≠ 𝑚𝑚𝑚𝑚
𝜇𝜇𝜇𝜇 > 𝑚𝑚𝑚𝑚
𝜇𝜇𝜇𝜇 < 𝑚𝑚𝑚𝑚
Reject 𝐻𝐻𝐻𝐻0
|𝐿𝐿𝐿𝐿| < 𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 ⁄2 (𝑛𝑛𝑛𝑛 − 1)
𝐿𝐿𝐿𝐿 > 𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 (𝑛𝑛𝑛𝑛 − 1)
𝐿𝐿𝐿𝐿 < −𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 (𝑛𝑛𝑛𝑛 − 1)
Retain 𝐻𝐻𝐻𝐻0
|𝐿𝐿𝐿𝐿| ≥ 𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 ⁄2 (𝑛𝑛𝑛𝑛 − 1)
𝐿𝐿𝐿𝐿 ≤ 𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 (𝑛𝑛𝑛𝑛 − 1)
𝐿𝐿𝐿𝐿 ≥ −𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 (𝑛𝑛𝑛𝑛 − 1)
Example
Let us check the hypothesis that mean number of customers in the shop is less than 400
per day.
Given data:
316
230
318
341
345
357
387
351
361
341
385
345
234
343
324
345
385
356
464
243
345
451
436
379
435
301
371
395
279
348
�
𝐿𝐿𝐿𝐿 =
398
375
368
359
326
376
332
347
351
381
𝐻𝐻𝐻𝐻0 : 𝜇𝜇𝜇𝜇 = 400
𝐻𝐻𝐻𝐻1 : 𝜇𝜇𝜇𝜇 < 400
𝑥𝑥𝑥𝑥̅ = 353,1; 𝑠𝑠𝑠𝑠 = 50,78
353,1 − 400
√39 = −5,84
50,78
𝐿𝐿𝐿𝐿𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 = 𝐿𝐿𝐿𝐿0,05 (39) = 2,021
As 𝐿𝐿𝐿𝐿 = −5,84 < −𝐿𝐿𝐿𝐿0,05 (39) = −2,021, the null hypothesis should be rejected.
Testing the difference of two means
(independent and paired samples)
The following example illustrates the differences between independent samples (as
encountered so far) and dependent samples where slightly different methods of analysis are
required. The example also illustrates how a particular problem can often be analysed by a
variety of statistical methods. A company introduces a training programme to raise the
productivity of its clerical workers, which is measured by the number of invoices processed
per day. The company wants to know if the training programme is effective. How should it
evaluate the programme? There is a variety of ways of going about the task, as follows [3]:
• take two (random) samples of workers, one trained and one not trained, and
compare their productivity;
• take a sample of workers and compare their productivity before and after training;
• take two samples of workers, one to be trained and the other not. Compare the
improvement of the trained workers with any change in the other group’s
performance over the same time period.
We shall go through each method in turn, pointing out any possible difficulties.
Two independent samples
Assumptions:
• Two samples (x and y) are random samples independently drawn from
distributions that are normal
• Variances are the same (homogeneity of variance)
The statistical hypotheses
𝐻𝐻𝐻𝐻0 : 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 = 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦
– two sided alternative
•�
𝐻𝐻𝐻𝐻1 : 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 ≠ 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦
𝐻𝐻𝐻𝐻0 : 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 = 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦
•�
– one sided alternatives
𝐻𝐻𝐻𝐻1 : 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 > 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦 �𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 < 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦 �
Criteria for a decision is 𝐿𝐿𝐿𝐿 value which depends on df and significance level 𝛼𝛼𝛼𝛼
(usually 5%). Critical values (𝐿𝐿𝐿𝐿𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 = 𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 (𝑛𝑛𝑛𝑛 + 𝑚𝑚𝑚𝑚 − 2)) can be found in Appendix 2. Here 𝑛𝑛𝑛𝑛
1st sample size and 𝑚𝑚𝑚𝑚 – 2nd sample size.
Statistical test
𝐿𝐿𝐿𝐿 =
𝑥𝑥𝑥𝑥̅ − 𝑦𝑦𝑦𝑦�
�𝑠𝑠𝑠𝑠𝑥𝑥𝑥𝑥2 (𝑛𝑛𝑛𝑛 − 1) + 𝑠𝑠𝑠𝑠𝑦𝑦𝑦𝑦2 (𝑚𝑚𝑚𝑚 − 1)
�
𝑚𝑚𝑚𝑚 ∗ 𝑛𝑛𝑛𝑛(𝑚𝑚𝑚𝑚 + 𝑛𝑛𝑛𝑛 − 2)
𝑚𝑚𝑚𝑚 + 𝑛𝑛𝑛𝑛
Decision making. Decision depends on statistical test and alternative hypothesis.
Table 4. Decision making rules
𝐻𝐻𝐻𝐻1
𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 ≠ 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦
𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 > 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦
𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 < 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦
Reject 𝐻𝐻𝐻𝐻0
|𝐿𝐿𝐿𝐿| < 𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 ⁄2 (𝑛𝑛𝑛𝑛 + 𝑚𝑚𝑚𝑚 − 2)
𝐿𝐿𝐿𝐿 > 𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 (𝑛𝑛𝑛𝑛 + 𝑚𝑚𝑚𝑚 − 2)
𝐿𝐿𝐿𝐿 < −𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 (𝑛𝑛𝑛𝑛 + 𝑚𝑚𝑚𝑚 − 2)
Retain 𝐻𝐻𝐻𝐻0
|𝐿𝐿𝐿𝐿| ≥ 𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 ⁄2 (𝑛𝑛𝑛𝑛 + 𝑚𝑚𝑚𝑚 − 2)
𝐿𝐿𝐿𝐿 ≤ 𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 (𝑛𝑛𝑛𝑛 + 𝑚𝑚𝑚𝑚 − 2)
𝐿𝐿𝐿𝐿 ≥ −𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 (𝑛𝑛𝑛𝑛 + 𝑚𝑚𝑚𝑚 − 2)
Example
Given data about Men’s and Women’s salary. Check the hypothesis that Men’s
salary is bigger than Women’s.
Men’s salary
(Lt)
2500
2600
3100
4200
1200
1900
1800
1500
2700
Women’s
salary (Lt)
2400
2500
2300
3100
3200
3900
1200
1300
1500
Men’s salary
(Lt)
2300
2900
3500
3300
3600
3400
4800
2600
2900
Women’s
salary (Lt)
1800
2100
3400
3800
3500
3900
2100
𝐻𝐻𝐻𝐻0 : 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 = 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦
�
𝐻𝐻𝐻𝐻1 : 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 > 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦
Here 𝑥𝑥𝑥𝑥 – Men’s sample and 𝑦𝑦𝑦𝑦 – Women’s sample.
𝑥𝑥𝑥𝑥̅ = 2822,22; 𝑠𝑠𝑠𝑠𝑥𝑥𝑥𝑥 = 916,87
𝑦𝑦𝑦𝑦� = 2625; 𝑠𝑠𝑠𝑠𝑦𝑦𝑦𝑦 = 931,31
𝐿𝐿𝐿𝐿 =
18 ∙ 16 ∙ (18 + 16 − 2)
= 0,873
18 + 16
�916,872 ∙ 17 + 931,312 ∙ 15
2822,22 − 2625
�
𝐿𝐿𝐿𝐿𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 = 𝐿𝐿𝐿𝐿0,05 (32) = 2,042
As 𝐿𝐿𝐿𝐿 = 0,873 < 𝐿𝐿𝐿𝐿0,05 (32) = 2,042, the null hypothesis should be retained – average
salaries are equal.
Two dependent samples
The dependent t-test (also called the paired t-test or paired-samples t-test)
compares the means of two related groups to detect whether there are any statistically
significant differences between these means.
A dependent t-test is an example of a "within-subjects" or "repeated-measures"
statistical test. This indicates that the same subjects are tested more than once. Thus, in
the dependent t-test, "related groups" indicates that the same subjects are present in both
groups. The reason that it is possible to have the same subjects in each group is because
each subject has been measured on two occasions on the same dependent variable. For
example, you might have measured 10 individuals' (subjects') performance in a spelling
test (the dependent variable) before and after they underwent a new form of
computerised teaching method to improve spelling. You would like to know if the
computer training improved their spelling performance. Here, we can use a dependent t-
test because we have two related groups. The first related group consists of the subjects at
the beginning (prior to) the computerised spell training and the second related group
consists of the same subjects, but now at the end of the computerised training.
The statistical hypotheses
𝐻𝐻𝐻𝐻0 : 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 = 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦
•�
– two sided alternative
𝐻𝐻𝐻𝐻1 : 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 ≠ 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦
𝐻𝐻𝐻𝐻0 : 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 = 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦
•�
– one sided alternatives
𝐻𝐻𝐻𝐻1 : 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 > 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦 �𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 < 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦 �
Criteria for a decision is 𝐿𝐿𝐿𝐿 value which depends on df and significance level 𝛼𝛼𝛼𝛼
(usually 5%). Critical values (𝐿𝐿𝐿𝐿𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 = 𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 (𝑛𝑛𝑛𝑛 − 1)) can be found in Appendix 2. Here 𝑛𝑛𝑛𝑛 1st
sample size and 𝑚𝑚𝑚𝑚 – 2nd sample size.
Statistical test
here 𝑑𝑑𝑑𝑑𝑚𝑚𝑚𝑚 = 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚 − 𝑦𝑦𝑦𝑦𝑚𝑚𝑚𝑚 – differences.
𝐿𝐿𝐿𝐿 =
𝑑𝑑𝑑𝑑�
2
�𝑠𝑠𝑠𝑠𝑑𝑑𝑑𝑑 �
𝑛𝑛𝑛𝑛
,
Decision making. Decision depends on statistical test and alternative hypothesis.
Table 5. Decision making rules
𝐻𝐻𝐻𝐻1
Reject 𝐻𝐻𝐻𝐻0
Retain 𝐻𝐻𝐻𝐻0
𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 ≠ 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦
|𝐿𝐿𝐿𝐿| < 𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 ⁄2 (𝑛𝑛𝑛𝑛 − 1)
|𝐿𝐿𝐿𝐿| ≥ 𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 ⁄2 (𝑛𝑛𝑛𝑛 − 1)
𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 < 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦
𝐿𝐿𝐿𝐿 < −𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 (𝑛𝑛𝑛𝑛 − 1)
𝐿𝐿𝐿𝐿 ≥ −𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 (𝑛𝑛𝑛𝑛 − 1)
𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 > 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦
𝐿𝐿𝐿𝐿 > 𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 (𝑛𝑛𝑛𝑛 − 1)
𝐿𝐿𝐿𝐿 ≤ 𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 (𝑛𝑛𝑛𝑛 − 1)
Example
Let’s check if sales before advertisement are less than after. Here X – sales before
advertisement; Y – sales after advertisement (see table below).
𝐻𝐻𝐻𝐻0 : 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 = 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦
𝐻𝐻𝐻𝐻1 : 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 < 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦
X
18
16
18
12
20
17
18
20
22
20
10
8
20
Y 𝑑𝑑𝑑𝑑 = 𝑋𝑋𝑋𝑋 − 𝑌𝑌𝑌𝑌
20
-2
22
-6
24
-6
10
2
25
-5
19
-2
20
-2
21
-1
23
-1
20
0
10
0
12
-4
22
-2
𝑑𝑑𝑑𝑑̅ = −2,23; 𝑠𝑠𝑠𝑠𝑑𝑑𝑑𝑑 = 2,42
𝐿𝐿𝐿𝐿 =
−2,23
�
2,42�
13
= −3,32
As 𝐿𝐿𝐿𝐿 = −3,32 < −𝐿𝐿𝐿𝐿0,05 (12) = −2,179, the null hypothesis should be rejected –
advertisement was effective.
Hypothesis testing in SPSS
One sample t test
Check hypothesis that average score of statistical test is more than 70:
𝐻𝐻𝐻𝐻 : 𝜇𝜇𝜇𝜇 = 70
� 0
𝐻𝐻𝐻𝐻1 : 𝜇𝜇𝜇𝜇 < 70
Click Analyze-> Compare Means -> One-Sample T Test:
You will be presented with the One-Sample T Test dialogue box, as shown below:
Transfer the dependent variable, Statistics test score, into the Test Variable(s).
Enter the population mean you are comparing the sample against in the Test Value: box,
by changing the current value of "0" to "70". You will end up with the following screen:
You will receive following results:
Statistics test score
N
30
One-Sample Statistics
Mean
Std. Deviation
73,4333
11,16856
Std. Error Mean
2,03909
Statistics test
score
One-Sample Test
Test Value = 70
t
df
Sig. (2-tailed)
Mean
Difference
1,684
29
,103
3,43333
95% Confidence Interval
of the Difference
Lower
Upper
-,7371
7,6037
Table One-Sample Statistics shows general information about sample. Table One-
Sample Test presents value of statistical test (column t) and also p value (column Sig. 2tailed). Rule for decision making:
𝐻𝐻𝐻𝐻1
Reject 𝐻𝐻𝐻𝐻0
𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 ≠ 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦
𝑒𝑒𝑒𝑒 < 𝛼𝛼𝛼𝛼
𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 > 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦 𝑒𝑒𝑒𝑒 < 2𝛼𝛼𝛼𝛼
𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 < 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦
𝑒𝑒𝑒𝑒 < 2𝛼𝛼𝛼𝛼
𝑒𝑒𝑒𝑒 = 0,103 > 2 ∗ 0,05 𝐻𝐻𝐻𝐻0 retains.
Two independent samples
Check hypothesis that Man’s (x) average salary is bigger than Women’s (y):
𝐻𝐻𝐻𝐻0 : 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 = 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦
�
𝐻𝐻𝐻𝐻1 : 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 > 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦
Click Analyze -> Compare Means -> Independent-Samples T Test
You will be presented with the Independent-Samples T Test dialogue box, as shown below:
Transfer the dependent variable, Salarie, into the Test Variable(s) box, and transfer
the independent variable, Gender, into the Grouping Variable. You will end up with the
following screen:
You then need to define the groups (gender). Click on the
button.
You will be presented with the Define Groups dialogue box. Enter 1 (Men) into the Group
1: box and enter 2 (Woman) into the Group 2: box.
Click Continue and OK
You will receive following results:
First table Group Statistics presents general information about samples (N – sample size,
Mean and standard deviation of Man’s and Woman’s salaries respectively). Don’t forget
that here is only information about Sample and all conclusions we make for the all
Population.
Gender
Salarie
Male
Female
N
11
16
Group Statistics
Mean
1872,7273
2293,7500
Std. Deviation
682,77509
635,05249
Std. Error
Mean
205,86443
158,76312
Second table Independent Samples Test.
To find out which row to read from, look at the large column labelled Levene’s Test
for Equality of Variances. This is a test that determines if the two conditions have about
the same or different amounts of variability between scores. You will see two smaller
columns labelled F and Sig (p value). Look in the Sig. (pvalue) column. It will have one
value. You will use this value to determine which row to read from. In this example, the
value in the Sig. (p value) column is 0.833.
Read from the top row. A value greater than .05 means that the variability in your
two conditions is about the same. That the scores in one condition do not vary too much
more than the scores in your second condition. Put scientifically, it means that the
variability in the two conditions is not significantly different. This is a good thing. In this
example, the Sig. value is greater than .05. So, we read from the first row.
Than you can find value of statistical test (column t) and also p value (column Sig.
2-tailed). Rule for decision making:
𝐻𝐻𝐻𝐻1
Reject 𝐻𝐻𝐻𝐻0
𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 ≠ 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦
𝑒𝑒𝑒𝑒 < 𝛼𝛼𝛼𝛼
𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 > 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦 𝑒𝑒𝑒𝑒 < 2𝛼𝛼𝛼𝛼
𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 < 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦
𝑒𝑒𝑒𝑒 < 2𝛼𝛼𝛼𝛼
Levene's
Test for
Equality of
Variances
F
Sig.
Equal
variances
,045
assuSala- med
ry
Equal
variances not
assumed
,833
Independent Samples Test
t-test for Equality of Means
t
df
Sig.
(2-tailed)
Mean
Difference
Std. Error 95% Confidence Interval
Difference
of the Difference
Lower
Upper
-1,64
25
,113
-421,02273
256,37429 -949,03546
106,99001
-1,61 20,579 ,121
-421,02273
259,97287 -962,33962
120,29416
𝑒𝑒𝑒𝑒 = 0,113 > 2 ∗ 0,05 𝐻𝐻𝐻𝐻0 retains. That means that Men’s average salary is equal to
Women’s average salary.
Two dependent (paired) samples
Check hypothesis that Sales before advertisement are less than after:
𝐻𝐻𝐻𝐻0 : 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 = 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦
�
𝐻𝐻𝐻𝐻1 : 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 < 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦
Click Analyze -> Compare Means -> Paired-Samples T Test
Select Pair: Sales before and Sales after as it is shown below:
and push OK.
You will receive following results:
First table Paired Samples Statistics presents general information about samples (N
– sample size, Mean and standard deviation of Sales before and after advertisement
respectively). Don’t forget that here is only information about Sample and all conclusions
we make for the all Population.
Pair 1
Sales before
advertisement
Sales after
advertisement
Paired Samples Statistics
Mean
N
Std. Deviation
16,8462
19,0769
13
13
4,27875
5,10656
Std. Error Mean
1,18671
1,41630
Second table Paired Samples Test.
In part Paired Differences you’ll find descriptive statistics characteristics (Mean
and Standard deviation), calculated for differences of data (Sales before advertisement Sales after advertisement in this case).
In the last three columns you can find value of statistical test (column t), degrees of
freedom (df) and also p value (column Sig. 2-tailed). Rule for decision making:
𝐻𝐻𝐻𝐻1
Reject 𝐻𝐻𝐻𝐻0
𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 ≠ 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦
𝑒𝑒𝑒𝑒 < 𝛼𝛼𝛼𝛼
𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 > 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦 𝑒𝑒𝑒𝑒 < 2𝛼𝛼𝛼𝛼
𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 < 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦
𝑒𝑒𝑒𝑒 < 2𝛼𝛼𝛼𝛼
Mean
Sales before
Pair advertisement 1
Sales after
advertisement
-2,23
Paired Samples Test
Paired Differences
Std.
Std.
95%
Deviation Error
Confidence
Mean
Interval of the
Difference
Lower Upper
2,42
,67133
-3,69
-,76
t
df Sig. (2tailed)
-3,32 12
,006
Here 𝑒𝑒𝑒𝑒 = 0,006 < 2 ∗ 0,05 𝐻𝐻𝐻𝐻0 should be rejected. This means that sales before
advertisement were lower that after, advertisement was successful.
Self-test questions
1. In order to compare men's and women's average monthly income, we apply
a) the hypothesis testing for two independent samples;
b) the hypothesis testing for two dependent samples;
c) analysis of variance.
2. There are data on demand before and after advertisement.
Before
111
115
114
118
After
116
120
121
119
In order to determine whether the advertising was effective, we apply:
a) the hypothesis testing for two independent samples;
b) the hypothesis testing for two dependent samples;
c) regression analysis.
3. In order to determine whether the average salary is the same, staff in Shopping Centre A,
Shopping Centre B and Shopping Centre C where interviewed. We apply:
a) hypothesis testing,
b) post-hoc test,
c) analysis of variance (ANOVA).
Exercises
1. Check the hypothesis, that mean number of customers in the shop is less than 400 per
day.
316
357
385
345
345
301
398
376
318
351
234
356
436
395
368
347
230
341
345
387
361
341
345
343
324
385
464
243
451
379
435
2. Data about car sales in 20 days.
371
279
348
375
359
326
332
351
381
5, 7 ,8 ,2 ,2 ,2, 4 ,5 ,3 ,3 ,3 ,5 ,8 ,7 ,5 ,4 ,3 ,2 ,5 ,6 ,7 ,8 ,5 ,4 ,5
Check hypothesis that number of sold car per day is not more than 5
3. Compare results on statistics test of students in Economics and Computer
programming
Economic students
57
60
55
58
91
83
82
95
89
77
73
70
71
87
85
98
73
75
68
64
75
50
49
59
62
Computer programming students
58
64
76
55
59
59
68
79
52
81
58
71
75
51
57
93
86
43
80
87
Check hypothesis if Economic students are better in statistics than Computer
programming students.
4. Suppose a group of 10 workers is trained and compared to a group of 10 non-trained
workers, with the following data being relevant. Thus, trained workers process 25.5
invoices per day compared to only 21 by non-trained workers. The question is whether
this is significant, given that the sample sizes are quite small.
This is the situation where a sample of workers is tested before and after training.
The sample data are as follows:
Worker 1
Before 21
After
23
2
24
27
3
23
24
4
25
28
5
28
29
6
17
21
7
24
24
8
22
25
9
24
26
10
27
28
5. Data on shampoo sales before and after advertisement. Was advertisement effective?
Before
32
35
31
38
38
39
32
37
35
33
38
39
38
35
32
After
38
36
34
40
38
37
35
38
39
33
40
39
40
36
36
6. A group of students’ marks on two tests, before and after instruction, were as follows:
Student 1 2 3 4 5 6 7 8 9 10 11 12
Before 14 16 11 8 20 19 6 11 13 16 9 13
After
15 18 15 11 19 18 9 12 16 16 12 13
Test the hypothesis that the instruction had no effect, using both the independent
sample and paired sample methods. Compare the two results.
Correlation analysis
Correlation is a measure of relationship between two variables. Goal of this chapter
is to explain when do we need correlation analysis, how to apply it properly. After
finishing this chapter students will be able to choose right correlation coefficient, will
know how to calculate it and will be able to make decision based on correlation analysis
results.
Keywords: Pearson, Spearman, significance, size of relationship.
Examples of correlation analysis:
• high grades in English tend to be associated with high grades in foreign languages;
• both of these tend to be associated with high scores on intelligence test
• correlation between the price at which products are sold and the amount available
for sale.
Such relationships do not necessarily imply that one is the cause of the other. In
some situations, we find that two variables are related because they are both related to, or
caused by, third variable.
Correlation coefficient tells us two things:
• direction of relationship,
• size of relationship.
When two variables are positively related, as one increases, the other also
increases, e.g. Intelligence scores and academic grades. Other variables are inversely
related: as one increases, the other decreases, e.g. Speed of automobile and miles per litre
of gasoline.
Symbol of correlation coefficient is r. Size of r varies from -1 to 1. If r is negative (-
1<r<0) – variables are related inversely; If r is positive (0<r<1) – variables are related
positive; If r=0 – variables aren't related
The most popular Correlation Coefficient is Pearson r. It shows linear dependence
between two quantitative variables
Example
𝑐𝑐𝑐𝑐 =
���
𝑥𝑥𝑥𝑥𝑦𝑦𝑦𝑦 − 𝑥𝑥𝑥𝑥̅ ∙ 𝑦𝑦𝑦𝑦�
𝑠𝑠𝑠𝑠𝑥𝑥𝑥𝑥 ∙ 𝑠𝑠𝑠𝑠𝑦𝑦𝑦𝑦
Scores of 35 university students on two statistics tests
1st
test
2nd
test
1st
test
2nd
test
80 95 94 101 105 89 106 92 105 107 111 114 83 112 91 88 105 106
61 28 74 46 44 38 72 41 49
69 82 76
39
64
77 50 55
86 63 31 57 70 43 70 54 51
58 63 73
71
76
76 59 71
105 80 85 93 85 92 90 89 85
96 85 98 101 106 112 93 110
59
First we calculate necessary descriptive statistics characteristics:
𝑥𝑥𝑥𝑥𝑦𝑦𝑦𝑦 = 5861,4
𝑥𝑥𝑥𝑥̅ = 96,83; 𝑦𝑦𝑦𝑦� = 59,89; 𝑠𝑠𝑠𝑠𝑥𝑥𝑥𝑥 = 10,11; 𝑠𝑠𝑠𝑠𝑦𝑦𝑦𝑦 = 14,91; ���
Then Pearson’s correlation coefficient will be:
𝑐𝑐𝑐𝑐 =
5861,4 − 96,83 ∙ 59,89
= 0,42
10,11 ∙ 14,91
On occasion it is inappropriate or impossible to calculate the correlation coefficient
as described above and an alternative approach is required. Sometimes the original data
are unavailable but the ranks are. For example, schools may be ranked in terms of their
exam results, but the actual pass rates are not available. Similarly, they may be ranked in
terms of spending per pupil, with actual spending levels unavailable. Although the original
data are missing, one can still test for an association between spending and exam success
by calculating the correlation between the ranks. If extra spending improves exam
performance, schools ranked higher on spending should also be ranked higher on exam
success, leading to a positive correlation. Then instead of Pearson’s coefficient Spearmen
Rank Rank-Order Correlation Coefficient is calculated
where d – are differences between ranks.
Example
𝑐𝑐𝑐𝑐 = 1 −
6 ∑ 𝑑𝑑𝑑𝑑 𝑚𝑚𝑚𝑚2
𝑛𝑛𝑛𝑛 3 −𝑛𝑛𝑛𝑛
,
We have the scores on tests X and Y for seven individuals
Test X
18
17
14
13
12
10
8
Sum
Test Y
24
28
30
26
22
18
15
Rank of X
1
2
3
4
5
6
7
𝑐𝑐𝑐𝑐 = 1 −
Significance of Correlation Coefficient
Rank of Y
4
2
1
3
5
6
7
𝑑𝑑𝑑𝑑
3
0
2
1
0
0
0
𝑑𝑑𝑑𝑑𝑚𝑚𝑚𝑚2
9
0
4
1
0
0
0
14
6 ∙ 14
= 0,75
73 − 7
These results come from a (small) sample, one of many that could have been
collected. Once again we can ask the question, what can we infer about the population
from the sample? Assuming the sample was drawn at random (which may not be justified)
we can use the principles of hypothesis testing. As usual, there are two possibilities:
• The truth is that there is no correlation (in the population) and that our sample
exhibits such a large (absolute) value by chance.
• There really is a correlation between the birth rate and the growth rate and the
sample correctly reflects this.
Denoting the true but unknown population correlation coefficient by 𝜌𝜌𝜌𝜌 (the Greek
letter ‘rho’) the possibilities can be expressed in terms of a hypothesis test.
Test statistic
𝐻𝐻𝐻𝐻 : 𝜌𝜌𝜌𝜌 = 0
� 0
𝐻𝐻𝐻𝐻1 : 𝜌𝜌𝜌𝜌 ≠ 0
𝐿𝐿𝐿𝐿 =
𝑐𝑐𝑐𝑐√𝑛𝑛𝑛𝑛 − 2
√1 − 𝑐𝑐𝑐𝑐 2
which has a t distribution with 𝑛𝑛𝑛𝑛 − 2 degrees of freedom. The five steps of the test
procedure are therefore:
• Write down the null and alternative hypotheses (shown above).
• Choose the significance level of the test: 5% by convention.
• Look up the critical value of the test for 𝑛𝑛𝑛𝑛 − 2 = 10 degrees of freedom:
• Calculate the test statistic using equation
• Compare the test statistic with the critical value. In this case |𝐿𝐿𝐿𝐿| ≤ 𝐿𝐿𝐿𝐿𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 =
𝐿𝐿𝐿𝐿0,05 (𝑛𝑛𝑛𝑛 − 2) so 𝐻𝐻𝐻𝐻0 is rejected. There is a less than 5% chance of the sample
evidence occurring if the null hypothesis were true, so the latter is rejected.
Example
About test’s in statistics
𝐿𝐿𝐿𝐿 =
0,42√35−2
�1−0,42 2
= 2,73,
𝐿𝐿𝐿𝐿0,05 (33) = 2,042,
so |𝐿𝐿𝐿𝐿| = 2,73 > 𝐿𝐿𝐿𝐿0,05 (33) = 2,042, 𝐻𝐻𝐻𝐻0 is not rejected – correlation is not significant.
Self-test questions
1. To determine the relationship between age and salary we calculate:
a) Chi square coefficient
b) Pearson correlation coefficient
c) Spearman's correlation coefficient
2. Correlation coefficient is equal to 0,785. Please make conclusions about size and
direction of relationship.
3. Which one coefficient would you calculate to describe relationship between sex and
salary? Why?
4. How do you understand significance or correlation?
Exercises
1. Given data about temperature and number of sold Ice cream. Is there significant
correlation between these variables?
Temp
25 26 24 26 24 26 22 23 27 20 20 22 28 22 26
Ice cream 116 120 115 119 115 118 111 113 121 108 109 110 122 113 121
2. The following table shows how a panel of nutrition experts and a panel of heads of
household ranked 15 breakfast foods on their palatability. Calculate r as a measure of
the consistency of the two rankings.
Nutrition experts
Heads of household
3
5
7
4
11
8
9
14
1
2
4
6
10
12
8
7
5
1
13
15
12
9
2
3
15
10
6
11
14
13
3. The following are the rankings which three judges gave to work of ten corporate
accounting departments’ trainees. Is there relation between these three judges
opinion?
Judge A
Judge B
Judge C
6
2
7
4
5
3
2
4
1
5
8
2
9
10
10
3
1
6
1
6
4
8
9
9
10
7
8
7
3
5
Regression analysis
Correlation and regression are techniques for investigating the statistical
relationship between two, or more, variables.
Regression analysis is a more sophisticated way of examining the relationship
between two (or more) variables than is correlation. The major differences between
correlation and regression are the following:
• Regression can investigate the relationships between two or more variables.
• A direction of causality is asserted, from the explanatory variable (or variables) to
the dependent variable.
• The influence of each explanatory variable upon the dependent variable is
measured.
• The significance of each explanatory variable can be ascertained.
Let‘s say we are interested if number of sold Ice Cream depends on Temperature. In
this example we assert that the direction of causality is from the Temperature (X) to the
Ice Cream (Y) and not vice versa. The Temperature is therefore the explanatory variable
(also referred to as the independent or exogenous variable) and the Ice Cream is the
dependent variable (also called the explained or endogenous variable).
Regression analysis describes this causal relationship by fitting a straight line
drawn through the data, which best summarises them. It is sometimes called ‘the line of
best fit’ for this reason. This is illustrated in Figure 14 for the Ice Cream and Temperature
data. Note that (by convention) the explanatory variable is placed on the horizontal axis,
the explained on the vertical. This regression line is upward sloping (its derivation will be
explained shortly) for the same reason that the correlation coefficient is positive, i.e. high
values of Y are generally associated with high values of X and vice versa. Since the
regression line summarises knowledge of the relationship between X and Y, it can be used
to predict the value of Y given any particular value of X [9].
Ice Cream
124
122
120
118
116
114
112
110
108
106
0
5
10
Figure 14. Regression line.
15
Temperature
20
25
30
After finishing this chapter students will know difference between linear and
multiple regression, will be able to fit regression model, to formulate conclusions based on
the results of statistical analysis, able to make decisions based on the results of statistical
analysis.
Keywords: linear regression, determination, multiple regression, standardized
variables.
Linear regression
The simplest and most popular is linear regression model. A simple linear
regression model is a regression model where the dependent variable is continuous,
explained by a single exogenous variable, and linear in the parameters.
Linear regression model:
where 𝑦𝑦𝑦𝑦 is dependent variable;
𝑦𝑦𝑦𝑦 = 𝑚𝑚𝑚𝑚 + 𝑏𝑏𝑏𝑏𝑥𝑥𝑥𝑥
𝑥𝑥𝑥𝑥 – independent (exogenous) variable;
a, b are fixed coefficients to be estimated; a measures the intercept of the regression.
Coefficients a, b can be found by using formulas:
𝑠𝑠𝑠𝑠𝑦𝑦𝑦𝑦
𝑏𝑏𝑏𝑏 = 𝑐𝑐𝑐𝑐 ; 𝑚𝑚𝑚𝑚 = 𝑦𝑦𝑦𝑦� − 𝑏𝑏𝑏𝑏𝑥𝑥𝑥𝑥̅ .
𝑠𝑠𝑠𝑠𝑥𝑥𝑥𝑥
Multiple regression
Multiple regression is a statistical technique that allows us to predict someone’s
score on one variable on the basis of their scores on several other variables. An example
might help. Suppose we were interested in predicting how much an individual enjoys their
job. Variables such as salary, extent of academic qualifications, age, sex, number of years in
full-time employment and socioeconomic status might all contribute towards job
satisfaction. If we collected data on all of these variables, perhaps by surveying a few
hundred members of the public, we would be able to see how many and which of these
variables gave rise to the most accurate prediction of job satisfaction. We might find that
job satisfaction is most accurately predicted by type of occupation, salary and years in fulltime employment, with the other variables not helping us to predict job satisfaction [9].
When using multiple regression in psychology, many researchers use the term
“independent variables” to identify those variables that they think will influence some
other “dependent variable”. We prefer to use the term “predictor variables” for those
variables that may be useful in predicting the scores on another variable that we call the
“criterion variable”. Thus, in our example above, type of occupation, salary and years in
full-time employment would emerge as significant predictor variables, which allow us to
estimate the criterion variable – how satisfied someone is likely to be with their job. As we
have pointed out before, human behaviour is inherently noisy and therefore it is not
possible to produce totally accurate predictions, but multiple regression allows us to
identify a set of predictor variables which together provide a useful estimate of a
participant’s likely score on a criterion variable [9].
When should I use multiple regression?
1. You can use this statistical technique when exploring linear relationships
between the predictor and criterion variables – that is, when the relationship follows a
straight line. (To examine non-linear relationships, special techniques can be used.)
2. The criterion variable that you are seeking to predict should be measured on a
continuous scale (such as interval or ratio scale). There is a separate regression method
called logistic regression that can be used for dichotomous dependent variables (not
covered here).
scales
3. The predictor variables that you select should be measured on a ratio or interval
4. Multiple regression requires a large number of observations. The number of
cases (participants) must substantially exceed the number of predictor variables you are
using in your regression. The absolute minimum is that you have five times as many
participants as predictor variables. A more acceptable ratio is 10:1, but some people argue
that this should be as high as 40:1 for some statistical selection methods.
Multiple regression model:
𝑦𝑦𝑦𝑦 = 𝑚𝑚𝑚𝑚 + 𝑏𝑏𝑏𝑏1 𝑥𝑥𝑥𝑥1 + 𝑏𝑏𝑏𝑏2 𝑥𝑥𝑥𝑥2 + ⋯ + 𝑏𝑏𝑏𝑏𝑘𝑘𝑘𝑘 𝑥𝑥𝑥𝑥𝑘𝑘𝑘𝑘
In multiple regression model predictor variables can be measured in different
scales, i.e. we are fitting model which describes how Salary can be explained (described)
by experience (measured in years) and IQ. In such case it is quite difficult to interpret the
results, impossible to decide which explanatory (predictor) variable makes stronger
influence to the dependent variable. As alternative standardized values 𝛽𝛽𝛽𝛽𝑚𝑚𝑚𝑚 can be used.
Standardized values are dimensionless higher value shows bigger influence.
Adequacy of regression model
Before prediction regression model needs to be checked. Usually determination
coefficient (𝑅𝑅𝑅𝑅2 ) is calculated and hypothesis about coefficients 𝑏𝑏𝑏𝑏𝑚𝑚𝑚𝑚 is checked.
𝑅𝑅𝑅𝑅2 coefficient describes how good data variability is explained by using fitted
regressionmodel. It can be from 0 to 1, higher coefficient shows better fitting. In the case
of linear regression 𝑅𝑅𝑅𝑅2 coefficient is square of Pearson’s correlation.
If in all fitted coefficients 𝑏𝑏𝑏𝑏𝑚𝑚𝑚𝑚 are equal to 0, regression model doesn’t fit. First of all
we check hypothesis:
�
𝐻𝐻𝐻𝐻0 : 𝑚𝑚𝑚𝑚𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑏𝑏𝑏𝑏𝑚𝑚𝑚𝑚 = 0
𝐻𝐻𝐻𝐻1 : 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑜𝑜𝑜𝑜𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑏𝑏𝑏𝑏𝑚𝑚𝑚𝑚 ≠ 0
If null hypothesis is rejected, than for each 𝑚𝑚𝑚𝑚 = 1, … , 𝑘𝑘𝑘𝑘 hypothesis is checked:
�
𝐻𝐻𝐻𝐻0 : 𝑏𝑏𝑏𝑏𝑚𝑚𝑚𝑚 = 0
𝐻𝐻𝐻𝐻1 : 𝑏𝑏𝑏𝑏𝑚𝑚𝑚𝑚 ≠ 0
All statistically insignificant variables should be rejected from the model and new
model fitted.
Example
fitted.
By using data about Temperature and Ice Cream linear regression model will be
Tempe25 26 24 26 24 26 22
rature
Ice
116 120 115 119 115 118 111
cream
23
27
20
20 22
28
22
26
113 121 108 109 110 122 113 121
Corresponding chart is shown in figure 14. Temperature is x as it is explanatory
(independent) variable and Ice cream is y. Corresponding descriptive statistics
characteristics are as follows:
𝑥𝑥𝑥𝑥̅ = 24,07; 𝑦𝑦𝑦𝑦� = 115,40; 𝑠𝑠𝑠𝑠𝑥𝑥𝑥𝑥 = 2,41; 𝑠𝑠𝑠𝑠𝑦𝑦𝑦𝑦 = 4,50; ���
𝑥𝑥𝑥𝑥𝑦𝑦𝑦𝑦 = 2787,87
Pearson’s correlation:
𝑐𝑐𝑐𝑐 =
Linear regression coefficients:
2787,87 − 24,07 ∙ 115,40
= 0,98
2,41 ∙ 4,50
𝑏𝑏𝑏𝑏 = 0,98 ∙
4,50
= 1,82
2,41
𝑚𝑚𝑚𝑚 = 115,40 − 1,82 ∙ 24,07 = 71,49
Then fitted linear regression model will be:
𝑦𝑦𝑦𝑦 = 71,49 + 1,82𝑥𝑥𝑥𝑥
with 𝑅𝑅𝑅𝑅2 = 0,96.
Model shows that when temperature increases by one degree, ice cream sales
increase by 1,82 portions. Model adequacy is 96%. Regression line you can see in figure
14.
Hypothesis about coefficient 𝑏𝑏𝑏𝑏 is easier to check by using SPSS.
Correlation and regression analysis in SPSS
The following data show the number of bedrooms, the number of baths and prices
at which one-family houses sold recently.
To calculate correlation coefficient click Analyze->Correlate->Bivariate
Select at least two variables and choose correlation coefficient (Pearson or Spearman). In
this case variables are quantitative therefore Pearson coefficient was selected.
After clicking OK, correlation table will appear:
Correlations
Number of bedrooms Price (Lt)
Pearson Correlation
1
,971**
Number of bedrooms
Sig. (2-tailed)
,000
N
31
31
**
Pearson Correlation
,971
1
Price (Lt)
Sig. (2-tailed)
,000
N
31
31
**. Correlation is significant at the 0.01 level (2-tailed).
Pearson’s correlation is correlation coefficient (i.e. r), correlation is significant if
Sig. (2-tailed) (p value) is less than 0,05. Significant correlation is also flagged with stars
(**).
Correlation
between
Number
of
bedrooms
and
Price
(Sig.=0,000<0,05). Correlation is positive and very strong (r=0,971).
is
significant
Correlation matrix also can be created. Click Analyze->Correlate->Bivariate.
And select all variables which should appear in correlation matrix
After clicking OK, correlation matrix will appear:
Correlations
Number of
bedrooms
Pearson Correlation
1
Number of bedrooms
Sig. (2-tailed)
N
31
Pearson Correlation
,971**
Price (Lt)
Sig. (2-tailed)
,000
N
31
Pearson Correlation
,772**
Number of bath
Sig. (2-tailed)
,000
N
31
**. Correlation is significant at the 0.01 level (2-tailed).
Price (Lt)
,971**
,000
31
1
31
,742**
,000
31
Number of
bath
,772**
,000
31
,742**
,000
31
1
31
In correlation matrix you can find correlations between all possible pairs, usually
upper triangle is analysed (in lower part you would find the same coefficients). All
correlations are significant, strong and positive.
Linear regression
Click Analyze->Regression->Curve Estimation
Select dependent variable (y) and independent (explanatory) variable (x). We will
fit model which shows how Price of House depend on Number of bedrooms in this house.
After clicking OK results will appear:
Equation
Linear
Model Summary and Parameter Estimates
Dependent Variable: Price (Lt)
Model Summary
Parameter Estimates
R Square
F
df1
df2
Sig.
Constant
b1
,942
471,100
1
29
,000
11161,290 79666,667
The independent variable is Number of bedrooms.
R Square – 𝑅𝑅𝑅𝑅2 coefficient. In analysed case it is equal to 0,942, thus model fits well.
Sig. (p value) – if less than 0,05 model is significant (coefficient b is not equal to 0).
Parameter estimates:
Constant – a;
b1 – b
Therefore regression model will be: 𝑦𝑦𝑦𝑦 = 11161,29 + 79666,67𝑥𝑥𝑥𝑥.
Also regression chart will be presented with fitted regression line:
Multiple regression
Click Analyze->Regression->Linear
Dialogue box will appear.
Select Dependent variable (y) and all independent variables (x‘s). In this case
multiple regression model which shows how House price (𝑦𝑦𝑦𝑦) depend on number of
Berooms (𝑥𝑥𝑥𝑥1 ) and number of Baths (𝑥𝑥𝑥𝑥2 ).
After clicking OK you will get results.
Model Summary
R – multiple correlation coefficient, which can be interpreted in the same way as Pearson‘s
correlation.
R Square - 𝑅𝑅𝑅𝑅2 coefficient, in this case it is 0,942.
Model
1
Model Summary
R Square
Adjusted R Square
R
,971a
,942
,938
Std. Error of the
Estimate
20437,352
a. Predictors: (Constant), Number of bath, Number of bedrooms
ANOVA
Here hypothesis
�
𝐻𝐻𝐻𝐻0 : 𝑚𝑚𝑚𝑚𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑏𝑏𝑏𝑏𝑚𝑚𝑚𝑚 = 0
𝐻𝐻𝐻𝐻1 : 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑜𝑜𝑜𝑜𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑏𝑏𝑏𝑏𝑚𝑚𝑚𝑚 ≠ 0
Is checked. If Sig. (p value) is less than 0,05 null hypothesis should be rejected, I. e.
at least one 𝑏𝑏𝑏𝑏𝑚𝑚𝑚𝑚 ≠ 0. If null hypothesis would retain, there is no sense to analyse model (it is
not statistically significant). Here Sig.=0,000<0,05 – null hypothesis is rejected.
Model
1
Sum of Squares
Regression 190429003141,219
Residual
Total
11695190407,168
202124193548,387
a. Dependent Variable: Price (Lt)
ANOVAa
df
2
28
30
Mean Square
95214501570,609
417685371,685
F
227,957
Sig.
,000b
b. Predictors: (Constant), Number of bath, Number of bedrooms
Coefficients
Unstandardized Coefficients (Column B) shows multiple regression coefficients:
(Constant) - a
Number of bedrooms - 𝑏𝑏𝑏𝑏1
Number of bath - 𝑏𝑏𝑏𝑏2
Multiple regression model would be:
𝑦𝑦𝑦𝑦 = 11624,72 + 80789,96𝑥𝑥𝑥𝑥1 − 1773,62𝑥𝑥𝑥𝑥2
In Column Standardized Coefficients (Beta) 𝛽𝛽𝛽𝛽𝑚𝑚𝑚𝑚 are shown. 𝛽𝛽𝛽𝛽1 = 0,984, 𝛽𝛽𝛽𝛽2 = −0,018 as 𝛽𝛽𝛽𝛽1 is
much higher Number of bedrooms has stronger influence to the Price.
But before making decision column Sig. needs to be analysed. Here hypotheses
𝐻𝐻𝐻𝐻 : 𝑏𝑏𝑏𝑏 = 0
� 0 𝑚𝑚𝑚𝑚
, for 𝑚𝑚𝑚𝑚 = 1,2
𝐻𝐻𝐻𝐻1 : 𝑏𝑏𝑏𝑏𝑚𝑚𝑚𝑚 ≠ 0
are checked. P value (Sig.) next to Number of Bedrooms is 0,000<0,05 and it shows that
coefficient 𝑏𝑏𝑏𝑏1 is significant, but P value (Sig.) next to Number of Baths is 0,806>0,05 and
it shows that coefficient 𝑏𝑏𝑏𝑏2 is insignificant. It means, that model needs to be refitted and
only Number of Bedrooms has significant influence to the Price. In this case we would get
linear regression model, which we analysed before.
Model
Coefficientsa
Unstandardized
Standardized
Coefficients
Coefficients
B
Std. Error
Beta
11624,720 11927,863
(Constant)
Number of
1
80789,959
bedrooms
Number of bath
-1773,620
a. Dependent Variable: Price (Lt)
5869,749
7154,410
,984
-,018
t
Sig.
,975
,338
-,248
,806
13,764
,000
Self-test questions
1. Regression Analysis can be applied to ... variables:
a) rank,
b) quantitative,
c) nominal.
2. Given equation y = 12.3 + 1.2 x : where y - income, x - advertising expenditure, that is:
a) linear regression equation,
b) linear trend equation,
c) multiple regression equation.
3. Which one model: linear or multiple regression is more accurate? Why?
4. Given equation Income=3,5+0,7*advertisement. How would you interpret the number
0,7?
Exercises
1. The following sample data show the demand for a product (in thousands of units) and
its price (in Litas) charged in 12 different market areas.
X 18 16 18 12 20 17 18 20 22 20 10 8 20
Y 20 22 24 10 25 19 20 21 23 20 10 12 22
2. The following sample data show the average annual yield of wheat (in bushels per acre)
in a given country and the annual rainfall (in centimeters) measured from September to
August:
Rainfall
(x)
Yield of
wheat
(y)
8,8 10,3
39,6 42,5
15,9 13,1 12,9 7,2 11,3 18,6 14,3
9,1
8,9 11,4 12,5
69,3 52,4 60,5 26,7 50,2 78,6 59,1 41,3 40,5 43,5 48,8
3. Suppose that we are given the following sample data to study relationship between the
grades students get in a certain examination, their IQ’s, and the number of hours they
studied for the test.
Hours
IQ
Grade
Hours
IQ
Grade
8
98
56
6
105
73
5
99
44
9
108
71
11
118
79
6
99
85
13
94
72
7
110
47
10
109
70
5
112
58
5
116
54
4
107
63
18
97
94
11
105
74
15
100
85
13
101
87
2
99
33
15
103
92
8
114
65
14
109
81
0,3
105
9,1
2,1
99
10,4
0,9
74
7,7
1,7
71
7,8
4. The following sample data were collected to determine the relationship between two
processing variables and the current gain of a certain kind of transistor.
Diffusion time
Sheet resistance
Current gain
Diffusion time
Sheet resistance
Current gain
1,5
66
5,3
2,4
111
8,1
2,5
88
7,8
2
78
7,2
0,5
69
7,4
0,7
66
6,5
1,2
141
9,8
1,6
123
12,6
2,6
93
10,8
1,8
128
13,1
Time series analysis
A time series is a collection of observations of well-defined data items obtained
through repeated measurements over time. For example, measuring the value of retail
sales each month of the year would comprise a time series. This is because sales revenue
is well defined, and consistently measured at equally spaced intervals. Data collected
irregularly or only once are not time series.
After finishing this chapter students will be able to make a system of statistical
indicators taking into account the nature, specifics of the socio-economical phenomenon
and the aims of the research. Will be able to create time series model and use it for
analysis and forecasting.
Keywords: trend, seasonality, irregular component, moving average, exponential
smoothing.
Components of time series
An observed time series can be decomposed into three components: the trend (long
term direction), the seasonal (systematic, calendar related movements) and the irregular
(unsystematic, short term fluctuations).
Trend smooth or regular underlying movement of a series over a fairly lying period
of time
Seasonal variation is the movement in a time series, which recur year after year in
the same months (or the same quarters) or the year with more or less the same intensity
Irregular variation – fluctuations from trend, seasonal or cyclical components, caused by
some special events.
Stock and flow series
Time series can be classified into two different types: stock and flow. A stock series
is a measure of certain attributes at a point in time and can be thought of as “stocktakes”.
For example, the Monthly Labour Force Survey is a stock measure because it takes stock of
whether a person was employed in the reference week.
Flow series are series which are a measure of activity over a given period. For
example, surveys of Retail Trade activity. Manufacturing is also a flow measure because a
certain amount is produced each day, and then these amounts are summed to give a total
value for production for a given reporting period.
The main difference between a stock and a flow series is that flow series can
contain effects related to the calendar (trading day effects). Both types of series can still
be seasonally adjusted using the same seasonal adjustment process.
Seasonal effects
A seasonal effect is a systematic and calendar related effect. Some examples include
the sharp escalation in most Retail series which occurs around December in response to
the Christmas period, or an increase in water consumption in summer due to warmer
weather. Other seasonal effects include trading day effects (the number of working or
trading days in a given month differs from year to year which will impact upon the level of
activity in that month) and moving holidays (the timing of holidays such as Easter varies,
so the effects of the holiday will be experienced in different periods each year).
Seasonal adjustment
Seasonal adjustment is the process of estimating and then removing from a time
series influences that are systematic and calendar related. Observed data needs to be
seasonally adjusted as seasonal effects can conceal both the true underlying movement in
the series, as well as certain non-seasonal characteristics which may be of interest to
analysts [1].
Smoothing methods
In cases in which the time series is fairly stable and has no significant trend,
seasonal, or cyclical effects, one can use smoothing methods to average out the irregular
component of the time series [1].
Common smoothing methods are:
• Moving Averages
• Centered Moving Averages
• Weighted Moving Averages
• Exponential Smoothing
The moving averages method consists of computing an average of the most recent n
data values for the series and using this average for forecasting the value of the time series
for the next period. Moving averages are useful if one can assume item to be forecast will
stay fairly steady over time. Series of arithmetic means - used only for smoothing,
provides overall impression of data over time
𝑀𝑀𝑀𝑀𝑜𝑜𝑜𝑜𝑀𝑀𝑀𝑀𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑀𝑀𝑀𝑀 𝐴𝐴𝐴𝐴𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 =
∑ 𝑚𝑚𝑚𝑚𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑛𝑛𝑛𝑛 𝑑𝑑𝑑𝑑𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑣𝑣𝑣𝑣𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒
𝑛𝑛𝑛𝑛
Example
Month
January
February
March
April
May
June
July
August
September
October
November
December
Actual
Sales
10
12
16
13
17
19
15
20
22
19
21
19
Three-Month Moving averages
(10+12+16)/3
(12+16+13)/3
(16+13+17)/3
…
…
…
…
…
…
(19+21+19)/3
12,67
13,67
15,33
16,33
17,00
18,00
19,00
20,33
20,67
19,67
Moving Average
25
20
15
10
5
0
Actual
Forecast
The centered moving average method consists of computing an average of n
periods' data and associating it with the midpoint of the periods. For example, the average
for periods 5, 6, and 7 is associated with period 6. This methodology is useful in the
process of computing season indexes [1].
May
June
July
17
19
15
(17+19+15)/3=17
Weighted Moving Averages method is used when trend is present, older data usually
less important, the more recent observations are typically given more weight than older
observations. Weights based on intuition; often lay between 0 & 1, & sum to 1.0
𝑊𝑊𝑊𝑊𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 =
∑(𝑊𝑊𝑊𝑊𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒ℎ𝐿𝐿𝐿𝐿 𝑓𝑓𝑓𝑓𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝑛𝑛𝑛𝑛)(𝑉𝑉𝑉𝑉𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝑛𝑛𝑛𝑛)
∑ 𝑊𝑊𝑊𝑊𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒ℎ𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿
Example
Month
January
February
March
April
May
June
July
August
September
October
November
December
25
20
Actual
Sales
10
12
16
13
17
19
15
20
22
19
21
19
Weighted Moving average
(1*10+2*12+3*16)/6
(1*12+2*16+3*13)/6
…
…
…
…
…
…
(1*22+2*19+3*21)/6
13,67
13,83
15,50
17,33
16,67
18,17
20,17
20,17
20,50
Weighted Moving average
15
10
Actual Sales
Forecast
5
0
Disadvantages of M.A. Methods:
• increasing n makes forecast less sensitive to changes;
• do not forecast trends well;
• require sufficient historical data;
• moving averages and weighted moving averages are effective in smoothing out sudden
fluctuations in demand pattern in order to provide stable estimates;
• requires maintaining extensive records of past data;
• exponential smoothing requires little record keeping of past data.
Exponential smoothing
Exponential smoothing is probably the widely used class of procedures for
smoothing discrete time series in order to forecast the immediate future. This popularity
can be attributed to its simplicity, its computational efficiency, the ease of adjusting its
responsiveness to changes in the process being forecast, and its reasonable accuracy.
The idea of exponential smoothing is to smooth the original series the way the
moving average does and to use the smoothed series in forecasting future values of the
variable of interest. In exponential smoothing, however, we want to allow the more recent
values of the series to have greater influence on the forecast of future values than the
more distant observations.
Exponential smoothing is a simple and pragmatic approach to forecasting, whereby
the forecast is constructed from an exponentially weighted average of past observations.
The largest weight is given to the present observation, less weight to the immediately
preceding observation, even less weight to the observation before that, and so on
(exponential decay of influence of past data [1].
Exponential Smoothing Model:
𝐹𝐹𝐹𝐹𝐿𝐿𝐿𝐿+1 = 𝛼𝛼𝛼𝛼𝑦𝑦𝑦𝑦𝐿𝐿𝐿𝐿 + (1 − 𝛼𝛼𝛼𝛼) 𝐹𝐹𝐹𝐹𝐿𝐿𝐿𝐿 ,
where:
𝐹𝐹𝐹𝐹𝐿𝐿𝐿𝐿+1 - forecast value for period t + 1
𝑦𝑦𝑦𝑦𝐿𝐿𝐿𝐿 - actual value for period t
𝐹𝐹𝐹𝐹𝐿𝐿𝐿𝐿 - forecast value for period t
𝛼𝛼𝛼𝛼 - alpha (smoothing constant).
Example
Quarter (t)
1
2
3
4
5
6
7
8
9
10
60
50
40
30
20
10
0
Sales 𝑦𝑦𝑦𝑦𝐿𝐿𝐿𝐿
23
40
25
27
32
48
33
37
37
50
𝐹𝐹𝐹𝐹𝐿𝐿𝐿𝐿+1
23
(0,2*40+0,8*23)
(0,2*25+0,8*26,4)
…
…
…
…
…
(0,2*37+0,8*32,87)
#N/A
23
26,4
26,12
26,30
27,44
31,55
31,84
32,87
33,70
Exponential Smoothing
Actual
Forecast
1
2
3
4
5
6
7
8
9
10
Measures of Forecast Accuracy
Mean Squared Error (MSE). The average of the squared forecast errors for the
historical data is calculated. The forecasting method or parameter(s) which minimize this
mean squared error is then selected.
Mean Absolute Deviation (MAD). The mean of the absolute values of all forecast
errors is calculated, and the forecasting method or parameter(s) which minimize this
measure is selected. The mean absolute deviation measure is less sensitive to individual
large forecast errors than the mean squared error measure.
You may choose either of the above criteria for evaluating the accuracy of a method
(or parameter).
Example
MSE for Exponential Smoothing method
Quarter (t)
1
2
3
4
5
6
7
8
9
10
Sales 𝑦𝑦𝑦𝑦𝐿𝐿𝐿𝐿
23
40
25
27
32
48
33
37
37
50
Sum
𝐹𝐹𝐹𝐹𝐿𝐿𝐿𝐿+1
#N/A
23
26,4
26,12
26,30
27,44
31,55
31,84
32,87
33,70
𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 =
(𝑦𝑦𝑦𝑦𝐿𝐿𝐿𝐿 − 𝐹𝐹𝐹𝐹𝐿𝐿𝐿𝐿+1 )2
289
1,96
0,77
32,54
422,85
2,10
26,63
17,04
265,78
1058,67
1058,67
= 117,63
9
Real Accuracy is shown by Root from MSE (RMSE):
Linear trend
𝑅𝑅𝑅𝑅𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = √117,63 ≈ 10,85.
Linear trend is the simplest time series model, its equation
𝑦𝑦𝑦𝑦 = 𝑚𝑚𝑚𝑚 + 𝑏𝑏𝑏𝑏𝐿𝐿𝐿𝐿
Coefficients 𝑚𝑚𝑚𝑚 and 𝑏𝑏𝑏𝑏 can be found by solving system of linear equations:
𝑛𝑛𝑛𝑛𝑚𝑚𝑚𝑚 + 𝑏𝑏𝑏𝑏 � 𝐿𝐿𝐿𝐿 = � 𝑦𝑦𝑦𝑦
�
𝑚𝑚𝑚𝑚 � 𝐿𝐿𝐿𝐿 + 𝑏𝑏𝑏𝑏 � 𝐿𝐿𝐿𝐿 2 = � 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿
Example
For the years 2000 to 2009 Corporation X reported annual revenue
Revenue (mln, Lt)
10,3
13,5
13,7
14,2
15
15,1
16,3
17,5
19
18,3
152,9
t
1
2
3
4
5
6
7
8
9
10
55
t^2
1
4
9
16
25
36
49
64
81
100
385
10
12
14
Revenue
16
18
Year
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
Total
2
4
6
8
10
Time
Seems similar to the linear trend
System of linear equations:
From here 𝑚𝑚𝑚𝑚 = 10,76; 𝑏𝑏𝑏𝑏 = 0,82
�
10𝑚𝑚𝑚𝑚 + 55𝑏𝑏𝑏𝑏 = 152,9
55𝑚𝑚𝑚𝑚 + 385𝑏𝑏𝑏𝑏 = 908,9
Equation would be: 𝑦𝑦𝑦𝑦 = 10,76 + 0,82𝐿𝐿𝐿𝐿
y*t
10,3
27
41,1
56,8
75
90,6
114,1
140
171
183
908,9
Trend in SPSS
Given data about Production (in units) from 1997 year to 2011 year.
To fit Linear trend click Analyze->Regression->Curve Estimation.
In dependent list select Production, as Independent select “Time”. In Models list choose
“Linear”.
After clicking OK you will receive results similar to linear regression results.
Interpretation also is very similar. In Model Summary and Parameter Estimates table you
can find 𝑅𝑅𝑅𝑅2 , which now is very low (0,078) and Sig., which is 0,312>0,05, i.e. linear model
is not adequate. In the chart you see Observed data and Linear model (straight line),
which does not fit at all.
Equation
Linear
Model Summary and Parameter Estimates
R Square
,078
Dependent Variable: Production
Model Summary
F
1,107
df1
1
df2
13
Sig.
,312
Parameter Estimates
Constant
238,238
b1
2,129
But from chart we see that dependence is not linear, it seems to be quadratic.
Quadratic trend model also can be fitted by using SPSS.
Click Analyze->Regression->Curve Estimation
In dependent list select Production, as Independent select “Time”. In Models leave
“Linear” and select “Quadratic” (then we can compare two models).
𝑅𝑅𝑅𝑅2 coefficient of quadratic model (0,830) is much higher than for linear model. Also
Sig. of quadratic model is 0,000 and it less than 0,05. That means quadratic model is
adequate model for analysed data.
Equation
Linear
Quadrati
c
R Square
,078
,830
Model Summary and Parameter Estimates
Dependent Variable: Production
Model Summary
Parameter Estimates
F
df1
df2
Sig.
Constant
b1
b2
1,107
1
13
,312
238,238 2,129
29,341
2
12
,000
316,033 -25,328
Quadratic trend model has expression: 𝑦𝑦𝑦𝑦 = 𝑚𝑚𝑚𝑚 + 𝑏𝑏𝑏𝑏1 𝐿𝐿𝐿𝐿 + 𝑏𝑏𝑏𝑏2 𝐿𝐿𝐿𝐿 2
Fitted model for analysed data will be:
𝑦𝑦𝑦𝑦 = 316,03 − 25,33𝐿𝐿𝐿𝐿 + 1,72𝐿𝐿𝐿𝐿 2
Chart also shows, that quadratic trend model fits much better than linear.
1,716
Self-test questions
1. Describe and visualize graphically the seasonal component of the time series.
2. Determine whether the time series is characterized by a linear trend:
Period
Value
1
21,3
2
21,9
3
21,5
4
21,8
5
21,3
6
21,7
7
22
8
21,4
3. How do you understand exponential smoothing method?
4. When instead of Moving Averages it is better to use Centred Moving Averages
5. What is the difference between MSE and RMSE? Which one would you use to check
accuracy of prediction?
Exercises
Choose the best models for given time series data
1. Data about average fuel price
Year
2008
2008
2008
2008
2008
2008
2008
2008
2008
2008
2008
2008
2009
2009
2009
2009
2009
Month
1
2
3
4
5
6
7
8
9
10
11
12
1
2
3
4
5
Average fuel price
2,32
2,34
2,35
2,41
2,58
2,63
2,71
2,95
3,01
3,03
3,14
3,18
3,21
3,32
3,35
3,43
3,51
2009
2009
2009
2009
2009
2009
2009
2010
2010
2010
2010
2010
2010
2010
2010
2010
2010
2010
2010
2011
2011
2011
2011
2011
2011
2011
2011
2011
2011
2011
2011
6
7
8
9
10
11
12
1
2
3
4
5
6
7
8
9
10
11
12
1
2
3
4
5
6
7
8
9
10
11
12
3,78
3,83
3,91
3,99
4,02
4,05
4,13
4,04
3,99
3,97
3,91
3,84
3,83
3,79
3,77
3,63
3,58
3,39
3,38
3,44
3,68
3,71
3,93
4,04
4,18
4,21
4,33
4,39
4,41
4,43
4,48
2. Data about Customers in the shop
Time
2004 I Q
2004 II Q
2004 III Q
2004 IV Q
2005 I Q
2005 II Q
2005 III Q
Customers (thous.)
3,1
5,4
6,9
7,3
3,3
5,2
6,1
2005 IV Q
2006 I Q
2006 II Q
2006 III Q
2006 IV Q
2007 I Q
2007 II Q
2007 III Q
2007 IV Q
2008 I Q
2008 II Q
2008 III Q
2008 IV Q
2009 I Q
2009 II Q
2009 III Q
2001 I Q
3. Data about Profit
Year
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
8,1
2,9
6,3
6,9
7,3
3,2
5,4
6,5
7,7
3,1
5,4
6,9
7,3
3,6
5,8
6,7
7,5
Profit
2,05
2,33
2,66
3,03
3,45
3,93
4,47
5,09
5,80
6,60
7,52
8,57
9,76
11,11
12,65
14,41
16,41
18,69
4. Data about Computer sales.
Year
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
Computer sales (thous.)
13,4
13,5
13,6
13,5
14,7
15
15,3
15,8
15,5
16,1
16,4
16
16,8
17
17,3
17,5
18
21,3
22,6
21,5
23,4
23,3
Appendix 1. t values
Appendix 2. Chi square values
Appendix 3. F values
References
1.
2.
3.
4.
http://www.statoek.wiso.uni-
goettingen.de/veranstaltungen/graduateseminar/SmoothingMethods_Narodzonek-
Karpowska.pdf
Fernandez M. Statistics for Business and Economics. 2009, Ventus Publishing ApS
Tyrrell S. SPSS: Stats Practically Short and Simple. 2009, Ventus Publishing ApS
Smith R. Applied Statistics and Econometrics: Notes and Exercises. 2009, Birkbeck
5.
Barrow M. Statistics for Economics, Accounting and Business Studies. 2009, Pearson
6.
Kenny D. A. Statistics for the social and behavioural sciences. 1987
7.
8.
9.
Education Limited
http://www.sussex.ac.uk/Users/grahamh/RM1web/SPSShdt1-2012.pdf
http://www.sagepub.com/upm-data/40007_Chapter8.pdf
http://www.law.uchicago.edu/files/files/20.Sykes_.Regression.pdf
10. http://www.cliffsnotes.com/math/statistics/sampling/populations-samples-parametersand-statistics
11. http://sociology.about.com/od/Statistics/a/Descriptive-inferential-statistics.htm
12. https://statistics.laerd.com/statistical-guides/measures-central-tendency-mean-modemedian.php