Download BASICS OF APPLIED STATISTICS

SMK University of Applied Social Sciences Laura Saltyte BASICS OF APPLIED STATISTICS Course Handbook SMK University of Applied Social Sciences Laura Saltyte BASICS OF APPLIED STATISTICS Course Handbook Klaipeda 2015 Laura Saltyte BASICS OF APPLIED STATISTICS Methodological Handbook Approved by the decision of the Academic Board of SMK University of Applied Social Sciences, 15th April 2014, No. 4. Layout by Jurate Banyte-Gudeliene The publication is financed within project “Joint Degree Study programme ‘Technology and Innovation Management’ preparation and implementation“ No. VP1-2.2-SMM-07-K-02-087 funded in accordance with the means VP1-2.2-SMM-07-K “Improvement of study quality, development of Internationalization” of priority 2 “Lifelong Learning” of the Action Programme of Human Relations Development 2007 – 2013. © Laura Saltyte, 2015 © SMK University of Applied Social Sciences, 2015 ISBN 978-9955-648-27 -7 Contents Introduction ..................................................................................................................................................................5 Population and Sample .........................................................................................................................................6 Data types and measurements ..........................................................................................................................8 Getting started with SPSS ....................................................................................................................................8 Self-test questions ............................................................................................................................................... 13 Descriptive statistics ............................................................................................................................................... 14 Frequencie distributions. Graphs .................................................................................................................. 14 Measures of central tendency ......................................................................................................................... 17 Measures of variation......................................................................................................................................... 22 Standard Scores and Normal distribution ................................................................................................. 25 Descriptive statistics in SPSS .......................................................................................................................... 30 Self-test questions ............................................................................................................................................... 32 Excersices ............................................................................................................................................................... 33 Hypotheses testing .................................................................................................................................................. 34 The concepts of hypothesis testing............................................................................................................... 35 t Ratio or Student’s t ........................................................................................................................................... 38 Hypothesis for one sample .......................................................................................................................... 39 Testing the difference of two means (independent and paired samples) ............................... 39 Two independent samples ........................................................................................................................... 41 Two dependent samples ............................................................................................................................... 43 Hypothesis testing in SPSS ............................................................................................................................... 44 Self-test questions ............................................................................................................................................... 51 Exercises ................................................................................................................................................................. 52 Correlation analysis ................................................................................................................................................ 53 Self-test questions ............................................................................................................................................... 56 Exercises ................................................................................................................................................................. 57 Regression analysis ................................................................................................................................................. 57 Linear regression ................................................................................................................................................. 59 Multiple regression ............................................................................................................................................. 59 Adequacy of regression model ....................................................................................................................... 60 Correlation and regression analysis in SPSS ............................................................................................. 62 Self-test questions ............................................................................................................................................... 70 Excersises ............................................................................................................................................................... 70 Time series analysis ................................................................................................................................................ 71 Components of time series............................................................................................................................... 72 Trend in SPSS ........................................................................................................................................................ 80 Self-test questions ............................................................................................................................................... 84 Excersices ............................................................................................................................................................... 85 Appendix 1. t values ................................................................................................................................................ 89 Appendix 2. Chi square values ............................................................................................................................ 90 Appendix 3. F values ............................................................................................................................................... 91 References................................................................................................................................................................... 92 Introduction Statistics – collections of data associated with human enterprises. Statistics – a method that can be used to analyse data. That is to organize and make sense out of a large amount of material. Statistics is the studies of how to collect, organize, analyse, and interpret the numerical information from data. Statistics is necessary in: • sports, • stock market, • traffic, • and hundreds of other human activities, • etc. Like most people, you probably feel that it is important to "take control of your life." But what does this mean? Partly, it means being able to properly evaluate the data and claims that bombard you every day. If you cannot distinguish good from faulty reasoning, then you are vulnerable to manipulation and to decisions that are not in your best interest. Statistics provides tools that you need in order to react intelligently to information you hear or read. In this sense, statistics is one of the most important things that you can study [10]. Statistics are often presented in an effort to add credibility to an argument or advice. You can see this by paying attention to television advertisements. Many of the numbers thrown about in this way do not represent careful statistical analysis. They can be misleading and push you into decisions that you might find cause to regret. For these reasons, learning about statistics is a long step towards taking control of your life. (It is not, of course, the only step needed for this purpose.) The present handbook is designed to help you learn statistical essentials. Aim of this methodological handbook is to introduce students about data measurement scales, types of data, data collection and coding techniques. The characteristics of data location (mean mode, median) and data dissemination (dispersion, standard deviation, range) are analysed. The course unit discusses the statistical relationship indicators, linear and nonlinear regression. The students learn about parameter hypothesis testing. Graphical analysis of data is performed during the course unit studies. Data analysis software tools and their use in solving typical data analysis tasks are introduced to students. After completion of the course unit, the students are able to select the models that are the most appropriate for the available data and to interpret the obtained results. The handbook consists of 6 chapters: Introduction; Descriptive statistics; Hypotheses testing; Correlation analysis; Regression analysis and Time series analysis. Also guidelines on how to use one of the most popular packages for statistical calculation (SPSS) are given at the end of each topic. Keywords: population, sample, Discrete and discontinuous data, Nominal scale, Ordinal scale, Interval scale, Ratio scale. Population and Sample Before we study specific statistical descriptions, let me define the terms population and sample [10]. A population is a group of phenomena that have something in common. The term often refers to a group of people, as in the following examples: • • • all registered voters in Klaipeda; all members of the International Machinists Union; all Lithuanians who played basketball at least once in the past year. But populations can refer to things as well as people: • • • all widgets produced last Tuesday by the Acme Widget Company; all daily maximum temperatures in July for major Lithuania cities; all basal ganglia cells from a particular rhesus monkey. Often, researchers want to know things about populations but do not have data for every person or thing in the population. If a company's customer service division wanted to learn whether its customers were satisfied, it would not be practical (or perhaps even possible) to contact every individual who purchased a product. Instead, the company might select a sample of the population. A sample is a smaller group of members of a population selected to represent the population. In order to use statistics to learn things about the population, the sample must be random. A random sample is one in which every member of a population has an equal chance of being selected. The most commonly used sample is a simple random sample. It requires that every possible sample of the selected size has an equal chance of being used [10] (see figure 1). A parameter is a characteristic of a population. A statistic is a characteristic of a sample. Inferential statistics enables you to make an educated guess about a population parameter based on a statistic computed from a sample randomly drawn from that population. Usually population size is denoted by the letter 𝑁𝑁𝑁𝑁 and the sample size by the letter 𝑛𝑛𝑛𝑛. Figure 1. Population and sample. Source: www.boundless.com Statistical procedures can be divided into two major categories: descriptive statistics and inferential statistics. Descriptive statistics includes statistical procedures that we use to describe the population we are studying. The data could be collected from either a sample or a population, but the results help us organize and describe data. Descriptive statistics can only be used to describe the group that is being studying. That is, the results cannot be generalized to any larger group. Descriptive statistics are useful and serviceable if you do not need to extend your results to any larger group. However, much of social sciences tend to include studies that give us “universal” truths about segments of the population, such as all parents, all women, all victims, etc. Frequency distributions, measures of central tendency (mean, median, and mode), and graphs like pie charts and bar charts that describe the data are all examples of descriptive statistics. Inferential statistics is concerned with making predictions or inferences about a population from observations and analyses of a sample. That is, we can take the results of an analysis using a sample and can generalize it to the larger population that the sample represents. In order to do this, however, it is imperative that the sample is representative of the group to which it is being generalized. To address this issue of generalization, we have tests of significance. A Chi-square or t-test, for example, can tell us the probability that the results of our analysis on the sample are representative of the population that the sample represents. In other words, these tests of significance tell us the probability that the results of the analysis could have occurred by chance when there is no relationship at all between the variables we studied in the population we studied. Examples of inferential statistics include regression analysis, ANOVA, correlation analysis, survival analysis, etc [11]. Data types and measurements We can classify data into two types: continuous and discrete. Meters, centimetres, millimetres; kilos, grams, milligrams are examples of continuous data. With these we can make measurements of varying degrees of precision. Discrete or discontinuous data are based on measurements that can be expressed only in whole units (counting of people, number of words spelled correctly, number of cars passing a point, etc.). Normally, when one hears the term measurement, they may think in terms of measuring the length of something (e.g., the length of a piece of wood) or measuring a quantity of something (i.e. a cup of flour).This represents a limited use of the term measurement. In statistics, the term measurement is used more broadly and is more appropriately termed scales of measurement. Scales of measurement refer to ways in which variables/numbers are defined and categorized. Each scale of measurement has certain properties which in turn determine the appropriateness for use of certain statistical analyses. The four scales of measurement are nominal, ordinal, interval, and ratio [5]. Nominal scale – measure of identity, i.e. it classifies individuals into categories (religious preference: (1) Protestant, (2) Catholic, (3) Jewish, (4) Hinduism, (5) other, (6) none). Just simple statistical methods are used with nominal data. Ordinal scale – measures are arranged from the highest to lowest or vice versa. In this scale we can compare which is larger or smaller, harder or softer etc., but measures don’t tell how much. Statistically not much can be done, but more than with nominal data. Interval scale provides numbers that reflect differences among items. With interval scales the measurement units are equal (Fahrenheit and Celsius thermometers; time as reckoned our calendar; scores of intelligence test). Many statistical methods can be used with interval scale. Ratio scale – the basic difference between this and interval scale is that ratio scales have an absolute zero (length, width, weight, capacity etc.). All statistical methods can be used. Sometimes Interval and Ratio scale is called quantitative scale [5]. Getting started with SPSS SPSS stands for "Statistical Package for the Social Sciences". It is a very powerful program that can do all of the statistics that you are ever likely to want to use. It is actually fairly easy to use, but because it can do statistics for "grownups" as well as novices, it may seem quite daunting at first. It presents you with a bewildering array of options that you will probably never need to use. When it comes to giving you statistical results, it will give you what you want - as well as a lot of extra stuff that you may not need! The secret to using SPSS is to take it one small step at a time. These series of hand-outs are aimed at showing you how to use SPSS to do the statistics referred to in the lectures. There are now many helpful books which explain using SPSS well: the only catch with these is that SPSS exists in various versions, and so, depending on which book you get, it may not correspond exactly to what happens with the version we will be using (version 20.0) [7]. 1. Starting up SPSS: Double-click on the SPSS icon. After a few seconds, a window like the following should appear on your screen. This is the "Data View" window. It has the default name "Untitled1". At the top of this window, there is a row of commands, (File, Edit, Help, etc.) Clicking on any of these will produce a drop-down menu, much as in other programs that you might be familiar with (such as Word or Excel). At this stage, many of the options on the menus will appear quite meaningless to you. "File", "Edit", "Analyze" and "Window" are the options that we will use the most. As you might expect, "File" enables you to open and save files, and "Edit" enables you to cut, copy and paste things. We will use "Analyze" to perform various statistical tests. "Help" provides you with information about SPSS [7]. If you click on the tab at the bottom left of the window, it switches to a new window, the "Variable View". You can toggle between these two windows at any time. You use the Data View window when you want to input data, and you use the Variable View window to change various properties of the data. There is a third window, the "Output Window": this will contain the results of any statistical analyses that you perform. (The Output Window will only be available once you have some output to see, so it won't actually be accessible just yet). Each window has the controls for SPSS at the top of the screen (the words "File", "Edit" and so on, and the row of icons beneath them). Most of the SPSS controls will remain visible all the time, but you can switch between the three windows whenever you like. You switch between the "Data View" and "Variable View" windows by using the tabs at the bottom left of either of these two windows. You switch between these two windows and the "Output Window" by clicking on "window" at the top left of the screen and selecting the one that you want from the menu that appears. 2. Entering data: In SPSS, each row of the grid is a "case" and each column is a "variable". To make this clear, imagine we have the heights and weights of six people. Each person is a separate case. We have three variables: height, weight and sex. We could therefore enter the data in such a way that each row represents a different individual: one column has the height data, another column has the weight data, and a third column tells SPSS whether the person was male or female. To enter values, make sure you have the "Data view" window selected. Move the cursor to the square into which you wish to make an entry, and click on it. Enter the value, followed by a press on the "enter" key. To move around the grid, you can use the arrow keys or the mouse [7]. First of all you need to prepare SPSS for entering the data. So you need to switch from Data View to Variable View. Each row in this window contains information about one of the variables (one of the columns) in the "Data View" window. Change the Name of variable "var0001" to "name"; change the name of "var0002" to "height"; change "var0003" to "weight"; and change "var0004" to "sex". In the version of SPSS that we are using, the variable name can be any combination of letters and numbers, but it must not contain a space or any other symbols. You can add a more informative title to a variable, one that can include punctuation marks and spaces, by entering it in the box that is entitled Label. It is very important to do this, as SPSS will show these labels in your output, and they will make the output much easier to understand. I have many data files with variable names like "qw1325bc", which made sense at the time I wrote the file but are now quite meaningless to me because I didn't label the variables! Type and Values: SPSS can treat an entry to a cell as a sequence of characters (a "string") or as a number. "Tom" is a string, while his height and weight are numbers. For "sex", I have used numbers as strings: "1" represents "male" and "2" represents "female", but these are arbitrary labels. I could have used any two numbers (say "5" for "male" and "0" for "female") and SPSS would have been equally happy. It is easy to forget which number represents which condition. However, if you click on the "Values" column, a small grey box appears. Click on this, and a dialog box pops up. You can associate a label with each number - thus, in this case, you can tell SPSS that "1" means "male" and "2" means "female". (Don't forget to click on "add" each time you enter a label). Width: Width merely specifies the width of each column in the "Data View" window. By default, a column that contains numbers is eight characters wide. However, an annoying complication of this version of SPSS is that, if the first entry in a column is a string of a certain length, SPSS assumes that all of the subsequent cells will contain strings of the same length. This is why the names start with "Matilda": had I started the column with "Tom", SPSS would assume that all of the strings in this column are going to be 3 characters long. Consequently "Dick" would have been truncated to "Dic", "Harry" to "Har" and so on. You can stop SPSS doing this by changing the column width to suit the length of your strings - or, as I did here, by making sure that the longest string goes in the first cell of a column! [7] Decimals: If the variable is a number, this column specifies how many decimal places each case will be displayed to. So, if you select 0, Tom's height will appear as "2000"; if you select 2, it will appear as "2000.00"; if you select 3, his height will be "2000.000"; and so on. If the variable is a string, this is irrelevant, and so SPSS will show a 0 in the relevant cell of the "Variable View" window [7]. Missing: Sometimes a data-set is incomplete - perhaps someone forgot to tell you their height, for example. If a numerical entry is blank, SPSS assumes it misses the data. However, sometimes you might want to enter a code for missing values - perhaps "999" to show that the value is missing because the person forgot to enter it, and "99" to show that it is missing because the person refused to give it. "Missing" enables you to do this. At the moment, it is the simplest to show missing data by leaving blank the relevant cell in the "Data View" window [7]. Measure: This will be explained more fully in the statistics lectures. Essentially this column shows what kind of data SPSS thinks is in the column: nominal (a name, i.e. a string), ordinal (rating data) or scale (interval or ratio data). At the moment, it will suffice to keep clear the distinction between using numbers as numbers, and using them as labels (strings). Now we are ready to enter the data. You need to go back to Data View Window. Let’s say we have information about six people. A "variable": each column contains data of a particular kind (in this instance,"gender") A "case": each row contains a single person's set of data In column “sex“ you can enter numbers (1 or 2 adequately), and they will be changed to “male“ or “female“. 3. Saving a file: Once you have entered your data, always save the data before doing anything else. Saving the file will save you a lot of heartache in the future. It is really demoralising to spend hours typing numbers in, only to lose the lot by some accident or mistake. Click on "File" of the SPSS controls (it is top left). A menu will appear. Now click on "Save As", and enter a filename in the top left-hand box - where it says " *.sav". Any combination of letters and numbers will do, but let's call the file "chicken". SPSS will automatically add the suffix ".sav". The ".sav" bit is important, as it tells SPSS that this file is a data-file of a kind that it likes. (SPSS can read other types of data-file as well, but it is simplest to stick with .sav types for the moment). So, all you have to do, is type "name" (or, in future, any filename you choose) in the box to the right of "filename". Press "enter", and SPSS will save the data into a file. You can now carry on, secure in the knowledge that whatever happens, your data will be safe on the computer; or you can quit SPSS and come back another time. To quit SPSS, click on "exit", which is at the bottom of the "File" menu. When you save your data, an "Output Window" will open automatically, containing information that you've successfully saved the file. If this is all it contains, just close it without saving it. However if you have run some statistical analyses and hence have some output in the "Output Window", you can save this in a separate file. To do so, make the "Output Window" the active window (as described earlier), and then click on "File" and then "Save As" in the same way as for saving the data. This time, SPSS will prompt you to supply a filename ending in ".spo", to show that it is an output file rather than a data file. Thus, you could call it "chicken.spo", and SPSS will then save the contents of the "Output" window as a file. This file can be read into a word-processor such as Microsoft Word, and then treated like any other text document [7]. Self-test questions 1. What is the main difference between Nominal and Ordinal scale? 2. What is the main difference between descriptive and inferential statistics? 3. What for column “Values” in SPSS can be used? 4. What is the population if goal of the research is to explore student’s view of Lithuania high schools? 5. Variable “Age” will be interval or scale variable? Descriptive statistics It is well known, that picture tells more than a thousand words. The same applies to any serious data analysis. The first step of data analysis is to summarize the data by drawing plots and charts as well as by computing some descriptive statistics. These tools essentially aim to provide a better understanding of how frequent the distinct data values are, and how much variability there is around a typical value in the data. After finishing this chapter, students will be able to collect, systematize and analyse characteristics defining the social-economical phenomena. Aim of this chapter is to learn how to systemize data, what kind of descriptive statistics characteristics can be used and how to use them also Normal curve and its applications will be explained. Keywords: frequency table, grouped data, central tendency, variation, standard scores. Frequency distributions. Graphs Often to make our data more interpretable and convenient, we set up a frequency distribution and draw graphs of various kinds to represent the data [6]. Frequency table reports the number of times that a given observation occurs or, if based on relative terms, the frequency of that value divided by the number of observations in the sample. Usually frequency table is applied to categorical (discrete data), when we have not more than 10-15 different categories. Example A company in the transformation industry classifies the individuals at managerial positions according to their university degree. 1 – Accountant; 2 – Administrator; 3 – Economist; 4 – Engineer; 5 – Lawyer; 6 – Physicist; Given data: 1, 2, 3, 6, 2, 3, 4, 5, 4, 2, 3, 4, 3, 4, 4, 5, 4, 4 Frequency table: Degree Accounting Business frequencies 1/18 1/6 counts 1 percentage 3 5,56% Economics Engineering 16,67% 4 2/9 22,22% 7 7/18 38,89% Law Physics 1/9 1/18 2 11,11% 1 5,56% The corresponding plot for this type of categorical data is bar or pie chart (see figures 1 and 2). 40,00 30,00 20,00 10,00 0,00 5,56 16,67 22,22 38,89 11,11 5,56 Figure 1. Bar plot of frequency table 11,11 38,89 5,56 5,56 16,67 22,22 Accounting Business Economics Engineering Law Physics Figure 2. Pie plot of frequency table If sample size is big, and measurements are made in interval or ratio scale Frequency table of grouped data can be used. Before making frequency table of grouped data some rules should be followed: • we seldom use fewer than 6 or more than 15 classes (intervals). The exact number we use in a given situation will depend on the nature, magnitude, and range of the data; • we always make sure that each item (measurement or observation) goes into one and only one class (interval); • whenever possible, we make the classes (intervals) the same length; that is, we make them cover equal ranges of values; • if a set of data contains a few values, which are much greater than or much smaller than the rest, open classes are quite useful in reducing the number of classes required to accommodate the data. Scheme can be used to construct intervals of equal length: • find 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑛𝑛𝑛𝑛 and 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑥𝑥𝑥𝑥 ; • determine number of intervals (𝑘𝑘𝑘𝑘 = 6 … 15); • calculate length of interval ℎ = 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑥𝑥𝑥𝑥 − 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑛𝑛𝑛𝑛 ; • calculate break points for intervals 𝑐𝑐𝑐𝑐𝑚𝑚𝑚𝑚 = 𝑐𝑐𝑐𝑐𝑚𝑚𝑚𝑚−1 + ℎ (𝑐𝑐𝑐𝑐0 = 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑛𝑛𝑛𝑛 ). Example Consider a sample of College graduates, whose first salaries (in 1000 Lt per annum) after graduating are as follows: 140 150 75 96 96 86 99 100 86 87 89 95 122 125 95 95 96 97 97 150 97 98 99 95 132 99 99 100 100 105 110 110 110 115 97 98 120 95 135 160 1. 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑛𝑛𝑛𝑛 = 75 ; 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑥𝑥𝑥𝑥 = 160; 2. let‘s make 5 intervals (𝑘𝑘𝑘𝑘 = 5); 3. interval lenght: ℎ = 160−75 4. interval break points: 𝑐𝑐𝑐𝑐0 = 75; 5 = 17; 𝑐𝑐𝑐𝑐1 = 75 + 17 = 92; 𝑐𝑐𝑐𝑐2 = 92 + 17 = 109; 𝑐𝑐𝑐𝑐4 = 126 + 17 = 143; 𝑐𝑐𝑐𝑐5 = 143 + 17 = 160. Frequency table for grouped data: 𝑐𝑐𝑐𝑐3 = 109 + 17 = 126; Intervals Counts Frequencies [75;92) [92;109) [109;126) [126;143) [143;160] 5 22 7 3 3 1/8 11/20 7/40 3/40 Percentage Middle points 7,50 151,5 3/40 12,50 55,00 17,50 7,50 Middle points are calculated as follows: 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚0 = 83,5 100,5 117,5 134,5 𝑐𝑐𝑐𝑐𝑚𝑚𝑚𝑚−1 + 𝑐𝑐𝑐𝑐𝑚𝑚𝑚𝑚 2 The corresponding plot for this type of grouped data is histogram. A histogram is constructed by representing the measurements of observations that are grouped on a horizontal scale, the class frequencies (or corresponding percentage) on vertical scale, and drawing rectangles whose bases equal the class interval and whose heights are determined by the corresponding class frequencies (percentage). 55,00 60,00 50,00 40,00 30,00 20,00 10,00 0,00 12,50 [75;92) Figure3. Histogram 17,50 7,50 7,50 [92;109) [109;126) [126;143) [143;160] An alternative, although less widely used form of graphical presentation is the frequency polygon. Here the class frequencies are plotted at the class marks and the successive points are connected by means of straight lines. 60,00 50,00 40,00 30,00 20,00 10,00 0,00 55,00 12,50 83,5 Figure 4. Polygon 17,50 100,5 117,5 7,50 7,50 134,5 151,5 Measures of central tendency A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. As such, measures of central tendency are sometimes called measures of central location. They are also classed as summary statistics. The mean (often called the average) is most likely the measure of central tendency that you are most familiar with, but there are others, such as the median and the mode. The mean, median and mode are all valid measures of central tendency, but under different conditions, some measures of central tendency become more appropriate to use than others. In the following sections, we will look at the mean, mode and median, and learn how to calculate them and under what conditions they are most appropriate to be used [12]. The Mean The mean (or average) is the most popular and well known measure of central tendency. It can be used with both discrete and continuous data, although its use is most often with continuous data. The mean is equal to the sum of all the values in the data set divided by the number of values in the data set. Therefore, if we have n values in a data set and they have values 𝑥𝑥𝑥𝑥1 , 𝑥𝑥𝑥𝑥2 , … , 𝑥𝑥𝑥𝑥𝑛𝑛𝑛𝑛 , the sample mean, usually denoted by 𝑥𝑥𝑥𝑥̅ (pronounced 𝑥𝑥𝑥𝑥 bar), is: 𝑥𝑥𝑥𝑥̅ = 𝑥𝑥𝑥𝑥1 + 𝑥𝑥𝑥𝑥2 + ⋯ + 𝑥𝑥𝑥𝑥𝑛𝑛𝑛𝑛 𝑛𝑛𝑛𝑛 This formula is usually written in a slightly different manner using the Greek capitol letter Σ, pronounced "sigma", which means "sum of...": 𝑥𝑥𝑥𝑥̅ = ∑ 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚 𝑛𝑛𝑛𝑛 If we have a frequency table, a bit difference formula can be used: 𝑥𝑥𝑥𝑥̅ = Here 𝑓𝑓𝑓𝑓𝑚𝑚𝑚𝑚 are frequencies. ∑ 𝑓𝑓𝑓𝑓𝑚𝑚𝑚𝑚 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚 𝑛𝑛𝑛𝑛 Formula for grouped data would be: ∑ 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚0 𝑥𝑥𝑥𝑥̅ = 𝑛𝑛𝑛𝑛 Here 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 are frequencies of grouped data and 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚0 – middle points. Example Suppose that we have the grades of 50 students in a course of elementary statistics Grades (𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚 ) 5 6 7 8 9 10 Counts (𝑓𝑓𝑓𝑓𝑚𝑚𝑚𝑚 ) 12 18 13 4 2 1 𝑛𝑛𝑛𝑛 = 50 𝑥𝑥𝑥𝑥̅ = Example 60 108 91 32 18 10 � 𝑓𝑓𝑓𝑓𝑚𝑚𝑚𝑚 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚 = 319 319 = 6,38 50 Suppose that we have grouped data 𝑓𝑓𝑓𝑓𝑚𝑚𝑚𝑚 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚 Age groups Counts (𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 ) Midpoints (𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚0 ) [30;35) 0 32,5 [25;30) [35;40) 1 3 [40;45) 6 [45;50) 6 [55;60) 4 [65;70) 1 [75;80) 1 [80;85] 1 0 37,5 112,5 47,5 285 255 315 57,5 402,5 67,5 270 62,5 4 [70;75) 27,5 52,5 7 [60;65) 27,5 42,5 6 [50;55) 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚0 250 72,5 72,5 82,5 82,5 77,5 𝑛𝑛𝑛𝑛 = 40 77,5 � 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚0 = 2150 𝑥𝑥𝑥𝑥̅ = 2150 = 53,75 40 The mean is essentially a model of your data set. It is the value that is most common. You will notice, however, that the mean is not often one of the actual values that you have observed in your data set [12]. When not to use the mean The mean has one main disadvantage: it is particularly susceptible to the influence of outliers. These are values that are unusual compared to the rest of the data set by being especially small or large in numerical value. For example, consider the wages of staff at a factory below: Staff Salary 1 1500 2 1800 Mean would be 3 1600 4 1400 𝑥𝑥𝑥𝑥̅ = 5 1500 6 1500 30700 = 3070 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 10 7 1200 8 1700 9 9000 10 9500 And this mean doesn‘t describe the real situation in the company. The Median The median is the middle score for a set of data that has been arranged in order of magnitude. The median is less affected by outliers and skewed data. In order to calculate the median, suppose we have the data below: 65, 55, 89, 56, 35, 14, 56, 55, 87, 45, 92 We first need to rearrange that data into order of magnitude (smallest first): 14, 35, 45, 55, 55, 56, 56, 65, 87, 89, 92 Our median mark is the middle mark - in this case, 56 (highlighted in bold). It is the middle mark because there are 5 scores before it and 5 scores after it. This works fine when you have an odd number of scores, but what happens when you have an even number of scores? What if you had only 10 scores? Well, you simply have to take the middle two scores and average the result. So, if we look at the example below: 65, 55, 89, 56, 35, 14, 56, 55, 87, 45 We again rearrange that data into order of magnitude (smallest first): 14, 35, 45, 55, 55, 56, 56, 65, 87, 89, 92 Only now we have to take the 5th and 6th score in our data set and average them to get a median of 55.5 [12]. The Mode The mode is the most frequent score in our data set. On a histogram it represents the highest bar in a bar chart or histogram. You can, therefore, sometimes consider the mode as being the most popular option. Normally, the mode is used for categorical data where we wish to know which is the most common category, as illustrated below: The Mode 30 20 10 0 Car Figure 5. Bus Train Bycile One of the problems with the mode is that it is not unique, so it leaves us with problems when we have two or more values that share the highest frequency, such as below: 10 8 6 4 2 0 1 5 8 10 14 19 22 25 28 30 Figure 6. Another problem with the mode is that it will not provide us with a very good measure of central tendency when the most common mark is far away from the rest of the data in the data set, as depicted in the diagram below: 10 8 6 4 2 0 1 3 5 7 8 9 10 13 14 15 19 20 22 24 25 27 28 29 30 Figure 7. Summary of when to use the mean, median and mode Please use the following summary table to know what the best measure of central tendency is with respect to the different types of variable [12]. Table 1. When to use the mean, median and mode Type of Variable Nominal Ordinal Interval/Ratio Best measure of central tendency Mode Median or Mode Mean Measures of variation Measures of variability indicate the degree to which the scores in a distribution are spread out. Larger numbers indicate greater variability of scores. Sometimes the word dispersion is substituted for variability, and you will find that term used in some statistics texts. We will divide our discussion of measures of variability into three categories: the range, the variance, and the standard deviation. The Range The Range is the distance from the lowest score to the highest score. We noted that the range is very unstable, because it depends on only two scores. If one of those scores moves further from the distribution, the range will increase even though the typical variability among the scores has changed very little. The Variance The average of the squared deviations between the individual scores and the mean. The larger the variance the more variability there is among the scores. When comparing two samples with the same unit of measurement, the variances are comparable even though the sample sizes may be different. The notation that is used for the variance is a lowercase 𝑠𝑠𝑠𝑠 2 . The formula for the variance is shown below. 𝑠𝑠𝑠𝑠 2 = Variance for frequency table 𝑠𝑠𝑠𝑠 2 = Variance for grouped data 𝑠𝑠𝑠𝑠 2 = 1 �� 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚2 � − (𝑥𝑥𝑥𝑥̅ )2 𝑛𝑛𝑛𝑛 − 1 1 �� 𝑓𝑓𝑓𝑓𝑚𝑚𝑚𝑚 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚2 � − (𝑥𝑥𝑥𝑥̅ )2 𝑛𝑛𝑛𝑛 − 1 1 2 �� 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 �𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚0 � � − (𝑥𝑥𝑥𝑥̅ )2 𝑛𝑛𝑛𝑛 − 1 Definitions for 𝑓𝑓𝑓𝑓𝑚𝑚𝑚𝑚 , 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 , and 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚0 can be found above. Did you recognize that the variance formula does not divide by 𝑛𝑛𝑛𝑛, but instead divides by (𝑛𝑛𝑛𝑛 − 1)? The denominator (𝑛𝑛𝑛𝑛 − 1) in this equation is called the degrees of freedom. It is a concept that you will hear about again and again in statistics. The reason that the variance formula divides the sum of squared differences from the mean by (𝑛𝑛𝑛𝑛 − 1) is that dividing by 𝑛𝑛𝑛𝑛 would produce a biased estimate of the population variance, and that bias is removed by dividing by (𝑛𝑛𝑛𝑛 − 1). The Standard deviation The variance has some excellent statistical properties, but it is hard for most students to conceptualize. To start with, the unit of measurement for the mean is the same as the unit of measurement for the score. For example, if we compute the mean age of the sample and find that it is 28.7 years, that mean is on the same scale as the individual ages of our participants. But the variance is in squared units. For example, we might find that the variance is 100 years2. Can you even imagine what the unit of years’ squared represents? Most people can't. But there is a measure of variability that is in the same units as the mean. It is called the standard deviation, and it is the square root of the variances (see the formula below). So if the variance was 100 years2, the standard deviation would be 10 years. Since we used the symbol 𝑠𝑠𝑠𝑠 2 to indicate variance, you might not be surprised that we use the lowercase letter s to indicate the standard deviation. You will see in our discussion of relative scores how valuable the standard deviation can be. 𝑠𝑠𝑠𝑠 = �𝑠𝑠𝑠𝑠 2 At this point, many students assume that the variance is just a step in computing the standard deviation, because the standard deviation seems like it is much more useful and understandable. In fact, you will use the standard deviation for description purposes only and will use the variance for all your other statistical tasks. Example Let’s say we have data set 10, 12, 15, 18, 20 The Range would be 20 − 10 = 10 The Variance 𝑠𝑠𝑠𝑠 2 = 1 (102 + 122 + 152 + 182 + 202 ) − (15)2 = 73,25 4 The Standard deviation Example 𝑠𝑠𝑠𝑠 = �73,25 ≈ 8,56 Grades (𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚 ) Counts (𝑓𝑓𝑓𝑓𝑚𝑚𝑚𝑚 ) 𝑓𝑓𝑓𝑓𝑚𝑚𝑚𝑚 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚2 6 18 648 5 12 7 13 8 637 4 9 256 2 10 The variance: 𝑠𝑠𝑠𝑠 2 = 300 1 𝑛𝑛𝑛𝑛 = 50 1 49 162 � 𝑓𝑓𝑓𝑓𝑚𝑚𝑚𝑚 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚2 = 2103 ∙ 2103 − 6,382 ≈ 2,21 here 6,38 is the average, calculated before. 100 The standard deviation: 𝑠𝑠𝑠𝑠 = √2,21 ≈ 1,49. Example For given grouped data calculate measures of variability Age groups Counts (𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 ) Midpoints (𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚0 ) 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 �𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚0 � [30;35) 0 32,5 0 [25;30) 1 [35;40) 3 [40;45) 6 [45;50) 6 [50;55) 6 [55;60) 7 [60;65) 4 [65;70) 4 [70;75) 1 [75;80) [80;85] The variance here 53,75 is the average. The standard deviation 1 1 𝑛𝑛𝑛𝑛 = 40 𝑠𝑠𝑠𝑠 2 = 27,5 2 756,25 37,5 4218,75 47,5 13537,5 42,5 52,5 10837,5 16537,5 57,5 23143,75 67,5 18225 62,5 15625 72,5 5256,25 82,5 6806,25 77,5 1 ∙ 120950 − 53,752 ≈ 212,22 39 𝑠𝑠𝑠𝑠 = √212,22 ≈ 14,57. 6006,25 2 � 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 �𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚0 � = 120950 Standard Scores and Normal distribution Each quantitative variable can be transformed into standard scores. Formula for standard score 𝑧𝑧𝑧𝑧 = 𝑥𝑥𝑥𝑥 − 𝑥𝑥𝑥𝑥̅ 𝑠𝑠𝑠𝑠 where 𝑥𝑥𝑥𝑥 – any raw score or unit of measurement; 𝑥𝑥𝑥𝑥̅ - mean and 𝑠𝑠𝑠𝑠 standard deviation of the distributions of scores. First we calculate deviation from the mean, and then divide it by standard deviation. When we change raw scores to standard scores, we are expressing them in standard deviation units. These standard scores tell us how many standard deviation units any given raw score deviates from the mean. Practically 3 standard deviations on either side of the mean include all of the cases. Deviation of z scores can be described by saying that they have a zero mean and a standard deviation equal to 1. Any time we see a standard score, we should be able to place exactly where an individual falls in a distribution. For example a student with a z score 2.5 is 2.5 deviations above the mean on that test distribution and has a very good score. Example The scores of students on three elementary school tests are presented. Student A B C Etc… Mean St. Deviation Geography 60 72 46 … 60 10 Then Standard Scores will be Spelling 140 100 110 … 100 20 Arithmetic 40 36 24 … 22 6 Student A B C Etc… Geograph y 50 62 36 … Spelling 70 50 55 … Arithmetic 80 73 53 … Average 67 62 48 … There are many probability distributions in statistics, developed to analyse different types of problem. Several of them are covered here and the most important of them is the Normal distribution, which we now turn to. It was discovered by the German mathematician Gauss in the nineteenth century (hence it is also known as the Gaussian distribution. Many random variables turn out to be normally distributed. Men’s (or women’s) heights are normally distributed. IQ (the measure of intelligence) is also normally distributed. Another example is of a machine producing (say) bolts with a nominal length of 5 cm which will actually produce bolts of slightly varying length (these differences would probably be extremely small) due to factors such as wear in the machinery, slight variations in the pressure of the lubricant, etc. These would result in bolts whose length varies, in accordance with the Normal distribution. This sort of process is extremely common, with the result that the Normal distribution often occurs in everyday situations. The Normal distribution tends to arise when a random variable is the result of many independent, random influences added together, none of which dominates the others. A man’s height is the result of many genetic influences, plus environmental factors such as diet, etc. As a result, height is normally distributed. If one takes the height of men and women together, the result is not a Normal distribution, however. This is because there is one influence which dominates the others: gender. Men are, on average, taller than women. Many variables familiar in economics are not Normal however – incomes, for example (although the logarithm of income is approximately Normal). We shall learn techniques to deal with such circumstances in due course. 0,045 0,04 0,035 0,03 0,025 f(x) 0,02 0,015 0,01 0,005 0 0 100 x 200 300 Figure 8. Normal distribution Having introduced the idea of the Normal distribution, what does it look like? It is presented below in graphical and then mathematical forms. Unlike the Binomial, the Normal distribution applies to continuous random variables such as height and a typical Normal distribution is illustrated in Figure 8. Since the Normal distribution is a continuous one it can be evaluated for all values of x, not just for integers. The figure illustrates the main features of the distribution: • It is unimodal, having a single, central peak. If this were men’s heights it would illustrate the fact that most men are clustered around the average height, with a few very tall and a few very short people. • It is symmetric, the left and right halves being mirror images of each other. • It is bell-shaped. • It extends continuously over all the values of x from minus infinity to plus infinity, although the value of f (x) becomes extremely small as these values are approached (the pages of this book being of only finite width, this last characteristic is not faithfully reproduced!). This also demonstrates that most empirical distributions (such as men’s heights) can only be an approximation to the theoretical ideal, although the approximation is close and good enough for practical purposes. In mathematical terms the formula for the Normal distribution is (x is the random variable) (𝑥𝑥𝑥𝑥 − 𝜇𝜇𝜇𝜇)2 � 𝑓𝑓𝑓𝑓 (𝑥𝑥𝑥𝑥) = 𝑒𝑒𝑒𝑒𝑥𝑥𝑥𝑥𝑒𝑒𝑒𝑒 �− 2𝜎𝜎𝜎𝜎 2 𝜎𝜎𝜎𝜎 √2𝜋𝜋𝜋𝜋 1 The mathematical formulation is not as formidable as it appears. 𝜇𝜇𝜇𝜇 and 𝜎𝜎𝜎𝜎 are the parameters of the distribution; 𝜋𝜋𝜋𝜋 is 3.1416 and e is 2.7183. If the formula is evaluated using different values of x the values of f(x) obtained will map out a Normal distribution. Fortunately, as we shall see, we do not need to use the mathematical formula in most practical problems. The Normal is a family of distributions differing from one another only in the values of the parameters 𝜇𝜇𝜇𝜇 and 𝜎𝜎𝜎𝜎. Several Normal distributions are drawn in Figures 9 - 11 for different values of the parameters. Whatever value of 𝜇𝜇𝜇𝜇 is chosen turns out to be the centre of the distribution. As the distribution is symmetric, 𝜇𝜇𝜇𝜇 is its mean. The effect of varying 𝜎𝜎𝜎𝜎 is to narrow (small 𝜎𝜎𝜎𝜎) or widen (large 𝜎𝜎𝜎𝜎) the distribution. 𝜎𝜎𝜎𝜎 turns out to be the standard deviation of the distribution. The Normal is two-parameter family of distributions and once the mean 𝜇𝜇𝜇𝜇 and the standard deviation 𝜎𝜎𝜎𝜎 (or equivalently the variance 𝜎𝜎𝜎𝜎 2 ) are known the whole of the distribution can be drawn. 0,045 0,04 0,035 0,03 0,025 f(x) 0,02 0,015 0,01 0,005 0 0 50 100 x 150 200 250 Figure 9. Graph of normal distribution with mean 175 and standard deviation 9.6 0,05 f(x) 0,04 0,03 0,02 0,01 0 0 50 100 x 150 200 250 Figure 10. Graph of normal distribution with mean 175 and standard deviation 15.3 0,25 0,2 f(x) 0,15 0,1 0,05 0 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 x Figure 11. Graph of normal distribution with mean 40 and standard deviation 25 The shorthand notation for a Normal distribution is x ~ N(𝜇𝜇𝜇𝜇, 𝜎𝜎𝜎𝜎 2 ) meaning “the variable x is Normally distributed with mean 𝜇𝜇𝜇𝜇 and variance 𝜎𝜎𝜎𝜎 2 “. Use of the Normal distribution can be illustrated using a simple example. The height of adult males is Normally distributed with mean height 𝜇𝜇𝜇𝜇 = 174 cm and standard deviation 𝜎𝜎𝜎𝜎 = 9.6 cm. Let x represent the height of adult males; then x ~ N(174, 92.16) and this is illustrated in Figure 12. Note that shorthand equation contains the variance rather than the standard deviation. What is the probability that a randomly selected man is taller than 180 cm? If all men are equally likely to be selected, this is equivalent to asking what proportion of men are over 180 cm in height. This is given by the area under the Normal distribution, to the right of x = 180, i.e. the shaded area in Figure 12. The further from the mean of 174, the smaller the area in the tail of the distribution. One way to find this area would be to make use of equation (3.4), but this requires the use of sophisticated mathematics. Figure 12. Men‘s height distribution Since this is a frequently encountered problem, the answers have been set out in the tables of the standard Normal distribution. We can simply look up the solution. However, since there is an infinite number of Normal distributions (one for every combination of 𝜇𝜇𝜇𝜇 and 𝜎𝜎𝜎𝜎 2 ), it would be an impossible task to tabulate the average of around them all. The standard Normal distribution, which has a mean of zero and variance of one, is therefore used to represent all Normal distributions. Before the table can be consulted, therefore, the data have to be transformed so that they accord with the standard Normal distribution. The required transformation is the z score, which was introduced above. This measures the distance between the value of interest (180) and the mean, measured in terms of standard deviations. Therefore we calculate 𝑓𝑓𝑓𝑓(𝑥𝑥𝑥𝑥) = 1 √2𝜋𝜋𝜋𝜋 𝑧𝑧𝑧𝑧 2 𝑒𝑒𝑒𝑒𝑥𝑥𝑥𝑥𝑒𝑒𝑒𝑒 �− �, 2 and z is a Normally distributed random variable with mean 0 and variance 1, i.e. z ~ N(0, 1). This transformation shifts the original distribution 𝜇𝜇𝜇𝜇 units to the left and then adjusts the dispersion by dividing through by 𝜎𝜎𝜎𝜎, resulting in a mean of 0 and variance 1. z is Normally distributed because x is Normally distributed. The transformation in equation above retains the Normal distribution shape, despite the changes to mean and variance. If x followed some other distribution then z would not be Normal either. Descriptive statistics in SPSS Click Analyze->Descriptive statistics->Descriptives The screen should now look like this: The box on the left shows variables for which you could produce descriptive statistics such as means, etc. (Note that SPSS doesn't show you the first column, as that contains data in the form of words, and you can't calculate means on this kind of data! However, we have fooled it with the variable "sex". SPSS would allow us to work out descriptive statistics for "sex", even though they would be meaningless - remember we used "1" and "2" merely as labels for "male" and "female" respectively, so they are not really "numbers" in an arithmetical sense at all). You can move any or all of the variable names on the left, into the box on the right. Highlight a variable name by clicking on it, and then click on the little arrow between the boxes. If variables are moved to the box on the right, SPSS will calculate basic descriptive statistics on them. As a default option, SPSS will work out the mean, standard deviation, and minimum and maximum values for each variable placed in the right-most box. If you want other descriptive statistics, try clicking on the "Options" button. For this example, we will content ourselves with the default statistics, so move "height" and "weight" into the right-hand box, and then press "OK". The screen will abruptly switch from the data window, to the "Output Window", and the statistics will be displayed. The first column in the table, "N", tells you how many valid observations there were, basically a reassurance that SPSS has used as many participants' data in its calculations as we thought it was going to, and that it hasn't dropped participants from the analysis because they had missing data. (With a small data-set like ours, this isn't too useful, but if you had zillions of entries, it is always possible you made a mistake in entering the data and failed to notice). There follow means, standard deviations and minimum and maximum values for our two variables of height and weight. Self-test questions 1. Given data set 1200, 1400, 6000, 1900, 2100. Which central tendency characteristics would you calculate? 2. How do you understand the 3 sigma rule? Give some example. 3. Give some examples when only Mode can be calculated. 4. Give some examples when it is better to calculate Mean and when Median. 5. When histogram can be used and when barplot? Exercises 1. Given data about car sales in 20 days. Make frequency table and graph 3, 5, 7 ,8 ,2 ,2 ,2, 4 ,5 ,3 ,3 ,3 ,5 ,8 ,7 ,5 ,4 ,3 ,2 ,5 ,6 ,7 ,8 ,5 ,4 ,5 2. Given data about number of customers in the shop. Make frequency table of grouped data 316 357 385 345 345 301 398 376 318 351 234 356 436 395 368 347 230 341 345 387 361 341 345 343 324 385 464 243 451 379 435 371 279 348 375 359 326 332 351 381 3. The following are the scores of 60 students on a 100-item spelling test: 84 74 83 46 80 57 59 94 76 72 52 77 48 48 61 65 86 65 73 54 74 64 60 63 68 41 66 55 46 75 76 64 68 67 68 27 67 53 68 78 59 72 71 67 68 62 58 69 54 62 64 72 61 67 39 57 57 75 69 61 Make frequency table for grouped data 4. Calculate averages and standard deviation of the following test scores 13 11 10 9 8 6 4 12 11 10 9 7 6 4 12 11 10 8 7 5 4 11 10 9 8 6 4 4 0 9 8 7 5 4 5. A group of seniors majoring in psychology made the following scores on the verbal test of the Graduate Record Examination. Calculate averages and standard deviation of the above scores 750 640 600 570 540 490 450 400 700 630 590 570 540 490 440 380 680 630 590 560 530 480 440 360 660 610 580 560 500 470 430 350 650 600 570 540 490 470 420 320 Hypotheses testing We use inferential statistics because it allows us to measure behaviour in samples to learn more about the behaviour in populations that are often too large or inaccessible. We use samples because we know how they are related to populations. For example, suppose the average score on a standardized exam in a given population is 1,000. In our example, if we select a random sample from this population with a mean of 1,000, then on average, the value of a sample mean will equal 1,000. In behavioural research, we select samples to learn more about populations of interest to us. In terms of the mean, we measure a sample mean to learn more about the mean in a population. Therefore, we will use the sample mean to describe the population mean. We begin by stating the value of a population mean, and then we select a sample and measure the mean in that sample. On average, the value of the sample mean will equal the population mean. The larger the difference or discrepancy between the sample mean and population mean, the less likely it is that we could have selected that sample mean, if the value of the population mean is correct. The method in which we select samples to learn more about characteristics in a given population is called hypothesis testing. Hypothesis testing is really a systematic way to test claims or ideas about a group or population. To illustrate, suppose we read an article stating that children in the Lithuania watch an average of 3 hours of TV per week. To test whether this claim is true, we record the time (in hours) that a group of 20 Lithuanian children (the sample), among all children in the Lithuania (the population), watch TV. The mean we measure for these 20 children is a sample mean. We can then compare the sample mean we select to the population mean stated in the article. Hypothesis testing or significance testing is a method for testing a claim or hypothesis about a parameter in a population, using data measured in a sample. In this method, we test some hypothesis by determining the likelihood that a sample statistic could have been selected, if the hypothesis regarding the population parameter were true [8]. Different symbols are used to denote parameters of sample and population. Table 2. Notations for Population and Sample parameters Mean Standard deviation Variance Sample 𝑥𝑥𝑥𝑥̅ 𝑠𝑠𝑠𝑠 𝑠𝑠𝑠𝑠 2 Population 𝜇𝜇𝜇𝜇 𝜎𝜎𝜎𝜎 𝜎𝜎𝜎𝜎 2 After finishing this chapter students will be able to choose an appropriate hypothesis and to formulate conclusions based on the results of statistical analysis also to make decisions based on the results of statistical analysis. Keywords: null hypothesis, alternate hypothesis, paired samples, test statistic, alpha level. The concepts of hypothesis testing The goal of hypothesis testing is to determine the likelihood that a population parameter, such as the mean, is likely to be true. In this section, we describe the four steps of hypothesis testing: • state the hypotheses, • set the criteria for a decision, • compute the test statistic, • make a decision. State the hypotheses. We begin by stating the value of a population mean in a null hypothesis, which we presume is true. For the children watching TV example, we state the null hypothesis that children in the Lithuania watch an average of 3 hours of TV per week. This is a starting point so that we can decide whether this is likely to be true, similar to the presumption of innocence in a courtroom. When a defendant is on trial, the jury starts by assuming that the defendant is innocent. The basis of the decision is to determine whether this assumption is true. Likewise, in hypothesis testing, we start by assuming that the hypothesis or claim we are testing is true. This is stated in the null hypothesis. The basis of the decision is to determine whether this assumption is likely to be true [8]. The null hypothesis (𝐻𝐻𝐻𝐻0 ), stated as the null, is a statement about a population parameter, such as the population mean, that is assumed to be true. The null hypothesis is a starting point. We will test whether the value stated in the null hypothesis is likely to be true. Keep in mind that the only reason we are testing the null hypothesis is because we think it is wrong. We state what we think is wrong about the null hypothesis in an alternative hypothesis (𝐻𝐻𝐻𝐻1 ). For the children watching TV example, we may have reason to believe that children watch more than (>) or less than (<) 3 hours of TV per week. When we are uncertain of the direction, we can state that the value in the null hypothesis is not equal to (≠) 3 hours. In a courtroom, since the defendant is assumed to be innocent (this is the null hypothesis so to speak), the burden is on a prosecutor to conduct a trial to show evidence that the defendant is not innocent. In a similar way, we assume the null hypothesis is true, placing the burden on the researcher to conduct a study to show evidence that the null hypothesis is unlikely to be true. Regardless, we always make a decision about the null hypothesis (that it is likely or unlikely to be true). The alternative hypothesis is needed for Step 2. An alternative hypothesis (𝐻𝐻𝐻𝐻1 ) is a statement that directly contradicts a null hypothesis by stating that that the actual value of a population parameter is less than, greater than, or not equal to the value stated in the null hypothesis. The alternative hypothesis states what we think is wrong about the null hypothesis, which is needed for Step 2 A decision made in hypothesis testing centres on the null hypothesis. Set the criteria for a decision. To set the criteria for a decision, we state the level of significance for a test. This is similar to the criterion that jurors use in a criminal trial. Jurors decide whether the evidence presented shows guilt beyond a reasonable doubt (this is the criterion). Likewise, in hypothesis testing, we collect data to show that the null hypothesis is not true, based on the likelihood of selecting a sample mean from a population (the likelihood is the criterion). The likelihood or level of significance is typically set at 5% in behavioural research studies. When the probability of obtaining a sample mean is less than 5% if the null hypothesis were true, then we conclude that the sample we selected is too unlikely and so we reject the null hypothesis [8]. Level of significance, or significance level, refers to a criterion of judgment upon which a decision is made regarding the value stated in a null hypothesis. The criterion is based on the probability of obtaining a statistic measured in a sample if the value stated in the null hypothesis were true. In behavioural science, the criterion or level of significance is typically set at 5%. When the probability of obtaining a sample mean is less than 5% if the null hypothesis were true, then we reject the value stated in the null hypothesis. The alternative hypothesis establishes where to place the level of significance. Remember that we know that the sample mean will equal the population mean on average if the null hypothesis is true. All other possible values of the sample mean are normally distributed (central limit theorem). The empirical rule tells us that at least 95% of all sample means fall within about 2 standard deviations (SD) of the population mean, meaning that there is less than a 5% probability of obtaining a sample mean that is beyond 2 SD from the population mean. For the children watching TV example, we can look for the probability of obtaining a sample mean beyond 2 SD in the upper tail (greater than 3), the lower tail (less than 3), or both tails (not equal to 3). Figure 13 shows that the alternative hypothesis is used to determine which tail or tails to place the level of significance for a hypothesis test [8]. Figure 13. Alternative hypothesis and alpha level Compute the test statistic. Suppose we measure a sample mean equal to 4 hours per week that children watch TV. To make a decision, we need to evaluate how likely this sample outcome is, if the population mean stated by the null hypothesis (3 hours per week) is true. We use a test statistic to determine this likelihood. Specifically, a test statistic tells us how far, or how many standard deviations, a sample mean is from the population mean. The larger the value of the test statistic, the further the distance, or number of standard deviations, a sample mean is from the population mean stated in the null hypothesis. The value of the test statistic is used to make a decision in Step 4. The test statistic is a mathematical formula that allows researchers to determine the likelihood of obtaining sample outcomes if the null hypothesis were true. The value of the test statistic is used to make a decision regarding the null hypothesis [8]. Make a decision. We use the value of the test statistic to make a decision about the null hypothesis. The decision is based on the probability of obtaining a sample mean, given that the value stated in the null hypothesis is true. If the probability of obtaining a sample mean is less than 5% when the null hypothesis is true, then the decision is to reject the null hypothesis. If the probability of obtaining a sample mean is greater than 5% when the null hypothesis is true, then the decision is to retain the null hypothesis. In sum, there are two decisions a researcher can make: • reject the null hypothesis. The sample mean is associated with a low probability of occurrence when the null hypothesis is true; • retain the null hypothesis. The sample mean is associated with a high probability of occurrence when the null hypothesis is true. The probability of obtaining a sample mean, given that the value stated in the null hypothesis is true, is stated by the p value. The p value is a probability: It varies between 0 and 1 and can never be negative. In Step 2, we stated the criterion or probability of obtaining a sample mean at which point we will decide to reject the value stated in the null hypothesis, which is typically set at 5% in behavioural research. To make a decision, we compare the p value to the criterion we set in Step 2. A p value is the probability of obtaining a sample outcome, given that the value stated in the null hypothesis is true. The p value for obtaining a sample outcome is compared to the level of significance. Significance, or statistical significance, describes a decision made concerning a value stated in the null hypothesis. When the null hypothesis is rejected, we reach significance. When the null hypothesis is retained, we fail to reach significance. When the p value is less than 5% (p < .05), we reject the null hypothesis. We will refer to p < .05 as the criterion for deciding to reject the null hypothesis, although note that when p = .05, the decision is also to reject the null hypothesis. When the p value is greater than 5% (p > .05), we retain the null hypothesis. The decision to reject or retain the null hypothesis is called significance. When the p value is less than .05, we reach significance; the decision is to reject the null hypothesis. When the p value is greater than .05, we fail to reach significance; the decision is to retain the null hypothesis [8]. One more important definition in hypothesis testing is degrees of freedom (df). Degrees of freedom mean freedom to vary. Suppose we have six scores, and the mean of these six scores is to be 10. Sixth score makes adjustments that the mean will be 10. Examples 10, 12, 18, 16, 4. Sixth score should be equal to 0 2, 8, 4, 6, 10. Sixth score should be equal to 30. In each case we have 5 degrees of freedom. The next chapters represent the different types of hypotheses for one or two samples. t Ratio or Student’s t Usually for hypothesis testing Student’s or 𝐿𝐿𝐿𝐿 distribution is used. The 𝐿𝐿𝐿𝐿 distributions were discovered by William S. Gosset in 1908. Gosset was a statistician employed by the Guinness brewing company which had stipulated that he not publish under his own name. He therefore wrote under the pen name ``Student.'' These distributions arise in the following situation. Suppose we have a simple random sample of size 𝑛𝑛𝑛𝑛 drawn from a Normal population with mean 𝜇𝜇𝜇𝜇 and standard deviation. Let 𝑥𝑥𝑥𝑥̅ denote the sample mean and 𝑠𝑠𝑠𝑠, the sample standard deviation. Then the quantity 𝐿𝐿𝐿𝐿 = 𝑥𝑥𝑥𝑥̅ −𝜇𝜇𝜇𝜇 𝑠𝑠𝑠𝑠 √𝑛𝑛𝑛𝑛 (1) has a t distribution with n-1 degrees of freedom. Note that there is a different 𝐿𝐿𝐿𝐿 distribution for each sample size, in other words, it is a class of distributions. When we speak of a specific 𝐿𝐿𝐿𝐿 distribution, we have to specify the degrees of freedom. The t density curves are symmetric and bell-shaped like the normal distribution and have their peak at 0. However, the spread is more than that of the standard normal distribution. This is due to the fact that in formula 1, the denominator is s rather than 𝜎𝜎𝜎𝜎. Since 𝑠𝑠𝑠𝑠 is a random quantity varying with various samples, the variability in 𝐿𝐿𝐿𝐿 is more, resulting in a larger spread [3]. Hypothesis for one sample The one-sample 𝐿𝐿𝐿𝐿 -test is used when we want to know whether our sample comes from a particular population but we do not have full population information available to us. For instance, we may want to know if a particular sample of college students is similar to or different from college students in general. The one-sample 𝐿𝐿𝐿𝐿 -test is used only for tests of the sample mean. Thus, our hypothesis tests whether the average of our sample (𝑥𝑥𝑥𝑥̅ ) suggests that our students come from a population with a known mean (𝜇𝜇𝜇𝜇) or whether it comes from a different population [3]. The statistical hypotheses for one-sample 𝐿𝐿𝐿𝐿 -tests take one of the following forms. In the equations below 𝜇𝜇𝜇𝜇 refers to the population from which the study sample was drawn; 𝑚𝑚𝑚𝑚 is replaced by the actual value of the population mean. 𝐻𝐻𝐻𝐻 : 𝜇𝜇𝜇𝜇 = 𝑚𝑚𝑚𝑚 � 0 – two-sided alternative 𝐻𝐻𝐻𝐻1 : 𝜇𝜇𝜇𝜇 ≠ 𝑚𝑚𝑚𝑚 𝐻𝐻𝐻𝐻0 : 𝜇𝜇𝜇𝜇 = 𝑚𝑚𝑚𝑚 � – one-sided alternatives 𝐻𝐻𝐻𝐻1 : 𝜇𝜇𝜇𝜇 > 𝑚𝑚𝑚𝑚(𝜇𝜇𝜇𝜇 < 𝑚𝑚𝑚𝑚 ) Criteria for a decision is 𝐿𝐿𝐿𝐿 value which depends on df and significance level 𝛼𝛼𝛼𝛼 (usually 5%). Critical values (𝐿𝐿𝐿𝐿𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 = 𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 (𝑛𝑛𝑛𝑛 − 1)) can be found in Appendix 2. Statistical test 𝐿𝐿𝐿𝐿 = 𝑥𝑥𝑥𝑥̅ − 𝑚𝑚𝑚𝑚 √𝑛𝑛𝑛𝑛 − 1 𝑠𝑠𝑠𝑠 Decision making. Decision depends on statistical test and alternative hypothesis. Table 3. Decision making rules 𝐻𝐻𝐻𝐻1 𝜇𝜇𝜇𝜇 ≠ 𝑚𝑚𝑚𝑚 𝜇𝜇𝜇𝜇 > 𝑚𝑚𝑚𝑚 𝜇𝜇𝜇𝜇 < 𝑚𝑚𝑚𝑚 Reject 𝐻𝐻𝐻𝐻0 |𝐿𝐿𝐿𝐿| < 𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 ⁄2 (𝑛𝑛𝑛𝑛 − 1) 𝐿𝐿𝐿𝐿 > 𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 (𝑛𝑛𝑛𝑛 − 1) 𝐿𝐿𝐿𝐿 < −𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 (𝑛𝑛𝑛𝑛 − 1) Retain 𝐻𝐻𝐻𝐻0 |𝐿𝐿𝐿𝐿| ≥ 𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 ⁄2 (𝑛𝑛𝑛𝑛 − 1) 𝐿𝐿𝐿𝐿 ≤ 𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 (𝑛𝑛𝑛𝑛 − 1) 𝐿𝐿𝐿𝐿 ≥ −𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 (𝑛𝑛𝑛𝑛 − 1) Example Let us check the hypothesis that mean number of customers in the shop is less than 400 per day. Given data: 316 230 318 341 345 357 387 351 361 341 385 345 234 343 324 345 385 356 464 243 345 451 436 379 435 301 371 395 279 348 � 𝐿𝐿𝐿𝐿 = 398 375 368 359 326 376 332 347 351 381 𝐻𝐻𝐻𝐻0 : 𝜇𝜇𝜇𝜇 = 400 𝐻𝐻𝐻𝐻1 : 𝜇𝜇𝜇𝜇 < 400 𝑥𝑥𝑥𝑥̅ = 353,1; 𝑠𝑠𝑠𝑠 = 50,78 353,1 − 400 √39 = −5,84 50,78 𝐿𝐿𝐿𝐿𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 = 𝐿𝐿𝐿𝐿0,05 (39) = 2,021 As 𝐿𝐿𝐿𝐿 = −5,84 < −𝐿𝐿𝐿𝐿0,05 (39) = −2,021, the null hypothesis should be rejected. Testing the difference of two means (independent and paired samples) The following example illustrates the differences between independent samples (as encountered so far) and dependent samples where slightly different methods of analysis are required. The example also illustrates how a particular problem can often be analysed by a variety of statistical methods. A company introduces a training programme to raise the productivity of its clerical workers, which is measured by the number of invoices processed per day. The company wants to know if the training programme is effective. How should it evaluate the programme? There is a variety of ways of going about the task, as follows [3]: • take two (random) samples of workers, one trained and one not trained, and compare their productivity; • take a sample of workers and compare their productivity before and after training; • take two samples of workers, one to be trained and the other not. Compare the improvement of the trained workers with any change in the other group’s performance over the same time period. We shall go through each method in turn, pointing out any possible difficulties. Two independent samples Assumptions: • Two samples (x and y) are random samples independently drawn from distributions that are normal • Variances are the same (homogeneity of variance) The statistical hypotheses 𝐻𝐻𝐻𝐻0 : 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 = 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦 – two sided alternative •� 𝐻𝐻𝐻𝐻1 : 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 ≠ 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦 𝐻𝐻𝐻𝐻0 : 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 = 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦 •� – one sided alternatives 𝐻𝐻𝐻𝐻1 : 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 > 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦 �𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 < 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦 � Criteria for a decision is 𝐿𝐿𝐿𝐿 value which depends on df and significance level 𝛼𝛼𝛼𝛼 (usually 5%). Critical values (𝐿𝐿𝐿𝐿𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 = 𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 (𝑛𝑛𝑛𝑛 + 𝑚𝑚𝑚𝑚 − 2)) can be found in Appendix 2. Here 𝑛𝑛𝑛𝑛 1st sample size and 𝑚𝑚𝑚𝑚 – 2nd sample size. Statistical test 𝐿𝐿𝐿𝐿 = 𝑥𝑥𝑥𝑥̅ − 𝑦𝑦𝑦𝑦� �𝑠𝑠𝑠𝑠𝑥𝑥𝑥𝑥2 (𝑛𝑛𝑛𝑛 − 1) + 𝑠𝑠𝑠𝑠𝑦𝑦𝑦𝑦2 (𝑚𝑚𝑚𝑚 − 1) � 𝑚𝑚𝑚𝑚 ∗ 𝑛𝑛𝑛𝑛(𝑚𝑚𝑚𝑚 + 𝑛𝑛𝑛𝑛 − 2) 𝑚𝑚𝑚𝑚 + 𝑛𝑛𝑛𝑛 Decision making. Decision depends on statistical test and alternative hypothesis. Table 4. Decision making rules 𝐻𝐻𝐻𝐻1 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 ≠ 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 > 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 < 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦 Reject 𝐻𝐻𝐻𝐻0 |𝐿𝐿𝐿𝐿| < 𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 ⁄2 (𝑛𝑛𝑛𝑛 + 𝑚𝑚𝑚𝑚 − 2) 𝐿𝐿𝐿𝐿 > 𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 (𝑛𝑛𝑛𝑛 + 𝑚𝑚𝑚𝑚 − 2) 𝐿𝐿𝐿𝐿 < −𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 (𝑛𝑛𝑛𝑛 + 𝑚𝑚𝑚𝑚 − 2) Retain 𝐻𝐻𝐻𝐻0 |𝐿𝐿𝐿𝐿| ≥ 𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 ⁄2 (𝑛𝑛𝑛𝑛 + 𝑚𝑚𝑚𝑚 − 2) 𝐿𝐿𝐿𝐿 ≤ 𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 (𝑛𝑛𝑛𝑛 + 𝑚𝑚𝑚𝑚 − 2) 𝐿𝐿𝐿𝐿 ≥ −𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 (𝑛𝑛𝑛𝑛 + 𝑚𝑚𝑚𝑚 − 2) Example Given data about Men’s and Women’s salary. Check the hypothesis that Men’s salary is bigger than Women’s. Men’s salary (Lt) 2500 2600 3100 4200 1200 1900 1800 1500 2700 Women’s salary (Lt) 2400 2500 2300 3100 3200 3900 1200 1300 1500 Men’s salary (Lt) 2300 2900 3500 3300 3600 3400 4800 2600 2900 Women’s salary (Lt) 1800 2100 3400 3800 3500 3900 2100 𝐻𝐻𝐻𝐻0 : 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 = 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦 � 𝐻𝐻𝐻𝐻1 : 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 > 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦 Here 𝑥𝑥𝑥𝑥 – Men’s sample and 𝑦𝑦𝑦𝑦 – Women’s sample. 𝑥𝑥𝑥𝑥̅ = 2822,22; 𝑠𝑠𝑠𝑠𝑥𝑥𝑥𝑥 = 916,87 𝑦𝑦𝑦𝑦� = 2625; 𝑠𝑠𝑠𝑠𝑦𝑦𝑦𝑦 = 931,31 𝐿𝐿𝐿𝐿 = 18 ∙ 16 ∙ (18 + 16 − 2) = 0,873 18 + 16 �916,872 ∙ 17 + 931,312 ∙ 15 2822,22 − 2625 � 𝐿𝐿𝐿𝐿𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 = 𝐿𝐿𝐿𝐿0,05 (32) = 2,042 As 𝐿𝐿𝐿𝐿 = 0,873 < 𝐿𝐿𝐿𝐿0,05 (32) = 2,042, the null hypothesis should be retained – average salaries are equal. Two dependent samples The dependent t-test (also called the paired t-test or paired-samples t-test) compares the means of two related groups to detect whether there are any statistically significant differences between these means. A dependent t-test is an example of a "within-subjects" or "repeated-measures" statistical test. This indicates that the same subjects are tested more than once. Thus, in the dependent t-test, "related groups" indicates that the same subjects are present in both groups. The reason that it is possible to have the same subjects in each group is because each subject has been measured on two occasions on the same dependent variable. For example, you might have measured 10 individuals' (subjects') performance in a spelling test (the dependent variable) before and after they underwent a new form of computerised teaching method to improve spelling. You would like to know if the computer training improved their spelling performance. Here, we can use a dependent t- test because we have two related groups. The first related group consists of the subjects at the beginning (prior to) the computerised spell training and the second related group consists of the same subjects, but now at the end of the computerised training. The statistical hypotheses 𝐻𝐻𝐻𝐻0 : 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 = 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦 •� – two sided alternative 𝐻𝐻𝐻𝐻1 : 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 ≠ 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦 𝐻𝐻𝐻𝐻0 : 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 = 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦 •� – one sided alternatives 𝐻𝐻𝐻𝐻1 : 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 > 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦 �𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 < 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦 � Criteria for a decision is 𝐿𝐿𝐿𝐿 value which depends on df and significance level 𝛼𝛼𝛼𝛼 (usually 5%). Critical values (𝐿𝐿𝐿𝐿𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 = 𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 (𝑛𝑛𝑛𝑛 − 1)) can be found in Appendix 2. Here 𝑛𝑛𝑛𝑛 1st sample size and 𝑚𝑚𝑚𝑚 – 2nd sample size. Statistical test here 𝑑𝑑𝑑𝑑𝑚𝑚𝑚𝑚 = 𝑥𝑥𝑥𝑥𝑚𝑚𝑚𝑚 − 𝑦𝑦𝑦𝑦𝑚𝑚𝑚𝑚 – differences. 𝐿𝐿𝐿𝐿 = 𝑑𝑑𝑑𝑑� 2 �𝑠𝑠𝑠𝑠𝑑𝑑𝑑𝑑 � 𝑛𝑛𝑛𝑛 , Decision making. Decision depends on statistical test and alternative hypothesis. Table 5. Decision making rules 𝐻𝐻𝐻𝐻1 Reject 𝐻𝐻𝐻𝐻0 Retain 𝐻𝐻𝐻𝐻0 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 ≠ 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦 |𝐿𝐿𝐿𝐿| < 𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 ⁄2 (𝑛𝑛𝑛𝑛 − 1) |𝐿𝐿𝐿𝐿| ≥ 𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 ⁄2 (𝑛𝑛𝑛𝑛 − 1) 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 < 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦 𝐿𝐿𝐿𝐿 < −𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 (𝑛𝑛𝑛𝑛 − 1) 𝐿𝐿𝐿𝐿 ≥ −𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 (𝑛𝑛𝑛𝑛 − 1) 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 > 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦 𝐿𝐿𝐿𝐿 > 𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 (𝑛𝑛𝑛𝑛 − 1) 𝐿𝐿𝐿𝐿 ≤ 𝐿𝐿𝐿𝐿𝛼𝛼𝛼𝛼 (𝑛𝑛𝑛𝑛 − 1) Example Let’s check if sales before advertisement are less than after. Here X – sales before advertisement; Y – sales after advertisement (see table below). 𝐻𝐻𝐻𝐻0 : 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 = 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦 𝐻𝐻𝐻𝐻1 : 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 < 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦 X 18 16 18 12 20 17 18 20 22 20 10 8 20 Y 𝑑𝑑𝑑𝑑 = 𝑋𝑋𝑋𝑋 − 𝑌𝑌𝑌𝑌 20 -2 22 -6 24 -6 10 2 25 -5 19 -2 20 -2 21 -1 23 -1 20 0 10 0 12 -4 22 -2 𝑑𝑑𝑑𝑑̅ = −2,23; 𝑠𝑠𝑠𝑠𝑑𝑑𝑑𝑑 = 2,42 𝐿𝐿𝐿𝐿 = −2,23 � 2,42� 13 = −3,32 As 𝐿𝐿𝐿𝐿 = −3,32 < −𝐿𝐿𝐿𝐿0,05 (12) = −2,179, the null hypothesis should be rejected – advertisement was effective. Hypothesis testing in SPSS One sample t test Check hypothesis that average score of statistical test is more than 70: 𝐻𝐻𝐻𝐻 : 𝜇𝜇𝜇𝜇 = 70 � 0 𝐻𝐻𝐻𝐻1 : 𝜇𝜇𝜇𝜇 < 70 Click Analyze-> Compare Means -> One-Sample T Test: You will be presented with the One-Sample T Test dialogue box, as shown below: Transfer the dependent variable, Statistics test score, into the Test Variable(s). Enter the population mean you are comparing the sample against in the Test Value: box, by changing the current value of "0" to "70". You will end up with the following screen: You will receive following results: Statistics test score N 30 One-Sample Statistics Mean Std. Deviation 73,4333 11,16856 Std. Error Mean 2,03909 Statistics test score One-Sample Test Test Value = 70 t df Sig. (2-tailed) Mean Difference 1,684 29 ,103 3,43333 95% Confidence Interval of the Difference Lower Upper -,7371 7,6037 Table One-Sample Statistics shows general information about sample. Table One- Sample Test presents value of statistical test (column t) and also p value (column Sig. 2tailed). Rule for decision making: 𝐻𝐻𝐻𝐻1 Reject 𝐻𝐻𝐻𝐻0 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 ≠ 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦 𝑒𝑒𝑒𝑒 < 𝛼𝛼𝛼𝛼 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 > 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦 𝑒𝑒𝑒𝑒 < 2𝛼𝛼𝛼𝛼 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 < 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦 𝑒𝑒𝑒𝑒 < 2𝛼𝛼𝛼𝛼 𝑒𝑒𝑒𝑒 = 0,103 > 2 ∗ 0,05 𝐻𝐻𝐻𝐻0 retains. Two independent samples Check hypothesis that Man’s (x) average salary is bigger than Women’s (y): 𝐻𝐻𝐻𝐻0 : 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 = 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦 � 𝐻𝐻𝐻𝐻1 : 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 > 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦 Click Analyze -> Compare Means -> Independent-Samples T Test You will be presented with the Independent-Samples T Test dialogue box, as shown below: Transfer the dependent variable, Salarie, into the Test Variable(s) box, and transfer the independent variable, Gender, into the Grouping Variable. You will end up with the following screen: You then need to define the groups (gender). Click on the button. You will be presented with the Define Groups dialogue box. Enter 1 (Men) into the Group 1: box and enter 2 (Woman) into the Group 2: box. Click Continue and OK You will receive following results: First table Group Statistics presents general information about samples (N – sample size, Mean and standard deviation of Man’s and Woman’s salaries respectively). Don’t forget that here is only information about Sample and all conclusions we make for the all Population. Gender Salarie Male Female N 11 16 Group Statistics Mean 1872,7273 2293,7500 Std. Deviation 682,77509 635,05249 Std. Error Mean 205,86443 158,76312 Second table Independent Samples Test. To find out which row to read from, look at the large column labelled Levene’s Test for Equality of Variances. This is a test that determines if the two conditions have about the same or different amounts of variability between scores. You will see two smaller columns labelled F and Sig (p value). Look in the Sig. (pvalue) column. It will have one value. You will use this value to determine which row to read from. In this example, the value in the Sig. (p value) column is 0.833. Read from the top row. A value greater than .05 means that the variability in your two conditions is about the same. That the scores in one condition do not vary too much more than the scores in your second condition. Put scientifically, it means that the variability in the two conditions is not significantly different. This is a good thing. In this example, the Sig. value is greater than .05. So, we read from the first row. Than you can find value of statistical test (column t) and also p value (column Sig. 2-tailed). Rule for decision making: 𝐻𝐻𝐻𝐻1 Reject 𝐻𝐻𝐻𝐻0 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 ≠ 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦 𝑒𝑒𝑒𝑒 < 𝛼𝛼𝛼𝛼 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 > 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦 𝑒𝑒𝑒𝑒 < 2𝛼𝛼𝛼𝛼 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 < 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦 𝑒𝑒𝑒𝑒 < 2𝛼𝛼𝛼𝛼 Levene's Test for Equality of Variances F Sig. Equal variances ,045 assuSala- med ry Equal variances not assumed ,833 Independent Samples Test t-test for Equality of Means t df Sig. (2-tailed) Mean Difference Std. Error 95% Confidence Interval Difference of the Difference Lower Upper -1,64 25 ,113 -421,02273 256,37429 -949,03546 106,99001 -1,61 20,579 ,121 -421,02273 259,97287 -962,33962 120,29416 𝑒𝑒𝑒𝑒 = 0,113 > 2 ∗ 0,05 𝐻𝐻𝐻𝐻0 retains. That means that Men’s average salary is equal to Women’s average salary. Two dependent (paired) samples Check hypothesis that Sales before advertisement are less than after: 𝐻𝐻𝐻𝐻0 : 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 = 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦 � 𝐻𝐻𝐻𝐻1 : 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 < 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦 Click Analyze -> Compare Means -> Paired-Samples T Test Select Pair: Sales before and Sales after as it is shown below: and push OK. You will receive following results: First table Paired Samples Statistics presents general information about samples (N – sample size, Mean and standard deviation of Sales before and after advertisement respectively). Don’t forget that here is only information about Sample and all conclusions we make for the all Population. Pair 1 Sales before advertisement Sales after advertisement Paired Samples Statistics Mean N Std. Deviation 16,8462 19,0769 13 13 4,27875 5,10656 Std. Error Mean 1,18671 1,41630 Second table Paired Samples Test. In part Paired Differences you’ll find descriptive statistics characteristics (Mean and Standard deviation), calculated for differences of data (Sales before advertisement Sales after advertisement in this case). In the last three columns you can find value of statistical test (column t), degrees of freedom (df) and also p value (column Sig. 2-tailed). Rule for decision making: 𝐻𝐻𝐻𝐻1 Reject 𝐻𝐻𝐻𝐻0 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 ≠ 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦 𝑒𝑒𝑒𝑒 < 𝛼𝛼𝛼𝛼 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 > 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦 𝑒𝑒𝑒𝑒 < 2𝛼𝛼𝛼𝛼 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 < 𝜇𝜇𝜇𝜇𝑦𝑦𝑦𝑦 𝑒𝑒𝑒𝑒 < 2𝛼𝛼𝛼𝛼 Mean Sales before Pair advertisement 1 Sales after advertisement -2,23 Paired Samples Test Paired Differences Std. Std. 95% Deviation Error Confidence Mean Interval of the Difference Lower Upper 2,42 ,67133 -3,69 -,76 t df Sig. (2tailed) -3,32 12 ,006 Here 𝑒𝑒𝑒𝑒 = 0,006 < 2 ∗ 0,05 𝐻𝐻𝐻𝐻0 should be rejected. This means that sales before advertisement were lower that after, advertisement was successful. Self-test questions 1. In order to compare men's and women's average monthly income, we apply a) the hypothesis testing for two independent samples; b) the hypothesis testing for two dependent samples; c) analysis of variance. 2. There are data on demand before and after advertisement. Before 111 115 114 118 After 116 120 121 119 In order to determine whether the advertising was effective, we apply: a) the hypothesis testing for two independent samples; b) the hypothesis testing for two dependent samples; c) regression analysis. 3. In order to determine whether the average salary is the same, staff in Shopping Centre A, Shopping Centre B and Shopping Centre C where interviewed. We apply: a) hypothesis testing, b) post-hoc test, c) analysis of variance (ANOVA). Exercises 1. Check the hypothesis, that mean number of customers in the shop is less than 400 per day. 316 357 385 345 345 301 398 376 318 351 234 356 436 395 368 347 230 341 345 387 361 341 345 343 324 385 464 243 451 379 435 2. Data about car sales in 20 days. 371 279 348 375 359 326 332 351 381 5, 7 ,8 ,2 ,2 ,2, 4 ,5 ,3 ,3 ,3 ,5 ,8 ,7 ,5 ,4 ,3 ,2 ,5 ,6 ,7 ,8 ,5 ,4 ,5 Check hypothesis that number of sold car per day is not more than 5 3. Compare results on statistics test of students in Economics and Computer programming Economic students 57 60 55 58 91 83 82 95 89 77 73 70 71 87 85 98 73 75 68 64 75 50 49 59 62 Computer programming students 58 64 76 55 59 59 68 79 52 81 58 71 75 51 57 93 86 43 80 87 Check hypothesis if Economic students are better in statistics than Computer programming students. 4. Suppose a group of 10 workers is trained and compared to a group of 10 non-trained workers, with the following data being relevant. Thus, trained workers process 25.5 invoices per day compared to only 21 by non-trained workers. The question is whether this is significant, given that the sample sizes are quite small. This is the situation where a sample of workers is tested before and after training. The sample data are as follows: Worker 1 Before 21 After 23 2 24 27 3 23 24 4 25 28 5 28 29 6 17 21 7 24 24 8 22 25 9 24 26 10 27 28 5. Data on shampoo sales before and after advertisement. Was advertisement effective? Before 32 35 31 38 38 39 32 37 35 33 38 39 38 35 32 After 38 36 34 40 38 37 35 38 39 33 40 39 40 36 36 6. A group of students’ marks on two tests, before and after instruction, were as follows: Student 1 2 3 4 5 6 7 8 9 10 11 12 Before 14 16 11 8 20 19 6 11 13 16 9 13 After 15 18 15 11 19 18 9 12 16 16 12 13 Test the hypothesis that the instruction had no effect, using both the independent sample and paired sample methods. Compare the two results. Correlation analysis Correlation is a measure of relationship between two variables. Goal of this chapter is to explain when do we need correlation analysis, how to apply it properly. After finishing this chapter students will be able to choose right correlation coefficient, will know how to calculate it and will be able to make decision based on correlation analysis results. Keywords: Pearson, Spearman, significance, size of relationship. Examples of correlation analysis: • high grades in English tend to be associated with high grades in foreign languages; • both of these tend to be associated with high scores on intelligence test • correlation between the price at which products are sold and the amount available for sale. Such relationships do not necessarily imply that one is the cause of the other. In some situations, we find that two variables are related because they are both related to, or caused by, third variable. Correlation coefficient tells us two things: • direction of relationship, • size of relationship. When two variables are positively related, as one increases, the other also increases, e.g. Intelligence scores and academic grades. Other variables are inversely related: as one increases, the other decreases, e.g. Speed of automobile and miles per litre of gasoline. Symbol of correlation coefficient is r. Size of r varies from -1 to 1. If r is negative (- 1<r<0) – variables are related inversely; If r is positive (0<r<1) – variables are related positive; If r=0 – variables aren't related The most popular Correlation Coefficient is Pearson r. It shows linear dependence between two quantitative variables Example 𝑐𝑐𝑐𝑐 = �� 𝑥𝑥𝑥𝑥𝑦𝑦𝑦𝑦 − 𝑥𝑥𝑥𝑥̅ ∙ 𝑦𝑦𝑦𝑦� 𝑠𝑠𝑠𝑠𝑥𝑥𝑥𝑥 ∙ 𝑠𝑠𝑠𝑠𝑦𝑦𝑦𝑦 Scores of 35 university students on two statistics tests 1st test 2nd test 1st test 2nd test 80 95 94 101 105 89 106 92 105 107 111 114 83 112 91 88 105 106 61 28 74 46 44 38 72 41 49 69 82 76 39 64 77 50 55 86 63 31 57 70 43 70 54 51 58 63 73 71 76 76 59 71 105 80 85 93 85 92 90 89 85 96 85 98 101 106 112 93 110 59 First we calculate necessary descriptive statistics characteristics: 𝑥𝑥𝑥𝑥𝑦𝑦𝑦𝑦 = 5861,4 𝑥𝑥𝑥𝑥̅ = 96,83; 𝑦𝑦𝑦𝑦� = 59,89; 𝑠𝑠𝑠𝑠𝑥𝑥𝑥𝑥 = 10,11; 𝑠𝑠𝑠𝑠𝑦𝑦𝑦𝑦 = 14,91; �� Then Pearson’s correlation coefficient will be: 𝑐𝑐𝑐𝑐 = 5861,4 − 96,83 ∙ 59,89 = 0,42 10,11 ∙ 14,91 On occasion it is inappropriate or impossible to calculate the correlation coefficient as described above and an alternative approach is required. Sometimes the original data are unavailable but the ranks are. For example, schools may be ranked in terms of their exam results, but the actual pass rates are not available. Similarly, they may be ranked in terms of spending per pupil, with actual spending levels unavailable. Although the original data are missing, one can still test for an association between spending and exam success by calculating the correlation between the ranks. If extra spending improves exam performance, schools ranked higher on spending should also be ranked higher on exam success, leading to a positive correlation. Then instead of Pearson’s coefficient Spearmen Rank Rank-Order Correlation Coefficient is calculated where d – are differences between ranks. Example 𝑐𝑐𝑐𝑐 = 1 − 6 ∑ 𝑑𝑑𝑑𝑑 𝑚𝑚𝑚𝑚2 𝑛𝑛𝑛𝑛 3 −𝑛𝑛𝑛𝑛 , We have the scores on tests X and Y for seven individuals Test X 18 17 14 13 12 10 8 Sum Test Y 24 28 30 26 22 18 15 Rank of X 1 2 3 4 5 6 7 𝑐𝑐𝑐𝑐 = 1 − Significance of Correlation Coefficient Rank of Y 4 2 1 3 5 6 7 𝑑𝑑𝑑𝑑 3 0 2 1 0 0 0 𝑑𝑑𝑑𝑑𝑚𝑚𝑚𝑚2 9 0 4 1 0 0 0 14 6 ∙ 14 = 0,75 73 − 7 These results come from a (small) sample, one of many that could have been collected. Once again we can ask the question, what can we infer about the population from the sample? Assuming the sample was drawn at random (which may not be justified) we can use the principles of hypothesis testing. As usual, there are two possibilities: • The truth is that there is no correlation (in the population) and that our sample exhibits such a large (absolute) value by chance. • There really is a correlation between the birth rate and the growth rate and the sample correctly reflects this. Denoting the true but unknown population correlation coefficient by 𝜌𝜌𝜌𝜌 (the Greek letter ‘rho’) the possibilities can be expressed in terms of a hypothesis test. Test statistic 𝐻𝐻𝐻𝐻 : 𝜌𝜌𝜌𝜌 = 0 � 0 𝐻𝐻𝐻𝐻1 : 𝜌𝜌𝜌𝜌 ≠ 0 𝐿𝐿𝐿𝐿 = 𝑐𝑐𝑐𝑐√𝑛𝑛𝑛𝑛 − 2 √1 − 𝑐𝑐𝑐𝑐 2 which has a t distribution with 𝑛𝑛𝑛𝑛 − 2 degrees of freedom. The five steps of the test procedure are therefore: • Write down the null and alternative hypotheses (shown above). • Choose the significance level of the test: 5% by convention. • Look up the critical value of the test for 𝑛𝑛𝑛𝑛 − 2 = 10 degrees of freedom: • Calculate the test statistic using equation • Compare the test statistic with the critical value. In this case |𝐿𝐿𝐿𝐿| ≤ 𝐿𝐿𝐿𝐿𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 = 𝐿𝐿𝐿𝐿0,05 (𝑛𝑛𝑛𝑛 − 2) so 𝐻𝐻𝐻𝐻0 is rejected. There is a less than 5% chance of the sample evidence occurring if the null hypothesis were true, so the latter is rejected. Example About test’s in statistics 𝐿𝐿𝐿𝐿 = 0,42√35−2 �1−0,42 2 = 2,73, 𝐿𝐿𝐿𝐿0,05 (33) = 2,042, so |𝐿𝐿𝐿𝐿| = 2,73 > 𝐿𝐿𝐿𝐿0,05 (33) = 2,042, 𝐻𝐻𝐻𝐻0 is not rejected – correlation is not significant. Self-test questions 1. To determine the relationship between age and salary we calculate: a) Chi square coefficient b) Pearson correlation coefficient c) Spearman's correlation coefficient 2. Correlation coefficient is equal to 0,785. Please make conclusions about size and direction of relationship. 3. Which one coefficient would you calculate to describe relationship between sex and salary? Why? 4. How do you understand significance or correlation? Exercises 1. Given data about temperature and number of sold Ice cream. Is there significant correlation between these variables? Temp 25 26 24 26 24 26 22 23 27 20 20 22 28 22 26 Ice cream 116 120 115 119 115 118 111 113 121 108 109 110 122 113 121 2. The following table shows how a panel of nutrition experts and a panel of heads of household ranked 15 breakfast foods on their palatability. Calculate r as a measure of the consistency of the two rankings. Nutrition experts Heads of household 3 5 7 4 11 8 9 14 1 2 4 6 10 12 8 7 5 1 13 15 12 9 2 3 15 10 6 11 14 13 3. The following are the rankings which three judges gave to work of ten corporate accounting departments’ trainees. Is there relation between these three judges opinion? Judge A Judge B Judge C 6 2 7 4 5 3 2 4 1 5 8 2 9 10 10 3 1 6 1 6 4 8 9 9 10 7 8 7 3 5 Regression analysis Correlation and regression are techniques for investigating the statistical relationship between two, or more, variables. Regression analysis is a more sophisticated way of examining the relationship between two (or more) variables than is correlation. The major differences between correlation and regression are the following: • Regression can investigate the relationships between two or more variables. • A direction of causality is asserted, from the explanatory variable (or variables) to the dependent variable. • The influence of each explanatory variable upon the dependent variable is measured. • The significance of each explanatory variable can be ascertained. Let‘s say we are interested if number of sold Ice Cream depends on Temperature. In this example we assert that the direction of causality is from the Temperature (X) to the Ice Cream (Y) and not vice versa. The Temperature is therefore the explanatory variable (also referred to as the independent or exogenous variable) and the Ice Cream is the dependent variable (also called the explained or endogenous variable). Regression analysis describes this causal relationship by fitting a straight line drawn through the data, which best summarises them. It is sometimes called ‘the line of best fit’ for this reason. This is illustrated in Figure 14 for the Ice Cream and Temperature data. Note that (by convention) the explanatory variable is placed on the horizontal axis, the explained on the vertical. This regression line is upward sloping (its derivation will be explained shortly) for the same reason that the correlation coefficient is positive, i.e. high values of Y are generally associated with high values of X and vice versa. Since the regression line summarises knowledge of the relationship between X and Y, it can be used to predict the value of Y given any particular value of X [9]. Ice Cream 124 122 120 118 116 114 112 110 108 106 0 5 10 Figure 14. Regression line. 15 Temperature 20 25 30 After finishing this chapter students will know difference between linear and multiple regression, will be able to fit regression model, to formulate conclusions based on the results of statistical analysis, able to make decisions based on the results of statistical analysis. Keywords: linear regression, determination, multiple regression, standardized variables. Linear regression The simplest and most popular is linear regression model. A simple linear regression model is a regression model where the dependent variable is continuous, explained by a single exogenous variable, and linear in the parameters. Linear regression model: where 𝑦𝑦𝑦𝑦 is dependent variable; 𝑦𝑦𝑦𝑦 = 𝑚𝑚𝑚𝑚 + 𝑏𝑏𝑏𝑏𝑥𝑥𝑥𝑥 𝑥𝑥𝑥𝑥 – independent (exogenous) variable; a, b are fixed coefficients to be estimated; a measures the intercept of the regression. Coefficients a, b can be found by using formulas: 𝑠𝑠𝑠𝑠𝑦𝑦𝑦𝑦 𝑏𝑏𝑏𝑏 = 𝑐𝑐𝑐𝑐 ; 𝑚𝑚𝑚𝑚 = 𝑦𝑦𝑦𝑦� − 𝑏𝑏𝑏𝑏𝑥𝑥𝑥𝑥̅ . 𝑠𝑠𝑠𝑠𝑥𝑥𝑥𝑥 Multiple regression Multiple regression is a statistical technique that allows us to predict someone’s score on one variable on the basis of their scores on several other variables. An example might help. Suppose we were interested in predicting how much an individual enjoys their job. Variables such as salary, extent of academic qualifications, age, sex, number of years in full-time employment and socioeconomic status might all contribute towards job satisfaction. If we collected data on all of these variables, perhaps by surveying a few hundred members of the public, we would be able to see how many and which of these variables gave rise to the most accurate prediction of job satisfaction. We might find that job satisfaction is most accurately predicted by type of occupation, salary and years in fulltime employment, with the other variables not helping us to predict job satisfaction [9]. When using multiple regression in psychology, many researchers use the term “independent variables” to identify those variables that they think will influence some other “dependent variable”. We prefer to use the term “predictor variables” for those variables that may be useful in predicting the scores on another variable that we call the “criterion variable”. Thus, in our example above, type of occupation, salary and years in full-time employment would emerge as significant predictor variables, which allow us to estimate the criterion variable – how satisfied someone is likely to be with their job. As we have pointed out before, human behaviour is inherently noisy and therefore it is not possible to produce totally accurate predictions, but multiple regression allows us to identify a set of predictor variables which together provide a useful estimate of a participant’s likely score on a criterion variable [9]. When should I use multiple regression? 1. You can use this statistical technique when exploring linear relationships between the predictor and criterion variables – that is, when the relationship follows a straight line. (To examine non-linear relationships, special techniques can be used.) 2. The criterion variable that you are seeking to predict should be measured on a continuous scale (such as interval or ratio scale). There is a separate regression method called logistic regression that can be used for dichotomous dependent variables (not covered here). scales 3. The predictor variables that you select should be measured on a ratio or interval 4. Multiple regression requires a large number of observations. The number of cases (participants) must substantially exceed the number of predictor variables you are using in your regression. The absolute minimum is that you have five times as many participants as predictor variables. A more acceptable ratio is 10:1, but some people argue that this should be as high as 40:1 for some statistical selection methods. Multiple regression model: 𝑦𝑦𝑦𝑦 = 𝑚𝑚𝑚𝑚 + 𝑏𝑏𝑏𝑏1 𝑥𝑥𝑥𝑥1 + 𝑏𝑏𝑏𝑏2 𝑥𝑥𝑥𝑥2 + ⋯ + 𝑏𝑏𝑏𝑏𝑘𝑘𝑘𝑘 𝑥𝑥𝑥𝑥𝑘𝑘𝑘𝑘 In multiple regression model predictor variables can be measured in different scales, i.e. we are fitting model which describes how Salary can be explained (described) by experience (measured in years) and IQ. In such case it is quite difficult to interpret the results, impossible to decide which explanatory (predictor) variable makes stronger influence to the dependent variable. As alternative standardized values 𝛽𝛽𝛽𝛽𝑚𝑚𝑚𝑚 can be used. Standardized values are dimensionless higher value shows bigger influence. Adequacy of regression model Before prediction regression model needs to be checked. Usually determination coefficient (𝑅𝑅𝑅𝑅2 ) is calculated and hypothesis about coefficients 𝑏𝑏𝑏𝑏𝑚𝑚𝑚𝑚 is checked. 𝑅𝑅𝑅𝑅2 coefficient describes how good data variability is explained by using fitted regressionmodel. It can be from 0 to 1, higher coefficient shows better fitting. In the case of linear regression 𝑅𝑅𝑅𝑅2 coefficient is square of Pearson’s correlation. If in all fitted coefficients 𝑏𝑏𝑏𝑏𝑚𝑚𝑚𝑚 are equal to 0, regression model doesn’t fit. First of all we check hypothesis: � 𝐻𝐻𝐻𝐻0 : 𝑚𝑚𝑚𝑚𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑏𝑏𝑏𝑏𝑚𝑚𝑚𝑚 = 0 𝐻𝐻𝐻𝐻1 : 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑜𝑜𝑜𝑜𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑏𝑏𝑏𝑏𝑚𝑚𝑚𝑚 ≠ 0 If null hypothesis is rejected, than for each 𝑚𝑚𝑚𝑚 = 1, … , 𝑘𝑘𝑘𝑘 hypothesis is checked: � 𝐻𝐻𝐻𝐻0 : 𝑏𝑏𝑏𝑏𝑚𝑚𝑚𝑚 = 0 𝐻𝐻𝐻𝐻1 : 𝑏𝑏𝑏𝑏𝑚𝑚𝑚𝑚 ≠ 0 All statistically insignificant variables should be rejected from the model and new model fitted. Example fitted. By using data about Temperature and Ice Cream linear regression model will be Tempe25 26 24 26 24 26 22 rature Ice 116 120 115 119 115 118 111 cream 23 27 20 20 22 28 22 26 113 121 108 109 110 122 113 121 Corresponding chart is shown in figure 14. Temperature is x as it is explanatory (independent) variable and Ice cream is y. Corresponding descriptive statistics characteristics are as follows: 𝑥𝑥𝑥𝑥̅ = 24,07; 𝑦𝑦𝑦𝑦� = 115,40; 𝑠𝑠𝑠𝑠𝑥𝑥𝑥𝑥 = 2,41; 𝑠𝑠𝑠𝑠𝑦𝑦𝑦𝑦 = 4,50; �� 𝑥𝑥𝑥𝑥𝑦𝑦𝑦𝑦 = 2787,87 Pearson’s correlation: 𝑐𝑐𝑐𝑐 = Linear regression coefficients: 2787,87 − 24,07 ∙ 115,40 = 0,98 2,41 ∙ 4,50 𝑏𝑏𝑏𝑏 = 0,98 ∙ 4,50 = 1,82 2,41 𝑚𝑚𝑚𝑚 = 115,40 − 1,82 ∙ 24,07 = 71,49 Then fitted linear regression model will be: 𝑦𝑦𝑦𝑦 = 71,49 + 1,82𝑥𝑥𝑥𝑥 with 𝑅𝑅𝑅𝑅2 = 0,96. Model shows that when temperature increases by one degree, ice cream sales increase by 1,82 portions. Model adequacy is 96%. Regression line you can see in figure 14. Hypothesis about coefficient 𝑏𝑏𝑏𝑏 is easier to check by using SPSS. Correlation and regression analysis in SPSS The following data show the number of bedrooms, the number of baths and prices at which one-family houses sold recently. To calculate correlation coefficient click Analyze->Correlate->Bivariate Select at least two variables and choose correlation coefficient (Pearson or Spearman). In this case variables are quantitative therefore Pearson coefficient was selected. After clicking OK, correlation table will appear: Correlations Number of bedrooms Price (Lt) Pearson Correlation 1 ,971** Number of bedrooms Sig. (2-tailed) ,000 N 31 31 ** Pearson Correlation ,971 1 Price (Lt) Sig. (2-tailed) ,000 N 31 31 **. Correlation is significant at the 0.01 level (2-tailed). Pearson’s correlation is correlation coefficient (i.e. r), correlation is significant if Sig. (2-tailed) (p value) is less than 0,05. Significant correlation is also flagged with stars (**). Correlation between Number of bedrooms and Price (Sig.=0,000<0,05). Correlation is positive and very strong (r=0,971). is significant Correlation matrix also can be created. Click Analyze->Correlate->Bivariate. And select all variables which should appear in correlation matrix After clicking OK, correlation matrix will appear: Correlations Number of bedrooms Pearson Correlation 1 Number of bedrooms Sig. (2-tailed) N 31 Pearson Correlation ,971** Price (Lt) Sig. (2-tailed) ,000 N 31 Pearson Correlation ,772** Number of bath Sig. (2-tailed) ,000 N 31 **. Correlation is significant at the 0.01 level (2-tailed). Price (Lt) ,971** ,000 31 1 31 ,742** ,000 31 Number of bath ,772** ,000 31 ,742** ,000 31 1 31 In correlation matrix you can find correlations between all possible pairs, usually upper triangle is analysed (in lower part you would find the same coefficients). All correlations are significant, strong and positive. Linear regression Click Analyze->Regression->Curve Estimation Select dependent variable (y) and independent (explanatory) variable (x). We will fit model which shows how Price of House depend on Number of bedrooms in this house. After clicking OK results will appear: Equation Linear Model Summary and Parameter Estimates Dependent Variable: Price (Lt) Model Summary Parameter Estimates R Square F df1 df2 Sig. Constant b1 ,942 471,100 1 29 ,000 11161,290 79666,667 The independent variable is Number of bedrooms. R Square – 𝑅𝑅𝑅𝑅2 coefficient. In analysed case it is equal to 0,942, thus model fits well. Sig. (p value) – if less than 0,05 model is significant (coefficient b is not equal to 0). Parameter estimates: Constant – a; b1 – b Therefore regression model will be: 𝑦𝑦𝑦𝑦 = 11161,29 + 79666,67𝑥𝑥𝑥𝑥. Also regression chart will be presented with fitted regression line: Multiple regression Click Analyze->Regression->Linear Dialogue box will appear. Select Dependent variable (y) and all independent variables (x‘s). In this case multiple regression model which shows how House price (𝑦𝑦𝑦𝑦) depend on number of Berooms (𝑥𝑥𝑥𝑥1 ) and number of Baths (𝑥𝑥𝑥𝑥2 ). After clicking OK you will get results. Model Summary R – multiple correlation coefficient, which can be interpreted in the same way as Pearson‘s correlation. R Square - 𝑅𝑅𝑅𝑅2 coefficient, in this case it is 0,942. Model 1 Model Summary R Square Adjusted R Square R ,971a ,942 ,938 Std. Error of the Estimate 20437,352 a. Predictors: (Constant), Number of bath, Number of bedrooms ANOVA Here hypothesis � 𝐻𝐻𝐻𝐻0 : 𝑚𝑚𝑚𝑚𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑏𝑏𝑏𝑏𝑚𝑚𝑚𝑚 = 0 𝐻𝐻𝐻𝐻1 : 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑜𝑜𝑜𝑜𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑏𝑏𝑏𝑏𝑚𝑚𝑚𝑚 ≠ 0 Is checked. If Sig. (p value) is less than 0,05 null hypothesis should be rejected, I. e. at least one 𝑏𝑏𝑏𝑏𝑚𝑚𝑚𝑚 ≠ 0. If null hypothesis would retain, there is no sense to analyse model (it is not statistically significant). Here Sig.=0,000<0,05 – null hypothesis is rejected. Model 1 Sum of Squares Regression 190429003141,219 Residual Total 11695190407,168 202124193548,387 a. Dependent Variable: Price (Lt) ANOVAa df 2 28 30 Mean Square 95214501570,609 417685371,685 F 227,957 Sig. ,000b b. Predictors: (Constant), Number of bath, Number of bedrooms Coefficients Unstandardized Coefficients (Column B) shows multiple regression coefficients: (Constant) - a Number of bedrooms - 𝑏𝑏𝑏𝑏1 Number of bath - 𝑏𝑏𝑏𝑏2 Multiple regression model would be: 𝑦𝑦𝑦𝑦 = 11624,72 + 80789,96𝑥𝑥𝑥𝑥1 − 1773,62𝑥𝑥𝑥𝑥2 In Column Standardized Coefficients (Beta) 𝛽𝛽𝛽𝛽𝑚𝑚𝑚𝑚 are shown. 𝛽𝛽𝛽𝛽1 = 0,984, 𝛽𝛽𝛽𝛽2 = −0,018 as 𝛽𝛽𝛽𝛽1 is much higher Number of bedrooms has stronger influence to the Price. But before making decision column Sig. needs to be analysed. Here hypotheses 𝐻𝐻𝐻𝐻 : 𝑏𝑏𝑏𝑏 = 0 � 0 𝑚𝑚𝑚𝑚 , for 𝑚𝑚𝑚𝑚 = 1,2 𝐻𝐻𝐻𝐻1 : 𝑏𝑏𝑏𝑏𝑚𝑚𝑚𝑚 ≠ 0 are checked. P value (Sig.) next to Number of Bedrooms is 0,000<0,05 and it shows that coefficient 𝑏𝑏𝑏𝑏1 is significant, but P value (Sig.) next to Number of Baths is 0,806>0,05 and it shows that coefficient 𝑏𝑏𝑏𝑏2 is insignificant. It means, that model needs to be refitted and only Number of Bedrooms has significant influence to the Price. In this case we would get linear regression model, which we analysed before. Model Coefficientsa Unstandardized Standardized Coefficients Coefficients B Std. Error Beta 11624,720 11927,863 (Constant) Number of 1 80789,959 bedrooms Number of bath -1773,620 a. Dependent Variable: Price (Lt) 5869,749 7154,410 ,984 -,018 t Sig. ,975 ,338 -,248 ,806 13,764 ,000 Self-test questions 1. Regression Analysis can be applied to ... variables: a) rank, b) quantitative, c) nominal. 2. Given equation y = 12.3 + 1.2 x : where y - income, x - advertising expenditure, that is: a) linear regression equation, b) linear trend equation, c) multiple regression equation. 3. Which one model: linear or multiple regression is more accurate? Why? 4. Given equation Income=3,5+0,7*advertisement. How would you interpret the number 0,7? Exercises 1. The following sample data show the demand for a product (in thousands of units) and its price (in Litas) charged in 12 different market areas. X 18 16 18 12 20 17 18 20 22 20 10 8 20 Y 20 22 24 10 25 19 20 21 23 20 10 12 22 2. The following sample data show the average annual yield of wheat (in bushels per acre) in a given country and the annual rainfall (in centimeters) measured from September to August: Rainfall (x) Yield of wheat (y) 8,8 10,3 39,6 42,5 15,9 13,1 12,9 7,2 11,3 18,6 14,3 9,1 8,9 11,4 12,5 69,3 52,4 60,5 26,7 50,2 78,6 59,1 41,3 40,5 43,5 48,8 3. Suppose that we are given the following sample data to study relationship between the grades students get in a certain examination, their IQ’s, and the number of hours they studied for the test. Hours IQ Grade Hours IQ Grade 8 98 56 6 105 73 5 99 44 9 108 71 11 118 79 6 99 85 13 94 72 7 110 47 10 109 70 5 112 58 5 116 54 4 107 63 18 97 94 11 105 74 15 100 85 13 101 87 2 99 33 15 103 92 8 114 65 14 109 81 0,3 105 9,1 2,1 99 10,4 0,9 74 7,7 1,7 71 7,8 4. The following sample data were collected to determine the relationship between two processing variables and the current gain of a certain kind of transistor. Diffusion time Sheet resistance Current gain Diffusion time Sheet resistance Current gain 1,5 66 5,3 2,4 111 8,1 2,5 88 7,8 2 78 7,2 0,5 69 7,4 0,7 66 6,5 1,2 141 9,8 1,6 123 12,6 2,6 93 10,8 1,8 128 13,1 Time series analysis A time series is a collection of observations of well-defined data items obtained through repeated measurements over time. For example, measuring the value of retail sales each month of the year would comprise a time series. This is because sales revenue is well defined, and consistently measured at equally spaced intervals. Data collected irregularly or only once are not time series. After finishing this chapter students will be able to make a system of statistical indicators taking into account the nature, specifics of the socio-economical phenomenon and the aims of the research. Will be able to create time series model and use it for analysis and forecasting. Keywords: trend, seasonality, irregular component, moving average, exponential smoothing. Components of time series An observed time series can be decomposed into three components: the trend (long term direction), the seasonal (systematic, calendar related movements) and the irregular (unsystematic, short term fluctuations). Trend smooth or regular underlying movement of a series over a fairly lying period of time Seasonal variation is the movement in a time series, which recur year after year in the same months (or the same quarters) or the year with more or less the same intensity Irregular variation – fluctuations from trend, seasonal or cyclical components, caused by some special events. Stock and flow series Time series can be classified into two different types: stock and flow. A stock series is a measure of certain attributes at a point in time and can be thought of as “stocktakes”. For example, the Monthly Labour Force Survey is a stock measure because it takes stock of whether a person was employed in the reference week. Flow series are series which are a measure of activity over a given period. For example, surveys of Retail Trade activity. Manufacturing is also a flow measure because a certain amount is produced each day, and then these amounts are summed to give a total value for production for a given reporting period. The main difference between a stock and a flow series is that flow series can contain effects related to the calendar (trading day effects). Both types of series can still be seasonally adjusted using the same seasonal adjustment process. Seasonal effects A seasonal effect is a systematic and calendar related effect. Some examples include the sharp escalation in most Retail series which occurs around December in response to the Christmas period, or an increase in water consumption in summer due to warmer weather. Other seasonal effects include trading day effects (the number of working or trading days in a given month differs from year to year which will impact upon the level of activity in that month) and moving holidays (the timing of holidays such as Easter varies, so the effects of the holiday will be experienced in different periods each year). Seasonal adjustment Seasonal adjustment is the process of estimating and then removing from a time series influences that are systematic and calendar related. Observed data needs to be seasonally adjusted as seasonal effects can conceal both the true underlying movement in the series, as well as certain non-seasonal characteristics which may be of interest to analysts [1]. Smoothing methods In cases in which the time series is fairly stable and has no significant trend, seasonal, or cyclical effects, one can use smoothing methods to average out the irregular component of the time series [1]. Common smoothing methods are: • Moving Averages • Centered Moving Averages • Weighted Moving Averages • Exponential Smoothing The moving averages method consists of computing an average of the most recent n data values for the series and using this average for forecasting the value of the time series for the next period. Moving averages are useful if one can assume item to be forecast will stay fairly steady over time. Series of arithmetic means - used only for smoothing, provides overall impression of data over time 𝑀𝑀𝑀𝑀𝑜𝑜𝑜𝑜𝑀𝑀𝑀𝑀𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑀𝑀𝑀𝑀 𝐴𝐴𝐴𝐴𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = ∑ 𝑚𝑚𝑚𝑚𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑛𝑛𝑛𝑛 𝑑𝑑𝑑𝑑𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑣𝑣𝑣𝑣𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝑛𝑛𝑛𝑛 Example Month January February March April May June July August September October November December Actual Sales 10 12 16 13 17 19 15 20 22 19 21 19 Three-Month Moving averages (10+12+16)/3 (12+16+13)/3 (16+13+17)/3 … … … … … … (19+21+19)/3 12,67 13,67 15,33 16,33 17,00 18,00 19,00 20,33 20,67 19,67 Moving Average 25 20 15 10 5 0 Actual Forecast The centered moving average method consists of computing an average of n periods' data and associating it with the midpoint of the periods. For example, the average for periods 5, 6, and 7 is associated with period 6. This methodology is useful in the process of computing season indexes [1]. May June July 17 19 15 (17+19+15)/3=17 Weighted Moving Averages method is used when trend is present, older data usually less important, the more recent observations are typically given more weight than older observations. Weights based on intuition; often lay between 0 & 1, & sum to 1.0 𝑊𝑊𝑊𝑊𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = ∑(𝑊𝑊𝑊𝑊𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒ℎ𝐿𝐿𝐿𝐿 𝑓𝑓𝑓𝑓𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝑛𝑛𝑛𝑛)(𝑉𝑉𝑉𝑉𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝑛𝑛𝑛𝑛) ∑ 𝑊𝑊𝑊𝑊𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒ℎ𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 Example Month January February March April May June July August September October November December 25 20 Actual Sales 10 12 16 13 17 19 15 20 22 19 21 19 Weighted Moving average (1*10+2*12+3*16)/6 (1*12+2*16+3*13)/6 … … … … … … (1*22+2*19+3*21)/6 13,67 13,83 15,50 17,33 16,67 18,17 20,17 20,17 20,50 Weighted Moving average 15 10 Actual Sales Forecast 5 0 Disadvantages of M.A. Methods: • increasing n makes forecast less sensitive to changes; • do not forecast trends well; • require sufficient historical data; • moving averages and weighted moving averages are effective in smoothing out sudden fluctuations in demand pattern in order to provide stable estimates; • requires maintaining extensive records of past data; • exponential smoothing requires little record keeping of past data. Exponential smoothing Exponential smoothing is probably the widely used class of procedures for smoothing discrete time series in order to forecast the immediate future. This popularity can be attributed to its simplicity, its computational efficiency, the ease of adjusting its responsiveness to changes in the process being forecast, and its reasonable accuracy. The idea of exponential smoothing is to smooth the original series the way the moving average does and to use the smoothed series in forecasting future values of the variable of interest. In exponential smoothing, however, we want to allow the more recent values of the series to have greater influence on the forecast of future values than the more distant observations. Exponential smoothing is a simple and pragmatic approach to forecasting, whereby the forecast is constructed from an exponentially weighted average of past observations. The largest weight is given to the present observation, less weight to the immediately preceding observation, even less weight to the observation before that, and so on (exponential decay of influence of past data [1]. Exponential Smoothing Model: 𝐹𝐹𝐹𝐹𝐿𝐿𝐿𝐿+1 = 𝛼𝛼𝛼𝛼𝑦𝑦𝑦𝑦𝐿𝐿𝐿𝐿 + (1 − 𝛼𝛼𝛼𝛼) 𝐹𝐹𝐹𝐹𝐿𝐿𝐿𝐿 , where: 𝐹𝐹𝐹𝐹𝐿𝐿𝐿𝐿+1 - forecast value for period t + 1 𝑦𝑦𝑦𝑦𝐿𝐿𝐿𝐿 - actual value for period t 𝐹𝐹𝐹𝐹𝐿𝐿𝐿𝐿 - forecast value for period t 𝛼𝛼𝛼𝛼 - alpha (smoothing constant). Example Quarter (t) 1 2 3 4 5 6 7 8 9 10 60 50 40 30 20 10 0 Sales 𝑦𝑦𝑦𝑦𝐿𝐿𝐿𝐿 23 40 25 27 32 48 33 37 37 50 𝐹𝐹𝐹𝐹𝐿𝐿𝐿𝐿+1 23 (0,2*40+0,8*23) (0,2*25+0,8*26,4) … … … … … (0,2*37+0,8*32,87) #N/A 23 26,4 26,12 26,30 27,44 31,55 31,84 32,87 33,70 Exponential Smoothing Actual Forecast 1 2 3 4 5 6 7 8 9 10 Measures of Forecast Accuracy Mean Squared Error (MSE). The average of the squared forecast errors for the historical data is calculated. The forecasting method or parameter(s) which minimize this mean squared error is then selected. Mean Absolute Deviation (MAD). The mean of the absolute values of all forecast errors is calculated, and the forecasting method or parameter(s) which minimize this measure is selected. The mean absolute deviation measure is less sensitive to individual large forecast errors than the mean squared error measure. You may choose either of the above criteria for evaluating the accuracy of a method (or parameter). Example MSE for Exponential Smoothing method Quarter (t) 1 2 3 4 5 6 7 8 9 10 Sales 𝑦𝑦𝑦𝑦𝐿𝐿𝐿𝐿 23 40 25 27 32 48 33 37 37 50 Sum 𝐹𝐹𝐹𝐹𝐿𝐿𝐿𝐿+1 #N/A 23 26,4 26,12 26,30 27,44 31,55 31,84 32,87 33,70 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = (𝑦𝑦𝑦𝑦𝐿𝐿𝐿𝐿 − 𝐹𝐹𝐹𝐹𝐿𝐿𝐿𝐿+1 )2 289 1,96 0,77 32,54 422,85 2,10 26,63 17,04 265,78 1058,67 1058,67 = 117,63 9 Real Accuracy is shown by Root from MSE (RMSE): Linear trend 𝑅𝑅𝑅𝑅𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = √117,63 ≈ 10,85. Linear trend is the simplest time series model, its equation 𝑦𝑦𝑦𝑦 = 𝑚𝑚𝑚𝑚 + 𝑏𝑏𝑏𝑏𝐿𝐿𝐿𝐿 Coefficients 𝑚𝑚𝑚𝑚 and 𝑏𝑏𝑏𝑏 can be found by solving system of linear equations: 𝑛𝑛𝑛𝑛𝑚𝑚𝑚𝑚 + 𝑏𝑏𝑏𝑏 � 𝐿𝐿𝐿𝐿 = � 𝑦𝑦𝑦𝑦 � 𝑚𝑚𝑚𝑚 � 𝐿𝐿𝐿𝐿 + 𝑏𝑏𝑏𝑏 � 𝐿𝐿𝐿𝐿 2 = � 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 Example For the years 2000 to 2009 Corporation X reported annual revenue Revenue (mln, Lt) 10,3 13,5 13,7 14,2 15 15,1 16,3 17,5 19 18,3 152,9 t 1 2 3 4 5 6 7 8 9 10 55 t^2 1 4 9 16 25 36 49 64 81 100 385 10 12 14 Revenue 16 18 Year 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Total 2 4 6 8 10 Time Seems similar to the linear trend System of linear equations: From here 𝑚𝑚𝑚𝑚 = 10,76; 𝑏𝑏𝑏𝑏 = 0,82 � 10𝑚𝑚𝑚𝑚 + 55𝑏𝑏𝑏𝑏 = 152,9 55𝑚𝑚𝑚𝑚 + 385𝑏𝑏𝑏𝑏 = 908,9 Equation would be: 𝑦𝑦𝑦𝑦 = 10,76 + 0,82𝐿𝐿𝐿𝐿 y*t 10,3 27 41,1 56,8 75 90,6 114,1 140 171 183 908,9 Trend in SPSS Given data about Production (in units) from 1997 year to 2011 year. To fit Linear trend click Analyze->Regression->Curve Estimation. In dependent list select Production, as Independent select “Time”. In Models list choose “Linear”. After clicking OK you will receive results similar to linear regression results. Interpretation also is very similar. In Model Summary and Parameter Estimates table you can find 𝑅𝑅𝑅𝑅2 , which now is very low (0,078) and Sig., which is 0,312>0,05, i.e. linear model is not adequate. In the chart you see Observed data and Linear model (straight line), which does not fit at all. Equation Linear Model Summary and Parameter Estimates R Square ,078 Dependent Variable: Production Model Summary F 1,107 df1 1 df2 13 Sig. ,312 Parameter Estimates Constant 238,238 b1 2,129 But from chart we see that dependence is not linear, it seems to be quadratic. Quadratic trend model also can be fitted by using SPSS. Click Analyze->Regression->Curve Estimation In dependent list select Production, as Independent select “Time”. In Models leave “Linear” and select “Quadratic” (then we can compare two models). 𝑅𝑅𝑅𝑅2 coefficient of quadratic model (0,830) is much higher than for linear model. Also Sig. of quadratic model is 0,000 and it less than 0,05. That means quadratic model is adequate model for analysed data. Equation Linear Quadrati c R Square ,078 ,830 Model Summary and Parameter Estimates Dependent Variable: Production Model Summary Parameter Estimates F df1 df2 Sig. Constant b1 b2 1,107 1 13 ,312 238,238 2,129 29,341 2 12 ,000 316,033 -25,328 Quadratic trend model has expression: 𝑦𝑦𝑦𝑦 = 𝑚𝑚𝑚𝑚 + 𝑏𝑏𝑏𝑏1 𝐿𝐿𝐿𝐿 + 𝑏𝑏𝑏𝑏2 𝐿𝐿𝐿𝐿 2 Fitted model for analysed data will be: 𝑦𝑦𝑦𝑦 = 316,03 − 25,33𝐿𝐿𝐿𝐿 + 1,72𝐿𝐿𝐿𝐿 2 Chart also shows, that quadratic trend model fits much better than linear. 1,716 Self-test questions 1. Describe and visualize graphically the seasonal component of the time series. 2. Determine whether the time series is characterized by a linear trend: Period Value 1 21,3 2 21,9 3 21,5 4 21,8 5 21,3 6 21,7 7 22 8 21,4 3. How do you understand exponential smoothing method? 4. When instead of Moving Averages it is better to use Centred Moving Averages 5. What is the difference between MSE and RMSE? Which one would you use to check accuracy of prediction? Exercises Choose the best models for given time series data 1. Data about average fuel price Year 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 2009 2009 2009 2009 2009 Month 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 Average fuel price 2,32 2,34 2,35 2,41 2,58 2,63 2,71 2,95 3,01 3,03 3,14 3,18 3,21 3,32 3,35 3,43 3,51 2009 2009 2009 2009 2009 2009 2009 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 3,78 3,83 3,91 3,99 4,02 4,05 4,13 4,04 3,99 3,97 3,91 3,84 3,83 3,79 3,77 3,63 3,58 3,39 3,38 3,44 3,68 3,71 3,93 4,04 4,18 4,21 4,33 4,39 4,41 4,43 4,48 2. Data about Customers in the shop Time 2004 I Q 2004 II Q 2004 III Q 2004 IV Q 2005 I Q 2005 II Q 2005 III Q Customers (thous.) 3,1 5,4 6,9 7,3 3,3 5,2 6,1 2005 IV Q 2006 I Q 2006 II Q 2006 III Q 2006 IV Q 2007 I Q 2007 II Q 2007 III Q 2007 IV Q 2008 I Q 2008 II Q 2008 III Q 2008 IV Q 2009 I Q 2009 II Q 2009 III Q 2001 I Q 3. Data about Profit Year 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 8,1 2,9 6,3 6,9 7,3 3,2 5,4 6,5 7,7 3,1 5,4 6,9 7,3 3,6 5,8 6,7 7,5 Profit 2,05 2,33 2,66 3,03 3,45 3,93 4,47 5,09 5,80 6,60 7,52 8,57 9,76 11,11 12,65 14,41 16,41 18,69 4. Data about Computer sales. Year 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 Computer sales (thous.) 13,4 13,5 13,6 13,5 14,7 15 15,3 15,8 15,5 16,1 16,4 16 16,8 17 17,3 17,5 18 21,3 22,6 21,5 23,4 23,3 Appendix 1. t values Appendix 2. Chi square values Appendix 3. F values References 1. 2. 3. 4. http://www.statoek.wiso.uni- goettingen.de/veranstaltungen/graduateseminar/SmoothingMethods_Narodzonek- Karpowska.pdf Fernandez M. Statistics for Business and Economics. 2009, Ventus Publishing ApS Tyrrell S. SPSS: Stats Practically Short and Simple. 2009, Ventus Publishing ApS Smith R. Applied Statistics and Econometrics: Notes and Exercises. 2009, Birkbeck 5. Barrow M. Statistics for Economics, Accounting and Business Studies. 2009, Pearson 6. Kenny D. A. Statistics for the social and behavioural sciences. 1987 7. 8. 9. Education Limited http://www.sussex.ac.uk/Users/grahamh/RM1web/SPSShdt1-2012.pdf http://www.sagepub.com/upm-data/40007_Chapter8.pdf http://www.law.uchicago.edu/files/files/20.Sykes_.Regression.pdf 10. http://www.cliffsnotes.com/math/statistics/sampling/populations-samples-parametersand-statistics 11. http://sociology.about.com/od/Statistics/a/Descriptive-inferential-statistics.htm 12. https://statistics.laerd.com/statistical-guides/measures-central-tendency-mean-modemedian.php

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download BASICS OF APPLIED STATISTICS