Download An approach to Descriptive Statistics through real situations

MaMaEuSch Management Mathematics for European Schools http://www.mathematik.unikl.de/˜ mamaeusch An approach to Descriptive Statistics through real situations Paula Lagares Barreiro1 Federico Perea Rojas-Marcos1 Justo Puerto Albandoz1 MaMaEuSch2 Management Mathematics for European Schools 94342 - CP - 1 - 2001 - 1 - DE - COMENIUS - C21 1 University of Seville This project has been carried out with the partial support of the European Community in the framework of the Sokrates programme. The content does not necessarily reflect the position of the European Community, nor does it involve any responsibility on the part of the European Community. 2 Contents 1 One-dimensional Descriptive Statistics 1.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 The example: an opinion poll . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Population and samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Types of statistical variables: quantitative (discrete and continuous) and qualitative 1.5 Frequency tables: absolute, relative and percentage frequencies . . . . . . . . . . . . 1.6 Graphical methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.1 Bar graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.2 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.3 Frequency polygon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.4 Pie Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.5 Pictogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.6 Stem and leaf plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.7 Some remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7 Measures of central tendency: mean, median, mode, quantiles . . . . . . . . . . . . . 1.8 Measures of variability: Range, variance, standard deviation . . . . . . . . . . . . . . 1.9 Joint use of the mean and the standard deviation: Tchebicheff’s theorem, Pearson’s coefficient of variation, z-scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.9.1 Tchebicheff’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.9.2 Pearson’s coefficient of variation . . . . . . . . . . . . . . . . . . . . . . . . . 1.9.3 Z-scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 4 4 5 6 8 8 9 10 11 12 12 14 14 17 20 20 21 22 2 Analysis of the opinion poll 23 2.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3 Two-dimensional Descriptive Statistics 3.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 The example: an opinion poll . . . . . . . . . . . . . . . 3.3 Introduction and simple tables . . . . . . . . . . . . . . 3.4 Frequency tables, marginal distributions and conditional 3.5 Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Functional dependence and statistical dependence . . . . 3.7 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 Linear correlation . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 28 29 29 30 32 33 35 36 3.9 Regression lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 36 Chapter 1 One-dimensional Descriptive Statistics We are going to study an opinion poll. You will fill a poll, so that we will see what you think about a lot of topics and we will study some characteristics as height, number of brothers/sisters, etc. We will check if your opinions coincide with those of the rest of your friends and also if there are many people in your classroom with similar characteristics to yours. For instance, how many of your partners are higher than you? And how many of them have the same number of brothers/sisters than you? Before continuing, we will pose the main objectives that we want to achieve in this chapter. 1.1 Objectives • To distinguish the different types of statistics. • To determine which type of statistic process we shall use, depending on the type of data that we are studying. • To get to know the concepts of central tendency and variability of a set of data. • To determine the parameters of an statistics distribution. • To study the coefficient of variation. • To motivate through information given in examples and exercises about social, ecological, economical topics, etc. 3 1.2 The example: an opinion poll ¿From now on, we will work with an opinion poll. We want to know some things about the students of the same class than you. We will ask you about some personal data and then you will give to us some information and opinion about many topics, as sports, food, etc. Our poll will be anonymous, so that each one of you can feel free to answer without worrying about the later reading of those opinions. Thus, with these data, we will pose some interesting questions about ourselves as a group, that we can maybe use as an orientation to answer other questions about a wider group of people. For instance, • Which is the most frequent height in your class? • Can you consider your weekly pay normal compared with those of your partners? • How many of you practice sports often? How many have breakfast before coming to the high school? • What kind of fruit do you eat more: fruit, milk, coffee, milk, fish . . . ? We will see that analyzing the answers we get in the poll, you will be able to answer all these questions we have posed. Surely, at the end of this chapter we will have all the answers. But first of all, we are going to present the concepts that you will need. 1.3 Population and samples Before answering all those questions, we have to clarify some things. Who do we want to get information about? We have said yet that we want to know things about the students of your level, so our population will not be only the students of this class, but all the students of your level. But it will take too long to ask all those students, thus we have decided to take a representative group of all the classrooms of your level, that is your class, in this case. So that you are the sample. Furthermore, each member of the population is called data point. Let us make some comments about what we have just said. First of all, maybe we want to study some characteristic in animals, plants or things, for instance, the life of batteries of a mobile phone and, in this case, the population is not ”human”, but the different types of mobile phones. Moreover, we can find some situations in which the use of sampling is even more justified than in our case, due to different reasons: if we want to know the vote of all the spanish people, we can’t ask all the inhabitants older than 18, because those are millions of people and that means lots of time and money. To study, for example, the average life of light bulbs we can’t prove all of them because each proof means that a bulb is blown, this is an example of those situation in which sampling means destroying a data point. Therefore, sampling is justified in many situations by reasons of time, money or destruction of the data points. Exercise 1.3.1 The University studies demand poll in Andalusia was made in 2001 to know what the 65356 high school students wanted to study and why. In order to get that, data from 8500 students from all Andalusia were collected. Could you say which are the sample and the population in this example? Which are the reasons to choose a sample in this example? 4 1.4 Types of statistical variables: quantitative (discrete and continuous) and qualitative In order to answer to many of our questions in the right way, what we shall first do is to decide the kind of method we want to apply to our data. Notice that not all the data we can collect are the same kind, for instance, we can think about the answer to three questions of our poll: 1. The answer to the question sex (male or female). 2. The answer to the question number of brothers/sisters. 3. The answer to the question height. The first thing we notice is that the answer to the first question is not numerical whereas the answers to questions two and three are numerical. The characteristic corresponding to the answer of the first question is called qualitative whereas the ones related to the answers of questions two and three are called quantitative. It is easy to see that quantitative variables allow to do operations that we cannot do with qualitative characteristics. We call categories to the different possibilities of the qualitative variable and values to the ones of the quantitative variables. Let us see now which are the differences between variables 2 and 3, because this one is a little more complicated. The variable number of brothers/sisters take numerical values that we can call ”isolated”, 0,1,2,3,. . . , but it cannot take any value between two of those ones, for instance, it cannot have the value 3.5. Nevertheless this does not happen with the variable height. In fact, height can have any value between certain limits, we can measure height as precisely as we want. We can say that height can take any value from an interval. So the variable in case 2 is called discrete and the variable in case 3 is called continuous. Exercise 1.4.1 Decide whether these variables are qualitative or quantitative, and if they are quantitative, whether they are discrete or continuous 1. Number of babies born in a day. 2. Blood group of a person. 3. Time needed to solve a problem. 4. Number of questions in an exam. 5. Temperature of a person. 6. Political party voted in the last elections. 7. Number of goals scored by a player in a season. 5 1.5 Frequency tables: absolute, relative and percentage frequencies It is the time now to start processing the data we have collected with our poll. The data that we have about number of brothers/sisters are 013201011223121111004231212110 Meanwhile for the weights we have 52 66 54 70 46 62 59 68 49 50 77 57 63 67 58 54 52 47 74 72 80 82 60 75 53 55 69 67 50 52 We can pose a lot of questions: how many of my partners have the same number of brothers/sisters as I have? How many of them have more than me? And less than me? how many of my partners weigh more than me? and less than me? To answer these questions, we would have to count how many time each answer appears. Let us start counting the ones related to the number of brothers/sisters. This is what we have 0 1 2 3 4 ||||| | → 6 ||||| ||||| ||| → 13 ||||| || → 7 ||| → 3 |→1 So, we know now that there are 13 people that have 1 brother/sister. This number is called absolute frequency and we denote it by ni . And, how many people has at most 1 brother/sister? In our case, the people that has 0 or 1 brother/sister, this is, 6 + 13 = 19. This number is called cumulative absolute frequency and we will denote it by Ni . We can write now the cumulative and absolute frequency table: N. bro/sis 0 1 2 3 4 absolute fr. 6 13 7 3 1 cum. absolute fr. 6 13 + 6 = 19 13 + 6 + 7 = 26 13 + 6 + 7 + 3 = 29 13 + 6 + 7 + 3 + 1 = 30 It is important to put the values of the characteristic in order from the biggest to the smallest, if we want to calculate the cumulative frequencies in the right way. We are going to define now other kinds of frequencies, because it is interesting to know the proportion of the total that represents a concrete value, because that’s the way we can compare it with other populations. For instance, in our case, there are 6 students that have 0 brothers/sisters, but we have asked in a group of 50 people and we know that there are 9 people with 0 brothers/sisters, so in which of the two groups is there a bigger proportion of people with no brothers/sisters? It is easy to see that the proportions are 6 9 = 0.2 and = 0.18 30 50 So the proportion is bigger in our group of 30 people. This proportion is called relative frequency and we denote it by fi . If we express it as a percentage (multiplying by 100) we get the percentage 6 frequency, that in our case are 20% and 18% respectively. We denote these frequencies by pi . We add now all these frequencies to our table and we get Bro/sis 0 1 2 3 4 absolute fr. 6 13 7 3 1 relative fr. 6 30 = 0.2 13 b 30 = 0.43 7 b 30 = 0.23 3 = 0.1 30 1 b 30 = 0.3 percentage fr. 20% 43.b 3% 23.b 3% 10% 3.b 3% cum. abs. fr. 6 13 + 6 = 19 13 + 6 + 7 = 26 13 + 6 + 7 + 3 = 29 13 + 6 + 7 + 3 + 1 = 30 cum. rel. fr. 0.2 0.6b 3 0.8b 6 0.9b 6 1 Let us analyze now the weight data. We count the different values: 46 47 49 50 52 53 54 55 57 58 59 60 62 63 66 67 68 69 70 72 74 75 77 80 82 |→1 |→1 |→1 || → 2 ||| → 3 |→1 || → 2 |→1 |→1 |→1 |→1 |→1 |→1 |→1 |→1 || → 2 |→1 |→1 |→1 |→1 |→1 |→1 |→1 |→1 |→1 As you can see, most of the values have frequency 1 and our variable takes 25 different values. Those are too many different values to represent in a table (even more if we only have 30 data). How can we get a more representative table of the distribution of the data? It seems logical to group similar data in intervals. There is a complete theory about how to group data in a right way. These are the main points we want to remark: • The number of classes shall not be neither too high (around 6 − 8 is the maximum number we usually work with) nor too low (it makes no sense to group in 2 or 3 classes because we are losing a lot of information. 7 • Excepting maybe the extreme classes, all the intervals should have the same width, because if not, the information can be misinterpreted. Can you imagine which are the intervals we are looking for? You can think about the number of classes you want to have, for instance. Let us note that between the highest value (82) and the lowest value (46) there is a difference of 36 kg. For instance, if we want to group in 6 classes the width of the interval should be 36 6 = 6. So we obtain the following intervals: [46,52], (52,58], (58,64], (64,70],(70, 76], (76,82]. Now we have a possible classification though, of course, there are many more. In some analysis you may find that the first interval is of the kind ”smaller than 52” and the last interval ”greater than 76”. This kind of interval is considered the same size as the others in order to make calculus. Once decided the data grouping, we can calculate the frequencies: Weight [46,52] (52,58] (58,64] (64,70] (70,76] (76,82] absolute fr. 8 6 4 6 3 3 relative fr. 0.2b 6 0.2 0.1b 3 0.2 0.1 0.1 percentage fr. 26.b 6% 20% 13.b 3% 20% 10% 10% cum. abs. fr. 8 14 18 24 27 30 cum. rel. fr. 0.2b 6 0.4b 6 0.6 0.8 0.9 1 Moreover, when we work with grouped data we shall need to choose a representative of each one of the intervals, and we will call it class mark, and it will be the half point of the interval (lower extreme of the interval plus higher extreme of the interval, divided by 2). Exercise 1.5.1 Make the frequency table from the variable ”answers to the question 1.3” and from the answers to the question ”height”, deciding previously if it is necessary to group the data in intervals or not. 1.6 Graphical methods Once we have the frequency tables, imagine that your teacher ask you to present to the rest of the students the conclusions you have obtained. You can present your frequency tables and talk about the main conclusions, but, is there any way of presenting data in such a way that the main conclusions can be seen in a more simple way? As you can suppose, the answer to this question is yes. Maybe you have seen in books or mainly in the media, that data are usually presented in a graphic way, so that are more attractive to the people and also easier to analyze data. In this section we want to show all the types of graphs and we are going to stress in how important it is to make a right choice of the type of graph depending on the data we are working with. Now we have the frequency tables for the variables weight and number of brothers/sisters, we are going to use them to introduce the different graphs. 1.6.1 Bar graph The first kind of graph we are going to study is the bar graph. This is a graph that is used for 8 qualitative variables and discrete variables grouped in intervals. We know already that our data about number of brothers/sisters is a discrete variable, so let us see how to build a bar graph using those data. In the OX axis we place the categories if we have a qualitative variable or the values in the case we have a discrete variable, in our example, those values are 0, 1, 2, 3 y 4. Over each one of these values, we place a rectangle or a bar of equal base, having a height proportional to the corresponding frequency. In our case, we shall have a graph like this: Figure 1.1: brothers/sisters (vertical bars) Sometimes this graph is also presented with horizontal bars, in such a way like this: Figure 1.2: brothers/sisters (horizontal bars) 1.6.2 Histogram An histogram is a graph very similar to the bar graph, but this one is used for variables grouped in intervals. We are going to build an histogram for the variable weight. As the one before, it is built by representing in the OX axis the intervals and, over each of them we place a rectangle having a basis with the same width of the interval and such a height that the area of the rectangle is proportional to the frequency of the interval. In this kind of graph, the areas of the rectangles 9 are very important, because we are not representing a bar corresponding to a point but the width of the bar is representing our interval. So, if our intervals have the same width, the height should be the frequency, if not, we shall modify the height in order to keep proportions between frequency and area. Our histogram for the variable weight, that we have already grouped is: Figure 1.3: weight (histogram) We can represent it also with horizontal rectangles: Figure 1.4: weight (histogram) Surely, you have seen sometime a population pyramid in any media. You can notice that a population pyramid is in fact two horizontal histograms (one for women an other for men) in which we represent the number of inhabitants grouped by age . 1.6.3 Frequency polygon The next type of graph that we are going to define is the frequency polygon. This graph is used when we have quantitative variables, discrete or continuous. In order to draw it, we start from the histogram or the bar graph, depending on the case that we have a grouped or not grouped variable. 10 We have to join with a line the half-points of the higher basis in the bar graph or the histogram. In our two examples, we shall have for the number of brothers/sisters the next graph Figure 1.5: brothers/sisters (frequency polygon) The case of the weight is a little bit different. In this situation, the area under the line represents the data we have, as in the histogram, because we are talking about the whole width of the interval. The graph looks like this: Figure 1.6: weight (frequency polygon) All the graphs that we have seen before can be drawn also for relative frequencies and for cumulative frequencies. 1.6.4 Pie Chart The next type of graph that we are going to present is a well-known type, the pie chart. In a pie chart, we assign to each category or value a part of a circle in such a way that its area should be proportional to the frequency. This graph is usually used for qualitative variables and not grouped discrete variables. 11 Figure 1.7: brothers/sisters (pie chart) 1.6.5 Pictogram These are a kind of graphs that are very frequent in the media, and they are called pictograms. They are graphs in which a picture related to the variable is used to represent the frequencies. But we have to stress again on something: the size (and not only the height) has to be proportional to the frequency that we want to represent. It is usual to write also the frequency aside to avoid mistakes. 1.6.6 Stem and leaf plot There is a representation that is between a graph and a data recount, this is the stem and leaf plot. We are going to see how to make it through the example of the weight. We recall that the data we had are: 52 66 54 70 46 62 59 68 49 50 77 57 63 67 58 54 52 47 74 72 80 82 60 75 53 55 69 67 50 52 In a stem and leaf plot, the first thing we have to do is to write in a column the different figures corresponding to the tens that we can find in the data, in our example, as our values range between 46 and 82, we shall have to write 4, 5, 6, 7 and 8 in the following way 4 5 6 7 8 Next, we take the first observation, 52, and we place the units figure aside its corresponding tens figure, this is 12 4 5 2 6 7 8 So we keep placing the units figures aside the tens ones for the rest of the data. What we get is something like this: 4 697 5 249078423502 6 62837097 7 07425 8 02 You can notice that we have something similar (but not equal) to a bar graph or an histogram. Obviously we could have made it vertically and we would have something like this: 2 0 5 3 2 7 4 9 8 0 7 7 5 0 3 2 7 9 8 4 9 4 2 7 2 6 2 6 0 0 4 5 6 7 8 That looks like an histogram or a bar graph though it is not. But the stem and leaf plot can be taken as an approximation to the distribution of the data. In fact, we have only divided in tens (from 40 to 49, from 50 to 59, . . . ) but we could divide in groups of 5 (from 40 to 44, from 45 to 49, from 50 to 54, . . . ) just placing twice each of the ten figure, aside the first one we place the unit figures between 0 and 4 and aside the second one, the unit figures between 5 and 9. In our example and for the horizontal case, we would have: 4 4 697 5 24042302 5 9785 6 230 6 68797 7 042 7 75 8 02 8 13 1.6.7 Some remarks Imagine that you see the two following graphs referred to the benefits of a company. Which one would you choose to be your company? Figure 1.8: benefits (company 1 and company 2) Most of you may choose company 2, because surely you agree that it is better than company 1, but in fact data from the two graphs are the same. We have only changed the OY axis scale. We will make some remarks before starting the next section. Graphs are a very useful tool and they make easier to obtain conclusions from our data, but it is necessary to draw them in the right way in order to avoid mistakes. It is very important to keep proportions among the pictures we represent so as to make sure that the axis scales keep also proportional, because small changes in scales make big differences in appearance and graph can be misunderstood. 1.7 Measures of central tendency: mean, median, mode, quantiles Let us suppose now that we are planning a trip with all the class and we want to earn some money, so we have decided to sell t-shirts, but we don’t know which is the appropriate price. The only thing we know is that we pay for them 4 euros. We would like to have benefits but we cannot put a high price because we want everybody to buy our t-shirts. We think that the weekly pay is a good reference to know what the students can afford. So, we are going to use the weekly pay data that we have: 6 8 10 5 15 20 9 10 9 9 20 15 12 6 15 12 10 25 20 30 15 12 9 20 6 9 10 25 9 9 We have 30 values, but we need only one value to represent them all. Which is the value we can choose? A first solution might be choosing an intermediate value among all the data we have. In order to get that, we sum all the numbers and divide it by the total number of data, so we have: 14 x= 6 + 8 + 10 + 5 + 15 + 20 + 9 + 10 + 9 + 9 + 20 + 15 + 12 + 6 + 15 + 12 + 10 + 25 + 30 20 + 30 + 15 + 12 + 9 + 20 + 6 + 9 + 10 + 25 + 9 + 9 390 = = 13 30 30 Now we have the first possible price, 13 euros. This number, we have just calculated is called mean. But there are more possibilities, for instance, we can choose the most frequent value to represent our data. In our example, the most frequent value is 9, that can also be a good choice for a price. We call mode to the most frequent value. But none of those two numbers that we have got say anything about the number of people that can afford the t-shirt. So, we have another idea. Let us sort the data we have: 5 6 6 6 8 9 9 9 9 9 9 9 10 10 10 10 12 12 12 15 15 15 15 20 20 20 20 25 25 30 So now we want to find the value that leaves half of the data on each side. The values placed in numbers 15 an 16 leave 14 values in each side, as both of them have value 10, we can consider that 10 is the value that leaves half of the data in each side. This number is called median. Just as we have proposed a value that leaves 50% of the data on each side, we can look for a value that can afford 75% of the class, this is, we want to find the value that leaves 25% on the left (this means that only 25% of the data is lower than that value), or any other percentage. This numbers are called quantiles. We can choose now any of those three values, depending on what we pretend on each case or depending on the value that best represents al the data set. Those three values are not always valid for every case, but can help us to see where the center of the distribution is. These are the main measures of central tendency. We are now going to define in a formal way the concepts that we have presented. We are speaking from now on about variables. Let us suppose that we have observed a variable in n data points and we got k different values, x1 , x2 , . . . xk , each of them with aPfrequency of n1 , n2 , . . . nk where ni is the absolute frequency of the value xi . We denote by Ni = j≤i nj the cumulative absolute frequency of the value xi and by fi = nni the relative frequency. If the values of the variable are grouped, we can suppose we have h intervals that we can denote by + (L0 , L1 ], (L1 , L2 ], . . . (Lh−1 , Lh ] whose class marks will be c1 , c2 , . . . ch . In this case, the absolute frequencies will be denoted by n1 , n2 , . . . , nh , the cumulative absolute frequencies by N1 , N2 , . . . , Nh = n and the relative frequencies by f1 , f2 , . . . , fh . Then, the mean is defined as follows Pn xi ni x = i=1 n . For not grouped variables. If we have a grouped variable we will use the class marks ci instead of the values xi . The mean has as main characteristics the following: • It is the gravity center of the distribution and it is unique. • When we have extreme or scarcely representative values (too big or too small), the mean may not be representative. 15 • It makes no sense to calculate the mean for a qualitative variable or if we have grouped data and anyone of the intervals is not bounded. • For grouped data, we use the class mark of each interval to calculate the mean. Moreover, the mean has the following properties: • If a constant is summed to each value, the mean is summed in that constant also. • If we multiply all the values by a constant, the mean is also multiplied by the same constant. The mode is usually defined as the most frequent value. For the case of a not grouped variable it is the value that appears more times. In the case of grouped variables in intervals of the same width, we shall look for the interval with the highest frequency (modal class or interval) and the approximation of the mode is done through the formula: M o = Li−1 + ni − ni−1 · ci (ni − ni−1 ) + (ni − ni+1 ) . where: Li−1 is the lower limit of the modal interval. ni is the absolute frequency of the modal interval. ni−1 is the absolute frequency of the previous interval to the modal interval. ni+1 is the absolute frequency of the next interval to the modal interval. ci is the width of the interval. The mode verifies that: • We can have more than a mode for the distribution. In that case, we will say that we have a bimodal, trimodal, . . . distribution depending on the number of values presenting the highest absolute frequency. • The mode is usually a worse representing than the mean, excepting the case of qualitative data. • If we have intervals with different width, we have to look for the interval with the highest frequency density (this is usually the result of dividing the absolute frequency by the width of the interval ncii ) and then we use the preceding formula. The median is, in the case of a grouped variable and once we have sorted our data the central value if there is an odd number of observations and the media of the central values if we have a pair number of data. If we have a grouped variable, we have to look for the central interval (the one in which we can find the central value), that is to say the one in which Ni is bigger than n2 for the first time, and then we can apply the formula: M e = Li−1 + n 2 . 16 − Ni−1 · ci ni where Li−1 is the lower limit of the interval. ni is the absolute frequency of the central interval. Ni−1 is the cumulative absolute frequency of the previous interval to the central interval. n is the number of data. ci is the width of the interval. Moreover, the quantiles are position measures that generalize the concept of median. We are going to define now the concept of centiles or percentiles, the quartiles and the deciles. We suppose that we have sorted our data. The centiles or percentiles are the values of the variable that leave on the left side a concrete percentage of the data. We denote them by Ph or Ch where h is the percentage, h = 1, 2, . . . , 99. If we have a grouped variable, once we have the interval in which we can find the centil, we apply the next formula: Ph = Ch = Li−1 + h· n 100 − Ni−1 · ci ni . Where the different elements have the same meaning as in the median case. The quartiles are the values that, once we have sorted the data, divide the variable in 4 equal groups. Between each of them there is a 25% of the data points. We denote them by Q1 , Q2 y Q3 and they verify that Q1 = C25 , Q2 = C50 = M e, Q3 = C75 . The deciles are the values that, once we have sorted our data, divide the data in 10 equal groups, in such a way that between any 2 of them there is a 10% of the data points. We denote them by D1 , D2 , D3 , . . . , D9 . They verify that D1 = C10 , D2 = C20 , D3 = C30 , . . . D9 = C90 . Exercise 1.7.1 For the data of number of brothers/sisters and weight, calculate mean, mode, median and cuantiles: Q1 , Q3 , C30 , C74 , D4 , D9 . 1.8 Measures of variability: Range, variance, standard deviation Imagine that we have 3 different data sets about the weights of certain people and we know that in the 3 cases, the mean of the variable weight is 55. Does this mean that the 3 sets are equal or similar? We get the data and we find that the observations are: Set 1: 55 55 55 55 55 55 55 Set 2: 47 51 54 55 56 59 63 Set 3: 39 47 53 55 57 63 71 we can see that, though they have the same mean, the data sets are very different. Look at their stem and leaf plots: 17 3 4 5 5 5 5 5 5 5 5 6 7 3 7 4 9 6 5 4 1 5 3 6 7 9 3 7 4 7 5 1 5 3 6 1 7 Then, how can we find those differences among the data sets? It seems that the measures of central tendency do not give to us enough information for all the situations, so we have to look for any other measures that can tell us how far the data and the mean are. It means that we need to use the concept of variability of the data. The first thing we notice is that in the first case, all the data are equal, in the second one there is a little more difference between the biggest and the smallest ones and in the third case this is even more obvious. Exactly, we have that 55 − 55 = 0 63 − 47 = 16 71 − 39 = 32 This numbers are called range of the data. Nevertheless, though it is a very easy measure to calculate, it is not very much used, because if we have a very small or a very big value in our data, the range changes a lot, so it is not an useful measure for every situation. How can we find a number that can give to us an approximation to the distance between the data and the mean? We can calculate the distances from every data point to the mean (in absolute value) and then calculate the mean of those distances. This is what we call mean deviation. Let us calculate the mean deviation for the second group of data, we have: |47 − 55| + |51 − 55| + |54 − 55| + |55 − 55| + |56 − 55| + |59 − 55| + |63 − 55| = 7 = 8+4+1+0+1+4+8 26 = = 3.714 7 7 . Nevertheless, we usually use a different measure of variability, that is the mean of the square deviation of the data from the mean, and so we get that the biggest deviations have a smaller influence. But we are going to present the formal definition of all these concepts. The range is the difference between the biggest and the smallest value of the variable, if it is not grouped. If we have a grouped variable, we calculate the difference between the higher limit of the last interval and the lower limit of the first interval. The range only depends on the biggest and the smallest elements, and not on the rest of the data. For instance, we could have the following two data sets with the same range: It is easy to see that the difference between xk and x1 is the same in both situations but both sets are very different. The interquartile range is the difference between the third and the first quartiles, and it gives to us a zone where we can find 50% of the distribution. The mean deviation is the mean of the deviations of the data from the mean. We call deviation from the mean the absolute value of the difference between the values of the variable and the mean (|xi − x|), so the definition of the mean deviation is 18 Figure 1.9: range Pk |xi − x| · ni n This is a measure that is not used very often because of the difficulty to calculate it due to the absolute value function. Anyway, a small mean deviation means that data are highly concentrated around the mean. We can define also the median deviation, though it is even less usual. The definition is: i=1 DM = Pk D= i=1 |xi − M e| · ni n . The variance is the mean of the square deviations of the data from the mean. We denote it by S 2 and its expression is 2 Pk S = i=1 (xi − x)2 · ni = n Pk x2i · ni − x2 n i=1 The variance verifies that: • As we are taking the square of the deviations, the bigger ones have more influence on the result. • The unit of measure of S 2 are not the same as the ones of the sample, because we have the square of the deviations. • Variance is always positive. It is 0 when all the values coincide with the mean. We define the quasivariance as 2 s = Pk − x)2 · ni n−1 i=1 (xi 2 its relation to the variance is S 2 = n−1 n s . This is a very useful measure when we work with inferences. Sometimes it is also denoted by Sc2 . The standard deviation is the square root of the variance. We denote it by S and its expression is 19 s Pk S=+ i=1 (xi − n x)2 s · ni Pk q x2i · ni 2 − x = + x2 − x2 n i=1 =+ Its main properties are • It is the most usual measure of variability. • It has the same measure units than the sample • Standard deviation is always positive or 0. Moreover, variance and standard deviation verify: • If we sum a constant to all the data, the variance and the standard deviation stay the same. • If we multiply all the values by a positive constant, the variance is multiplied by the square of the constant, and the standard deviation is multiplied by the constant. 1.9 1.9.1 Joint use of the mean and the standard deviation: Tchebicheff ’s theorem, Pearson’s coefficient of variation, z-scores Tchebicheff ’s theorem We have already found measures that can give us the center of the data and their variability, but we still need more information. Let us recall the data about number of brothers/sisters: Num brothers 0 1 2 3 4 absolute fr. 6 13 7 3 1 so we have that x = 1.33333, S 2 = 1.022, S = 1.011 , How many people is there around the mean? Are there many students that have 1 or 2 brothers/sisters? Let us take an interval centered in the mean, this is (x − a, x + a). We know that variance and standard deviation measure variability, so we will try to use them now. Which one would you use? We should reject variance because we cannot sum it to the mean because they have different measure units. Let us take then the standard deviation, a = S. Then we get the interval (1.3333 − 1.011, 1.3333 + 1.011) = (0.3223, 2.3443). Inside this interval we can find the students having 1 or 2 brothers/sisters. These are 20 of the 30 students, i. e., 66% of them. What could happen if we use 2S instead of S? We get the interval (1.3333 − 2.022, 1.3333 + 2.022) = (−0.6887, 3.3553). 20 Inside this interval we have 29 of the 30 students, i. e., 96% of them. Obviously if we calculate the interval for 3S we find that all the data are inside it. But the next question is does this always happen? Are these concentrations of data always the same? Let us see another example using the weekly pay. We have that x = 13, S 2 = 39.2, S = 6.26 Then, (13 − 6.26, 13 + 6.26) = (6.74, 19.26) (13 − 12.52, 13 + 12.52) = (0.48, 25.52) (13 − 18.78, 13 + 18.78) = (−5.78, 31.78) → → → contains 19 data (63%) contains 29 data (96%) contains 30 data (100%) As you can see, we get very similar results. This is because there is a theorem that assures that in this intervals we can find a certain percentage of the data, exactly, the theorem states that in an interval such as (x − aS, x − aS) we have at least 100(1 − a12 )% of the data. This statement is known as the Tchebicheff’s theorem. 1.9.2 Pearson’s coefficient of variation We are going to work now with height and weight data. We have that, for the weight: x = 60.8, S 2 = 99.56, S = 9.97 S 2 = 0.0128, S = 0.1132 , while for the heights we have x = 1.7133, . In which case do we have more variability? we could think that for the weight data because variance and standard deviation are bigger, but look what happens if we calculate the same for the heights measured in centimeters x = 171.33, S 2 = 128.35, S = 11.32 . If we repeat the question now, what shall you answer? In fact, we cannot compare neither standard deviations nor variances because they depend on the units, just like the mean. We should find an adimensional measure. Until now, we only know that the mean and the standard deviation have the same measure units, so how can we get an adimensional measure from them? We can divide them and then we get the Pearson’s coefficient of variation. CV = S x , We can calculate it for our examples. For the weight we have that 21 CV = 9.97 = 0.163 60.8 , and for the height CV = 11.32 0.1132 = = 0.066 171.33 1.7133 , then we can find more variability in the weights than in the heights. 1.9.3 Z-scores We can still find more information in our data. Imagine that your height is 1.74 and you have a friend in another class whose height is the same. But, inside each class which of you is higher? How can we compare these two data if we only know that the mean in your friend’s class is 1.708 and standard deviation is 12.53? There is a way to change these two data to ”comparable” values. These is what we denote by z-scores and it is calculated by making the difference between the value and its mean divided by the standard deviation. With this, we get that the two new values belong to a distribution with mean 0 and standard deviation 1, and so we can compare them. In our example we have the following z-scores z1 = 1.74 − 1.7133 = 0.236 0.1132 z2 = 1.74 − 1.708 = 0.255 0.1253 , . And we conclude that your friend is higher than you (each one inside its class) because the z-score is bigger. The formula for the z-score related to data xi is zi = xi − x S . 22 Chapter 2 Analysis of the opinion poll We are going now to make a deeper analysis of some of the tasks in the opinion poll. We have chosen 3 tasks: 2.1 You smoke 2.3 You read other books different than school books 3.1 You practice some sport out of the high school The data we have from question 2.1 are 135555511513315155555515154435 from question 2.3 we have 111222344413241213211121111224 and from 3.1 313534213335512123512532415543 The first thing we are going to do is to calculate the frequencies in all cases in order to have the frequency tables for all of them. For question 2.1 we have that Answer (2.1) 1 2 3 4 5 abs fr 8 0 4 2 16 rel fr 0.2b 6 0 0.1b 3 0.0b 6 0.5b 3 perc fr 26.b 6% 0% 13.b 3% 6.b 6% 53.b 3% For question 2.3 we have the following frequency table 23 cum abs fr 8 8 12 14 30 cum rel fr 0.2b 6 0.2b 6 0.4 0.4b 6 1 Answer (2.3) 1 2 3 4 5 abs fr 13 9 3 5 0 rel fr 0.4b 3 0.3 0.1 0.1b 6 0 perc fr 43.b 3% 30% 10% 16.b 6% 0% cum abs fr 13 22 25 30 30 cum rel fr 0.5b 3 0.7b 3 0.8b 3 1 1 cum abs fr 6 11 20 23 30 cum rel fr 0.2 0.3b 6 b 0.6 0.7b 6 1 and finally, the frequency table for question 3.1 is Answer (3.1) 1 2 3 4 5 abs fr 6 5 9 3 7 rel fr 0.2 0.1b 6 0.3 0.1 0.2b 3 perc fr 20% 1.6b 6% 30% 10% 23.b 3% Just looking at the data we have in the tables, we can notice that the three are very different. We will try now to see graphically how these variables are distributed and then we will talk about the first conclusions. As you can notice we have three discrete variables, so we are going to use the bar graph and the pie chart. These are the graphs for the question 2.1 Figure 2.1: answers to question 2.1 Let us represent now the graphs for question 2.3: and now here we have the ones for the question 3.1 24 Figure 2.2: answers to question 2.3 Figure 2.3: answers to question 3.1 We can talk now about the first conclusions. Is it quite obvious that for the question 2.1 the most frequent values are the extreme ones, 1 and 5, that is because there is a tendency to relate number 1 with the people that don’t smoke and number five with the people that do smoke. Anyway, most of the data are placed in the bigger values (3,4 and 5). On the contrary, in question 2.3 we can see that the most frequent values are the smaller ones, so we can say that reading is not a very ”popular” hobby. The third question is a little more ”spread” on all the values. It is also interesting in this example to represent a bar graph whit the cumulative absolute frequencies. We show you the three graphs in which you can see that the frequencies are more gradually distributed in the third case: Anyway, we are now going to confirm what we see by calculating the main measures of central tendency: We are going to present them in a table, in order to make easier to compare them: 25 Figure 2.4: cumulative bar graphs Q. 2.1 Q. 2.3 Q. 3.1 Mean 3.6 2 3 Median 5 2 3 Mode 5 1 3 This table gives us some interesting information. It is quite simple to see that though the mean for question 2.1 is 3.6, most of the data are bigger than the mean, because both the median and the mode are 5. For question 2.3 the situation is very different, we can see that most of the data are around the smallest values, and even the mode is the smallest one. In the question 3.1 we can notice that the 3 values coincide, then we can see that number 3 is the best one to represent our data. Let us calculate now the main measures of variability and then we will try to see which is the variable that is more spread. Q. 2.1 Q. 2.3 Q. 3.1 Range 4 3 4 Variance 3 1.24 2.06 26 Standard deviation 1.73 1.11 1.43 In our example, range is not very relevant, because all the answers range between 1 and 5. The only thing we can notice from the fact that in question 2.3 the range is 3 (smaller than the others) is that one of the extreme values (in this case value 5 has frequency 0) but for example, we can notice that for question 2.1, the frequency of value 2 is also 0. From the standard deviation we can conclude that the answers to question 2.1 are very spread. This is true because if you take a look to the data, you can find that most of them are extreme values, 1 or 5. The other two variables are a bit more concentrated around the mean, specially the answer to question 2.3. Let us check now if the mean is representative in our variables. We shall the calculate the coefficient of variation in each case. We have that Q. 2.1 Q. 2.3 Q. 3.1 Coefficient of variation 0.48 0.55 0.47 So the mean is representative for the three cases we are studying. 2.1 Conclusions In this last section of the analysis, it is important to stress on the meaning of the data we are studying. Until now, we have been talking about the statistical characteristics of the data, but we cannot forget that all those data have their own meaning. We can notice that smoking is something very popular among young people. More than half of this class says that they smoke every day, but only 8 people express that they never smoke. If we sum the frequencies of the students that at least smoke sometimes, we find that we get 22 of you, almost 3 quarters of the total. On the contrary, there is very few interest in reading. 22 of you express that never or rarely read a book different than the ones you need for school. This is maybe one of the biggest contrasts we can get from the poll. No one of you say that they read everyday, though there are 5 people that say to read usually. Sports are the middle ground. This is maybe because many of you can practice any sport in the weekends or when there is good weather, while the ones that practice sports very often balance the ones that almost never practice any sport. 27 Chapter 3 Two-dimensional Descriptive Statistics In the previous chapter, we were working with the data we got from a poll and we obtained the first conclusions. But we want to know more than what we already do, because from those data we can have more information with certain methods that we are going to study from now on. Before going on, we will state our objectives in this chapter. 3.1 Objectives • To represent and analyze data on two variables through an scatterplot. • To identify as a two-dimensional distribution a data set on two variables given in a table or by an scatterplot. • To analyze the relationship between two variables through their scatterplot, establishing by intuition if this relationship is positive or negative, if it is functional or not, and, in this case if it approaches to a line. • To compare global tasks of several distributions through their scatterplots. • To assign given scatterplots to different situations. • To determine the relationship between the different means through the scatterplot. • To find, in a graphical way, a line that fits the scatterplot. • To estimate the correlation coefficient from a scatterplot. • To analyze the grade of the relationship between two variables when the correlation coefficient is known. 28 • To calculate the correlation coefficient in two-dimensional distributions and the regression lines. • To make predictions from the regression line. 3.2 The example: an opinion poll In this chapter we will keep on getting deep in the analysis of the opinion poll we have been working with. From the information that we already have, we will try to answer questions like • Is there any relationship between the pay you receive and the number of brothers/sisters you have? • Does the sport you practice have any influence on how much you smoke or how much alcohol you drink? • Can we measure precisely these relationships? Along this chapter we will try to answer these questions and many more. We are presenting from now on the concepts that will be necessary to get these answers. 3.3 Introduction and simple tables We can think about many variables that can have influence over many others. For instance, we can think that as older you are, the bigger pay you get. We are going to see if that is really true. So, as you already know from the previous chapter, the first thing we have to do is to organize our data. We recall that the data about ages and pays that we had are the following: Age 16 16 16 16 17 18 16 17 17 17 19 16 17 16 17 Pay 6 8 10 5 15 20 9 10 9 9 20 15 12 6 15 Age 17 16 18 18 18 19 17 16 19 16 16 16 17 16 16 29 Pay 12 10 25 20 30 15 12 9 20 6 9 10 25 9 9 These are the pairs of data that we have. Let us start grouping the pairs that are equal. We get the following table Age 16 16 16 16 16 16 17 17 17 17 17 18 18 18 19 19 Pay 5 6 8 9 10 15 9 10 12 15 25 20 25 30 15 20 Number 1 3 1 5 3 1 2 1 3 2 1 2 1 1 1 2 This table we have just built will be called simple table and it will be the starting point for our analysis. 3.4 Frequency tables, marginal distributions and conditional distributions Is it simple to you to obtain conclusions from the previous table? Can we find any other way to represent our data? The idea is to avoid those repeated values that we can see in the column of ages and also in the columns of pays. We can group our data in the following way Pay 5 6 8 9 10 12 15 20 25 30 16 1 3 1 5 3 1 Age 17 18 2 1 3 2 1 30 2 1 1 19 1 2 This table allows us to have a more global vision of the distribution of the frequencies and the more different values we have,the more useful the table is. We call it table on two variables when we are representing two quantitative variables and contingency table when we have two qualitative variables. But from these tables, can we obtain the total number of people whose pay is 12 euros? and the total number of people whose age is 17? Obviously, the answer is yes. Notice that you can sum all the frequencies appearing on the row related to value 12 of the pay and so we can get the number of people whose pay is 12. In the same way, we can sum all the frequencies on the column related to value 17 of the age and we will have the total number of people that is 17. We add these numbers to our table and we have Pay 5 6 8 9 10 12 15 20 25 30 Tot 16 1 3 1 5 3 1 Age 17 18 2 1 3 2 2 1 1 4 1 14 9 19 1 2 3 Tot 1 3 1 7 4 3 4 4 2 1 30 In fact, what you have just got are the values of the two single variables independently one from the other. This values are called marginal distributions of the variables. To obtain the whole marginal distribution of the variable age we take the first and the last row, Age frequency 16 14 17 9 18 4 19 3 We can do this also for the variable pay, taking the first and the last column. Exercise 3.4.1 Can you build that similar table for the variable pay? In a general way, a table on two variable is defined as follows: Y X x1 x2 ... xs ... xk Tot y1 n11 n21 ... ns1 ... nk1 n∗1 y2 n12 n22 ... ns2 ... nk2 n∗2 ... ... ... ... ... ... ... ... yp n1p n2p ... nsp ... nkp n∗p 31 ... ... ... ... ... ... ... ... ym n1m n2m ... nsm ... nkm n∗m Tot n1∗ n2∗ ... ns∗ ... nk∗ n where the values or characteristics of X are x1 , x2 , . . . , xk and the ones of Y are y1 , y2 , . . . , ym ; nij is the number of data points presenting characteristic xi for the variable X and yj for the variable Y . Moreover, ni∗ denotes the number of data points presenting the characteristic xi and n∗j the number of data points presenting the characteristic yj . n is the total number of elements of the population or the sample. Once we know the marginal distributions, we can calculate the mean and the standard deviation of each of them as if the were one-dimensional variables. Their expressions are: s Pk Pk xi ni∗ i=1 (xi − x)ni∗ x = i=1 Sx = n n s Pm Pm j=1 yj n∗j j=1 (yj − y)n∗j y= Sy = n n Exercise 3.4.2 Which are the mean and the standard deviation of the pay and the age? One of your partners has a question. He is 17 and he wants to know if his pay is among the higher or the lower to ask for a raise in it if the pay is too low. In order to get that he wants to compare himself with all the other students of his age, so he takes out the data of those students having his age: Pay Age = 17 5 0 6 0 8 0 9 2 10 1 12 3 15 2 20 0 25 1 30 0 As this boy has a pay of 10 euros, he decides that most of his partners have a higher pay than him, so he is going to ask for a raise. What we have just calculated is the conditional distribution of the variable pay for a fixed value of the age, in this case 17. We have again a one-dimensional variable to whom we can calculate the measures of central tendency and of variability that we already know. Exercise 3.4.3 Calculate the frequency table for the variable age for pay=15 euros. Exercise 3.4.4 Calculate the frequency table, with the marginal frequencies, for the weight and the answer to the question 3.1 3.5 Scatterplots As it usually happens for one-dimensional variables, data are more easily analyzed if we represent them in a graph. Anyway, the situation now is different, because we need to represent two variables each one with its frequencies. To do that we use a graph called scatterplot. We are going to explain now how to draw it: we represent in the OX axis the variable pay and in the OY axis the variable age. We represent a point as big as its frequency or we represent as many points as the frequency shows. 32 Figure 3.1: scatterplot The shape of the points in the scatterplot can give us an idea of the possible dependence that can exist between the variables, as we will see on the following. Exercise 3.5.1 Draw the scatterplot of the variables weight and the answer to the question 3.1 3.6 Functional dependence and statistical dependence Suppose that you are studying the following variables: • The height and the size of the foot of a person • The weekly pay and the height • The number of members of a family and the number of rooms of their house. • The height from where we throw something and the time until it gets to the floor. • The weight and the number of brothers/sisters For each of the situations, we would like to know if there is any relationship between the variables that we study, if the value of one of them has influence over the other. Case 4 is, for instance, very clear. We have learnt in physics that there is a functional relationship between those variables, an equation relating both. In other cases, we can think that there is no relation, as in cases 2 and 5, but in cases 1 and 3 there is a possibility of relation that we cannot assure. The scatterplots can have very different shapes and can help us to realize how the variables are. We will use them as a first approach though later we will use more rigorous methods to decide whether two variables are related. 33 As we have just seen there are several levels in the relationship of the variables. We say that there is a functional dependence if we are in a similar situation than case 4 that we have just presented, this is, Y depends functionally on X when we can assign each value xi an unique value yj in such a way that yj = f (xi ). This means that a value of one variable determines exactly the value of the other one. The functional dependence is linear when all the pairs are in a line; it will be curvilinear when they are in a curve defined by the function y = f (x). Two variables X and Y are said to be independent if the value of one of them has no influence over the other one. This means that the relative conditional distributions coincide. In the rest of the situations we can talk about statistical dependence or relation. This dependence can be stronger or weaker depending on the situation. We can have an idea of how strong (or weak) it is through the scatterplot, taking into account that it will be stronger when data approach to the graph of a function. Scatterplots in which we can see linear or curvilinear dependence are: Figure 3.2: linear dependence Figure 3.3: curvilinear dependence Exercise 3.6.1 Can you see any conclusion about the possible dependence between the weight and the answer to the question 3.1 from the scatterplot you drew in the previous section? 34 3.7 Covariance Recall the scatterplot of the two variables we are studying. It is not easy to conclude which kind of relationship there is between them. But, for instance, do you think that the pay grows when the age grows? Do you think it happens the other way round? We are trying to find now a number that can give us a measure such that we can decide whether the relationship is direct or inverse. We will use for that the covariance, that is defined as follows: Pk i=1 Pm j=1 (xi − x)(yj − y)nij Pk i=1 Pm j=1 xi yj nij −x y n n This covariance is also known as the joint variance of the two variables. If the relationship is direct, the covariance is positive, and if the covariance is negative, the relationship is inverse. As we know that the average age is 16, 8b 6 and the average pay is 13, we obtain that Sxy = 4, 5b 3, and so the relationship is direct and quite strong. You can notice that in the expression of the covariance, its sign depends on the difference (xi −x) and (yj − y). Let us see what happens with the covariance in certain situations. We represent 3 scatterplots, in which we mark the point (x, y) that is the gravity center of the distributions (see figure 3.4). Sxy = = Figure 3.4: covariance We can see that in graph number 2 we have a big covariance because the differences (xi − x) and 35 (yj − y) have always the same sign (xi and yj are always in the first and third quadrants defined by the axis centered on (x, y)). As these differences are positive, they contribute in a positive way to the sum. In the other 2 cases there is no linear relationship and so we will have positive and negative summing because we have data points on the four quadrants so someone balance with others and the result can be next to 0. You can notice that covariance is a measure that depends on the measurement units, as it happened with variance and standard deviation, so we shall look for another adimensional measure that allows us to compare distributions. 3.8 Linear correlation We are now looking for a measure that tells us the grade of relationship existing between two variables (in a direct or inverse way). We want to use it also to measure the linear relationship between them. We start from the covariance that we have just presented, that depends on the product of the measurement units of the two variables, because (xi − x) depends on the measurement units of X and (yj − y) depends on the measurement units of Y ; while nij and n are adimensional. We should divide Sxy by a quantity in such a way that those two measurement units disappear. If you remember, the variance depended on the square of the measurement units of the variable, so we cannot use it, but the standard deviation depended on the measurement units of the variable. This means that the product Sx Sy depends on the product of the measurement units of X and Y , and this is what we were looking for. So, we define the linear correlation coefficient as follows: r= Sxy Sx Sy Let us calculate it in our example. We know that Sxy = 4, 5b 3 and Sx = 1, 008 and Sy = 6, 368 so r = 0, 706, but what does this mean? The value of r is always between −1 and 1. If the value of r is near −1 or 1, then the linear dependence between the variables is strong, being direct if it is near 1 and inverse if it is near −1. If the value of r is near 0 we have weak dependence in case it exists. If the value of r coincides with 1 or −1 the dependence is linear and all the points belong to a line. Then in our example, we confirm that the relationship is direct and quite strong. Exercise 3.8.1 Calculate the linear correlation coefficient of the variables weight and answer to the question 3.1. What can we say about the relationship between them? 3.9 Regression lines Let us suppose that you know that a boy from the high school has a pay of 18 euros, but you don’t know his age. We could think about predicting the value that the variable age should have for this boy. How could we do this? We have been discussing along this chapter about the possible 36 relationship between the variables, so this is the moment in which we are going to use it. If we were able to write the equation that relates the age and the pay, we would only have to substitute and we would have the value that we want. But, unfortunately, this is not so simple. As we know that the linear correlation between the two variables is quite big, we can try to find the line that best fits the points and then we can substitute the value of the pay in order to get the value of the age. This line is called the regression line. Let us define it and later we will calculate the one for our example. Let X, Y be two variables, we define the regression line as the line that makes minimum the sum of the squares of the distances between the data points and the estimated points. For the regression line of Y over X, that shall be y = ax + b, we have to make minimum the sum of the squares of the distances between the values yj and the expected values for them, axi + b. The equation for this line is: Y −y = Sxy (X − x) Sx2 We will use this line when we want to estimate the value of Y from the value of X. In the case of the regression line of X over Y , that shall be x = c + dy we make minimum the sum of the square of the distances between the values xi and the predictions for those values cyi + d. The equation of this line is: X −x= Sxy (Y − y) Sy2 We will use this line when we want to predict the value of X from the value of Y . Let us calculate the regression line for our example. Our variables are the pay (X) and the age (Y ) so we have to calculate the line of X over Y . We have that: x = 13 y = 16, 8b 6 Sxy = 4, 5b 3 Sx = 6, 368 Sx2 = 40, 551 so the line we are looking for is 4, 5b 3 Y − 16, 8b 6= (X − 13) 40, 551 or equivalently Y − 16, 8b 6 = 0, 111(X − 13) ⇒ Y = 0, 111X + 15, 413 so, if the pay of this boy is x = 18 euros, his age should be Y = 0, 111 · 18 + 15, 413 = 17, 41 i. e., this boy should be 17 years old. We have to make some remarks about the regression line. The first thing is that the cutting point of the two regression lines (X over Y and Y over X) is (x, y), unless in the case of linear correlation 1 or −1 in which the two lines coincide. If we want to make predictions using the regression line, we have to consider that we are in one of the next situations: 37 • We can conclude from the scatterplot that there is a possible linear relationship between the variables. • The linear correlation coefficient is near 1 or −1. • Common sense says to us that there is a possible relationship between the variables. An alternative way of expressing the regression lines is the following: • For the case of the regression line of Y over X, this is such as y = ax + b where a= Sxy Sx2 b=y− Sxy x Sx2 • For the case of the regression line of X over Y , this is such as x = cy + d where c= Sxy Sy2 d=x− Sxy y Sy2 Exercise 3.9.1 Calculate the regression lines for the variables weight and answer to question 3.1. If a student weighs 67 kg, can you predict which one can be the answer to question 3.1? 38

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download An approach to Descriptive Statistics through real situations