Download An approach to Descriptive Statistics through real situations

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Time series wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
MaMaEuSch
Management Mathematics for
European Schools
http://www.mathematik.unikl.de/˜ mamaeusch
An approach to Descriptive Statistics through real situations
Paula Lagares Barreiro1
Federico Perea Rojas-Marcos1
Justo Puerto Albandoz1
MaMaEuSch2
Management Mathematics for European Schools
94342 - CP - 1 - 2001 - 1 - DE - COMENIUS - C21
1
University of Seville
This project has been carried out with the partial support of the European Community in the framework of the Sokrates programme. The content does not necessarily reflect the position of the European
Community, nor does it involve any responsibility on the part of the European Community.
2
Contents
1 One-dimensional Descriptive Statistics
1.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 The example: an opinion poll . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Population and samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4 Types of statistical variables: quantitative (discrete and continuous) and qualitative
1.5 Frequency tables: absolute, relative and percentage frequencies . . . . . . . . . . . .
1.6 Graphical methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6.1 Bar graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6.2 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6.3 Frequency polygon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6.4 Pie Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6.5 Pictogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6.6 Stem and leaf plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6.7 Some remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.7 Measures of central tendency: mean, median, mode, quantiles . . . . . . . . . . . . .
1.8 Measures of variability: Range, variance, standard deviation . . . . . . . . . . . . . .
1.9 Joint use of the mean and the standard deviation: Tchebicheff’s theorem, Pearson’s
coefficient of variation, z-scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.9.1 Tchebicheff’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.9.2 Pearson’s coefficient of variation . . . . . . . . . . . . . . . . . . . . . . . . .
1.9.3 Z-scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
3
4
4
5
6
8
8
9
10
11
12
12
14
14
17
20
20
21
22
2 Analysis of the opinion poll
23
2.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3 Two-dimensional Descriptive Statistics
3.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 The example: an opinion poll . . . . . . . . . . . . . . .
3.3 Introduction and simple tables . . . . . . . . . . . . . .
3.4 Frequency tables, marginal distributions and conditional
3.5 Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . .
3.6 Functional dependence and statistical dependence . . . .
3.7 Covariance . . . . . . . . . . . . . . . . . . . . . . . . .
3.8 Linear correlation . . . . . . . . . . . . . . . . . . . . . .
1
. . . . . . . .
. . . . . . . .
. . . . . . . .
distributions
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
28
28
29
29
30
32
33
35
36
3.9
Regression lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
36
Chapter 1
One-dimensional Descriptive
Statistics
We are going to study an opinion poll. You will fill a poll, so that we will see what you think
about a lot of topics and we will study some characteristics as height, number of brothers/sisters,
etc. We will check if your opinions coincide with those of the rest of your friends and also if there are
many people in your classroom with similar characteristics to yours. For instance, how many of your
partners are higher than you? And how many of them have the same number of brothers/sisters
than you? Before continuing, we will pose the main objectives that we want to achieve in this
chapter.
1.1
Objectives
• To distinguish the different types of statistics.
• To determine which type of statistic process we shall use, depending on the type of data that
we are studying.
• To get to know the concepts of central tendency and variability of a set of data.
• To determine the parameters of an statistics distribution.
• To study the coefficient of variation.
• To motivate through information given in examples and exercises about social, ecological,
economical topics, etc.
3
1.2
The example: an opinion poll
¿From now on, we will work with an opinion poll. We want to know some things about the
students of the same class than you. We will ask you about some personal data and then you will
give to us some information and opinion about many topics, as sports, food, etc. Our poll will be
anonymous, so that each one of you can feel free to answer without worrying about the later reading
of those opinions. Thus, with these data, we will pose some interesting questions about ourselves
as a group, that we can maybe use as an orientation to answer other questions about a wider group
of people. For instance,
• Which is the most frequent height in your class?
• Can you consider your weekly pay normal compared with those of your partners?
• How many of you practice sports often? How many have breakfast before coming to the high
school?
• What kind of fruit do you eat more: fruit, milk, coffee, milk, fish . . . ?
We will see that analyzing the answers we get in the poll, you will be able to answer all these
questions we have posed. Surely, at the end of this chapter we will have all the answers. But first
of all, we are going to present the concepts that you will need.
1.3
Population and samples
Before answering all those questions, we have to clarify some things. Who do we want to get
information about? We have said yet that we want to know things about the students of your level,
so our population will not be only the students of this class, but all the students of your level. But
it will take too long to ask all those students, thus we have decided to take a representative group
of all the classrooms of your level, that is your class, in this case. So that you are the sample.
Furthermore, each member of the population is called data point. Let us make some comments
about what we have just said. First of all, maybe we want to study some characteristic in animals,
plants or things, for instance, the life of batteries of a mobile phone and, in this case, the population
is not ”human”, but the different types of mobile phones. Moreover, we can find some situations
in which the use of sampling is even more justified than in our case, due to different reasons: if
we want to know the vote of all the spanish people, we can’t ask all the inhabitants older than 18,
because those are millions of people and that means lots of time and money. To study, for example,
the average life of light bulbs we can’t prove all of them because each proof means that a bulb
is blown, this is an example of those situation in which sampling means destroying a data point.
Therefore, sampling is justified in many situations by reasons of time, money or destruction of the
data points.
Exercise 1.3.1 The University studies demand poll in Andalusia was made in 2001 to know what
the 65356 high school students wanted to study and why. In order to get that, data from 8500
students from all Andalusia were collected. Could you say which are the sample and the population
in this example? Which are the reasons to choose a sample in this example?
4
1.4
Types of statistical variables: quantitative (discrete and
continuous) and qualitative
In order to answer to many of our questions in the right way, what we shall first do is to decide
the kind of method we want to apply to our data. Notice that not all the data we can collect are
the same kind, for instance, we can think about the answer to three questions of our poll:
1. The answer to the question sex (male or female).
2. The answer to the question number of brothers/sisters.
3. The answer to the question height.
The first thing we notice is that the answer to the first question is not numerical whereas the
answers to questions two and three are numerical. The characteristic corresponding to the answer
of the first question is called qualitative whereas the ones related to the answers of questions two
and three are called quantitative. It is easy to see that quantitative variables allow to do operations
that we cannot do with qualitative characteristics. We call categories to the different possibilities of
the qualitative variable and values to the ones of the quantitative variables. Let us see now which
are the differences between variables 2 and 3, because this one is a little more complicated. The
variable number of brothers/sisters take numerical values that we can call ”isolated”, 0,1,2,3,. . . ,
but it cannot take any value between two of those ones, for instance, it cannot have the value 3.5.
Nevertheless this does not happen with the variable height. In fact, height can have any value
between certain limits, we can measure height as precisely as we want. We can say that height can
take any value from an interval. So the variable in case 2 is called discrete and the variable in case
3 is called continuous.
Exercise 1.4.1 Decide whether these variables are qualitative or quantitative, and if they are quantitative, whether they are discrete or continuous
1. Number of babies born in a day.
2. Blood group of a person.
3. Time needed to solve a problem.
4. Number of questions in an exam.
5. Temperature of a person.
6. Political party voted in the last elections.
7. Number of goals scored by a player in a season.
5
1.5
Frequency tables: absolute, relative and percentage frequencies
It is the time now to start processing the data we have collected with our poll. The data that we
have about number of brothers/sisters are
013201011223121111004231212110
Meanwhile for the weights we have
52 66 54 70 46 62 59 68 49 50 77 57 63 67 58 54 52 47 74 72 80 82 60 75 53 55 69 67 50 52
We can pose a lot of questions: how many of my partners have the same number of brothers/sisters as I have? How many of them have more than me? And less than me? how many of
my partners weigh more than me? and less than me? To answer these questions, we would have to
count how many time each answer appears. Let us start counting the ones related to the number
of brothers/sisters. This is what we have
0
1
2
3
4
||||| | → 6
||||| ||||| ||| → 13
||||| || → 7
||| → 3
|→1
So, we know now that there are 13 people that have 1 brother/sister. This number is called absolute
frequency and we denote it by ni . And, how many people has at most 1 brother/sister? In our case,
the people that has 0 or 1 brother/sister, this is, 6 + 13 = 19. This number is called cumulative
absolute frequency and we will denote it by Ni . We can write now the cumulative and absolute
frequency table:
N. bro/sis
0
1
2
3
4
absolute fr.
6
13
7
3
1
cum. absolute fr.
6
13 + 6 = 19
13 + 6 + 7 = 26
13 + 6 + 7 + 3 = 29
13 + 6 + 7 + 3 + 1 = 30
It is important to put the values of the characteristic in order from the biggest to the smallest, if
we want to calculate the cumulative frequencies in the right way. We are going to define now other
kinds of frequencies, because it is interesting to know the proportion of the total that represents
a concrete value, because that’s the way we can compare it with other populations. For instance,
in our case, there are 6 students that have 0 brothers/sisters, but we have asked in a group of 50
people and we know that there are 9 people with 0 brothers/sisters, so in which of the two groups is
there a bigger proportion of people with no brothers/sisters? It is easy to see that the proportions
are
6
9
= 0.2 and
= 0.18
30
50
So the proportion is bigger in our group of 30 people. This proportion is called relative frequency
and we denote it by fi . If we express it as a percentage (multiplying by 100) we get the percentage
6
frequency, that in our case are 20% and 18% respectively. We denote these frequencies by pi . We
add now all these frequencies to our table and we get
Bro/sis
0
1
2
3
4
absolute fr.
6
13
7
3
1
relative fr.
6
30 = 0.2
13
b
30 = 0.43
7
b
30 = 0.23
3
=
0.1
30
1
b
30 = 0.3
percentage fr.
20%
43.b
3%
23.b
3%
10%
3.b
3%
cum. abs. fr.
6
13 + 6 = 19
13 + 6 + 7 = 26
13 + 6 + 7 + 3 = 29
13 + 6 + 7 + 3 + 1 = 30
cum. rel. fr.
0.2
0.6b
3
0.8b
6
0.9b
6
1
Let us analyze now the weight data. We count the different values:
46
47
49
50
52
53
54
55
57
58
59
60
62
63
66
67
68
69
70
72
74
75
77
80
82
|→1
|→1
|→1
|| → 2
||| → 3
|→1
|| → 2
|→1
|→1
|→1
|→1
|→1
|→1
|→1
|→1
|| → 2
|→1
|→1
|→1
|→1
|→1
|→1
|→1
|→1
|→1
As you can see, most of the values have frequency 1 and our variable takes 25 different values.
Those are too many different values to represent in a table (even more if we only have 30 data).
How can we get a more representative table of the distribution of the data? It seems logical to
group similar data in intervals. There is a complete theory about how to group data in a right way.
These are the main points we want to remark:
• The number of classes shall not be neither too high (around 6 − 8 is the maximum number
we usually work with) nor too low (it makes no sense to group in 2 or 3 classes because we
are losing a lot of information.
7
• Excepting maybe the extreme classes, all the intervals should have the same width, because
if not, the information can be misinterpreted.
Can you imagine which are the intervals we are looking for? You can think about the number
of classes you want to have, for instance. Let us note that between the highest value (82) and the
lowest value (46) there is a difference of 36 kg. For instance, if we want to group in 6 classes the
width of the interval should be 36
6 = 6. So we obtain the following intervals: [46,52], (52,58], (58,64],
(64,70],(70, 76], (76,82]. Now we have a possible classification though, of course, there are many
more. In some analysis you may find that the first interval is of the kind ”smaller than 52” and
the last interval ”greater than 76”. This kind of interval is considered the same size as the others
in order to make calculus. Once decided the data grouping, we can calculate the frequencies:
Weight
[46,52]
(52,58]
(58,64]
(64,70]
(70,76]
(76,82]
absolute fr.
8
6
4
6
3
3
relative fr.
0.2b
6
0.2
0.1b
3
0.2
0.1
0.1
percentage fr.
26.b
6%
20%
13.b
3%
20%
10%
10%
cum. abs. fr.
8
14
18
24
27
30
cum. rel. fr.
0.2b
6
0.4b
6
0.6
0.8
0.9
1
Moreover, when we work with grouped data we shall need to choose a representative of each one
of the intervals, and we will call it class mark, and it will be the half point of the interval (lower
extreme of the interval plus higher extreme of the interval, divided by 2).
Exercise 1.5.1 Make the frequency table from the variable ”answers to the question 1.3” and from
the answers to the question ”height”, deciding previously if it is necessary to group the data in
intervals or not.
1.6
Graphical methods
Once we have the frequency tables, imagine that your teacher ask you to present to the rest of
the students the conclusions you have obtained. You can present your frequency tables and talk
about the main conclusions, but, is there any way of presenting data in such a way that the main
conclusions can be seen in a more simple way? As you can suppose, the answer to this question
is yes. Maybe you have seen in books or mainly in the media, that data are usually presented in
a graphic way, so that are more attractive to the people and also easier to analyze data. In this
section we want to show all the types of graphs and we are going to stress in how important it is
to make a right choice of the type of graph depending on the data we are working with. Now we
have the frequency tables for the variables weight and number of brothers/sisters, we are going to
use them to introduce the different graphs.
1.6.1
Bar graph
The first kind of graph we are going to study is the bar graph. This is a graph that is used for
8
qualitative variables and discrete variables grouped in intervals. We know already that our data
about number of brothers/sisters is a discrete variable, so let us see how to build a bar graph using
those data. In the OX axis we place the categories if we have a qualitative variable or the values
in the case we have a discrete variable, in our example, those values are 0, 1, 2, 3 y 4. Over each
one of these values, we place a rectangle or a bar of equal base, having a height proportional to the
corresponding frequency. In our case, we shall have a graph like this:
Figure 1.1: brothers/sisters (vertical bars)
Sometimes this graph is also presented with horizontal bars, in such a way like this:
Figure 1.2: brothers/sisters (horizontal bars)
1.6.2
Histogram
An histogram is a graph very similar to the bar graph, but this one is used for variables grouped
in intervals. We are going to build an histogram for the variable weight. As the one before, it
is built by representing in the OX axis the intervals and, over each of them we place a rectangle
having a basis with the same width of the interval and such a height that the area of the rectangle
is proportional to the frequency of the interval. In this kind of graph, the areas of the rectangles
9
are very important, because we are not representing a bar corresponding to a point but the width
of the bar is representing our interval. So, if our intervals have the same width, the height should
be the frequency, if not, we shall modify the height in order to keep proportions between frequency
and area. Our histogram for the variable weight, that we have already grouped is:
Figure 1.3: weight (histogram)
We can represent it also with horizontal rectangles:
Figure 1.4: weight (histogram)
Surely, you have seen sometime a population pyramid in any media. You can notice that a
population pyramid is in fact two horizontal histograms (one for women an other for men) in which
we represent the number of inhabitants grouped by age .
1.6.3
Frequency polygon
The next type of graph that we are going to define is the frequency polygon. This graph is used
when we have quantitative variables, discrete or continuous. In order to draw it, we start from the
histogram or the bar graph, depending on the case that we have a grouped or not grouped variable.
10
We have to join with a line the half-points of the higher basis in the bar graph or the histogram.
In our two examples, we shall have for the number of brothers/sisters the next graph
Figure 1.5: brothers/sisters (frequency polygon)
The case of the weight is a little bit different. In this situation, the area under the line represents
the data we have, as in the histogram, because we are talking about the whole width of the interval.
The graph looks like this:
Figure 1.6: weight (frequency polygon)
All the graphs that we have seen before can be drawn also for relative frequencies and for
cumulative frequencies.
1.6.4
Pie Chart
The next type of graph that we are going to present is a well-known type, the pie chart. In a pie
chart, we assign to each category or value a part of a circle in such a way that its area should be
proportional to the frequency. This graph is usually used for qualitative variables and not grouped
discrete variables.
11
Figure 1.7: brothers/sisters (pie chart)
1.6.5
Pictogram
These are a kind of graphs that are very frequent in the media, and they are called pictograms.
They are graphs in which a picture related to the variable is used to represent the frequencies. But
we have to stress again on something: the size (and not only the height) has to be proportional
to the frequency that we want to represent. It is usual to write also the frequency aside to avoid
mistakes.
1.6.6
Stem and leaf plot
There is a representation that is between a graph and a data recount, this is the stem and leaf
plot. We are going to see how to make it through the example of the weight. We recall that the
data we had are:
52 66 54 70 46 62 59 68 49 50 77 57 63 67 58 54 52 47 74 72 80 82 60 75 53 55 69 67 50 52
In a stem and leaf plot, the first thing we have to do is to write in a column the different figures
corresponding to the tens that we can find in the data, in our example, as our values range between
46 and 82, we shall have to write 4, 5, 6, 7 and 8 in the following way
4
5
6
7
8
Next, we take the first observation, 52, and we place the units figure aside its corresponding tens
figure, this is
12
4
5 2
6
7
8
So we keep placing the units figures aside the tens ones for the rest of the data. What we get is
something like this:
4 697
5 249078423502
6 62837097
7 07425
8 02
You can notice that we have something similar (but not equal) to a bar graph or an histogram.
Obviously we could have made it vertically and we would have something like this:
2
0
5
3
2 7
4 9
8 0
7 7 5
0 3 2
7 9 8 4
9 4 2 7 2
6 2 6 0 0
4 5 6 7 8
That looks like an histogram or a bar graph though it is not. But the stem and leaf plot can be
taken as an approximation to the distribution of the data. In fact, we have only divided in tens
(from 40 to 49, from 50 to 59, . . . ) but we could divide in groups of 5 (from 40 to 44, from 45 to
49, from 50 to 54, . . . ) just placing twice each of the ten figure, aside the first one we place the unit
figures between 0 and 4 and aside the second one, the unit figures between 5 and 9. In our example
and for the horizontal case, we would have:
4
4 697
5 24042302
5 9785
6 230
6 68797
7 042
7 75
8 02
8
13
1.6.7
Some remarks
Imagine that you see the two following graphs referred to the benefits of a company. Which one
would you choose to be your company?
Figure 1.8: benefits (company 1 and company 2)
Most of you may choose company 2, because surely you agree that it is better than company
1, but in fact data from the two graphs are the same. We have only changed the OY axis scale.
We will make some remarks before starting the next section. Graphs are a very useful tool and
they make easier to obtain conclusions from our data, but it is necessary to draw them in the right
way in order to avoid mistakes. It is very important to keep proportions among the pictures we
represent so as to make sure that the axis scales keep also proportional, because small changes in
scales make big differences in appearance and graph can be misunderstood.
1.7
Measures of central tendency: mean, median, mode,
quantiles
Let us suppose now that we are planning a trip with all the class and we want to earn some
money, so we have decided to sell t-shirts, but we don’t know which is the appropriate price. The
only thing we know is that we pay for them 4 euros. We would like to have benefits but we cannot
put a high price because we want everybody to buy our t-shirts. We think that the weekly pay is a
good reference to know what the students can afford. So, we are going to use the weekly pay data
that we have:
6 8 10 5 15 20 9 10 9 9 20 15 12 6 15 12 10 25 20 30 15 12 9 20 6 9 10 25 9 9
We have 30 values, but we need only one value to represent them all. Which is the value we can
choose? A first solution might be choosing an intermediate value among all the data we have. In
order to get that, we sum all the numbers and divide it by the total number of data, so we have:
14
x=
6 + 8 + 10 + 5 + 15 + 20 + 9 + 10 + 9 + 9 + 20 + 15 + 12 + 6 + 15 + 12 + 10 + 25
+
30
20 + 30 + 15 + 12 + 9 + 20 + 6 + 9 + 10 + 25 + 9 + 9
390
=
= 13
30
30
Now we have the first possible price, 13 euros. This number, we have just calculated is called
mean. But there are more possibilities, for instance, we can choose the most frequent value to
represent our data. In our example, the most frequent value is 9, that can also be a good choice
for a price. We call mode to the most frequent value. But none of those two numbers that we have
got say anything about the number of people that can afford the t-shirt. So, we have another idea.
Let us sort the data we have:
5 6 6 6 8 9 9 9 9 9 9 9 10 10 10 10 12 12 12 15 15 15 15 20 20 20 20 25 25 30
So now we want to find the value that leaves half of the data on each side. The values placed in
numbers 15 an 16 leave 14 values in each side, as both of them have value 10, we can consider that
10 is the value that leaves half of the data in each side. This number is called median. Just as we
have proposed a value that leaves 50% of the data on each side, we can look for a value that can
afford 75% of the class, this is, we want to find the value that leaves 25% on the left (this means
that only 25% of the data is lower than that value), or any other percentage. This numbers are
called quantiles.
We can choose now any of those three values, depending on what we pretend on each case or
depending on the value that best represents al the data set. Those three values are not always valid
for every case, but can help us to see where the center of the distribution is. These are the main
measures of central tendency. We are now going to define in a formal way the concepts that we
have presented. We are speaking from now on about variables.
Let us suppose that we have observed a variable in n data points and we got k different values,
x1 , x2 , . . . xk , each of them with aPfrequency of n1 , n2 , . . . nk where ni is the absolute frequency of
the value xi . We denote by Ni = j≤i nj the cumulative absolute frequency of the value xi and by
fi = nni the relative frequency. If the values of the variable are grouped, we can suppose we have h
intervals that we can denote by
+
(L0 , L1 ], (L1 , L2 ], . . . (Lh−1 , Lh ]
whose class marks will be c1 , c2 , . . . ch . In this case, the absolute frequencies will be denoted by
n1 , n2 , . . . , nh , the cumulative absolute frequencies by N1 , N2 , . . . , Nh = n and the relative frequencies by f1 , f2 , . . . , fh .
Then, the mean is defined as follows
Pn
xi ni
x = i=1
n
.
For not grouped variables. If we have a grouped variable we will use the class marks ci instead
of the values xi . The mean has as main characteristics the following:
• It is the gravity center of the distribution and it is unique.
• When we have extreme or scarcely representative values (too big or too small), the mean may
not be representative.
15
• It makes no sense to calculate the mean for a qualitative variable or if we have grouped data
and anyone of the intervals is not bounded.
• For grouped data, we use the class mark of each interval to calculate the mean.
Moreover, the mean has the following properties:
• If a constant is summed to each value, the mean is summed in that constant also.
• If we multiply all the values by a constant, the mean is also multiplied by the same constant.
The mode is usually defined as the most frequent value. For the case of a not grouped variable
it is the value that appears more times. In the case of grouped variables in intervals of the same
width, we shall look for the interval with the highest frequency (modal class or interval) and the
approximation of the mode is done through the formula:
M o = Li−1 +
ni − ni−1
· ci
(ni − ni−1 ) + (ni − ni+1 )
. where:
Li−1 is the lower limit of the modal interval.
ni is the absolute frequency of the modal interval.
ni−1 is the absolute frequency of the previous interval to the modal interval.
ni+1 is the absolute frequency of the next interval to the modal interval.
ci is the width of the interval.
The mode verifies that:
• We can have more than a mode for the distribution. In that case, we will say that we have a
bimodal, trimodal, . . . distribution depending on the number of values presenting the highest
absolute frequency.
• The mode is usually a worse representing than the mean, excepting the case of qualitative
data.
• If we have intervals with different width, we have to look for the interval with the highest
frequency density (this is usually the result of dividing the absolute frequency by the width
of the interval ncii ) and then we use the preceding formula.
The median is, in the case of a grouped variable and once we have sorted our data the central
value if there is an odd number of observations and the media of the central values if we have a
pair number of data. If we have a grouped variable, we have to look for the central interval (the
one in which we can find the central value), that is to say the one in which Ni is bigger than n2 for
the first time, and then we can apply the formula:
M e = Li−1 +
n
2
.
16
− Ni−1
· ci
ni
where
Li−1 is the lower limit of the interval.
ni is the absolute frequency of the central interval.
Ni−1 is the cumulative absolute frequency of the previous interval to the central interval.
n is the number of data.
ci is the width of the interval.
Moreover, the quantiles are position measures that generalize the concept of median. We are
going to define now the concept of centiles or percentiles, the quartiles and the deciles. We suppose
that we have sorted our data. The centiles or percentiles are the values of the variable that leave
on the left side a concrete percentage of the data. We denote them by Ph or Ch where h is the
percentage, h = 1, 2, . . . , 99. If we have a grouped variable, once we have the interval in which we
can find the centil, we apply the next formula:
Ph = Ch = Li−1 +
h·
n
100
− Ni−1
· ci
ni
.
Where the different elements have the same meaning as in the median case. The quartiles are the
values that, once we have sorted the data, divide the variable in 4 equal groups. Between each of
them there is a 25% of the data points. We denote them by Q1 , Q2 y Q3 and they verify that
Q1 = C25 , Q2 = C50 = M e, Q3 = C75 .
The deciles are the values that, once we have sorted our data, divide the data in 10 equal groups,
in such a way that between any 2 of them there is a 10% of the data points. We denote them by
D1 , D2 , D3 , . . . , D9 . They verify that D1 = C10 , D2 = C20 , D3 = C30 , . . . D9 = C90 .
Exercise 1.7.1 For the data of number of brothers/sisters and weight, calculate mean, mode, median and cuantiles: Q1 , Q3 , C30 , C74 , D4 , D9 .
1.8
Measures of variability: Range, variance, standard deviation
Imagine that we have 3 different data sets about the weights of certain people and we know that
in the 3 cases, the mean of the variable weight is 55. Does this mean that the 3 sets are equal or
similar? We get the data and we find that the observations are:
Set 1: 55 55 55 55 55 55 55
Set 2: 47 51 54 55 56 59 63
Set 3: 39 47 53 55 57 63 71
we can see that, though they have the same mean, the data sets are very different. Look at their
stem and leaf plots:
17
3
4
5
5
5
5
5
5
5
5
6
7
3
7
4
9
6
5
4
1
5
3
6
7
9
3
7
4
7
5
1
5
3
6
1
7
Then, how can we find those differences among the data sets? It seems that the measures of
central tendency do not give to us enough information for all the situations, so we have to look for
any other measures that can tell us how far the data and the mean are. It means that we need
to use the concept of variability of the data. The first thing we notice is that in the first case, all
the data are equal, in the second one there is a little more difference between the biggest and the
smallest ones and in the third case this is even more obvious. Exactly, we have that
55 − 55 = 0
63 − 47 = 16
71 − 39 = 32
This numbers are called range of the data. Nevertheless, though it is a very easy measure to
calculate, it is not very much used, because if we have a very small or a very big value in our
data, the range changes a lot, so it is not an useful measure for every situation. How can we find
a number that can give to us an approximation to the distance between the data and the mean?
We can calculate the distances from every data point to the mean (in absolute value) and then
calculate the mean of those distances. This is what we call mean deviation. Let us calculate the
mean deviation for the second group of data, we have:
|47 − 55| + |51 − 55| + |54 − 55| + |55 − 55| + |56 − 55| + |59 − 55| + |63 − 55|
=
7
=
8+4+1+0+1+4+8
26
=
= 3.714
7
7
.
Nevertheless, we usually use a different measure of variability, that is the mean of the square
deviation of the data from the mean, and so we get that the biggest deviations have a smaller
influence. But we are going to present the formal definition of all these concepts. The range is the
difference between the biggest and the smallest value of the variable, if it is not grouped. If we have
a grouped variable, we calculate the difference between the higher limit of the last interval and the
lower limit of the first interval.
The range only depends on the biggest and the smallest elements, and not on the rest of the
data. For instance, we could have the following two data sets with the same range:
It is easy to see that the difference between xk and x1 is the same in both situations but both
sets are very different. The interquartile range is the difference between the third and the first
quartiles, and it gives to us a zone where we can find 50% of the distribution. The mean deviation
is the mean of the deviations of the data from the mean. We call deviation from the mean the
absolute value of the difference between the values of the variable and the mean (|xi − x|), so the
definition of the mean deviation is
18
Figure 1.9: range
Pk
|xi − x| · ni
n
This is a measure that is not used very often because of the difficulty to calculate it due to the
absolute value function. Anyway, a small mean deviation means that data are highly concentrated
around the mean. We can define also the median deviation, though it is even less usual. The
definition is:
i=1
DM =
Pk
D=
i=1
|xi − M e| · ni
n
.
The variance is the mean of the square deviations of the data from the mean. We denote it by
S 2 and its expression is
2
Pk
S =
i=1 (xi
− x)2 · ni
=
n
Pk
x2i · ni
− x2
n
i=1
The variance verifies that:
• As we are taking the square of the deviations, the bigger ones have more influence on the
result.
• The unit of measure of S 2 are not the same as the ones of the sample, because we have the
square of the deviations.
• Variance is always positive. It is 0 when all the values coincide with the mean.
We define the quasivariance as
2
s =
Pk
− x)2 · ni
n−1
i=1 (xi
2
its relation to the variance is S 2 = n−1
n s . This is a very useful measure when we work with
inferences. Sometimes it is also denoted by Sc2 . The standard deviation is the square root of the
variance. We denote it by S and its expression is
19
s
Pk
S=+
i=1 (xi
−
n
x)2
s
· ni
Pk
q
x2i · ni
2
− x = + x2 − x2
n
i=1
=+
Its main properties are
• It is the most usual measure of variability.
• It has the same measure units than the sample
• Standard deviation is always positive or 0.
Moreover, variance and standard deviation verify:
• If we sum a constant to all the data, the variance and the standard deviation stay the same.
• If we multiply all the values by a positive constant, the variance is multiplied by the square
of the constant, and the standard deviation is multiplied by the constant.
1.9
1.9.1
Joint use of the mean and the standard deviation: Tchebicheff ’s theorem, Pearson’s coefficient of variation, z-scores
Tchebicheff ’s theorem
We have already found measures that can give us the center of the data and their variability, but
we still need more information. Let us recall the data about number of brothers/sisters:
Num brothers
0
1
2
3
4
absolute fr.
6
13
7
3
1
so we have that
x = 1.33333,
S 2 = 1.022,
S = 1.011
,
How many people is there around the mean? Are there many students that have 1 or 2 brothers/sisters? Let us take an interval centered in the mean, this is (x − a, x + a). We know that
variance and standard deviation measure variability, so we will try to use them now. Which one
would you use? We should reject variance because we cannot sum it to the mean because they have
different measure units. Let us take then the standard deviation, a = S. Then we get the interval
(1.3333 − 1.011, 1.3333 + 1.011) = (0.3223, 2.3443). Inside this interval we can find the students having 1 or 2 brothers/sisters. These are 20 of the 30 students, i. e., 66% of them. What could happen
if we use 2S instead of S? We get the interval (1.3333 − 2.022, 1.3333 + 2.022) = (−0.6887, 3.3553).
20
Inside this interval we have 29 of the 30 students, i. e., 96% of them. Obviously if we calculate
the interval for 3S we find that all the data are inside it. But the next question is does this always
happen? Are these concentrations of data always the same? Let us see another example using the
weekly pay. We have that
x = 13,
S 2 = 39.2,
S = 6.26
Then,
(13 − 6.26, 13 + 6.26) = (6.74, 19.26)
(13 − 12.52, 13 + 12.52) = (0.48, 25.52)
(13 − 18.78, 13 + 18.78) = (−5.78, 31.78)
→
→
→
contains 19 data (63%)
contains 29 data (96%)
contains 30 data (100%)
As you can see, we get very similar results. This is because there is a theorem that assures that
in this intervals we can find a certain percentage of the data, exactly, the theorem states that in
an interval such as (x − aS, x − aS) we have at least 100(1 − a12 )% of the data. This statement is
known as the Tchebicheff’s theorem.
1.9.2
Pearson’s coefficient of variation
We are going to work now with height and weight data. We have that, for the weight:
x = 60.8,
S 2 = 99.56,
S = 9.97
S 2 = 0.0128,
S = 0.1132
,
while for the heights we have
x = 1.7133,
.
In which case do we have more variability? we could think that for the weight data because
variance and standard deviation are bigger, but look what happens if we calculate the same for the
heights measured in centimeters
x = 171.33,
S 2 = 128.35,
S = 11.32
.
If we repeat the question now, what shall you answer? In fact, we cannot compare neither
standard deviations nor variances because they depend on the units, just like the mean. We should
find an adimensional measure. Until now, we only know that the mean and the standard deviation
have the same measure units, so how can we get an adimensional measure from them? We can
divide them and then we get the Pearson’s coefficient of variation.
CV =
S
x
,
We can calculate it for our examples. For the weight we have that
21
CV =
9.97
= 0.163
60.8
,
and for the height
CV =
11.32
0.1132
=
= 0.066
171.33
1.7133
,
then we can find more variability in the weights than in the heights.
1.9.3
Z-scores
We can still find more information in our data. Imagine that your height is 1.74 and you have
a friend in another class whose height is the same. But, inside each class which of you is higher?
How can we compare these two data if we only know that the mean in your friend’s class is 1.708
and standard deviation is 12.53? There is a way to change these two data to ”comparable” values.
These is what we denote by z-scores and it is calculated by making the difference between the value
and its mean divided by the standard deviation. With this, we get that the two new values belong
to a distribution with mean 0 and standard deviation 1, and so we can compare them.
In our example we have the following z-scores
z1 =
1.74 − 1.7133
= 0.236
0.1132
z2 =
1.74 − 1.708
= 0.255
0.1253
,
.
And we conclude that your friend is higher than you (each one inside its class) because the
z-score is bigger. The formula for the z-score related to data xi is
zi =
xi − x
S
.
22
Chapter 2
Analysis of the opinion poll
We are going now to make a deeper analysis of some of the tasks in the opinion poll. We have
chosen 3 tasks:
2.1 You smoke
2.3 You read other books different than school books
3.1 You practice some sport out of the high school
The data we have from question 2.1 are
135555511513315155555515154435
from question 2.3 we have
111222344413241213211121111224
and from 3.1
313534213335512123512532415543
The first thing we are going to do is to calculate the frequencies in all cases in order to have the
frequency tables for all of them. For question 2.1 we have that
Answer (2.1)
1
2
3
4
5
abs fr
8
0
4
2
16
rel fr
0.2b
6
0
0.1b
3
0.0b
6
0.5b
3
perc fr
26.b
6%
0%
13.b
3%
6.b
6%
53.b
3%
For question 2.3 we have the following frequency table
23
cum abs fr
8
8
12
14
30
cum rel fr
0.2b
6
0.2b
6
0.4
0.4b
6
1
Answer (2.3)
1
2
3
4
5
abs fr
13
9
3
5
0
rel fr
0.4b
3
0.3
0.1
0.1b
6
0
perc fr
43.b
3%
30%
10%
16.b
6%
0%
cum abs fr
13
22
25
30
30
cum rel fr
0.5b
3
0.7b
3
0.8b
3
1
1
cum abs fr
6
11
20
23
30
cum rel fr
0.2
0.3b
6
b
0.6
0.7b
6
1
and finally, the frequency table for question 3.1 is
Answer (3.1)
1
2
3
4
5
abs fr
6
5
9
3
7
rel fr
0.2
0.1b
6
0.3
0.1
0.2b
3
perc fr
20%
1.6b
6%
30%
10%
23.b
3%
Just looking at the data we have in the tables, we can notice that the three are very different.
We will try now to see graphically how these variables are distributed and then we will talk about
the first conclusions.
As you can notice we have three discrete variables, so we are going to use the bar graph and the
pie chart. These are the graphs for the question 2.1
Figure 2.1: answers to question 2.1
Let us represent now the graphs for question 2.3:
and now here we have the ones for the question 3.1
24
Figure 2.2: answers to question 2.3
Figure 2.3: answers to question 3.1
We can talk now about the first conclusions. Is it quite obvious that for the question 2.1 the most
frequent values are the extreme ones, 1 and 5, that is because there is a tendency to relate number
1 with the people that don’t smoke and number five with the people that do smoke. Anyway, most
of the data are placed in the bigger values (3,4 and 5). On the contrary, in question 2.3 we can
see that the most frequent values are the smaller ones, so we can say that reading is not a very
”popular” hobby. The third question is a little more ”spread” on all the values.
It is also interesting in this example to represent a bar graph whit the cumulative absolute
frequencies. We show you the three graphs in which you can see that the frequencies are more
gradually distributed in the third case:
Anyway, we are now going to confirm what we see by calculating the main measures of central
tendency: We are going to present them in a table, in order to make easier to compare them:
25
Figure 2.4: cumulative bar graphs
Q. 2.1
Q. 2.3
Q. 3.1
Mean
3.6
2
3
Median
5
2
3
Mode
5
1
3
This table gives us some interesting information. It is quite simple to see that though the mean
for question 2.1 is 3.6, most of the data are bigger than the mean, because both the median and
the mode are 5. For question 2.3 the situation is very different, we can see that most of the data
are around the smallest values, and even the mode is the smallest one. In the question 3.1 we can
notice that the 3 values coincide, then we can see that number 3 is the best one to represent our
data.
Let us calculate now the main measures of variability and then we will try to see which is the
variable that is more spread.
Q. 2.1
Q. 2.3
Q. 3.1
Range
4
3
4
Variance
3
1.24
2.06
26
Standard deviation
1.73
1.11
1.43
In our example, range is not very relevant, because all the answers range between 1 and 5. The
only thing we can notice from the fact that in question 2.3 the range is 3 (smaller than the others)
is that one of the extreme values (in this case value 5 has frequency 0) but for example, we can
notice that for question 2.1, the frequency of value 2 is also 0. From the standard deviation we can
conclude that the answers to question 2.1 are very spread. This is true because if you take a look
to the data, you can find that most of them are extreme values, 1 or 5. The other two variables are
a bit more concentrated around the mean, specially the answer to question 2.3.
Let us check now if the mean is representative in our variables. We shall the calculate the
coefficient of variation in each case. We have that
Q. 2.1
Q. 2.3
Q. 3.1
Coefficient of variation
0.48
0.55
0.47
So the mean is representative for the three cases we are studying.
2.1
Conclusions
In this last section of the analysis, it is important to stress on the meaning of the data we are
studying. Until now, we have been talking about the statistical characteristics of the data, but we
cannot forget that all those data have their own meaning.
We can notice that smoking is something very popular among young people. More than half of
this class says that they smoke every day, but only 8 people express that they never smoke. If we
sum the frequencies of the students that at least smoke sometimes, we find that we get 22 of you,
almost 3 quarters of the total.
On the contrary, there is very few interest in reading. 22 of you express that never or rarely
read a book different than the ones you need for school. This is maybe one of the biggest contrasts
we can get from the poll. No one of you say that they read everyday, though there are 5 people
that say to read usually.
Sports are the middle ground. This is maybe because many of you can practice any sport in the
weekends or when there is good weather, while the ones that practice sports very often balance the
ones that almost never practice any sport.
27
Chapter 3
Two-dimensional Descriptive
Statistics
In the previous chapter, we were working with the data we got from a poll and we obtained the
first conclusions. But we want to know more than what we already do, because from those data we
can have more information with certain methods that we are going to study from now on. Before
going on, we will state our objectives in this chapter.
3.1
Objectives
• To represent and analyze data on two variables through an scatterplot.
• To identify as a two-dimensional distribution a data set on two variables given in a table or
by an scatterplot.
• To analyze the relationship between two variables through their scatterplot, establishing by
intuition if this relationship is positive or negative, if it is functional or not, and, in this case
if it approaches to a line.
• To compare global tasks of several distributions through their scatterplots.
• To assign given scatterplots to different situations.
• To determine the relationship between the different means through the scatterplot.
• To find, in a graphical way, a line that fits the scatterplot.
• To estimate the correlation coefficient from a scatterplot.
• To analyze the grade of the relationship between two variables when the correlation coefficient
is known.
28
• To calculate the correlation coefficient in two-dimensional distributions and the regression
lines.
• To make predictions from the regression line.
3.2
The example: an opinion poll
In this chapter we will keep on getting deep in the analysis of the opinion poll we have been
working with. From the information that we already have, we will try to answer questions like
• Is there any relationship between the pay you receive and the number of brothers/sisters you
have?
• Does the sport you practice have any influence on how much you smoke or how much alcohol
you drink?
• Can we measure precisely these relationships?
Along this chapter we will try to answer these questions and many more. We are presenting
from now on the concepts that will be necessary to get these answers.
3.3
Introduction and simple tables
We can think about many variables that can have influence over many others. For instance, we
can think that as older you are, the bigger pay you get. We are going to see if that is really true.
So, as you already know from the previous chapter, the first thing we have to do is to organize our
data. We recall that the data about ages and pays that we had are the following:
Age
16
16
16
16
17
18
16
17
17
17
19
16
17
16
17
Pay
6
8
10
5
15
20
9
10
9
9
20
15
12
6
15
Age
17
16
18
18
18
19
17
16
19
16
16
16
17
16
16
29
Pay
12
10
25
20
30
15
12
9
20
6
9
10
25
9
9
These are the pairs of data that we have. Let us start grouping the pairs that are equal. We
get the following table
Age
16
16
16
16
16
16
17
17
17
17
17
18
18
18
19
19
Pay
5
6
8
9
10
15
9
10
12
15
25
20
25
30
15
20
Number
1
3
1
5
3
1
2
1
3
2
1
2
1
1
1
2
This table we have just built will be called simple table and it will be the starting point for our
analysis.
3.4
Frequency tables, marginal distributions and conditional
distributions
Is it simple to you to obtain conclusions from the previous table? Can we find any other way to
represent our data? The idea is to avoid those repeated values that we can see in the column of
ages and also in the columns of pays. We can group our data in the following way
Pay
5
6
8
9
10
12
15
20
25
30
16
1
3
1
5
3
1
Age
17 18
2
1
3
2
1
30
2
1
1
19
1
2
This table allows us to have a more global vision of the distribution of the frequencies and the
more different values we have,the more useful the table is. We call it table on two variables when
we are representing two quantitative variables and contingency table when we have two qualitative
variables. But from these tables, can we obtain the total number of people whose pay is 12 euros?
and the total number of people whose age is 17? Obviously, the answer is yes. Notice that you can
sum all the frequencies appearing on the row related to value 12 of the pay and so we can get the
number of people whose pay is 12. In the same way, we can sum all the frequencies on the column
related to value 17 of the age and we will have the total number of people that is 17. We add these
numbers to our table and we have
Pay
5
6
8
9
10
12
15
20
25
30
Tot
16
1
3
1
5
3
1
Age
17 18
2
1
3
2
2
1
1
4
1
14
9
19
1
2
3
Tot
1
3
1
7
4
3
4
4
2
1
30
In fact, what you have just got are the values of the two single variables independently one
from the other. This values are called marginal distributions of the variables. To obtain the whole
marginal distribution of the variable age we take the first and the last row,
Age
frequency
16
14
17
9
18
4
19
3
We can do this also for the variable pay, taking the first and the last column.
Exercise 3.4.1 Can you build that similar table for the variable pay?
In a general way, a table on two variable is defined as follows:
Y
X
x1
x2
...
xs
...
xk
Tot
y1
n11
n21
...
ns1
...
nk1
n∗1
y2
n12
n22
...
ns2
...
nk2
n∗2
...
...
...
...
...
...
...
...
yp
n1p
n2p
...
nsp
...
nkp
n∗p
31
...
...
...
...
...
...
...
...
ym
n1m
n2m
...
nsm
...
nkm
n∗m
Tot
n1∗
n2∗
...
ns∗
...
nk∗
n
where the values or characteristics of X are x1 , x2 , . . . , xk and the ones of Y are y1 , y2 , . . . , ym ; nij
is the number of data points presenting characteristic xi for the variable X and yj for the variable
Y . Moreover, ni∗ denotes the number of data points presenting the characteristic xi and n∗j the
number of data points presenting the characteristic yj . n is the total number of elements of the
population or the sample.
Once we know the marginal distributions, we can calculate the mean and the standard deviation
of each of them as if the were one-dimensional variables. Their expressions are:
s
Pk
Pk
xi ni∗
i=1 (xi − x)ni∗
x = i=1
Sx =
n
n
s
Pm
Pm
j=1 yj n∗j
j=1 (yj − y)n∗j
y=
Sy =
n
n
Exercise 3.4.2 Which are the mean and the standard deviation of the pay and the age?
One of your partners has a question. He is 17 and he wants to know if his pay is among the
higher or the lower to ask for a raise in it if the pay is too low. In order to get that he wants to
compare himself with all the other students of his age, so he takes out the data of those students
having his age:
Pay
Age = 17
5
0
6
0
8
0
9
2
10
1
12
3
15
2
20
0
25
1
30
0
As this boy has a pay of 10 euros, he decides that most of his partners have a higher pay than
him, so he is going to ask for a raise.
What we have just calculated is the conditional distribution of the variable pay for a fixed value
of the age, in this case 17. We have again a one-dimensional variable to whom we can calculate the
measures of central tendency and of variability that we already know.
Exercise 3.4.3 Calculate the frequency table for the variable age for pay=15 euros.
Exercise 3.4.4 Calculate the frequency table, with the marginal frequencies, for the weight and the
answer to the question 3.1
3.5
Scatterplots
As it usually happens for one-dimensional variables, data are more easily analyzed if we represent
them in a graph. Anyway, the situation now is different, because we need to represent two variables
each one with its frequencies. To do that we use a graph called scatterplot. We are going to explain
now how to draw it: we represent in the OX axis the variable pay and in the OY axis the variable
age. We represent a point as big as its frequency or we represent as many points as the frequency
shows.
32
Figure 3.1: scatterplot
The shape of the points in the scatterplot can give us an idea of the possible dependence that
can exist between the variables, as we will see on the following.
Exercise 3.5.1 Draw the scatterplot of the variables weight and the answer to the question 3.1
3.6
Functional dependence and statistical dependence
Suppose that you are studying the following variables:
• The height and the size of the foot of a person
• The weekly pay and the height
• The number of members of a family and the number of rooms of their house.
• The height from where we throw something and the time until it gets to the floor.
• The weight and the number of brothers/sisters
For each of the situations, we would like to know if there is any relationship between the variables
that we study, if the value of one of them has influence over the other. Case 4 is, for instance, very
clear. We have learnt in physics that there is a functional relationship between those variables, an
equation relating both. In other cases, we can think that there is no relation, as in cases 2 and 5,
but in cases 1 and 3 there is a possibility of relation that we cannot assure.
The scatterplots can have very different shapes and can help us to realize how the variables
are. We will use them as a first approach though later we will use more rigorous methods to decide
whether two variables are related.
33
As we have just seen there are several levels in the relationship of the variables. We say that
there is a functional dependence if we are in a similar situation than case 4 that we have just
presented, this is, Y depends functionally on X when we can assign each value xi an unique value
yj in such a way that yj = f (xi ). This means that a value of one variable determines exactly the
value of the other one. The functional dependence is linear when all the pairs are in a line; it will
be curvilinear when they are in a curve defined by the function y = f (x).
Two variables X and Y are said to be independent if the value of one of them has no influence
over the other one. This means that the relative conditional distributions coincide.
In the rest of the situations we can talk about statistical dependence or relation. This dependence
can be stronger or weaker depending on the situation. We can have an idea of how strong (or weak)
it is through the scatterplot, taking into account that it will be stronger when data approach to the
graph of a function.
Scatterplots in which we can see linear or curvilinear dependence are:
Figure 3.2: linear dependence
Figure 3.3: curvilinear dependence
Exercise 3.6.1 Can you see any conclusion about the possible dependence between the weight and
the answer to the question 3.1 from the scatterplot you drew in the previous section?
34
3.7
Covariance
Recall the scatterplot of the two variables we are studying. It is not easy to conclude which kind
of relationship there is between them. But, for instance, do you think that the pay grows when the
age grows? Do you think it happens the other way round? We are trying to find now a number
that can give us a measure such that we can decide whether the relationship is direct or inverse.
We will use for that the covariance, that is defined as follows:
Pk
i=1
Pm
j=1 (xi
− x)(yj − y)nij
Pk
i=1
Pm
j=1
xi yj nij
−x y
n
n
This covariance is also known as the joint variance of the two variables. If the relationship is
direct, the covariance is positive, and if the covariance is negative, the relationship is inverse. As
we know that the average age is 16, 8b
6 and the average pay is 13, we obtain that Sxy = 4, 5b
3, and
so the relationship is direct and quite strong.
You can notice that in the expression of the covariance, its sign depends on the difference (xi −x)
and (yj − y). Let us see what happens with the covariance in certain situations. We represent 3
scatterplots, in which we mark the point (x, y) that is the gravity center of the distributions (see
figure 3.4).
Sxy =
=
Figure 3.4: covariance
We can see that in graph number 2 we have a big covariance because the differences (xi − x) and
35
(yj − y) have always the same sign (xi and yj are always in the first and third quadrants defined
by the axis centered on (x, y)). As these differences are positive, they contribute in a positive way
to the sum.
In the other 2 cases there is no linear relationship and so we will have positive and negative
summing because we have data points on the four quadrants so someone balance with others and
the result can be next to 0.
You can notice that covariance is a measure that depends on the measurement units, as it
happened with variance and standard deviation, so we shall look for another adimensional measure
that allows us to compare distributions.
3.8
Linear correlation
We are now looking for a measure that tells us the grade of relationship existing between two
variables (in a direct or inverse way). We want to use it also to measure the linear relationship
between them.
We start from the covariance that we have just presented, that depends on the product of the
measurement units of the two variables, because (xi − x) depends on the measurement units of
X and (yj − y) depends on the measurement units of Y ; while nij and n are adimensional. We
should divide Sxy by a quantity in such a way that those two measurement units disappear. If you
remember, the variance depended on the square of the measurement units of the variable, so we
cannot use it, but the standard deviation depended on the measurement units of the variable. This
means that the product Sx Sy depends on the product of the measurement units of X and Y , and
this is what we were looking for. So, we define the linear correlation coefficient as follows:
r=
Sxy
Sx Sy
Let us calculate it in our example. We know that Sxy = 4, 5b
3 and Sx = 1, 008 and Sy = 6, 368
so r = 0, 706, but what does this mean?
The value of r is always between −1 and 1. If the value of r is near −1 or 1, then the linear
dependence between the variables is strong, being direct if it is near 1 and inverse if it is near −1.
If the value of r is near 0 we have weak dependence in case it exists. If the value of r coincides
with 1 or −1 the dependence is linear and all the points belong to a line.
Then in our example, we confirm that the relationship is direct and quite strong.
Exercise 3.8.1 Calculate the linear correlation coefficient of the variables weight and answer to
the question 3.1. What can we say about the relationship between them?
3.9
Regression lines
Let us suppose that you know that a boy from the high school has a pay of 18 euros, but you
don’t know his age. We could think about predicting the value that the variable age should have
for this boy. How could we do this? We have been discussing along this chapter about the possible
36
relationship between the variables, so this is the moment in which we are going to use it. If we were
able to write the equation that relates the age and the pay, we would only have to substitute and
we would have the value that we want.
But, unfortunately, this is not so simple. As we know that the linear correlation between the
two variables is quite big, we can try to find the line that best fits the points and then we can
substitute the value of the pay in order to get the value of the age. This line is called the regression
line. Let us define it and later we will calculate the one for our example.
Let X, Y be two variables, we define the regression line as the line that makes minimum the
sum of the squares of the distances between the data points and the estimated points.
For the regression line of Y over X, that shall be y = ax + b, we have to make minimum the
sum of the squares of the distances between the values yj and the expected values for them, axi + b.
The equation for this line is:
Y −y =
Sxy
(X − x)
Sx2
We will use this line when we want to estimate the value of Y from the value of X.
In the case of the regression line of X over Y , that shall be x = c + dy we make minimum the
sum of the square of the distances between the values xi and the predictions for those values cyi + d.
The equation of this line is:
X −x=
Sxy
(Y − y)
Sy2
We will use this line when we want to predict the value of X from the value of Y .
Let us calculate the regression line for our example. Our variables are the pay (X) and the age
(Y ) so we have to calculate the line of X over Y . We have that:
x = 13
y = 16, 8b
6
Sxy = 4, 5b
3
Sx = 6, 368
Sx2 = 40, 551
so the line we are looking for is
4, 5b
3
Y − 16, 8b
6=
(X − 13)
40, 551
or equivalently
Y − 16, 8b
6 = 0, 111(X − 13) ⇒ Y = 0, 111X + 15, 413
so, if the pay of this boy is x = 18 euros, his age should be
Y = 0, 111 · 18 + 15, 413 = 17, 41
i. e., this boy should be 17 years old.
We have to make some remarks about the regression line. The first thing is that the cutting
point of the two regression lines (X over Y and Y over X) is (x, y), unless in the case of linear
correlation 1 or −1 in which the two lines coincide.
If we want to make predictions using the regression line, we have to consider that we are in one
of the next situations:
37
• We can conclude from the scatterplot that there is a possible linear relationship between the
variables.
• The linear correlation coefficient is near 1 or −1.
• Common sense says to us that there is a possible relationship between the variables.
An alternative way of expressing the regression lines is the following:
• For the case of the regression line of Y over X, this is such as y = ax + b where
a=
Sxy
Sx2
b=y−
Sxy
x
Sx2
• For the case of the regression line of X over Y , this is such as x = cy + d where
c=
Sxy
Sy2
d=x−
Sxy
y
Sy2
Exercise 3.9.1 Calculate the regression lines for the variables weight and answer to question 3.1.
If a student weighs 67 kg, can you predict which one can be the answer to question 3.1?
38