Download lecture notes - Montanuniversität Leoben

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
S TATISTICS WITH M ATLAB FOR E NGINEERS :
D ESCRIPTIVE S TATISICS
Paul Razafimandimby
Montanuniversität Leoben
October 28, 2015
Contents
•
•
•
•
•
•
Introduction
Organizing and visualization of the data
Visualization of correlation
Measures of Central Tendency/Location
Measures of variation or dispersion
Appendix: Calculation of some parameters for grouped data.
Introduction
Roughly speaking, Statistics is the science of gaining knowledge from numerical and categorical data. It deals with the collection, analysis, interpretation
and drawing conclusion from collected data. A population is basically the
collection or set of all individuals under consideration in a statistical study. A
sample is a part of the part or subset of the population from which information
is collected.
One can distinguish two branches of Statistics.
1. Descriptive Statistics is the methodology of organizing and summarizing information. This branch of statistics deals with the construction of
the distribution of the sample/population (calculation of frequency), the
visualization of data (graphs, charts, histograms), and the calculation of
various descriptive measures (averages, standard deviation, percentiles).
2. Inferential Statistics is a science of drawing and measuring the reliability of conclusions about population based on information collected from
a sample of population. Inferential statistics deals with point estimation,
interval estimation and hypothesis testing which rely very much on probability theory.
1
Descriptive and inferential statistics are interrelated in that before inferring
conclusion from the statistical investigation it is necessary to organize and
summarize the information collected from a sample. Moreover, the knowledge from the descriptive statistics usually suggests the appropriate method
or approach to be used for the inferential statistics.
In a statistical study, either it is a descriptive or inferential, the property of
a population is usually described by numerical parameters. In many cases
these parameters are unknown and a statistical study are very often oriented
to the investigation/estimation of these parameters. For this purpose, one usually uses statistical samples to make inference about these unknown parameters. Numerical values calculated from and characterizing a statistical sample
is called a statistic and they are used to make inference about the unknown
parameters of the whole population. Statistics finds its applications in numerous applied sciences, among others, economics, political science, medicine. Of
course, Statistics play an important role in many branches of Engineering sciences. For instance, assuming that a factory producing use the same equipment, the raw materials and the methods of production, then using statistics
we can infer about the qualities of the light bulbs produced in the future.
Usually a statistical study has the following steps:
1. Describe the research problem. For instance, we want to know the average
age of MUL students.
2. Define the population and the sample on which we will conduct the study. In
a very simple terms, a population is basically the collection or set of all
individuals under consideration in a statistical study. In our example, the
population is the set of all MUL students (from 1st year to phd students).
A sample is a part or subset of the population from which information
is collected. Sample could be set of 100 students randomly interviewed
by 10 volunteers at 5 building entrances of the university from tomorrow
7:00-9:00 am.
3. Collect the data We send 10 volunteers to interview 100 students at 5 building entrances of the university during the period of tomorrow 7:00-9:00
am.
4. Conduct a descriptive data analysis After collecting the data we need to
organize it. For instance,
• we could form a table containing the (relative) frequency and cumulative
(relative) frequency of each class of the sample.
• We could plot the data to visualize some of its properties.
• Study the tendency of the population/sample by calculating its measure
of location such as mean, median, mode, ....
• We could also study the dispersion of the population/sample through the
calculation of range, variance, standard deviation, coefficient of skewness, kurtosis, interquartile,... All of these terms will be or have been
defined appropriately.
2
Organizing and visualization of the data
As defined above, this branch of statistics deals with the organization and the
summary of information form the collected data. But, before we organize our
data we need to specify our variate or (random) variable.
Variate/Variable: a characteristic that varies from one individual of the population to the other. In our example, our variable is the age of each MUL student.
On can distinguish three types of variables or data
1. Qualitative data/variable: This type of variable is also known as categorical or nominal data/variable and it can only described by word, letter
or phrase. For example, the sexe, marital status of blood type of the MUL
students.
2. Quantitative or numerical data/variable: is a variable that can be quantified or numerically described. For instance, the height, weight and age
of MUL students.
3. Ordinal data/variable is variable that cannot be numerically described or
does not fall into a quantitative variable, but can be ordered. For instance,
quality of moral behavior (bad manner, good manner), performance of a
football team (winner, runner up, semi-finalist,...).
After properly defining the variable one can organize the observed values into
classes (Ci , ) and form a table containing the count Ni of individual belonging
to each class (class frequency). One can also insert in the table the relative
frequency of each class. The relative frequency of a class Ci is defined by
RF (Ci ) =
Ni
.
∑i=1 .Ni
The relative frequency of all classes sum to 1 or 100% The cumulative (relative)
frequency of a class Ci is the sum of all frequencies of all classes up to to the
class Ci
i
CF (Ci ) =
∑ RF(Cj ).
j =1
Note that cumulative frequency makes sense only for quantitative and ordinal
variable.
Example: To simulate our statistical study on the students age, we generate
100 random numbers (I did it for you in this note, but you should learn how to
do it) from 17-40. First, we load it to Matlab
3
Age=load(’Age.txt’);
and create a frequency table from it
tabulate(Age);
Value
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
Count
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
5
4
3
7
2
5
9
5
7
3
3
2
1
1
3
4
6
4
4
5
2
6
Percent
0.00%
0.00%
0.00%
0.00%
0.00%
0.00%
0.00%
0.00%
0.00%
0.00%
0.00%
0.00%
0.00%
0.00%
0.00%
0.00%
5.00%
4.00%
3.00%
7.00%
2.00%
5.00%
9.00%
5.00%
7.00%
3.00%
3.00%
2.00%
1.00%
1.00%
3.00%
4.00%
6.00%
4.00%
4.00%
5.00%
2.00%
6.00%
4
39
40
5
4
5.00%
4.00%
But it gives us the age range 0-16 which we do not want. To get the right table
we have to remove these values. For this purpose, let us store the table in a
40 × 3 matrix called T
T=tabulate(Age);
and remove the block T (i, j), for i = 1, 16 and j = 2, 3.
T(1:16,:)=[];
Now we recreate the frequency table
Freq_Table=table(T(:,1),T(:,2),T(:,3),’VariableNames’,{’Age’,’Count’,’Percent’})
Freq_Table =
Age
___
Count
_____
Percent
_______
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
5
4
3
7
2
5
9
5
7
3
3
2
1
1
3
4
6
4
4
5
2
6
5
4
5
4
3
7
2
5
9
5
7
3
3
2
1
1
3
4
6
4
4
5
2
6
5
4
5
We can also export our table into a txt, xls, ... file.
writetable(Freq_Table,’Freq_Table_Age.txt’, ’Delimiter’,’ ’);
To visualize our data we can plot the frequencies versus the classes. For ordinal
or quantitative variable we usually use a pie chart or a bar graph. Note that in
a bar graph, the bars do not touch each other. Bar graph is also used to visualize
discrete quantitative data, i.e., the each class is described by a single number.
For visualization of continuous quantitative data, i.e., each class is an interval,
we usually draw an histogram. The bars of an histogram do touch each other.
The usual method to form the frequency table of a continuous quantitative data
is as follows.
1. Find the min and max values of the observed data
2. Form disjoint intervals of same length covering the range between the
min and the max values. In general 5 t0 15 intervals are satisfactory.
3. Count the number of individuals falling in each interval. This is the frequency distribution.
4. Form the relative frequency of each classes.
For our example, a lazy way of visualizing the frequency of our data is just bar
plot the second column of the matrix T
bar(T(:,2))
But in this case, the x axis contain unwanted values and does not contain the
whole range of our variable classes. To remediate this we can specify the value
of the bar location along the x-axis as follows
6
bar(17:1:40,T(:,2))
Which is equivalent to
bar(T(:,1),T(:,2));
We can also draw a histogram for our data. For instance, we will cover the min
and max values of our observation by disjoint intervals of same length, say,
[17,22], [23,28],...., Here is how we do it in matlab
7
histage=histogram(Age,[17:5:45]);
We could also draw a pie chart
pie(Age);
Well! This looks awful. Let us just do the pie chart of the first 5 students and
label them
8
pie(Age(1:5), {’Stud1’, ’Stud2’, ’Stud3’, ’Stud4’, ’Stud5’});
Visualization of correlation
Graphs are also very useful to give an intuition of teh correlation between variables. For example, we want to know whether smoking is one of cancer factors
and which cancer type is mostly caused by smoking. For this let us download
a data from
http://lib.stat.cmu.edu/DASL/Stories/cigcancer.html
. I named the data as smoke cancer.txt and load it to Matlab by using the
dataset command.
smokeds=dataset(’File’, ’smoke_cancer.txt’);
We can now visualize the correlation between smoking and let say bladder
cancer and lung cancer
subplot(2,1,1)
scatter(smokeds.CIG,smokeds.BLAD),
title(’CIG vs BLAD’)
subplot(2,1,2)
scatter(smokeds.CIG,smokeds.LUNG)
title(’CIG vs LUNG’);
9
It seems that CIG and LUNG has a positive linear correlation. Let see how if
we can draw something from the histogram
bar(smokeds.LUNG, ’c’)
hold on
bar(smokeds.BLAD, ’r’)
hold off
10
Measures of Central Tendency/Location
A measure of location is a typical or a central value which describe well the
location of the data. We mainly have three measures of location
Mean Let Xi , i = 1, . . . , N be our observed values, then the mean is defined by
X̄ =
1
N
N
∑ Xi .
i =1
Note when the data is grouped in classes Ci , i + 1, .., n, then the mean is defined
by
X̄ =
1
N
n
∑ f i Xi .
i =1
where Xi is midpoint of a class Ci (of course Xi = Ci is the variable is discrete)
and f i is the count of the class Ci (or Xi ) and N = ∑in=1 f i is the total number of
observation.
Median or Middle is the middle value which divides the observation into tow
equal parts. If the data is ungrouped, then the median is defined by
Med = X n+1 ,
2
if n is odd, and
X n + X n +1 /2,
2
2
is n is even.
Example: This is the list of ages of 7 MUL students
age7=[23,24,16,19,30,28,33];
age7s=sort(age7);
Medage7=age7s((length(age7)+1)/2);
11
Example again! Now let us look at an ungrouped data with even number of
observation. For this take 8 MUL students
age8=[23,24,16,19,30,28,33,40];
age8s=sort(age8);
Medage8=(age8s((length(age8))/2)+age8s(length(age8)/2+1))/2;
Warning The above formula/procedure for the median does not work well
grouped data (especially when the observed values are grouped into intervals)
For grouped data, the formula/procedure for finding the median is more complicated and it gives only an estimate for the median; we will the method on
how to find it in appendix. Nevertheless, it is relatively simple to find a Median class which is basically the interval containing the first cumulative frequency bigger than N/2. However, we can apply the above procedure in our
example of 100 MUL
Class mode is the most frequently occurring class, i.e., it is the class which has
the highest count. In our example, the mode or modal class is the number with
the highest frequency ( which is 9), i.e., 23. For a grouped data we only have
a complicated formula/procedure which will be given in the appendix. Fortunately, with Matlab we do not need to worry about these formula, the software
will do it for us (but, you should read books and understand the procedure).
Example: The average and median ages in our example is given by
Mean_age2=mean(Age);
Mean age2 is equivalent to the second definition of mean, i.e.,
Mean age2 =
1 24
∑ fi i.
100 i=
17
which rounds to 28.
Let us caclulate the median.
Med_age=median(T(:,1));
which returns 28.500.
Now let us calulate the mode
Mode=mode(Age);
which gives us 23. This is also the class modal as we grouped our data in a
discrete way.
12
Measures of variation or dispersion
The measures of dispersion given in the first lecture note are valid for ungrouped data, but their meaning are the same as for grouped data. For grouped
data we give them below. The variance and the standard deviation of sample
of size n are respectively defined by:
S2 =
n
1
f ( X − X̄ )2 ,
∑
n − 1 i =1 i i
S=
√
S.
Sometimes, we use the shortcut formula

n
1
 ∑ fi X2 − 1
S2 =
i
n − 1 i =1
n
S=
√
n
∑
!2 
f i Xi  ,
i =1
S.
As in ungouped data we can also defined the r-th moment and r-th central
moment . They are respectively defined by
Mr0 =
Mr =
1
n
1
n
n
∑ fi Xir ,
i =1
n
∑ fi (Xi − Mean)r .
i =1
Now these parameters can be used to defined the coefficient of skewness and
kurtosis whose definitions are exactly the same as in an ungrouped data.
Let us calculate the kurtosis of our data
First we try our formula.
13
Kurtf=mean((Age-Mean_age2).^4)/(mean((Age-Mean_age2).^2))^2;
Skewf=mean((Age-Mean_age2).^3)/(mean((Age-Mean_age2).^2))^(3/2);
We compare them with values returned by the Matlab functions kurtosis and
skewness
Kurtf-kurtosis(Age);
Skewf-skewness(Age);
Interquartile The k-th percentile is the value of the observed variable which
has a cumulative frequency equal to k/100.
The first quartile, the second quartile and the third quartile correspond to the
values with cumulative frequencies 25%, 50% and 75%, respectively.
The interquartile is the difference between the first quartile and third quartile.
It is a range within which the middle half of the data lie.
Appendix: Calculation of some parameters for grouped data.
This appendix serves to give an explanation on how to calculate of some parameters of grouped data, in particular, when the range of the observed values
of the variable are covered by disjoint intervals. Many of the calculations in
this section require the knowledge of lower/upper boundaries of the classes
which in its turn require the knowledge of a gap classes.
The gap between classes is the difference between the upper limit of one class
and the lower limit of the next class. For example, assume that our classes are
the interval ( ai , bi ), i = 1, . . . , n . The gap is
gap = bi − ai+1 .
Having the gap at hand, we can form the class boundaries.
The lower class boundaries are
ãi = ai − gap/2
and the upper class boundaries are
b̃i = bi + gap/2.
14
Now, we are ready to estimate the median, quartiles and interquartile (range)
of a grouped data. Follow the steps below to calculate the media:
1. Form the cumulative frequency table and insert in it the ranges of class
boundaries. Call N the total frequency which is also the total number
of observation or individuals in the sample. Locate the Median class,i.e.,
find the class which contains the N/2-th individual. Call it Cm = ( am , bm )
and C̃m = ( ãm , b̃m ) its lower and upper class boundaries. Apply the following formula to find the median
Median = ãm +
N/2 − Fb
fm
( bm − a m ) ,
where
f m is the frequency of the median class,
Fb is the cumulative frequency before the median class.
A similar argument can be used to compute the first quartile (∼ N/4) and the
third quartile (∼ 3N/4). Let α ∈ {1, 3}
Qα = ãQα +
αN/4 − Fb
f Qα
( bQ α − a Q α ).
For the mode we can use the following formula
f mo − f a
Mode = ãmo +
(b̃mo − ãmo ),
2 f − ( fa + fb )
where
f mo is the frequence of the class mode,
f b and f a are respectively the frequency of the class before and after the class
mode.
Exercise: Find the median, interquartile and the mode of the following
grouped data.
15
Time to travel to work
1-10
11-20
21-30
31-40
41-50
Frequency
8
14
12
9
7
16