Download lecture notes - Montanuniversität Leoben

S TATISTICS WITH M ATLAB FOR E NGINEERS : D ESCRIPTIVE S TATISICS Paul Razafimandimby Montanuniversität Leoben October 28, 2015 Contents • • • • • • Introduction Organizing and visualization of the data Visualization of correlation Measures of Central Tendency/Location Measures of variation or dispersion Appendix: Calculation of some parameters for grouped data. Introduction Roughly speaking, Statistics is the science of gaining knowledge from numerical and categorical data. It deals with the collection, analysis, interpretation and drawing conclusion from collected data. A population is basically the collection or set of all individuals under consideration in a statistical study. A sample is a part of the part or subset of the population from which information is collected. One can distinguish two branches of Statistics. 1. Descriptive Statistics is the methodology of organizing and summarizing information. This branch of statistics deals with the construction of the distribution of the sample/population (calculation of frequency), the visualization of data (graphs, charts, histograms), and the calculation of various descriptive measures (averages, standard deviation, percentiles). 2. Inferential Statistics is a science of drawing and measuring the reliability of conclusions about population based on information collected from a sample of population. Inferential statistics deals with point estimation, interval estimation and hypothesis testing which rely very much on probability theory. 1 Descriptive and inferential statistics are interrelated in that before inferring conclusion from the statistical investigation it is necessary to organize and summarize the information collected from a sample. Moreover, the knowledge from the descriptive statistics usually suggests the appropriate method or approach to be used for the inferential statistics. In a statistical study, either it is a descriptive or inferential, the property of a population is usually described by numerical parameters. In many cases these parameters are unknown and a statistical study are very often oriented to the investigation/estimation of these parameters. For this purpose, one usually uses statistical samples to make inference about these unknown parameters. Numerical values calculated from and characterizing a statistical sample is called a statistic and they are used to make inference about the unknown parameters of the whole population. Statistics finds its applications in numerous applied sciences, among others, economics, political science, medicine. Of course, Statistics play an important role in many branches of Engineering sciences. For instance, assuming that a factory producing use the same equipment, the raw materials and the methods of production, then using statistics we can infer about the qualities of the light bulbs produced in the future. Usually a statistical study has the following steps: 1. Describe the research problem. For instance, we want to know the average age of MUL students. 2. Define the population and the sample on which we will conduct the study. In a very simple terms, a population is basically the collection or set of all individuals under consideration in a statistical study. In our example, the population is the set of all MUL students (from 1st year to phd students). A sample is a part or subset of the population from which information is collected. Sample could be set of 100 students randomly interviewed by 10 volunteers at 5 building entrances of the university from tomorrow 7:00-9:00 am. 3. Collect the data We send 10 volunteers to interview 100 students at 5 building entrances of the university during the period of tomorrow 7:00-9:00 am. 4. Conduct a descriptive data analysis After collecting the data we need to organize it. For instance, • we could form a table containing the (relative) frequency and cumulative (relative) frequency of each class of the sample. • We could plot the data to visualize some of its properties. • Study the tendency of the population/sample by calculating its measure of location such as mean, median, mode, .... • We could also study the dispersion of the population/sample through the calculation of range, variance, standard deviation, coefficient of skewness, kurtosis, interquartile,... All of these terms will be or have been defined appropriately. 2 Organizing and visualization of the data As defined above, this branch of statistics deals with the organization and the summary of information form the collected data. But, before we organize our data we need to specify our variate or (random) variable. Variate/Variable: a characteristic that varies from one individual of the population to the other. In our example, our variable is the age of each MUL student. On can distinguish three types of variables or data 1. Qualitative data/variable: This type of variable is also known as categorical or nominal data/variable and it can only described by word, letter or phrase. For example, the sexe, marital status of blood type of the MUL students. 2. Quantitative or numerical data/variable: is a variable that can be quantified or numerically described. For instance, the height, weight and age of MUL students. 3. Ordinal data/variable is variable that cannot be numerically described or does not fall into a quantitative variable, but can be ordered. For instance, quality of moral behavior (bad manner, good manner), performance of a football team (winner, runner up, semi-finalist,...). After properly defining the variable one can organize the observed values into classes (Ci , ) and form a table containing the count Ni of individual belonging to each class (class frequency). One can also insert in the table the relative frequency of each class. The relative frequency of a class Ci is defined by RF (Ci ) = Ni . ∑i=1 .Ni The relative frequency of all classes sum to 1 or 100% The cumulative (relative) frequency of a class Ci is the sum of all frequencies of all classes up to to the class Ci i CF (Ci ) = ∑ RF(Cj ). j =1 Note that cumulative frequency makes sense only for quantitative and ordinal variable. Example: To simulate our statistical study on the students age, we generate 100 random numbers (I did it for you in this note, but you should learn how to do it) from 17-40. First, we load it to Matlab 3 Age=load(’Age.txt’); and create a frequency table from it tabulate(Age); Value 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 Count 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 4 3 7 2 5 9 5 7 3 3 2 1 1 3 4 6 4 4 5 2 6 Percent 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 5.00% 4.00% 3.00% 7.00% 2.00% 5.00% 9.00% 5.00% 7.00% 3.00% 3.00% 2.00% 1.00% 1.00% 3.00% 4.00% 6.00% 4.00% 4.00% 5.00% 2.00% 6.00% 4 39 40 5 4 5.00% 4.00% But it gives us the age range 0-16 which we do not want. To get the right table we have to remove these values. For this purpose, let us store the table in a 40 × 3 matrix called T T=tabulate(Age); and remove the block T (i, j), for i = 1, 16 and j = 2, 3. T(1:16,:)=[]; Now we recreate the frequency table Freq_Table=table(T(:,1),T(:,2),T(:,3),’VariableNames’,{’Age’,’Count’,’Percent’}) Freq_Table = Age ___ Count _____ Percent _______ 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 5 4 3 7 2 5 9 5 7 3 3 2 1 1 3 4 6 4 4 5 2 6 5 4 5 4 3 7 2 5 9 5 7 3 3 2 1 1 3 4 6 4 4 5 2 6 5 4 5 We can also export our table into a txt, xls, ... file. writetable(Freq_Table,’Freq_Table_Age.txt’, ’Delimiter’,’ ’); To visualize our data we can plot the frequencies versus the classes. For ordinal or quantitative variable we usually use a pie chart or a bar graph. Note that in a bar graph, the bars do not touch each other. Bar graph is also used to visualize discrete quantitative data, i.e., the each class is described by a single number. For visualization of continuous quantitative data, i.e., each class is an interval, we usually draw an histogram. The bars of an histogram do touch each other. The usual method to form the frequency table of a continuous quantitative data is as follows. 1. Find the min and max values of the observed data 2. Form disjoint intervals of same length covering the range between the min and the max values. In general 5 t0 15 intervals are satisfactory. 3. Count the number of individuals falling in each interval. This is the frequency distribution. 4. Form the relative frequency of each classes. For our example, a lazy way of visualizing the frequency of our data is just bar plot the second column of the matrix T bar(T(:,2)) But in this case, the x axis contain unwanted values and does not contain the whole range of our variable classes. To remediate this we can specify the value of the bar location along the x-axis as follows 6 bar(17:1:40,T(:,2)) Which is equivalent to bar(T(:,1),T(:,2)); We can also draw a histogram for our data. For instance, we will cover the min and max values of our observation by disjoint intervals of same length, say, [17,22], [23,28],...., Here is how we do it in matlab 7 histage=histogram(Age,[17:5:45]); We could also draw a pie chart pie(Age); Well! This looks awful. Let us just do the pie chart of the first 5 students and label them 8 pie(Age(1:5), {’Stud1’, ’Stud2’, ’Stud3’, ’Stud4’, ’Stud5’}); Visualization of correlation Graphs are also very useful to give an intuition of teh correlation between variables. For example, we want to know whether smoking is one of cancer factors and which cancer type is mostly caused by smoking. For this let us download a data from http://lib.stat.cmu.edu/DASL/Stories/cigcancer.html . I named the data as smoke cancer.txt and load it to Matlab by using the dataset command. smokeds=dataset(’File’, ’smoke_cancer.txt’); We can now visualize the correlation between smoking and let say bladder cancer and lung cancer subplot(2,1,1) scatter(smokeds.CIG,smokeds.BLAD), title(’CIG vs BLAD’) subplot(2,1,2) scatter(smokeds.CIG,smokeds.LUNG) title(’CIG vs LUNG’); 9 It seems that CIG and LUNG has a positive linear correlation. Let see how if we can draw something from the histogram bar(smokeds.LUNG, ’c’) hold on bar(smokeds.BLAD, ’r’) hold off 10 Measures of Central Tendency/Location A measure of location is a typical or a central value which describe well the location of the data. We mainly have three measures of location Mean Let Xi , i = 1, . . . , N be our observed values, then the mean is defined by X̄ = 1 N N ∑ Xi . i =1 Note when the data is grouped in classes Ci , i + 1, .., n, then the mean is defined by X̄ = 1 N n ∑ f i Xi . i =1 where Xi is midpoint of a class Ci (of course Xi = Ci is the variable is discrete) and f i is the count of the class Ci (or Xi ) and N = ∑in=1 f i is the total number of observation. Median or Middle is the middle value which divides the observation into tow equal parts. If the data is ungrouped, then the median is defined by Med = X n+1 , 2 if n is odd, and X n + X n +1 /2, 2 2 is n is even. Example: This is the list of ages of 7 MUL students age7=[23,24,16,19,30,28,33]; age7s=sort(age7); Medage7=age7s((length(age7)+1)/2); 11 Example again! Now let us look at an ungrouped data with even number of observation. For this take 8 MUL students age8=[23,24,16,19,30,28,33,40]; age8s=sort(age8); Medage8=(age8s((length(age8))/2)+age8s(length(age8)/2+1))/2; Warning The above formula/procedure for the median does not work well grouped data (especially when the observed values are grouped into intervals) For grouped data, the formula/procedure for finding the median is more complicated and it gives only an estimate for the median; we will the method on how to find it in appendix. Nevertheless, it is relatively simple to find a Median class which is basically the interval containing the first cumulative frequency bigger than N/2. However, we can apply the above procedure in our example of 100 MUL Class mode is the most frequently occurring class, i.e., it is the class which has the highest count. In our example, the mode or modal class is the number with the highest frequency ( which is 9), i.e., 23. For a grouped data we only have a complicated formula/procedure which will be given in the appendix. Fortunately, with Matlab we do not need to worry about these formula, the software will do it for us (but, you should read books and understand the procedure). Example: The average and median ages in our example is given by Mean_age2=mean(Age); Mean age2 is equivalent to the second definition of mean, i.e., Mean age2 = 1 24 ∑ fi i. 100 i= 17 which rounds to 28. Let us caclulate the median. Med_age=median(T(:,1)); which returns 28.500. Now let us calulate the mode Mode=mode(Age); which gives us 23. This is also the class modal as we grouped our data in a discrete way. 12 Measures of variation or dispersion The measures of dispersion given in the first lecture note are valid for ungrouped data, but their meaning are the same as for grouped data. For grouped data we give them below. The variance and the standard deviation of sample of size n are respectively defined by: S2 = n 1 f ( X − X̄ )2 , ∑ n − 1 i =1 i i S= √ S. Sometimes, we use the shortcut formula  n 1  ∑ fi X2 − 1 S2 = i n − 1 i =1 n S= √ n ∑ !2  f i Xi  , i =1 S. As in ungouped data we can also defined the r-th moment and r-th central moment . They are respectively defined by Mr0 = Mr = 1 n 1 n n ∑ fi Xir , i =1 n ∑ fi (Xi − Mean)r . i =1 Now these parameters can be used to defined the coefficient of skewness and kurtosis whose definitions are exactly the same as in an ungrouped data. Let us calculate the kurtosis of our data First we try our formula. 13 Kurtf=mean((Age-Mean_age2).^4)/(mean((Age-Mean_age2).^2))^2; Skewf=mean((Age-Mean_age2).^3)/(mean((Age-Mean_age2).^2))^(3/2); We compare them with values returned by the Matlab functions kurtosis and skewness Kurtf-kurtosis(Age); Skewf-skewness(Age); Interquartile The k-th percentile is the value of the observed variable which has a cumulative frequency equal to k/100. The first quartile, the second quartile and the third quartile correspond to the values with cumulative frequencies 25%, 50% and 75%, respectively. The interquartile is the difference between the first quartile and third quartile. It is a range within which the middle half of the data lie. Appendix: Calculation of some parameters for grouped data. This appendix serves to give an explanation on how to calculate of some parameters of grouped data, in particular, when the range of the observed values of the variable are covered by disjoint intervals. Many of the calculations in this section require the knowledge of lower/upper boundaries of the classes which in its turn require the knowledge of a gap classes. The gap between classes is the difference between the upper limit of one class and the lower limit of the next class. For example, assume that our classes are the interval ( ai , bi ), i = 1, . . . , n . The gap is gap = bi − ai+1 . Having the gap at hand, we can form the class boundaries. The lower class boundaries are ãi = ai − gap/2 and the upper class boundaries are b̃i = bi + gap/2. 14 Now, we are ready to estimate the median, quartiles and interquartile (range) of a grouped data. Follow the steps below to calculate the media: 1. Form the cumulative frequency table and insert in it the ranges of class boundaries. Call N the total frequency which is also the total number of observation or individuals in the sample. Locate the Median class,i.e., find the class which contains the N/2-th individual. Call it Cm = ( am , bm ) and C̃m = ( ãm , b̃m ) its lower and upper class boundaries. Apply the following formula to find the median Median = ãm + N/2 − Fb fm ( bm − a m ) , where f m is the frequency of the median class, Fb is the cumulative frequency before the median class. A similar argument can be used to compute the first quartile (∼ N/4) and the third quartile (∼ 3N/4). Let α ∈ {1, 3} Qα = ãQα + αN/4 − Fb f Qα ( bQ α − a Q α ). For the mode we can use the following formula f mo − f a Mode = ãmo + (b̃mo − ãmo ), 2 f − ( fa + fb ) where f mo is the frequence of the class mode, f b and f a are respectively the frequency of the class before and after the class mode. Exercise: Find the median, interquartile and the mode of the following grouped data. 15 Time to travel to work 1-10 11-20 21-30 31-40 41-50 Frequency 8 14 12 9 7 16

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download lecture notes - Montanuniversität Leoben