Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Descriptive statistics Lecture 1 Lecture aim (s) To equip students with knowledge, skills and techniques of summarizing data through various statistical and documentary tools Learning Objectives By the end, students should be able to 1. Summarize any given data set appropriately to manageable levels for one to see its picture (features) clearly in the data 2. Use tabular and / or diagrams to describe data 3. Interpret the tables or diagrams used Supplementary reading notes • Modern Approaches to the analysis of experimental data ~By Statistical Service Centre, University of Reading, 2001 or 2006 versions • Analyzing the data, from GEAR 4.2 • Modern Methods of Data Analysis ~By Statistical Service Centre, University of Reading, 2002 version • Case study 3 from Biometrics RM ILRI CD Self Learning software • Open folder Statistics Made Simple • Launch sms – topic 1 • Follow the teacher Practical statistical packages • Genstat • SPSS Steps in the Analysis 1. Defining the objectives of analysis 2. Preparing the data 3. Descriptive analysis • decide on the objectives of the analysis before starting it – do an analysis plan. • You may need to construct variables needed for analysis (e.g. find plant N content from concentrations and biomass data) or summarise variables to the correct ‘level’ • Calculation of summary tables and graphs, as defined when setting objectives. • Exploratory analysis to identify any unexpected patterns or results 4. Confirmatory analysis • adding measures of precision (e.g. standard errors and results of significance tests) to the results found in the descriptive analysis;. • improving the estimates of various critical quantities. 5. Interpretation • integrating the new knowledge with the existing body of knowledge on the problem • comparing results with those from other studies, building predictive models and formulating new hypotheses 6. Reporting • reporting the analysis and presenting the final tables and graphs Source: SSC Data analysis workshop Data Exploration and Descriptive statistics Follows after data entry Descriptive statistics • Aims at – reducing the data to manageable proportions, – summarises trends and tendencies within the data – for one to see results clearly Descriptive statistics • As data information grows, it becomes difficult to have a clear picture of what is happening • This leads to a process of data exploration • And reducing data into tables, diagrams and numerical measures Review on data structure • Please review data structure for analyses • In CAST, refer to Introduction: About data Standard data structure Lecturer to explain and illustrate uni-variate and multivariate data structures, arrangement of variables ~ factors and records; integers and real variables, etc Aims of descriptive statistics • That is – In data there is information of interest – This information becomes unclear as data increases – One way to get this information from data is through descriptive statistics Go through anthropometric data See the picture in the data Practice on understanding data • Using data collected from students • Define variables using SPSS • Enter data • Merge data from all students • Do some computations on this and other data • Understand data as you increase data set size • See the need for descriptive statistics Computation practice • On class data • Compute departments from knowledge of questionnaire coded numbers • Generate parameters of food security Descriptive Statistics /Data summarising depends of type of variables • Qualitative variables – Are summarised into frequencies – Are presented as tables, bar charts, pie charts • Quantitative variables – Are summarised using numerical measures – Are presented as tables, histograms, stem and leaf diagrams, box and whisker plots, scatter plots More on types of data / variables Data Types Qualitative Discrete data Nominal variable Ordinal variable Categorical Quantitative Discrete data Continuous data Ratio / Interval Scale Descriptive Statistics of qualitative data A frequency distribution • Shows the frequencies of occurrence of the observations in a data set • According to the class or category of the qualitative variable • Results can be displayed in a – Table or – in a diagram such as • Bar chart or • Pie chart • In which, each class is represented An example of frequency distribution From SAVE Baseline data file An example of frequency distribution Relative frequency distribution • When comparing two or more frequency distributions and total numbers differ, • It is difficult to compare them • So calculate proportions or percentage of observations in each class or category • Hence relative frequencies • They sum up to unity or 100% Frequency distribution example house type by roof Valid Frequency grass thatc hed 793 iron sheet 205 tiles 5 Total 1003 Percent 79.1 20.4 .5 100.0 Valid Perc ent 79.1 20.4 .5 100.0 Cumulative Percent 79.1 99.5 100.0 house type by roof Valid grass thatched iron sheet Total Frequency 430 70 500 Percent 86.0 14.0 100.0 Valid Percent 86.0 14.0 100.0 Cumulative Percent 86.0 100.0 Frequency distribution example From SAVE Baseline data file Cumulative relative frequency distribution Sometimes are computed for certain purposes. e.g. • To determine a percentage of observations below a certain cutoff point • Helps in developing some indices such as measures of distribution of assets among individuals – Lorenz curves and Ginis An example of cumulative frequency distribution, distribution of 81 76 71 66 61 56 51 46 41 36 31 26 21 16 11 6 100 90 80 70 60 50 40 30 20 10 0 1 Cummulative relative frequency (%) livestock among households Number of livestock 5th Percentile Median 95th Percentile See also Fig 2.1 page 13 of Statistics for Vet and AS Percentile • Values of a variable which divide the total frequency into 100 equal parts • e.g. the 50th percentile (median) is the value of the variable that divides the distribution into two halves • That is, 50 % of individuals have observations less than the median and 50 % of individuals have observations greater than the median • Often 25th and 75th percentile are quoted as lower and upper quartiles, respectively • That is, value at which 25 % of observations lie below lower quartile and 25 % lie above upper quartile Percentile output from SPSS Percentiles 5 Weighted household size Average(Definition 1) Tukey's Hinges 10 2.00 household size Data from income data 2.00 25 Percentiles 50 75 3.00 4.00 6.00 3.00 4.00 6.00 90 8.00 95 9.00 Cumulative relative frequency polygon of distribution length of eggs 100 Cummulative frequency (%) 90 80 70 60 50 40 30 20 10 0 19.7 20.2 20.7 21.2 21.7 22.2 22.7 23.2 23.7 24.2 24.7 25.2 length of egg (mm) Median An example of cumulative frequency distribution, Lorenz curves Percent species of livestock asset 100 90 80 70 60 50 40 30 20 10 0 1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 Percent households Ideal Livestock Lorenz curves • Measures level of asset distribution • By comparing actual distribution to a line of perfect distribution • Based on cumulative frequency distribution of assets vs cumulative frequency distribution of households • That is cum asset is plotted against cum household Lorenz curve for income drawn from Genstat Lorenz curve for income 1.0 0.8 0.6 Lorenz curve for income 0.4 Gini coefficient 0.7027 95% Bootstrap confidence interval (0.560, 0.807) 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Data is arranged in order as follows (details in excel) Livestock Farmers No. livestock Percent Livestock Cummulative Percent farmer Cummulative 0 1285 0 0.00 0.00 57.16 57.16 1 193 193 5.30 5.30 8.59 65.75 2 214 428 11.75 17.05 9.52 75.27 3 162 486 13.34 30.39 7.21 82.47 4 115 460 12.63 43.01 5.12 87.59 5 78 390 10.71 53.72 3.47 91.06 6 74 444 12.19 65.91 3.29 94.35 7 39 273 7.49 73.40 1.73 96.09 8 22 176 4.83 78.23 0.98 97.06 9 16 144 3.95 82.19 0.71 97.78 10 11 110 3.02 85.20 0.49 98.27 11 14 154 4.23 89.43 0.62 98.89 12 10 120 3.29 92.73 0.44 99.33 13 2 26 0.71 93.44 0.09 99.42 14 3 42 1.15 94.59 0.13 99.56 15 4 60 1.65 96.24 0.18 99.73 16 1 16 0.44 96.68 0.04 99.78 17 1 17 0.47 97.15 0.04 99.82 19 1 19 0.52 97.67 0.04 99.87 20 1 20 0.55 98.22 0.04 99.91 31 1 31 0.85 99.07 0.04 99.96 34 1 34 0.93 100.00 0.04 100.00 2248 3643 100 100 Cross tabulations • Indicates association of frequency distribution of two or more variables – An example of house type by roof and household food security status Cross tabulations Practical • From your data, isolate qualitative data and analyse for frequencies and crosstabulations. Report in Tables and graphically. Interpret the results Frequency distribution for quantitative variable • Is calculated when quantitative data is split into class interval • Each class encompass a range of values of the variable • An observation only falls into one class • Then determine number of observations belonging to each class • A complete set of class frequencies is a frequency distribution Frequency distribution for quantitative variable ~ Practical • From Anthropometric data • On variables Z-scores – Split into three status groups • Normal • Undernourished • Over • overweight – Based on own set criteria • Then run frequency distribution Frequency distribution for quantitative variable ~ Practical • From class generated data • Merge assumed provided energy food data • Compute and analyse for • Food secure • Food insecure households • Then run frequency distribution and cross tabulations Practical in Genstat • Open Income data in spss into Genstat • Run frequency • Stats- summary statistics – tarry • Then take rthouse • Same can be analyzed from survey analysis EDA / Descriptive analysis of quantitative data • In summarizing quantitative variables the most interesting things are • Location • Spread • Odd values (What is a typical value) (How much variation is there?) (What is their source and interpretation?) • Location is measured by mean or median • Spread is measured by standard deviation or distance between quartiles. • Use Histograms and Box plots. Descriptive analysis of quantitative data ~ check also data exploration & description (Case Study 11) • This involves measures of various statistics – of measure of location (Centre) – and dispersion about the location (Spread) Measures of location • These usually refer to measures of central tendency of a data set • The arithmetic mean • The geometric mean • The median • The mode The Arithmetic Mean • The average value of observations in a data set x x i n • True population mean () • Sample estimate of the population mean (x) Mean is the most commonly used measure of central tendency Example of data set No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Liveweight (y) kg 30 24 20 25 25 19 35 37 39 43 38 20 28 22 28 25 20 35 43 36 X n n X 29.6 Problems with the mean • Mean value is influenced by outliers • An observation whose value is highly inconsistent with the main body of the data • Can be excessively large or small • And influence the mean similarly • The mean requires symmetrical distribution of data to be an appropriate measure of central tendency • Mean will be pulled to the right if distribution is skewed to the right and pulled to the left if the distribution is skewed to the left Median • Also commonly used • The value that 50 % of observations exceed or fall below • A set of observations need to be arranged in a rank order • Median is a 50th Percentile • If n is even the median lies midway between the central two observations • If n is odd, the median is found by counting till (n+1)/2th observation is reached Example of data set No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Live weight (y) kg 19 20 20 20 22 24 25 25 25 28 28 30 35 35 36 37 38 39 43 43 Median 28 Demo on excel data by changing one figure Attributes of median • Median is not affected by outliers in the data • Median is not affected by the skewed distribution • Preferred under these conditions • Median will be • less than mean if data is skewed to the right • greater than mean if data is skewed to the left • close or equal to mean in value if distribution is symmetrical • Median, however, does not incorporate all observations in its calculations Geometric mean • Distribution of biological data is mostly symmetrical • If not, data is mostly skewed to the right • To make such data symmetrical, we take log of each value in the data set • Log transformation • Means are therefore, computed on transformed data • We therefore, convert back to original scale by taking antilog • This mean is called geometric mean Geometric mean ~ Practice • • • • Using income data merged Transform into ln (natural log) in SPSS and Genstat And run histogram with normality plot in SPSS Details at end of this lecture Relationship between Geometric mean and median • Geometric mean is • always less than arithmetic mean if data are skewed to the right • usually equal to the median if data are skewed to the right • In such data, preference is to transform data and report geometric means rather than median Other measures of central tendency • Mode – The most frequent value Genstat practice on • Calling data from Excel • Putting protocol and value definitions prior to dataset in Excel • Use anthropdata • Illustrate using on farm gliricidia and sesbania excel file Measures of dispersion (variation) • When data are collected, all values are rarely the same. • A major role of statistics is to describe and analyze this variation. • That is, an important role of statistics is to display and describe this variation in ways that highlight the information in it. Growth of yams (cm) for 7 days 10.1 9.2 11.9 6.3 7.4 5.4 9.3 11.1 7.2 6.8 9.1 10.9 10.1 7.4 9.2 9.5 6.0 5.3 8.9 10.4 There is clearly variability between yam plants and a quick scan shows that all values are between 5 and 12 cm Source: CAST ~ Displaying variables Measures of dispersion • Mean on its own is not adequate • Hence the need to measure distribution in a population • e.g. Xa 4 10 7 3 Xa = 6 Xb 5 7 7 5 Xb = 6 • They have same mean but distribution is different • Hence the need for a measure of distribution to see how the population is dispersed about the mean Range • The difference between the extreme ends of the observations Range Max value Min value Range • Again does not show how the population is dispersed about the mean Variance • Measure of the amount of variation in a population • It is the sum of all squared deviations from the mean divided by the number of degrees of freedom σ2 2 Yi Yi μ n1 Yi 2 2 n1 n An example Sum 592 Mean 29.6 0.00 (y-y)2 0.16 31.36 92.16 21.16 21.16 112.36 29.16 54.76 88.36 179.56 70.56 92.16 2.56 57.76 2.56 21.16 92.16 29.16 179.56 40.96 1218.8 SS y 2 y y Vy n 1 64.1 Vy kg2 8.0 σ kg Example in excel No. Liveweight y-y (y) kg 1 30 0.4 2 24 -5.6 3 20 -9.6 4 25 -4.6 5 25 -4.6 6 19 -11 7 35 5.4 8 37 7.4 9 39 9.4 10 43 13.4 11 38 8.4 12 20 -9.6 13 28 -1.6 14 22 -7.6 15 28 -1.6 16 25 -4.6 17 20 -9.6 18 35 5.4 19 43 13.4 20 36 6.4 Standard deviation () • Describes the variation of a population or sample about the mean in same units • Best describes the population when used with the mean 68 % 95 % 99.7 % Usually reported as x ± So a population can have same mean but the distribution is different Standard deviation () The standard deviation is a 'typical' distance of values from the centre of the distribution. Coefficient of Variation (CV) • The standardised standard deviation so that it can be compared to those in other population, other traits, different ages or classes • It therefore, measures variation in relative terms • It is expressed in percent S CV x 100 X Standard error (SE) • It is the standard deviation of the means of samples drawn repeatedly from the same population • It measures precision of your estimates • The smaller the SE, the more precise the estimate • Reported as lsmean±se S Sx n STD is a useful measure of variation of an individual observation SE is a useful measure of variation of the mean Confidence Interval (CI) • CI is a range between upper and lower limits that is expected to include true mean of population at a given probability • This is the value for which a sample provides an unbiased estimate • SE is used to calculate CI Confidence Interval (CI) • Usually talk about 95 % CI • This is the interval in which the true mean lies with 95 % chance of being correct or • When sampled 20 times, 19 times have mean lying within the range • Approximate 95 % CI can be estimated as sample mean ± 2* SE • Approximate 99 % CI can be estimated as sample mean ± 2.6* SE • Usually t value is used in calculating CI Practical • From data on one of the files, do statistical analyses • Understand each concept Testing for normality and displaying variation • We have seen that most of measures of quantitative data require data distribution to be symmetrical • Normally distributed • In case of outliers, they need data management • In case of skewed data, there is need to transform data • There is therefore, need to check if data are normally distributed before proceeding with analyses Tools to test normality • Use of diagrams – Histograms, Tukey’s Box-and-whisker plots, Stem and Leaf, normal Plot • Statistical – Shapiro-Wilk, Kolmogorov-Smirnov tests • Consistency between mean and median • Examples in SPSS • File chickwt • Anthropometric data Histogram for weight at week 10 20 Height of a class rectangle = frequency of the class 10 Std. Dev = 226.83 Mean = 808.0 N = 67.00 0 400.0 600.0 500.0 800.0 700.0 1000.0 900.0 1200.0 1100.0 1400.0 1300.0 1500.0 WT10 Check for symmetry – shape to right of central value is mirror image of that to the left Note rectangles are contiguous because quantitative variables are continuous Box-and-whisker plot 1600 60 1400 1200 1000 800 Whiskers are vertical lines extending from box to 2.5th and 97.5th percentiles Rest are extreme values 600 400 200 N= 67 WT10 Horizontal lines of box defines upper and lower quartiles that encloses 50 % of observations Median is marked by a line within the box B & W plot therefore, displays range, median and quartiles Box-and-whisker plot 1600 60 1400 30 1200 1000 800 i.e. check variation per group or category 600 WT10 Are also useful to compare a number of data sets, e.g. by sex 400 200 N= 45 22 f m SEX We see that observations in each sex are approximately normally distributed, though range is higher in females than males, distribution higher in males than females Boxplot of height by species 1 2 3 4 5 71 Normal plot of weight at week 10 Horizontal axis shows ordered values of the variable Normal Q-Q Plot of wt10 4 Vertical axis represents corresponding standardised normal deviates Expected Normal 2 0 -2 -4 250 500 750 1,000 1,250 1,500 Observed Value Normally distributed data shows a straight line. If data is not normally distributed a plot deviates from straight line and a curve is produced Descriptive of normality analyses Descriptives WT10 Mean 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtos is Lower Bound Upper Bound Statis tic 808.03 752.70 Std. Error 27.711 863.36 800.96 796.00 51450.848 226.828 350 1498 1148 292.00 .532 .653 .293 .578 Extreme values of normality analysis Extreme Values WT10 Highest Lowest 1 2 3 4 5 1 2 3 4 5 Cas e Number 60 30 28 17 37 61 44 57 12 43 Value 1498 1332 1276 1214 1208 350 366 430 466 500 Test of normality Tests of Normality a WT10 Kolmogorov-Smirnov Statis tic df Sig. .074 67 .200* Shapiro-Wilk Statis tic df .977 67 Sig. .259 *. This is a lower bound of the true significance. a. Lilliefors Significance Correction Graphical tests are subjective though commonly used Objective tests of normality includes SW (for < 2000 observations and KS (for > 2000 observations These test hypothesis that data is not significantly different from normal SW should be > .90 (p>0.05); KS should be small and p>0.05) Exercise on descriptive analyses and test of normality From Anthropometric dataset available • Run normality tests for all quantitative data • That is • Weight, height, MUAC, haz, whz and waz Data transformation • In case data fails to be normal • There are two options – Either go ahead with analyses and use non-parametric tests – Or transform data and analyse on transformed values • Data transformation is recommended Basis for transformation • Reduces skewness of the data and makes the residual variance less dependent on the mean • By so doing, you normalise the distribution of data • To linearize a relationship • It is easier to analyse data and investigate a relationship when that relationship can be described by a straight line • To stabilise the variance • Equal variance is assumed in statistical analyses that often assume normal distribution Types of transformations • Can be • Log or natural log • Normalises data skewed to the right • Add constant to 0 values (k+x) • Square transformation • Normalises data skewed to the left • Logit transformation • Mainly for proportions (percentages) • Arcsine transformation Analyses is done on transformed data but • Report geometric mean by taking antilog of results • Mean and CI • Not their SE • Constant must be taken off the antilog when reporting the results in order to calculate correct geometric mean and CI Most health, survival, hatchability animal data need transformation Refer to literature of the type of transformation Practical on data transformation • Based on datasets provided • And after running normality tests • Transform non-normal data • And run normality test again Practical • On livestock data • Compute livestock units • Test for normality • Transform the data • Compare between gender of households using location and distribution Conversion table for livestock units Class of livestock Livestock units (per head) Sheep 0.15 Goats 0.15 Cattle (24 months and over) 1.00 Cattle (6-23 months) 0.6 Chicken/duck broiler 0.005 Chicken/duck caged layers 0.008 Chicken pullets 0.002 Pigs (sows) 0.2 Pigs (weaners) 0.05 Pigs (feeder hogs) 0.25 Pigeon 0.0002 Rabbits 0.008 Guinea fowl 0.005