Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Types of data and descriptive statistic Intermediate Training in Quantitative Analysis Bangkok 19-23 November 2007 LEARNING PROGRAMME Topics to be covered in this presentation Data collection and variable types Basic descriptive analysis Review of mean, median, mode, range, frequencies, crosstabs Multiple response LEARNING PROGRAMME - 2 Learning objectives By the end of this session, the participant should be able to: Define variable types Conduct basic descriptive statistics for continuous and categorical variables Conduct multiple response tests LEARNING PROGRAMME - 3 In the social sciences we are usually interested in discovering something about a phenomenon. Whatever the phenomenon we desire to explain, we seek to explain it by collecting data from the real world and then using these data to draw conclusions about what is being studied. LEARNING PROGRAMME - 4 When collecting data… We rely on two types of variables 1. Continuous 2. Categorical variables provide us information on individuals, households, administrative areas, or meteorological stations, etc LEARNING PROGRAMME - 5 Continuous variable They assume numeric values, expressed in a given unit of measurement Income, mm. of rainfall, amount of agricultural production, percent of food insecure hhs, Weight-for-height z-score for children etc. Most importantly, each number (within a variable) has a meaning in relation to the other numbers, allowing arithmetic comparisons to be drawn LEARNING PROGRAMME - 6 Type of variables CATEGORICAL CONTINUOUS Nominal Interval Ordinal Ratio LEARNING PROGRAMME - 7 Types of variables: different levels of measurement Interval An interval scale differs from an ordinal one in that the differences between adjacent categories are equal. Examples include the Fahrenheit and Celsius temperature scale. Ratio A ratio scale differs from an interval one in that there is a true zero point (% of HHs food insecure). LEARNING PROGRAMME - 8 Distribution of continuous variables Continuous variables can either be symmetrically distributed (ie normally distributed) or asymmetrically distributed (otherwise known as skewed) Note: To assess a variables distribution in spss, use the “histogram” option under graphs (spss help can provide more details) A normal distribution looks like this… LEARNING PROGRAMME - 9 Distribution of continuous variables… A skewed distribution looks like this…as this shows, distributions can be both positively and negatively skewed LEARNING PROGRAMME - 10 Lets take a look at two examples… Income Weight for height z-scores LEARNING PROGRAMME - 11 Categorical variables Values are categories, taking a limited set of values Ex. Child age groups– 0-11, 12-23, 24-35, 36-47, 48-59 months; Sex of respondent- male/female Categories can be denoted by numbers/ alphabetically Categorical variables take 2 forms-- nominal and ordinal variables. LEARNING PROGRAMME - 12 Type of variables CATEGORICAL CONTINUOUS Nominal Interval Ordinal Ratio LEARNING PROGRAMME - 13 Nominal variables A nominal measurement scale is a set of mutually exclusive categories that varies qualitatively but not quantitatively, for example gender, provinces, income sources, etc. Codes are labels representing different behaviours/ characteristics and they do not imply any underlying order Variables with a "yes-no“ answer, “male” or “female” LEARNING PROGRAMME - 14 Ordinal categorical variables An ordinal measurement scale differs from a nominal one in that the order among the original categories is preserved in the analysis. However differences between adjacent categories are not equal. Examples social class and perception (bad – medium – good). LEARNING PROGRAMME - 15 Descriptive statistics Descriptive statistics are the most basic from of statistics They include: Summaries of one variable Comparisons of two or more variables These tests are the foundation for more advanced statistical techniques LEARNING PROGRAMME - 16 Descriptives Continuous variables Range Mean Median Mode Categorical variables Frequencies Crosstabs LEARNING PROGRAMME - 17 First lets discuss descriptives for continuous variables… LEARNING PROGRAMME - 18 Range It is the spread between the smallest and the largest values in a distribution LEARNING PROGRAMME - 19 Mean The mean is a measure of the variable’s central tendency The (arithmetic) MEAN is the sum of all the values divided by the numbers of cases Statistics such as mean assume normal distributions LEARNING PROGRAMME - 20 Median The MEDIAN is the value above and below which half the cases fall, the 50th percentile, i.e. the middle value of a set of observations ranked in order. The median is a measure of central tendency not sensitive to outlying values--unlike the mean, which can be affected by a few extremely high or low values. A median does not necessarily assume an normal distribution LEARNING PROGRAMME - 21 Mode The MODE of a distribution is the value of the observation occurring most frequently. It can be used with all measurement scales. If several values share the greatest frequency of occurrence, each of them is a mode. LEARNING PROGRAMME - 22 To illustrate these concepts… Looking at age data from 10 individuals… 1 12 2 19 3 23 4 26 5 28 6 28 7 28 8 34 9 36 10 38 What is the range? 12 to 38 What is the mean? 27.2 What is the median? 28 What is the mode? 28 LEARNING PROGRAMME - 23 Other basic concepts that must be understood… Variance Standard deviation LEARNING PROGRAMME - 24 Standard deviation and variance The standard deviation is the average error between the mean and the observations made (and so is a measure of how well the mean describes the actual data). The variance is square of the standard deviation 1 LEARNING PROGRAMME - 25 Variance and standard deviation 1 2 3 4 5 6 7 8 9 10 12 19 23 26 28 28 28 34 36 38 What is the standard deviation of age?? What is the variance?? LEARNING PROGRAMME - 26 Standard deviation In a normal distribution, 68.27% of cases fall within ± one standard deviation of the mean, 95.45% of cases fall within ± two standard deviations and 99.73% fall within ± three standard deviations. For example, if the mean age is 45, with a standard deviation of 10, 95% of the cases would be between 25 and 65 in a normal distribution. LEARNING PROGRAMME - 27 Standard deviation LEARNING PROGRAMME - 28 Now lets discuss descriptives for categorical variables… LEARNING PROGRAMME - 29 Analysing categorical data If you want to look at the relationship between two categorical variables: Cannot use the mean and median The mean of a categorical variable is completely meaningless because the numeric values we attach to different categories are arbitrary LEARNING PROGRAMME - 30 Descriptives for categorical data The most basic descriptive for categorical variables are frequencies which shows the number (or percent) of cases in each category WAZPREV Frequency Valid Missing T otal Percent Valid Percent Cumulative Percent Not malnourished 4290 11.2 80.2 80.2 Malnourished 1058 2. 8 19.8 100. 0 T otal 5348 14.0 100. 0 32840 86.0 38188 100. 0 System LEARNING PROGRAMME - 31 Descriptives for categorical data Second, we can also cross tabulate categories from one variable with categories from a second variable, this is known as a contingency table LEARNING PROGRAMME - 32 Contingency Tables Child Gender * underweight Crosstabulation underweight no Child Gender Male Female Total Total yes Count 793 280 1073 % within Child Gender 74% 26% 100% % within underweight 51% 53% 51% % of Total 38% 13% 51% Count 772 253 1025 % within Child Gender 75% 25% 100% % within underweight 49% 47% 49% % of Total 37% 12% 49% Count 1565 533 2098 % within Child Gender 75% 25% 100% % within underweight 100% 100% 100% 75% 25% 100% % of Total LEARNING PROGRAMME - 33 Contingency Tables Orphan Residence Capital, large city Small city Town Countryside Total No Yes Total Count 1,071 125 1,196 % within residence 89.5% 10.5% 100.0% % within orphan 7.4% 6.8% 7.3% Count 395 37 432 % within residence 91.4% 8.6% 100.0% % within orphan 2.7% 2.0% 2.6% Count 824 131 955 % within residence 86.3% 13.7% 100.0% % within orphan 5.7% 7.1% 5.8% Count 12,252 1,558 13,810 % within residence 88.7% 11.3% 100.0% % within orphan 84.3% 84.2% 84.2% Count 14,542 1,851 16,393 % within residence 88.7% 11.3% 100.0% % within orphan 100.0% 100.0% 100.0% LEARNING PROGRAMME - 34 Example… WAZPREV * ORPHAN Crosstabulation ORPHAN Non orphan WAZPREV Not malnourished Malnourished T otal Count Orphan T otal 3995 207 4202 % within WAZPREV 95.1% 4. 9% 100. 0% % within ORPHAN 80.2% 80.9% 80.2% 987 49 1036 % within WAZPREV 95.3% 4. 7% 100. 0% % within ORPHAN 19.8% 19.1% 19.8% 4982 256 5238 95.1% 4. 9% 100. 0% 100. 0% 100. 0% 100. 0% Count Count % within WAZPREV % within ORPHAN What percentage of orphans are malnourished? What percentage of non orphans are malnourished? What percentage of malnourished children are orphans? LEARNING PROGRAMME - 35 Multiple response analysis Sometimes we have to analyze categorical data, where households are able to give more than one response to a question (ex. livelihoods, coping strategies, etc) Analyzing such data requires a multiple response analysis LEARNING PROGRAMME - 36 After completing question SCM1, complete one incident at a time (line by line), each time repeating the questions above SCM 2. By order of importance, what incidents did your household experience in the last 1 YEAR SCM 3. What is the main action your household took to compensate the effect of that incident? Incidents code: 1 = Insecurity, Violence 2 = Increased price for food 4 = Drop in farm gate price 5 = Floods 6 = drought/dry spell 7 = crop pest and disease 8 = Livestock disease 9 = Sickness of household Member 10 = Death of household member 11 = Increased household size (IDPs) 12 = Loss / lack of employment Coping code: 0 = Nothing 1 = Eat less preferred foods 2 = Eat fewer or smaller meals per day 3 = Go one entire day without meals 4 = collect wild foods, hunt or harvest immature crops 5 = Distress sale / slaughter of livestock 6 = Distress sale of other assets 7 = Purchase food on credit 8 = Borrow food from families and friends, kinship support 9 = Worked for money 10 = Worked for food only 11 = Reduced expenditures on health or education 12 = spent savings 13 = Some household members migrated 14 = Other (specify)_____________ SCM 4. How often did you do this in the last 1 YEAR? SCM 5. Did your Household Recover from that incident? 1. Yes MAIN : (Code) ___ (Code) ___ ___ times 2. No 1. Yes SECOND : (Code) ___ (Code) ___ ___ times 2. No 1. Yes THIRD : (Code) ___ (Code) ___ ___ times 2. No 1. Yes FOURTH : (Code) ___ (Code) ___ ___ times 2. No 1. Yes FIFTH : (Code) ___ (Code) ___ ___ times 2. No LEARNING PROGRAMME - 37 Multiple response frequencies $shocks Frequencies $shocksa Total Insecurity, violence Higher prices Drop in farmgate price Floods Drought Crop pest/diseas e Livestock dis eas e Sicknes s in HH Death in HH Increased HH size (IDPs) Los s/lack of employment Res ponses N Percent 2418 12.5% 2115 11.0% 881 4.6% 1410 7.3% 2695 14.0% 2031 10.5% 1479 7.7% 2950 15.3% 1229 6.4% 689 3.6% 1398 7.2% 19295 100.0% Percent of Cas es 36.0% 31.5% 13.1% 21.0% 40.1% 30.3% 22.0% 43.9% 18.3% 10.3% 20.8% 287.4% a. Group LEARNING PROGRAMME - 38 N is the number of households that reported a shock The Percent column reports the percentage of total responses represented by each shock. This is not easily available from individual frequency tables. The Percent of Cases column is the percentage of valid cases represented by each shock. LEARNING PROGRAMME - 39 Multiple response crosstabs $shocks*fcgbivariate Crosstabulation $shocks Insecuri ty, vi olence Higher pri ces Drop in farm gate pri Floods Drought Crop pest/di seas e Livestock dis eas e Sicknes s in HH Death in HH Increased HH si ze (I Los s/lack of em ploym Total Count % within Count % within Count % within Count % within Count % within Count % within Count % within Count % within Count % within Count % within Count % within Count $shocks $shocks $shocks $shocks $shocks $shocks $shocks $shocks $shocks $shocks $shocks fcgbivariate poor/borde acceptabl e rl ine food food cons 726 1516 32.4% 67.6% 710 1269 35.9% 64.1% 275 536 33.9% 66.1% 494 787 38.6% 61.4% 911 1607 36.2% 63.8% 646 1227 34.5% 65.5% 414 946 30.4% 69.6% 764 2050 27.1% 72.9% 341 822 29.3% 70.7% 187 451 29.3% 70.7% 431 869 33.2% 66.8% 1887 4467 Total 2242 1979 811 1281 2518 1873 1360 2814 1163 638 1300 6354 Percentages and totals are bas ed on res pondents . a. Group LEARNING PROGRAMME - 40 Multiple response… To set up a multiple response in spss… Click on “Analyze” Click on “Multiple response” Click on “Define sets…” Move the variables of interest into the box in box on the right Then define the range of the variable Then name the variable Then click on “Add” Then click on “Close” LEARNING PROGRAMME - 41 Multiple response… To run a multiple response in spss… Click on “Analyze” Click on “Multiple response” Click on “frequencies…” or “crosstabs…” (whichever descriptive test you would like to conduct) Move the variables into the proper boxes Then click “Okay” LEARNING PROGRAMME - 42 now practical exercises….. LEARNING PROGRAMME - 43