Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data and central tendency Integrated Disease Surveillance Programme (IDSP) district surveillance officers (DSO) course 1 Outline of the session 1. Type of data 2. Central tendency 2 Epidemiological process • We collect data We use criteria and definitions • We analyze data into information “Data reduction / condensation” • We interpret the information for decision making What does the information means to us? 3 Surveillance: A role of the public health system The systematic process of collection, transmission, analysis and feedback of public health data for decision making Data Information Action Interpretation Analysis Today we will focus on DATA: The starting point 4 Surveillance Data: A definition • Set of related numbers • Raw material for statistics • Example: Temperature of a patient over time Date of onset of patients 5 Types of data • Qualitative data No magnitude / size Classified by counting the units that have the same attribute Types • Binary • Nominal • Ordinal • Quantitative data 6 Qualitative, binary data • The variable can only take two values 1,0 often used (or 1,2) Yes, No • Example: Sex • Male, Female Female sex • Yes, No 7 REC SEX --- ---1 M 2 M 3 M 4 F 5 M 6 F 7 F 8 M 9 M 10 M 11 F 12 M 13 M 14 M 15 F 16 F 17 F 18 M 19 M 20 M 21 F 22 M 23 M 24 F 25 M 26 M 27 M 28 F 29 M 30 M Frequency distribution for a qualitative binary variable Sex Frequency Proportion Female 10 33.3% Male 20 66.7% Total 30 100.0% 8 Using a pie chart to display qualitative binary variable Distribution of cases by sex Female Male 9 Qualitative, nominal data • The variable can take more than two values Any value • The information fits into one of the categories • The categories cannot be ranked • Example: Nationality Language spoken Blood group 10 Rec 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 State Punjab Bihar Rajasthan Punjab Bihar Punjab Bihar Bihar UP Rajasthan Bihar Rajasthan Punjab UP Rajasthan UP Punjab UP Rajasthan Bihar UP Bihar UP Rajasthan Bihar Bihar Bihar UP Bihar UP Frequency distribution for a qualitative nominal variable Country Frequency Bihar Proportion 11 36.7% UP 8 26.7% Rajasthan 6 20.0% Punjab 5 16.6% 30 100.0% Total 11 Using a horizontal bar chart to display qualitative nominal variable Bihar UP RJ Punjab 0 5 10 Frequency 12 Distribution of cases by state 15 Qualitative, ordinal data • The variable can only take a number of value than can be ranked through some gradient • Example: Birth order • First, second, third … Severity • Mild, moderate, severe Vaccination status • Unvaccinated, partially vaccinated, fully vaccinated 13 REC Status --- ------1 1 2 1 3 2 4 2 5 1 6 2 7 1 8 2 9 3 10 2 11 1 12 3 13 1 14 3 15 1 16 3 17 1 18 1 19 3 20 1 21 1 22 2 23 1 24 2 25 2 26 1 27 2 28 3 29 2 30 2 Frequency distribution for a qualitative ordinal variable Severity Frequency Proportion Mild 13 43.3% Moderate 11 36.7% 6 20.0% 30 100.0% Severe Total Clinical status: 1: Mild; 2 : Moderate; 3 : Severe 14 Using a vertical bar chart to display qualitative ordinal variable Frequency 15 10 5 0 Mild Moderate 15 Severe Distribution of cases by severity Key issues • Qualitative data • Quantitative data We are not simply counting We are also measuring • Discrete • Continuous 16 Quantitative, discrete data • Values are distinct and separated • Normally, values have no decimals • Example: Number of sexual partners Parity Number of persons who died from measles 17 REC CHILDREN --- ------1 1 2 2 3 5 4 6 5 3 6 4 7 1 8 1 9 2 10 3 11 1 12 2 13 7 14 3 15 4 16 2 17 1 18 1 19 1 20 1 21 2 22 3 23 1 24 4 25 2 26 1 27 6 28 4 29 3 30 1 Frequency distribution for a quantitative, discrete data Children Frequency Proportion 1 11 36.7% 2 6 20.0% 3 5 16.7% 4 4 13.3% 5 1 3.3% 6 2 6.7% 7 1 3.3% 30 100.0% Total 18 Using a histogram to display a discrete quantitative variable 12 Frequency 10 8 6 4 2 0 1 2 3 4 5 6 7 Number of children 19 Distribution of households by number of children Quantitative, continuous data • Continuous variable • Can assume continuous uninterrupted range of values • Values may have decimals • Example: Weight Height Hb level What about temperature? 20 REC WEIGHT --- -----1 10.5 2 23.7 3 21.8 4 33.1 5 38.0 6 34.5 7 38.5 8 38.4 9 30.1 10 34.7 11 37.9 12 38.0 13 39.2 14 30.1 15 43.2 16 45.7 17 40.4 18 56.4 19 55.1 20 55.4 21 66.7 22 82.9 23 109.7 24 120.2 25 10.4 26 10.8 27 25.5 28 20.2 29 27.3 30 38.7 Frequency distribution for a continuous quantitative variable: The tally mark Weight Tally mark Frequency 10-19 III 3 20-29 IIIII 5 30-39 IIIII IIIII II 12 40-49 III 3 50-59 III 3 60-69 I 1 70-79 - 0 80-89 I 1 90-99 - 0 100-109 I 1 110-119 I 1 21 REC WEIGHT --- -----1 10.5 2 23.7 3 21.8 4 33.1 5 38.0 6 34.5 7 38.5 8 38.4 9 30.1 10 34.7 11 37.9 12 38.0 13 39.2 14 30.1 15 43.2 16 45.7 17 40.4 18 56.4 19 55.1 20 55.4 21 66.7 22 82.9 23 109.7 24 120.2 25 10.4 26 10.8 27 25.5 28 20.2 29 27.3 30 38.7 Frequency distribution for a continuous quantitative variable, after aggregation Weight 22 Frequency Proportion 10-19 3 10.0% 20-29 5 16.7% 30-39 12 40.0% 40-49 3 10.0% 50-59 3 10.0% 60-69 1 3.3% 70-79 0 0.0% 80-89 1 3.3% 90-99 0 0.0% 100-109 1 3.3% 110-119 1 3.3% 30 100.0% Total Using a histogram to display a frequency distribution for a continuous quantitative variable, after aggregation 14 Frequency 12 10 8 6 4 2 0 0-9 ハ10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 Weight categories 23 Distribution of cases by weight 90-99 100-9 110-9 Summary statistics • A single value that summarizes the observed value of a variable Part of the data reduction process • Two types: Measures of location/central tendency/average Measures of dispersion/variability/spread • Describe the shape of the distribution of a set of observations • Necessary for precise and efficient comparisons of different sets of data The location (average) and shape (variability) of different distributions may be different 24 Describing a distribution Position 20 15 10 Dispersion 5 0 0-9 25 10-19К 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99 Same location, different variability Population A No. of People Population B Different Variability Same Location Factor X 26 Different location, same variability No. of People Population A Same Variability Different Locations Population B Factor Y 27 Measures of central tendency • Mode • Median • Arithmetic mean 28 The mode • Definition The mode of a distribution is the value that is observed most frequently in a given set of data • How to obtain it? Arrange the data in sequence from low to high Count the number of times each value occurs The most frequently occurring value is the mode 29 The mode Mode 20 18 16 14 12 10 8 6 4 2 0 30 Examples of mode annual salary (in 10,000 rupees) • 4, 3, 3, 2, 3, 8, 4, 3, 7, 2 • Arranging the values in order: 2, 2, 3, 3, 3, 3, 4, 4, 7, 8 7, 8 The mode is three times “3” 31 Specific features of the mode • There may be no mode When each value is unique • There may be more than one mode When more than 1 peak occurs Bimodal distribution • The mode is not amenable to statistical tests • The mode is not based on all the observations 32 The median • The median describes literally the middle value of the data • It is defined as the value above or below which half (50%) the observations fall 33 Computing the median • Arrange the observations in order from smallest to largest (ascending order) or viceversa • Count the number of observations “n” If “n” is an odd number • Median = value of the (n+1) / 2th observation (Middle value) If “n” is an even number • Median = the average of the n / 2th and (n /2)+1th observations (Average of the two middle numbers) 34 Example of median calculation • What is the median of the following values: 10, 20, 12, 3, 18, 16, 14, 25, 2 Arrange the numbers in increasing order • 2 , 3, 10, 12, 14, 16, 18, 20, 25 • Median = 14 • Suppose there is one more observation (8) 2 , 3, 8, 10, 12, 14, 16, 18, 20, 25 Median = Mean of 12 & 14 = 13 35 Advantages and disadvantages of the median • Advantages The median is unaffected by extreme values • Disadvantages The median does not contain information on the other values of the distribution • Only selected by its rank • You can change 50% of the values without affecting the median The median is less amenable to statistical tests 36 14 Median The median is not sensitive to extreme values 12 10 8 6 4 2 0 14 Same median Class of the variable 12 10 8 6 4 2 0 Class of the variable 37 Mean (Arithmetic mean / Average) • Most commonly used measure of location • Definition Calculated by adding all observed values and dividing by the total number of observations • Notations Each observation is denoted as x1, x2, … xn The total number of observations: n Summation process = Sigma : The mean: X X = xi /n 38 Computation of the mean • Duration of stay in days in a hospital 8,25,7,5,8,3,10,12,9 • 9 observations (n=9) • Sum of all observations = 87 • Mean duration of stay = 87 / 9 = 9.67 • Incubation period in days of a disease 8,45,7,5,8,3,10,12,9 • 9 observations (n=9) • Sum of all observations =107 • Mean incubation period = 107 / 9 = 11.89 39 Advantages and disadvantages of the mean • Advantages Has a lot of good theoretical properties Used as the basis of many statistical tests Good summary statistic for a symmetrical distribution • Disadvantages Less useful for an asymmetric distribution • Can be distorted by outliers, therefore giving a less “typical” value 40 Median = 10 Mode = 13.5 14 12 10 8 6 4 2 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Mean = 10.8 41 Ideal characteristics of a measure of central tendency • • • • Easy to understand Simple to compute Not unduly affected by extreme values Rigidly defined Clear guidelines for calculation • Capable of further mathematical treatment • Sample stability Different samples generate same measure 42 What measure of location to use? • Consider the duration (days) of absence from work of 21 labourers owing to sickness 1, 1, 2, 2, 3, 3, 4, 4, 4, 4, 5, 6, 6, 6, 7, 8, 9, 10, 10, 59, 80 • Mean = 11 days Not typical of the series as 19 of the 21 labourers were absent for less than 11 days Distorted by extreme values • Median = 5 days Better measure 43 Type of data: Summary Qualitative Quantitative Binary Nominal Ordinal Discrete Continuous Sex M M F M F F M M F M F F M M M F M F M State Bihar Punjab Bihar Punjab UP Bihar UP Rajasthan Punjab Rajasthan Bihar UP Rajasthan Bihar Punjab Punjab Rajasthan UP Bihar Status Mild Moderate Severe Mild Moderate Mild Moderate Severe Severe Mild Moderate Moderate Mild Severe Severe Moderate Mild Mild Mild Children 1 1 2 3 1 1 2 3 2 2 1 1 1 2 2 3 2 3 1 44 Weight 56.4 47.8 59.9 13.1 25.7 23.0 30.0 13.7 15.4 52.5 26.6 38.2 59.0 57.9 19.6 31.7 15.1 33.9 45.6 Definitions of measures of central tendency • Mode The most frequently occuring observation • Median The mid-point of a set of ordered observations • Arithmetic mean Aggregate / sum of the given observations divided by the number of observation 45