Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data and central tendency Integrated Disease Surveillance Programme (IDSP) district surveillance officers (DSO) course 1 Outline of the session 1. Type of data 2. Central tendency 2 Epidemiological process • We collect data  We use criteria and definitions • We analyze data into information  “Data reduction / condensation” • We interpret the information for decision making  What does the information means to us? 3 Surveillance: A role of the public health system The systematic process of collection, transmission, analysis and feedback of public health data for decision making Data Information Action Interpretation Analysis Today we will focus on DATA: The starting point 4 Surveillance Data: A definition • Set of related numbers • Raw material for statistics • Example:  Temperature of a patient over time  Date of onset of patients 5 Types of data • Qualitative data  No magnitude / size  Classified by counting the units that have the same attribute  Types • Binary • Nominal • Ordinal • Quantitative data 6 Qualitative, binary data • The variable can only take two values  1,0 often used (or 1,2)  Yes, No • Example:  Sex • Male, Female  Female sex • Yes, No 7 REC SEX --- ---1 M 2 M 3 M 4 F 5 M 6 F 7 F 8 M 9 M 10 M 11 F 12 M 13 M 14 M 15 F 16 F 17 F 18 M 19 M 20 M 21 F 22 M 23 M 24 F 25 M 26 M 27 M 28 F 29 M 30 M Frequency distribution for a qualitative binary variable Sex Frequency Proportion Female 10 33.3% Male 20 66.7% Total 30 100.0% 8 Using a pie chart to display qualitative binary variable Distribution of cases by sex Female Male 9 Qualitative, nominal data • The variable can take more than two values  Any value • The information fits into one of the categories • The categories cannot be ranked • Example:  Nationality  Language spoken  Blood group 10 Rec 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 State Punjab Bihar Rajasthan Punjab Bihar Punjab Bihar Bihar UP Rajasthan Bihar Rajasthan Punjab UP Rajasthan UP Punjab UP Rajasthan Bihar UP Bihar UP Rajasthan Bihar Bihar Bihar UP Bihar UP Frequency distribution for a qualitative nominal variable Country Frequency Bihar Proportion 11 36.7% UP 8 26.7% Rajasthan 6 20.0% Punjab 5 16.6% 30 100.0% Total 11 Using a horizontal bar chart to display qualitative nominal variable Bihar UP RJ Punjab 0 5 10 Frequency 12 Distribution of cases by state 15 Qualitative, ordinal data • The variable can only take a number of value than can be ranked through some gradient • Example:  Birth order • First, second, third …  Severity • Mild, moderate, severe  Vaccination status • Unvaccinated, partially vaccinated, fully vaccinated 13 REC Status --- ------1 1 2 1 3 2 4 2 5 1 6 2 7 1 8 2 9 3 10 2 11 1 12 3 13 1 14 3 15 1 16 3 17 1 18 1 19 3 20 1 21 1 22 2 23 1 24 2 25 2 26 1 27 2 28 3 29 2 30 2 Frequency distribution for a qualitative ordinal variable Severity Frequency Proportion Mild 13 43.3% Moderate 11 36.7% 6 20.0% 30 100.0% Severe Total Clinical status: 1: Mild; 2 : Moderate; 3 : Severe 14 Using a vertical bar chart to display qualitative ordinal variable Frequency 15 10 5 0 Mild Moderate 15 Severe Distribution of cases by severity Key issues • Qualitative data • Quantitative data  We are not simply counting  We are also measuring • Discrete • Continuous 16 Quantitative, discrete data • Values are distinct and separated • Normally, values have no decimals • Example:  Number of sexual partners  Parity  Number of persons who died from measles 17 REC CHILDREN --- ------1 1 2 2 3 5 4 6 5 3 6 4 7 1 8 1 9 2 10 3 11 1 12 2 13 7 14 3 15 4 16 2 17 1 18 1 19 1 20 1 21 2 22 3 23 1 24 4 25 2 26 1 27 6 28 4 29 3 30 1 Frequency distribution for a quantitative, discrete data Children Frequency Proportion 1 11 36.7% 2 6 20.0% 3 5 16.7% 4 4 13.3% 5 1 3.3% 6 2 6.7% 7 1 3.3% 30 100.0% Total 18 Using a histogram to display a discrete quantitative variable 12 Frequency 10 8 6 4 2 0 1 2 3 4 5 6 7 Number of children 19 Distribution of households by number of children Quantitative, continuous data • Continuous variable • Can assume continuous uninterrupted range of values • Values may have decimals • Example:     Weight Height Hb level What about temperature? 20 REC WEIGHT --- -----1 10.5 2 23.7 3 21.8 4 33.1 5 38.0 6 34.5 7 38.5 8 38.4 9 30.1 10 34.7 11 37.9 12 38.0 13 39.2 14 30.1 15 43.2 16 45.7 17 40.4 18 56.4 19 55.1 20 55.4 21 66.7 22 82.9 23 109.7 24 120.2 25 10.4 26 10.8 27 25.5 28 20.2 29 27.3 30 38.7 Frequency distribution for a continuous quantitative variable: The tally mark Weight Tally mark Frequency 10-19 III 3 20-29 IIIII 5 30-39 IIIII IIIII II 12 40-49 III 3 50-59 III 3 60-69 I 1 70-79 - 0 80-89 I 1 90-99 - 0 100-109 I 1 110-119 I 1 21 REC WEIGHT --- -----1 10.5 2 23.7 3 21.8 4 33.1 5 38.0 6 34.5 7 38.5 8 38.4 9 30.1 10 34.7 11 37.9 12 38.0 13 39.2 14 30.1 15 43.2 16 45.7 17 40.4 18 56.4 19 55.1 20 55.4 21 66.7 22 82.9 23 109.7 24 120.2 25 10.4 26 10.8 27 25.5 28 20.2 29 27.3 30 38.7 Frequency distribution for a continuous quantitative variable, after aggregation Weight 22 Frequency Proportion 10-19 3 10.0% 20-29 5 16.7% 30-39 12 40.0% 40-49 3 10.0% 50-59 3 10.0% 60-69 1 3.3% 70-79 0 0.0% 80-89 1 3.3% 90-99 0 0.0% 100-109 1 3.3% 110-119 1 3.3% 30 100.0% Total Using a histogram to display a frequency distribution for a continuous quantitative variable, after aggregation 14 Frequency 12 10 8 6 4 2 0 0-9 ハ10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 Weight categories 23 Distribution of cases by weight 90-99 100-9 110-9 Summary statistics • A single value that summarizes the observed value of a variable  Part of the data reduction process • Two types:  Measures of location/central tendency/average  Measures of dispersion/variability/spread • Describe the shape of the distribution of a set of observations • Necessary for precise and efficient comparisons of different sets of data  The location (average) and shape (variability) of different distributions may be different 24 Describing a distribution Position 20 15 10 Dispersion 5 0 0-9 25 10-19К 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99 Same location, different variability Population A No. of People Population B Different Variability Same Location Factor X 26 Different location, same variability No. of People Population A Same Variability Different Locations Population B Factor Y 27 Measures of central tendency • Mode • Median • Arithmetic mean 28 The mode • Definition  The mode of a distribution is the value that is observed most frequently in a given set of data • How to obtain it?  Arrange the data in sequence from low to high  Count the number of times each value occurs  The most frequently occurring value is the mode 29 The mode Mode 20 18 16 14 12 10 8 6 4 2 0 30 Examples of mode annual salary (in 10,000 rupees) • 4, 3, 3, 2, 3, 8, 4, 3, 7, 2 • Arranging the values in order:  2, 2, 3, 3, 3, 3, 4, 4, 7, 8 7, 8  The mode is three times “3” 31 Specific features of the mode • There may be no mode  When each value is unique • There may be more than one mode  When more than 1 peak occurs  Bimodal distribution • The mode is not amenable to statistical tests • The mode is not based on all the observations 32 The median • The median describes literally the middle value of the data • It is defined as the value above or below which half (50%) the observations fall 33 Computing the median • Arrange the observations in order from smallest to largest (ascending order) or viceversa • Count the number of observations “n”  If “n” is an odd number • Median = value of the (n+1) / 2th observation (Middle value)  If “n” is an even number • Median = the average of the n / 2th and (n /2)+1th observations (Average of the two middle numbers) 34 Example of median calculation • What is the median of the following values:  10, 20, 12, 3, 18, 16, 14, 25, 2  Arrange the numbers in increasing order • 2 , 3, 10, 12, 14, 16, 18, 20, 25 • Median = 14 • Suppose there is one more observation (8)  2 , 3, 8, 10, 12, 14, 16, 18, 20, 25  Median = Mean of 12 & 14 = 13 35 Advantages and disadvantages of the median • Advantages  The median is unaffected by extreme values • Disadvantages  The median does not contain information on the other values of the distribution • Only selected by its rank • You can change 50% of the values without affecting the median  The median is less amenable to statistical tests 36 14 Median The median is not sensitive to extreme values 12 10 8 6 4 2 0 14 Same median Class of the variable 12 10 8 6 4 2 0 Class of the variable 37 Mean (Arithmetic mean / Average) • Most commonly used measure of location • Definition  Calculated by adding all observed values and dividing by the total number of observations • Notations     Each observation is denoted as x1, x2, … xn The total number of observations: n Summation process = Sigma :  The mean: X X =  xi /n 38 Computation of the mean • Duration of stay in days in a hospital  8,25,7,5,8,3,10,12,9 • 9 observations (n=9) • Sum of all observations = 87 • Mean duration of stay = 87 / 9 = 9.67 • Incubation period in days of a disease  8,45,7,5,8,3,10,12,9 • 9 observations (n=9) • Sum of all observations =107 • Mean incubation period = 107 / 9 = 11.89 39 Advantages and disadvantages of the mean • Advantages  Has a lot of good theoretical properties  Used as the basis of many statistical tests  Good summary statistic for a symmetrical distribution • Disadvantages  Less useful for an asymmetric distribution • Can be distorted by outliers, therefore giving a less “typical” value 40 Median = 10 Mode = 13.5 14 12 10 8 6 4 2 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Mean = 10.8 41 Ideal characteristics of a measure of central tendency • • • • Easy to understand Simple to compute Not unduly affected by extreme values Rigidly defined  Clear guidelines for calculation • Capable of further mathematical treatment • Sample stability  Different samples generate same measure 42 What measure of location to use? • Consider the duration (days) of absence from work of 21 labourers owing to sickness  1, 1, 2, 2, 3, 3, 4, 4, 4, 4, 5, 6, 6, 6, 7, 8, 9, 10, 10, 59, 80 • Mean = 11 days  Not typical of the series as 19 of the 21 labourers were absent for less than 11 days  Distorted by extreme values • Median = 5 days  Better measure 43 Type of data: Summary Qualitative Quantitative Binary Nominal Ordinal Discrete Continuous Sex M M F M F F M M F M F F M M M F M F M State Bihar Punjab Bihar Punjab UP Bihar UP Rajasthan Punjab Rajasthan Bihar UP Rajasthan Bihar Punjab Punjab Rajasthan UP Bihar Status Mild Moderate Severe Mild Moderate Mild Moderate Severe Severe Mild Moderate Moderate Mild Severe Severe Moderate Mild Mild Mild Children 1 1 2 3 1 1 2 3 2 2 1 1 1 2 2 3 2 3 1 44 Weight 56.4 47.8 59.9 13.1 25.7 23.0 30.0 13.7 15.4 52.5 26.6 38.2 59.0 57.9 19.6 31.7 15.1 33.9 45.6 Definitions of measures of central tendency • Mode  The most frequently occuring observation • Median  The mid-point of a set of ordered observations • Arithmetic mean  Aggregate / sum of the given observations divided by the number of observation 45