Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Measures of central tendency and dispersion Tunis, 28th October 2014 Dr Ghada Abou Mrad Ministry of Public Health, Lebanon [email protected] Learning objectives • Define the different types of variables and data within a population or a sample • Describe data using the common measures of central tendency (Mode, Median, arithmetic Mean) • Describe data in terms of their measures of dispersion (range, standard deviation/variance, standard error) Variable • A population is any complete group of units (such as person or business) with at least one characteristic in common. It needs to be clearly identified at the beginning of a study. • A sample is a subset group of units in a population, selected to represent all units in a population of interest • A variable is any characteristics, number, or quantity that can be measured or counted. It is called a variable because its value may vary in the population and over time; it is represented by “X” in a population and “x” in a sample Data • Data are the measurements or observations or values that are collected for a specific variable in a population or a sample; an observation can be represented by “Xi “ in a population and “xi“ in a sample – A data unit (or unit record or record) is one entity (such as a person or business) in the population being studied, for which data are collected. – A data item (or variable) is a characteristic (or attribute) of a data unit which is measured or counted, such as height. Data item Data unit # 1 2 3 age 20 16 23 sex M F F height 175 163 170 Obs Dataset • A dataset is a complete collection of all observations for a specific variable in a population or a sample; it is called a raw dataset if the data have not been organized; the total number of observation in a dataset can be represented by “N” for a population and “n” for a sample • Example: Ages of students in a class (years) Age 27 30 28 31 28 36 29 37 29 34 30 30 27 30 28 31 32 30 29 29 Types of variables Variable Qualitative nominal Quantitative ordinal discrete continuous Types of variables • Qualitative variable: have value that describe a 'quality' ; it is also called a categorical variable – Nominal: Observations can take a value that is not able to be organized in a logical sequence like sex or eye color – Ordinal: Observations can take a value that can be logically ordered from lowest to highest like clothing size (i.e. small, medium, large) • The data collected for a qualitative variable are qualitative data Types of variables • Quantitative variable: have values that describe a measurable quantity ; it is also called numeric variable; it can be ordered from lowest to highest – Discrete: Observations can take a value based on a count from a set of values. It cannot take the value of a fraction between one value and the next closest value. Ex: number of children in a family – Continuous: Observations can take any value between a certain set of real numbers. Ex: height • The data collected for a quantitative variable are quantitative data Descriptive statistics Statistics describe or summarize data • Most data can be ordered from lowest to highest • The frequency is the number of times an observation occurs for a variable; the frequency distribution can be shown in a table or in a graph such as histogram • Quantitative data can be described using the common measures of central tendency (Mode, Median, Mean) and the measures of dispersion (range, standard deviation/variance, standard error) Obs Age 1 27 2 27 3 28 4 28 Age Frequency 5 28 27 2 6 29 7 29 28 3 8 29 29 4 9 29 30 5 10 30 31 2 11 30 12 30 32 1 13 30 33 0 14 30 34 1 15 31 35 0 16 31 17 32 36 1 18 34 37 1 19 36 Total 20 20 37 Frequency distribution Obs Age 1 27 2 27 3 28 4 28 5 28 6 29 7 29 8 29 9 29 10 30 6 11 30 5 12 30 4 13 30 14 30 15 31 2 16 31 1 17 32 18 34 19 36 20 37 Histogram 7 3 27 28 29 30 31 32 33 34 35 36 37 Histogram - Outliers Outliers are extreme, or atypical data value(s) that are notably different from the rest of the data. Number of patients 6 5 4 3 2 1 0 0 5 10 15 20 25 30 Nights of stay 35 40 45 50 Epidemic curve Central Location ? Number of people 20 ? 15 10 5 Spread 0 0-9 10-19 20-29 30-39 40-49 50-59 Age 60-69 70-79 80-89 90-99 Measures of central tendency and spread Central Location / Position / Tendency A single value that is a good summary of an entire distribution of data Spread / Dispersion / Variability How much the distribution is spread or dispersed from its central location Measure of Central Tendency Also known as measure of central position or location It is a single value that summarizes an entire distribution of data Common measures – Mode – Median – Arithmetic mean Mode Mode is the value that occurs most frequently Method for identification 1. Arrange data into frequency distribution or histogram, showing the values of the variable and the frequency with which each value occurs 2. Identify the value that occurs most often Obs Age 1 27 2 27 3 28 4 28 5 28 Age Frequency 6 29 7 29 27 2 8 29 28 3 9 29 29 4 10 30 30 5 11 30 31 2 12 30 13 30 32 1 14 30 33 0 15 31 34 1 16 31 35 0 17 32 18 34 36 1 19 36 37 1 20 37 Total 20 Mode Obs Age 1 27 2 27 3 28 4 28 5 28 6 29 7 29 8 29 9 29 10 30 6 11 30 5 12 30 4 13 30 14 30 15 31 2 16 31 1 17 32 18 34 19 36 20 37 Mode Mode = 30 7 3 27 28 29 30 31 32 33 34 35 36 37 20 Unimodal Distribution 18 Population 16 14 12 10 8 6 4 2 0 18 16 Population 14 12 10 8 6 4 2 0 Bimodal Distribution Mode – Properties / Uses • • • • • • Easiest measure to understand, explain, identify Always equals an original value Does not use all the data Insensitive to extreme values (outliers) May be more than one mode May be no mode Median Median is the middle value; it splits the distribution into two equal parts – 50% of observations are below the median – 50% of observations are above the median Method for identification 1. Arrange observations in order 2. Find middle position as (n + 1) / 2 3. Identify the value in the middle Obs Age 1 27 2 27 3 28 4 28 5 28 6 29 7 29 8 29 9 29 10 30 11 30 12 30 13 30 14 30 15 31 16 31 17 32 18 34 19 36 Median: uneven number of values n = 19 n+1 Median = 2 Observation 19+1 = 2 = 20 2 = 10 Median age = 30 years Obs Age 1 27 2 27 3 28 4 28 5 28 6 29 7 29 8 29 9 29 10 30 11 30 12 30 13 30 14 30 15 31 16 31 17 32 18 34 19 36 20 37 Median: even number of values n = 20 n+1 Median = 2 Observation 20+1 = 2 = 21 2 = 10.5 Median age = Average value between 10th and 11th observation 30+30 30 years = 2 Median – Properties / Uses • Does not use all the data available • Insensitive to extreme values (outliers) • Measure of choice for skewed data Arithmetic Mean Arithmetic mean = “average” value = m Method for identification 1. Sum up (S) all of the values (xi) 2. Divide the sum by the number of observations (n) Obs Age 1 27 2 27 3 28 4 28 5 28 6 29 7 29 8 29 9 29 10 30 11 30 12 30 13 30 14 30 15 31 16 31 17 32 18 34 19 36 20 37 Arithmetic Mean x i m= n n = 20 Sxi = 605 605 m= 20 = 30.25 Since the mean uses all data, is sensitive to outliers 6 5 4 3 2 1 0 Mean = 12.0 0 5 10 15 20 25 Nights of stay 30 35 40 45 50 Number of patients 6 Mean = 15.3 5 4 3 2 1 0 0 10 20 30 40 50 60 70 80 Nights of stay 90 100 110 120 130 140 150 When to use the arithmetic mean? Centered distribution Approximately symmetrical Few extreme values (outliers) When to use the arithmetic mean? (ii) 2 1 OK! 3 4 Arithmetic Mean – Properties / Uses • • • • Use all of the data Affected by extreme values (outliers) Best for normally distributed data Not usually equal to one of the original values How does the shape of a distribution influence the Measures of Central Tendency? Symmetrical: Mode = Median = Mean Skewed right: Mode < Median < Mean Skewed left: Mean < Median < Mode Epidemic curve Central Location ? Number of people 20 ? 15 10 5 Spread 0 0-9 10-19 20-29 30-39 40-49 50-59 Age 60-69 70-79 80-89 90-99 Same center but … different dispersions Measures of Spread Measures that quantify the variation or dispersion of a set of data from its central location • • Also known as “Measure of dispersion/ variation” Common measures • Range • Variance / standard deviation • Standard error Range Range = Difference between largest and smallest values in a dataset Properties / Uses: – Greatly affected by outliers – Usually used with median Finding the Range of Length of Stay Data 0, 2, 3, 4, 5, 5, 6, 7, 8, 9, 9, 9, 10, 10, 10, 10, 10, 11, 12, 12, 12, 13, 14, 16, 18, 18, 19, 22, 27, 49 6 5 4 3 2 1 0 0 5 10 15 20 25 Nights of stay 30 35 40 45 50 Range – Sensitive to Outliers? 6 5 4 3 2 1 0 Range = 49 - 0 = 49 0 5 10 15 20 25 Nights of stay 30 35 40 45 50 Number of patients 6 Range = 149 - 0 = 149 5 4 3 2 1 0 0 10 20 30 40 50 60 70 80 90 Nights of stay 100 110 120 130 140 150 Variance and Standard Deviation Measures of variation that quantifies how closely clustered the observed values are to the mean; measures of the spread of the data around the mean Variance = average of squared deviations from mean = Sum (each value – mean)2 / (n-1) Standard deviation = square root of variance Variance and Standard Deviation (ii) s² = (x i - x ) ² n-1 s = ( x i - x )² n-1 x : mean xi : value n : number s²: variance s : standard deviation Steps to Calculate Variance and Standard Deviation x : mean xi : value n : number s²: variance s : standard deviation s² = x 1. Calculate the arithmetic mean 2. Subtract the mean from each observation. 3. Square the difference. 4. ( x i - x )² n-1 x i- x ( x i - x )² Sum the squared differences ( x i - x )² 5. Divide the sum of the squared differences by n – 1 6. Take the square root of the variance s = s2 Length of Stay Data (0 – 12)2 = 144 (2 – 12)2 = 100 (3 – 12)2 = 81 (4 – 12)2 = 64 (5 – 12)2 = 49 (5 – 12)2 = 49 (6 – 12)2 = 36 (7 – 12)2 = 25 (8 – 12)2 = 16 (9 – 12)2 = 9 (9 – 12)2 = 9 (9 – 12)2 = 9 (10 – 12)2 = 4 (10 – 12)2 = 4 (10 – 12)2 = 4 (10 – 12)2 = 4 (10 – 12)2 = 4 (11 – 12)2 = 1 (12 – 12)2 = 0 (12 – 12)2 = 0 (12 – 12)2 = 0 (13 – 12)2 = 1 (14 – 12)2 = 4 (16 – 12)2 = 16 (18 – 12)2 = 36 (18 – 12)2 = 36 (19 – 12)2 = 49 (22 – 12)2 = 100 (27 – 12)2 = 225 (49 – 12)2 = 1369 Sum = 2448; Var = 2448 / 29 = 84.4; SD = 84 = 9.2 Standard Deviation Standard deviation usually calculated only when data are more or less normally distributed (bell shaped curve) For normally distributed data, • 68.3% of the data fall within plus/minus 1 SD • 95.5% of the data fall within plus/minus 2 SD • 95.0% of the data fall within plus/minus 1.96 SD • 99.7% of the data fall within plus/minus 3 SD The standard deviation of a normal distribution enables the calculation of confidence intervals Normal Distribution 2.5% 95% 2.5% 68% Standard deviation Mean Properties of Measures of Central Location and Spread • • • • • • • For quantitative / continuous variables Mode – simple, descriptive, not always useful Median – best for skewed data Arithmetic mean – best for normally distributed data Range – use with median Standard deviation – use with mean Standard error – used to construct confidence intervals Name the appropriate measures of central Location and Spread Distribution Single peak, symmetrical Skewed or Data with outliers Central Location Spread Name the appropriate measures of central Location and Spread Distribution Central Location Spread Single peak, symmetrical Mean* Standard deviation Skewed or Data with outliers Median Range or Interquartile range * Median and mode will be similar Any questions? Median Mode 14 12 Population 10 8 6 4 2 0 Age 1st quartile Minimum 3rd quartile Interquartile interval Range Maximum Thank you! Dr Ghada Abou Mrad Ministry of Public Health, Lebanon [email protected]