Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Learning objectives  To recognize different types of variables Descriptive Statistics  To learn how to appropriately explore your data ◙ How to display data using graphs F. Farrokhyar, MPhil, PhD, PDoc ◙ How to display data with numbers and tables  To learn about measures of central tendency  To learn about the measures of variation Descriptive and Inferential statistics?  Descriptive statistics help us with the presentation, organization, and summarization of data.  Inferential statistics allow us to make inferences from a sample of individuals to a larger population. Type of variables Qualitative or attribute variable  Nonnumeric gender (male, female), type of injury (blunt, fall, burn, etc) Quantitative variable  Numeric Discrete variable can assume only whole numbers What is data?  Data is a set of information or observation about a group of individuals or subjects.  This information is organized in the form of variables.  A variable is anyy characteristic of a pperson or a subject j that can be measured or categorized.  Its’ value varies from individual to individual. Level of measurement  There are four levels of measurement: ◙ Nominal ◙ Ordinal . ◙ Interval . ◙ Ratio no. of accidents, no. of injuries, no. of positive nodes Continuous variable may take any value, within a defined range: weight, age, blood pressure, level of cholesterol 1 Level of measurement … cont’d Level of measurement … cont’d  Nominal variable: consists of named categories with no order among the categories. - binomial ---- gender, mortality  Interval variable: has equal distances between values with no meaningful ‘zero’ value. - IQ test o - Temperature (0 C does not represent absence of temperature - multinomial ---- type of injury, blood type  Ordinal variable: consists of ordered categories, where the differences between categories cannot be considered to be equal. - Tumour stage – 1, 2, 3, 4  Ratio variable: has equal intervals between values and a meaningful zero point. The ratio between them makes sense. - height, weight, laboratory test values - Likert scale – excellent, very good, good, fair, poor Type of variables Level of measurement …  Variable type:  Assumptions: ◙ Nominal ◙ Named categories ◙ Ordinal . ◙ Same as nominal plus ordered categories ◙ Interval . ◙ Same as ordinal plus equal intervals ◙ ratio ◙ Same as interval plus meaningful zero Dependent variable  Is the outcome of interest, which changes in response to some intervention or exposure. - mortality, survival, post-op pain, quality of life Independent variable  Is the explanatory variable that explains the changes in the dependent variable - demographics (age, gender, height), risk factors (diabetes, BP)  Is the intervention or exposure variable that causes the changes in the dependent variable. - drug, surgery, radiation, smoking … Describing Categorical data Independent (Explanatory) variables: Age, Sex, Pre-op pain Severity Independent (Comparison) variable Dependent/outcome variables: Changes in pain, Complication  Graphs Bar charts Pie charts 2 Bar charts Bar Charts  Used to display nominal or ordinal data.  It is a series of separated bars.  Bars represent frequency (counts) or relative frequency (percent or proportion) of each category. category  Used to display data for more than one group. Bar Charts Pie charts  Used for nominal and ordinal data.  Used to display relative frequency distribution.  The circle is divided proportionally using relative frequency of each category. category  A pie chart is useful for showing data for one group but it is useless for illustration of two or more groups. Pie Charts Describing Categorical data  Numerically Frequencies (counts) Relative frequencies (%) 3 Describing quantitative data Cross-tabulation of categorical data  Graphs Type of surgery Severity mild moderate severe Histograms Open Laparoscopic Total 4 (27%) 6 (40%) 5 (33%) 3 (20%) 7 (47%) 5 (33%) 7 (23%) 13 (43%) 10 (33%) 7 (47%) 8 (53%) 4 (27%) 11 (73%) 11 (37%) 19 (63%) The five-number summary Æ Boxplot Sex male female Histogram Histograms …  Used for interval and ratio data.  A histogram is a graph in which each bar (horizontal axis) represent a range of numbers called interval width. The vertical axis represents the frequency of each interval. interval  There are no spaces between bars.  The frequencies are represented by the bar height and area of each bar  Histogram is useful for graphic illustration of one group. Box plot: 5 – number summary 100th  Used for interval and ratio data. Maximum Q3 Median (Q2) Q1 1st Box Plots …  Uses the five-number summary measures Median, Q1, Q3, minimum and maximum.  It is useful in detecting outliers  It is useful to illustrate the distribution of more than on group. Minimum 4 Box plot of change in pain score Scatter plot  Describing quantitative data  Numbers Measures of central tendency – mode, median, mean Measures of spread p – range, interquartile range, variance, standard deviation Used to display the relationship between two continuous variables. Mode – Measures of central tendency  Mode is the most frequent value – the highest peak  Used for nominal, ordinal, interval and ratio data.  Could be more than one mode. Example: pain score 1, 4, 6, 8, 5, 6, 3, 2, 15 1, 2, 3, 4, 5, 6, 6, 8, 15 Median – Measures of central tendency  Median is the midpoint of the values after arranging the observations in order of size, from smallest to largest.  There is a unique median for each dataset  Used for interval and ratio data. Mean – Measures of central tendency  Mean is the sum of sample values divided by the number of sample values --- n.  It is useful for interval and ratio data.  It may nott bbe necessarily il equall to t one off th the sample l values. l n ∑xi Properties:  It is resistant (insensitive) toward extreme values. X= i =1 n = 1+ 2 + 3 + 4 + 5 + 6 + 6 + 8 + 15 = 5.5 9  It is useful for summarising skewed data. Example - 1, 2, 3, 4, 5, 6, 6, 8, 15 5 Measures of central tendency … Properties of mean …  There is a unique mean for each dataset. Normal curve Skewed curve  All values are included in the computation.  It is the only measure of central tendency where the sum of deviations of each value from the mean will always be zero. n _ ∑( X i - X ) i= 1  The mean is sensitive toward extreme values. Measures of Spread  Range  Interquartile range  Variance  Standard deviation X Mean Median Mode Mean Median Mode Range  Used mainly for interval or ratio data  Range is the differences between the largest and smallest values in a dataset.  Properties It uses only two values in its calculation. It is effected by extreme values. It is easy to understand. 1, 2, 3, 4, 5, 6, 6, 8, 15 ---- range = 14 Interquartile range  Used mainly for interval and ratio data  It is the distance between the third quartile (Q3) and the first quartile (Q1).  Interquartile range = Q3 – Q1 Interquartile range  It is resistant (insensitive) to extreme values.  It is useful for summarising skewed interval and ratio data.  Arrange the observations from smallest to largest.  Divide into 4 equal parts. Example, 1, 2, 3, 4, 5, 6, 6, 8, 15 1st quartile (Q1) = (2+3)/ 2 = 2.5 Median (Q2) = 5 3rd quartile (Q3) = (6+8) / 2 = 7 Interquartile range = 7 – 2.5 = 4.5 6 1.5 × IQR Criterion for Outliers Interquartile range   Interquartile range (IQR) is the distance between the first and third quartiles. IQR = Q3 – Q1 Used to locate the outliers.  From data Q1 = 59 yyrs,, Q3 = 70 yyrs,, What are outliers?  IQR = 70 – 59 = 11 Outliers are extreme data values that fall outside of distribution of the data set. 1.5 × IQR = 1.5 × 11 = 16.5 Q1 – IQR = 59 – 16.5 = 42.5 Q3 + IQR = 70 + 16.5 = 86.5  From data: Min= 44 and Max = 82 Variance Box plot: 5 – number summary  100th Outliers: 82  < 42.5 population variance > 86.5 Q3 Median (Q2) Q1 44 1st Variance  n (x - x ) σ2 = ∑ i N i =1 n ( x - x )2 s2 = ∑ i n -1 i =1  Here, the df is n-1 rather than n because we lose 1 df by estimating the sample mean. Standard deviation  is square root of variance n ( x - x )2 sd = ∑ i = 4 .1 i=1 n - 1 of the original units S= sample variance Degrees of freedom – measure the amount of information available in the data that can be to estimate σ2. The units are not the same as data, they are the square Example: 1, 2, 3, 4, 5, 6, 6, 8, 15 2  Properties All values are used in the calculation Used for interval or ratio data Is the average of the squared deviations from the mean  It is the average deviation from the mean in the same unit as the data. (1_ 5.5)2 + ( 2 _ 5.5)2 + (3 _ 5.5)2 + ... + (15 _ 5.5)2 = 17.2 9 _1 7 Uses of standard deviation … Standard normal curve  It is used for Empirical Rule.  For any symmetrical distribution: ◘ About 68% of the observations will lie within 1 s.d. of the mean. ◘ About 95% of the observations will lie within 2 s.d. of the mean. ◘ About 99.8% of the observations will lie within 3 s.d. of the mean. Summary of what we have learned ….  We report Mean with standard deviation Data type Graph Numerically Ratio and interval Histogram Box plot Scatter plot Mean with standard deviation Median with IQR, range Mode Ordinal data Bar chart Pie chart Count and % Median IQR, range mode Nominal Bar chart Pie chart Count and % mode Median with first and third quartiles Median with minimum and maximum 8