Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Transcript

Learning objectives Â To recognize different types of variables Descriptive Statistics Â To learn how to appropriately explore your data ◙ How to display data using graphs F. Farrokhyar, MPhil, PhD, PDoc ◙ How to display data with numbers and tables Â To learn about measures of central tendency Â To learn about the measures of variation Descriptive and Inferential statistics? Â Descriptive statistics help us with the presentation, organization, and summarization of data. Â Inferential statistics allow us to make inferences from a sample of individuals to a larger population. Type of variables Qualitative or attribute variable Â Nonnumeric gender (male, female), type of injury (blunt, fall, burn, etc) Quantitative variable Â Numeric Discrete variable can assume only whole numbers What is data? Â Data is a set of information or observation about a group of individuals or subjects. Â This information is organized in the form of variables. Â A variable is anyy characteristic of a pperson or a subject j that can be measured or categorized. Â Its’ value varies from individual to individual. Level of measurement Â There are four levels of measurement: ◙ Nominal ◙ Ordinal . ◙ Interval . ◙ Ratio no. of accidents, no. of injuries, no. of positive nodes Continuous variable may take any value, within a defined range: weight, age, blood pressure, level of cholesterol 1 Level of measurement … cont’d Level of measurement … cont’d Â Nominal variable: consists of named categories with no order among the categories. - binomial ---- gender, mortality Â Interval variable: has equal distances between values with no meaningful ‘zero’ value. - IQ test o - Temperature (0 C does not represent absence of temperature - multinomial ---- type of injury, blood type Â Ordinal variable: consists of ordered categories, where the differences between categories cannot be considered to be equal. - Tumour stage – 1, 2, 3, 4 Â Ratio variable: has equal intervals between values and a meaningful zero point. The ratio between them makes sense. - height, weight, laboratory test values - Likert scale – excellent, very good, good, fair, poor Type of variables Level of measurement … Â Variable type: Â Assumptions: ◙ Nominal ◙ Named categories ◙ Ordinal . ◙ Same as nominal plus ordered categories ◙ Interval . ◙ Same as ordinal plus equal intervals ◙ ratio ◙ Same as interval plus meaningful zero Dependent variable Â Is the outcome of interest, which changes in response to some intervention or exposure. - mortality, survival, post-op pain, quality of life Independent variable Â Is the explanatory variable that explains the changes in the dependent variable - demographics (age, gender, height), risk factors (diabetes, BP) Â Is the intervention or exposure variable that causes the changes in the dependent variable. - drug, surgery, radiation, smoking … Describing Categorical data Independent (Explanatory) variables: Age, Sex, Pre-op pain Severity Independent (Comparison) variable Dependent/outcome variables: Changes in pain, Complication Â Graphs Bar charts Pie charts 2 Bar charts Bar Charts Â Used to display nominal or ordinal data. Â It is a series of separated bars. Â Bars represent frequency (counts) or relative frequency (percent or proportion) of each category. category Â Used to display data for more than one group. Bar Charts Pie charts Â Used for nominal and ordinal data. Â Used to display relative frequency distribution. Â The circle is divided proportionally using relative frequency of each category. category Â A pie chart is useful for showing data for one group but it is useless for illustration of two or more groups. Pie Charts Describing Categorical data Â Numerically Frequencies (counts) Relative frequencies (%) 3 Describing quantitative data Cross-tabulation of categorical data Â Graphs Type of surgery Severity mild moderate severe Histograms Open Laparoscopic Total 4 (27%) 6 (40%) 5 (33%) 3 (20%) 7 (47%) 5 (33%) 7 (23%) 13 (43%) 10 (33%) 7 (47%) 8 (53%) 4 (27%) 11 (73%) 11 (37%) 19 (63%) The five-number summary Æ Boxplot Sex male female Histogram Histograms … Â Used for interval and ratio data. Â A histogram is a graph in which each bar (horizontal axis) represent a range of numbers called interval width. The vertical axis represents the frequency of each interval. interval Â There are no spaces between bars. Â The frequencies are represented by the bar height and area of each bar Â Histogram is useful for graphic illustration of one group. Box plot: 5 – number summary 100th Â Used for interval and ratio data. Maximum Q3 Median (Q2) Q1 1st Box Plots … Â Uses the five-number summary measures Median, Q1, Q3, minimum and maximum. Â It is useful in detecting outliers Â It is useful to illustrate the distribution of more than on group. Minimum 4 Box plot of change in pain score Scatter plot Â Describing quantitative data Â Numbers Measures of central tendency – mode, median, mean Measures of spread p – range, interquartile range, variance, standard deviation Used to display the relationship between two continuous variables. Mode – Measures of central tendency Â Mode is the most frequent value – the highest peak Â Used for nominal, ordinal, interval and ratio data. Â Could be more than one mode. Example: pain score 1, 4, 6, 8, 5, 6, 3, 2, 15 1, 2, 3, 4, 5, 6, 6, 8, 15 Median – Measures of central tendency Â Median is the midpoint of the values after arranging the observations in order of size, from smallest to largest. Â There is a unique median for each dataset Â Used for interval and ratio data. Mean – Measures of central tendency Â Mean is the sum of sample values divided by the number of sample values --- n. Â It is useful for interval and ratio data. Â It may nott bbe necessarily il equall to t one off th the sample l values. l n ∑xi Properties: Â It is resistant (insensitive) toward extreme values. X= i =1 n = 1+ 2 + 3 + 4 + 5 + 6 + 6 + 8 + 15 = 5.5 9 Â It is useful for summarising skewed data. Example - 1, 2, 3, 4, 5, 6, 6, 8, 15 5 Measures of central tendency … Properties of mean … Â There is a unique mean for each dataset. Normal curve Skewed curve Â All values are included in the computation. Â It is the only measure of central tendency where the sum of deviations of each value from the mean will always be zero. n _ ∑( X i - X ) i= 1 Â The mean is sensitive toward extreme values. Measures of Spread Â Range Â Interquartile range Â Variance Â Standard deviation X Mean Median Mode Mean Median Mode Range Â Used mainly for interval or ratio data Â Range is the differences between the largest and smallest values in a dataset. Â Properties It uses only two values in its calculation. It is effected by extreme values. It is easy to understand. 1, 2, 3, 4, 5, 6, 6, 8, 15 ---- range = 14 Interquartile range Â Used mainly for interval and ratio data Â It is the distance between the third quartile (Q3) and the first quartile (Q1). Â Interquartile range = Q3 – Q1 Interquartile range Â It is resistant (insensitive) to extreme values. Â It is useful for summarising skewed interval and ratio data. Â Arrange the observations from smallest to largest. Â Divide into 4 equal parts. Example, 1, 2, 3, 4, 5, 6, 6, 8, 15 1st quartile (Q1) = (2+3)/ 2 = 2.5 Median (Q2) = 5 3rd quartile (Q3) = (6+8) / 2 = 7 Interquartile range = 7 – 2.5 = 4.5 6 1.5 × IQR Criterion for Outliers Interquartile range Â Â Interquartile range (IQR) is the distance between the first and third quartiles. IQR = Q3 – Q1 Used to locate the outliers. Â From data Q1 = 59 yyrs,, Q3 = 70 yyrs,, What are outliers? Â IQR = 70 – 59 = 11 Outliers are extreme data values that fall outside of distribution of the data set. 1.5 × IQR = 1.5 × 11 = 16.5 Q1 – IQR = 59 – 16.5 = 42.5 Q3 + IQR = 70 + 16.5 = 86.5 Â From data: Min= 44 and Max = 82 Variance Box plot: 5 – number summary Â 100th Outliers: 82 Â < 42.5 population variance > 86.5 Q3 Median (Q2) Q1 44 1st Variance Â n (x - x ) σ2 = ∑ i N i =1 n ( x - x )2 s2 = ∑ i n -1 i =1 Â Here, the df is n-1 rather than n because we lose 1 df by estimating the sample mean. Standard deviation Â is square root of variance n ( x - x )2 sd = ∑ i = 4 .1 i=1 n - 1 of the original units S= sample variance Degrees of freedom – measure the amount of information available in the data that can be to estimate σ2. The units are not the same as data, they are the square Example: 1, 2, 3, 4, 5, 6, 6, 8, 15 2 Â Properties All values are used in the calculation Used for interval or ratio data Is the average of the squared deviations from the mean Â It is the average deviation from the mean in the same unit as the data. (1_ 5.5)2 + ( 2 _ 5.5)2 + (3 _ 5.5)2 + ... + (15 _ 5.5)2 = 17.2 9 _1 7 Uses of standard deviation … Standard normal curve Â It is used for Empirical Rule. Â For any symmetrical distribution: ◘ About 68% of the observations will lie within 1 s.d. of the mean. ◘ About 95% of the observations will lie within 2 s.d. of the mean. ◘ About 99.8% of the observations will lie within 3 s.d. of the mean. Summary of what we have learned …. Â We report Mean with standard deviation Data type Graph Numerically Ratio and interval Histogram Box plot Scatter plot Mean with standard deviation Median with IQR, range Mode Ordinal data Bar chart Pie chart Count and % Median IQR, range mode Nominal Bar chart Pie chart Count and % mode Median with first and third quartiles Median with minimum and maximum 8