Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc Department of Surgery Department of Clinical Epidemiology and Biostatistics March 18, 2009 Objectives To understand and recognize different types of variables To learn how to explore your data ◙ How to display data with numbers and tables ◙ How to display data using graphs To understand the fundamental concept of variability To learn the notion of the distribution of a variable Why and how are statistics relevant to medicine? Prevention – What causes a disease? Diagnosis – What symptoms and signs do patients with a given disease present with? Treatment – What treatments are effective for a given disease and for which patients? Prognosis – How will specific patients with a given disease fare in the long term? Statistics – Why do we need it? B AEW DSAQP BBWEONF O H E E R D T TY E D TEQONEGGOL TSDGFEWGEGGVB AYAO E E DYH E J U E G D ETEWWETHEFEOPLUMR Descriptive and Inferential statistics? Descriptive statistics are concerned with the presentation, organization, and summarization of data Inferential statistics allow us the generalization from a sample to a larger group of subjects. What is data? Data is collected for some purpose and each collected information have a meaning in some context. Data is a set of information or observation about a group of individuals or subjects. This information is organized in form of variables. A variable is any characteristic of a person or a subject that can be measured or categorized and its value varies from individual to individual. Dependent and Independent Variables? Dependent variable Is the outcome of interest, which changes in response to some intervention or exposure. mortality, survival, post-op pain, quality of life, post-op complications Independent variable Is the explanatory variable that explains the changes in the dependent variable demographics (age, gender, height), risk factors (diabetes, CAD) Is the intervention or exposure that causes the changes in the dependent variable. drug, surgery, radiation, smoking … Type of variables …? Qualitative or attribute variable Categorical variables… Nonnumeric gender, severity of injury, type of injury, tumour grade Quantitative variable Numeric Discrete variable can assume only whole numbers: number of accidents, number of injuries, pain score Continuous variable may take any value, within a defined range: weight, height, age, blood pressure, level of cholesterol, pain score Level of measurement … There are four level of measurement: ◙ Nominal ◙ Ordinal ◙ Interval ◙ Ratio Qualitative/Categorical Quantitative/Numeric Level of measurement … cont’d Variable type: Assumptions: ◙ Nominal ◙ Named categories ◙ Ordinal . ◙ Same as nominal plus ordered categories ◙ Interval . ◙ Same as ordinal plus equal intervals ◙ Ratio ◙ Same as interval plus meaningful zero Level of measurement … cont’d A nominal variable: consists of named categories, with no implied order among the categories. - gender, mortality ---- dichotomous or binary - type of injury, type of fracture, blood type An ordinal variable: consists of ordered categories, where the differences between categories cannot be considered to be equal. - Tumour stage – I, II, III, IV, tumour grade – I II, III, IV - Likert scale – excellent, very good, good, fair, poor Level of measurement … cont’d An interval variable: has equal distances between values with no meaningful ‘zero’ value. - IQ test (the differences between numbers are meaningful but the ratios between them are not) An ratio variable: has equal intervals between values and a meaningful zero point. The ratio between them makes sense. - height, weight, laboratory test values, age For example Primary objective: To compare the post-operative pain between laparoscopic and open surgery in patients with colorectal cancer Secondary objective: To compare the post-operative complications between laparoscopic and open surgery in patients with colorectal cancer Independent (Explanatory) variables: Age, Sex, Pre-op pain Severity Independent (Comparison) variable Dependent/outcome variables: Changes in pain, Complication Data Editing Validity edits: Ensure that: essential fields have been completed and there are no missing information ◘ specified units of measure have been properly used and the measurements are within the acceptable range. Duplication edits: Ensure that each case/patient have been entered into the database only once. Statistical edits: Identify and double check all the extreme values, suspicious data and outliers. Descriptive Statistics … are a means of organizing and summarizing observations. We examine variables in order to describe their main features. It is the basic strategies that help us organize our exploration of a set of data: ◙ Begin by examining each variable. ◙ Examine the distribution of each variable by creating frequency tables, numerical summaries and graphs. ◙ Study the relationships between the variables. Examining Distributions: Categorical … Numbers Frequencies (counts), cumulative frequencies Relative frequencies (%), cumulative relative frequencies (%) Graphs Bar charts Pie charts Cross-tabulation of categorical data Se verity of disea se Valid 0 1 2 Total Frequency 7 13 10 30 Percent 23.3 43.3 33.3 100.0 Valid Percent 23.3 43.3 33.3 100.0 Cumulative Percent 23.3 66.7 100.0 Cross-tabulation of categorical data Type of surgery Complications No Yes Total Open Count Column N % 13 86.7% 2 13.3% 15 100.0% Lap Count Column N % 11 73.3% 4 26.7% 15 100.0% Examining Distributions: Categorical … Numbers Frequencies (counts), cumulative frequencies Relative frequencies (%), cumulative relative frequencies (%) Graphs Bar charts Pie charts Bar Charts Bar Charts Bar charts … A bar chart can be used to depict any levels of measurement (nominal, ordinal, interval, or ratio). A series of separated bars (vertical or Horizontal), one per category. Bars represent frequency (counts) or relative frequency (percent or proportion) of each category. A Bar chart is also useful for showing data for more than one group. Pie Charts Pie charts … Used primarily for nominal and ordinal data. Used to display relative frequency distribution. The circle is divided proportionally using relative frequency of each category. A pie chart is useful for showing data for one group but it is useless for graphic illustration of two or more groups. Examining Distributions: Quantitative … Numbers Measures of central tendency – mean, median, mode Measures of variation around mean – variance, standard deviation, standard error of mean Measures of variation around median – percentiles, quintiles, quartiles Graphs Histograms The five-number summary Box plots Measures of central tendency Mean: sum of observations divided by number of observations n ∑xi X = i=1 n Median: is a midpoint of a distribution after arranging all observations in order of size, from smallest to largest. Mode: most frequent value – the highest peak Properties of mean … It is used for interval or ratio data. A set of data has only a mean. All values are included in the computation. It is the only measure of central tendency where the sum of deviations of each value from the mean will always be zero. n _ ∑( Xi - X) i=1 The mean is a useful measures for comparing two or more sets of data. The mean is sensitive toward extreme values. Properties of median … It is used for interval or ratio data. There is a unique median for each data set. The median is not necessarily equal to one of the sample values. It is resistant (insensitive) toward extreme values. It is useful for summarising skewed data. Measures of variation around mean Variance: the average of the squares of the deviations of the data from their mean 2 ( x x ) σ2 = ∑ i i=1 n - 1 n Standard deviation: square root of variance ( xi - x )2 σ= ∑ i=1 n - 1 Standard error: σ s.e. = n n Properties of variance … All values are used on calculation. The units are not the same as data, they are the square of the original units. Properties of standard deviation … The units are the same as data It is used for Empirical Rule. For any symmetrical distribution: ◘ About 68% of the observations will lie within 1 s. d. of the mean. ◘ About 95% of the observations will lie within 2 s. d. of the mean. ◘ About 99.8% of the observations will lie within 3 s. d. of the mean. The Empirical Rule Measures of variation around median Percentiles: Arrange the observations from smallest to largest. Divide into 100 equal parts; for example; the 5th percentiles of a distribution is the value which 5% of the observations fall below and 95% fall above. Quartiles: 25th, 50th and 75th percentiles Quintiles: 20th, 40th, 60th, and 80th percentiles Deciles: 10th, 20th, 30th, 40th, 50th,……10th percentiles Statistics Age N Valid Mi ssing Mean St d. E rror of M ean Median Mode St d. Deviat ion Variance Range Mi nimum Maxim um Percentiles 25 50 75 30 0 63.87 1.494 64.00 58 a 8.182 66.947 38 44 82 58.75 64.00 69.50 a. Multipl e m odes exi st. The s mallest value is shown Examining Distributions: Quantitative … Numbers Measures of central tendency; mean, median, mode Measures of variation around mean – variance, standard deviation, standard error of mean Measures of variation around median – percentiles, quintiles, quartiles Graphs Histograms The five-number summary Boxplot Histogram Histograms … Used for interval and ratio data. A histogram is a graph in which each bar (horizontal axis) represent a range of numbers called interval width. The vertical axis represents the frequency of each interval. There are no spaces between bars. Histogram is useful for graphic illustration of one group. Box plot: 5 – number summary 100th Whiskers Outliers Inner fence Range = Max - Min Q3 Median/Q2 IQR = Q3 – Q1 Q1 Whiskers 1st Inner fence Box plot of change in pain score Box Plots … Used for interval and ratio data. Uses the five-number summary measures Median, Q1, Q3, minimum and maximum. It is useful in detecting outliers It is useful to illustrate the distribution of more than on group. What are outliers … ? Outliers are extreme data values that fall outside of distribution of the data set. Box plot: 5 – number summary 100th Whiskers Inner fence Q3 Median/Q2 IQR = Q3 – Q1 Q1 Whiskers 1st Inner fence 1.5 IQR Criterion for Outliers Interquartile range (IQR) is the distance between the first and third quartiles. IQR = Q3 – Q1 From data Q1 = 59 yrs, Q3 = 70 yrs, IQR = 70 – 59 = 11 1.5 IQR = 1.5 11 = 16.5 Q1 – IQR = 59 – 16.5 = 42.5 Q3 + IQR = 70 + 16.5 = 86.5 From data: Min= 44 and Max = 82 Properties of quartiles, quintiles… It is used for interval or ratio data. It is resistant (insensitive) to extreme values. It is useful for summarising skewed data. How to deal with skewed data Transform the data: Square/square root – (Poisson) count data Log(x) or ln(x) – data is skewed toward right Reciprocal (1/X) - data is skewed toward left Transformation: Make skewed data more symmetric Makes distribution more normal Stabilize variability Liberalize a relationship between two or more variables Show summary stat in original but analyse on the transformed data Summary of what we have learned …. Always plot your data: make a graph, e.i. histogram, box plot Look for overall pattern (shape, centre and spread) and for striking deviations such as outliers Check to see if overall pattern of distribution can be described by normal distribution. If not uniform, transform data to make skewed data more symmetric Calculate an appropriate numerical summary to describe centre and spread