Download Bar chart

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
Descriptive Statistics
F. Farrokhyar, MPhil, PhD, PDoc
Department of Surgery
Department of Clinical Epidemiology and Biostatistics
March 18, 2009
Objectives
 To understand and recognize different types of variables
 To learn how to explore your data
◙ How to display data with numbers and tables
◙ How to display data using graphs
 To understand the fundamental concept of variability
 To learn the notion of the distribution of a variable
Why and how are statistics relevant to medicine?
 Prevention – What causes a disease?
 Diagnosis – What symptoms and signs do patients with a given
disease present with?
 Treatment – What treatments are effective for a given disease and
for which patients?
 Prognosis – How will specific patients with a given disease fare in
the long term?
Statistics – Why do we need it?
B
AEW
DSAQP
BBWEONF
O H E E R D T TY E
D TEQONEGGOL
TSDGFEWGEGGVB
AYAO E E DYH E J U E G D
ETEWWETHEFEOPLUMR
Descriptive and Inferential statistics?
 Descriptive statistics are concerned with the
presentation, organization, and
summarization of data
 Inferential statistics allow us the
generalization from a sample to a larger
group of subjects.
What is data?
 Data is collected for some purpose and each
collected information have a meaning in some
context.
 Data is a set of information or observation about a
group of individuals or subjects.
 This information is organized in form of variables.
 A variable is any characteristic of a person or a
subject that can be measured or categorized and its
value varies from individual to individual.
Dependent and Independent Variables?
Dependent variable
 Is the outcome of interest, which changes in response to some
intervention or exposure.
mortality, survival, post-op pain, quality of life, post-op
complications
Independent variable
 Is the explanatory variable that explains the changes in the
dependent variable
demographics (age, gender, height), risk factors (diabetes, CAD)
 Is the intervention or exposure that causes the changes in the
dependent variable.
drug, surgery, radiation, smoking …
Type of variables …?
Qualitative or attribute variable
Categorical variables…
 Nonnumeric
gender, severity of injury, type of injury, tumour grade
Quantitative variable
 Numeric
Discrete variable can assume only whole numbers: number of
accidents, number of injuries, pain score
Continuous variable may take any value, within a defined range:
weight, height, age, blood pressure, level of cholesterol, pain
score
Level of measurement …
 There are four level of measurement:
◙ Nominal
◙ Ordinal
◙ Interval
◙ Ratio
Qualitative/Categorical
Quantitative/Numeric
Level of measurement … cont’d
 Variable type:
 Assumptions:
◙ Nominal
◙ Named categories
◙ Ordinal
.
◙ Same as nominal plus
ordered categories
◙ Interval
.
◙ Same as ordinal plus equal
intervals
◙ Ratio
◙ Same as interval plus
meaningful zero
Level of measurement … cont’d
 A nominal variable: consists of named categories,
with no implied order among the categories.
- gender, mortality ---- dichotomous or binary
- type of injury, type of fracture, blood type
 An ordinal variable: consists of ordered categories,
where the differences between categories cannot be
considered to be equal.
- Tumour stage – I, II, III, IV, tumour grade – I II, III, IV
- Likert scale – excellent, very good, good, fair, poor
Level of measurement … cont’d
 An interval variable: has equal distances between
values with no meaningful ‘zero’ value.
- IQ test (the differences between numbers are meaningful
but the ratios between them are not)
 An ratio variable: has equal intervals between values
and a meaningful zero point. The ratio between them
makes sense.
- height, weight, laboratory test values, age
For example
Primary objective: To compare the post-operative pain
between laparoscopic and open surgery in
patients with colorectal cancer
Secondary objective: To compare the post-operative
complications between laparoscopic and
open surgery in patients with colorectal
cancer
Independent (Explanatory)
variables:
Age, Sex, Pre-op pain
Severity
Independent
(Comparison)
variable
Dependent/outcome
variables:
Changes in pain,
Complication
Data Editing
 Validity edits: Ensure that:
essential fields have been completed and there are no
missing information
◘ specified units of measure have been properly used and
the measurements are within the acceptable range.
 Duplication edits: Ensure that each case/patient have been
entered into the database only once.
 Statistical edits: Identify and double check all the extreme
values, suspicious data and outliers.
Descriptive Statistics
 … are a means of organizing and summarizing observations.
 We examine variables in order to describe their main features.
 It is the basic strategies that help us organize our exploration of
a set of data:
◙ Begin by examining each variable.
◙ Examine the distribution of each variable by creating
frequency tables, numerical summaries and graphs.
◙ Study the relationships between the variables.
Examining Distributions: Categorical …
 Numbers
Frequencies (counts), cumulative frequencies
Relative frequencies (%), cumulative relative
frequencies (%)
 Graphs
Bar charts
Pie charts
Cross-tabulation of categorical data
Se verity of disea se
Valid
0
1
2
Total
Frequency
7
13
10
30
Percent
23.3
43.3
33.3
100.0
Valid Percent
23.3
43.3
33.3
100.0
Cumulative
Percent
23.3
66.7
100.0
Cross-tabulation of categorical data
Type of surgery
Complications
No
Yes
Total
Open
Count
Column N %
13
86.7%
2
13.3%
15
100.0%
Lap
Count
Column N %
11
73.3%
4
26.7%
15
100.0%
Examining Distributions: Categorical …
Numbers
Frequencies (counts), cumulative frequencies
Relative frequencies (%), cumulative relative
frequencies (%)
 Graphs
Bar charts
Pie charts
Bar Charts
Bar Charts
Bar charts …
 A bar chart can be used to depict any levels of
measurement (nominal, ordinal, interval, or ratio).
 A series of separated bars (vertical or Horizontal), one per
category.
 Bars represent frequency (counts) or relative frequency
(percent or proportion) of each category.
 A Bar chart is also useful for showing data for more than
one group.
Pie Charts
Pie charts …
 Used primarily for nominal and ordinal data.
 Used to display relative frequency distribution.
 The circle is divided proportionally using relative frequency
of each category.
 A pie chart is useful for showing data for one group but it is
useless for graphic illustration of two or more groups.
Examining Distributions: Quantitative …
 Numbers
Measures of central tendency – mean, median, mode
Measures of variation around mean – variance, standard
deviation, standard error of mean
Measures of variation around median – percentiles, quintiles,
quartiles
 Graphs
Histograms
The five-number summary  Box plots
Measures of central tendency
 Mean: sum of observations divided by number of
observations
n
∑xi
X = i=1
n
 Median: is a midpoint of a distribution after
arranging all observations in order of size, from
smallest to largest.
 Mode: most frequent value – the highest peak
Properties of mean …
 It is used for interval or ratio data.
 A set of data has only a mean.
 All values are included in the computation.
 It is the only measure of central tendency where the sum of
deviations of each value from the mean will always be zero.
n
_
∑( Xi - X)
i=1
 The mean is a useful measures for comparing two or more sets of
data.
 The mean is sensitive toward extreme values.
Properties of median …
 It is used for interval or ratio data.
 There is a unique median for each data set.
 The median is not necessarily equal to one of the sample
values.
 It is resistant (insensitive) toward extreme values.
 It is useful for summarising skewed data.
Measures of variation around mean

Variance: the average of the squares of the deviations of
the data from their mean
2
(
x
x
)
σ2 = ∑ i
i=1 n - 1
n

Standard deviation:
square root of variance
( xi - x )2
σ= ∑
i=1 n - 1

Standard error:
σ
s.e. =
n
n
Properties of variance …
 All values are used on calculation.
 The units are not the same as data, they are the square of
the original units.
Properties of standard deviation …
 The units are the same as data
 It is used for Empirical Rule.
 For any symmetrical distribution:
◘ About 68% of the observations will lie within 1 s. d. of the mean.
◘ About 95% of the observations will lie within 2 s. d. of the mean.
◘ About 99.8% of the observations will lie within 3 s. d. of the
mean.
The Empirical Rule
Measures of variation around median
 Percentiles:
Arrange the observations from smallest to largest.
Divide into 100 equal parts;
for example; the 5th percentiles of a distribution is the value
which 5% of the observations fall below and 95% fall above.
 Quartiles: 25th, 50th and 75th percentiles
 Quintiles: 20th, 40th, 60th, and 80th percentiles
 Deciles: 10th, 20th, 30th, 40th, 50th,……10th percentiles
Statistics
Age
N
Valid
Mi ssing
Mean
St d. E rror of M ean
Median
Mode
St d. Deviat ion
Variance
Range
Mi nimum
Maxim um
Percentiles
25
50
75
30
0
63.87
1.494
64.00
58 a
8.182
66.947
38
44
82
58.75
64.00
69.50
a. Multipl e m odes exi st. The s mallest value is shown
Examining Distributions: Quantitative …
 Numbers
Measures of central tendency; mean, median, mode
Measures of variation around mean – variance, standard
deviation, standard error of mean
Measures of variation around median – percentiles, quintiles,
quartiles
 Graphs
Histograms
The five-number summary  Boxplot
Histogram
Histograms …
 Used for interval and ratio data.
 A histogram is a graph in which each bar (horizontal axis)
represent a range of numbers called interval width. The
vertical axis represents the frequency of each interval.
 There are no spaces between bars.
 Histogram is useful for graphic illustration of one group.
Box plot: 5 – number summary
100th
Whiskers
Outliers
Inner fence Range = Max - Min
Q3
Median/Q2 IQR = Q3 – Q1
Q1
Whiskers
1st
Inner fence
Box plot of change in pain score
Box Plots …
 Used for interval and ratio data.
 Uses the five-number summary measures
Median, Q1, Q3, minimum and maximum.
 It is useful in detecting outliers
 It is useful to illustrate the distribution of more than
on group.
What are outliers … ?
Outliers are extreme data values that fall outside
of distribution of the data set.
Box plot: 5 – number summary
100th
Whiskers
Inner fence
Q3
Median/Q2 IQR = Q3 – Q1
Q1
Whiskers
1st
Inner fence
1.5  IQR Criterion for Outliers
 Interquartile range (IQR) is the distance between the
first and third quartiles. IQR = Q3 – Q1
 From data
Q1 = 59 yrs, Q3 = 70 yrs,
IQR = 70 – 59 = 11
1.5  IQR = 1.5  11 = 16.5
Q1 – IQR = 59 – 16.5 = 42.5
Q3 + IQR = 70 + 16.5 = 86.5
 From data: Min= 44 and Max = 82
Properties of quartiles, quintiles…
 It is used for interval or ratio data.
 It is resistant (insensitive) to extreme values.
 It is useful for summarising skewed data.
How to deal with skewed data
 Transform the data:
Square/square root – (Poisson) count data
Log(x) or ln(x) – data is skewed toward right
Reciprocal (1/X) - data is skewed toward left
 Transformation:
Make skewed data more symmetric
Makes distribution more normal
Stabilize variability
Liberalize a relationship between two or more variables
 Show summary stat in original but analyse on the transformed data
Summary of what we have learned ….
 Always plot your data: make a graph, e.i. histogram, box plot
 Look for overall pattern (shape, centre and spread) and for striking
deviations such as outliers
 Check to see if overall pattern of distribution can be described by
normal distribution.
 If not uniform, transform data to make skewed data more symmetric
 Calculate an appropriate numerical summary to describe centre and
spread