Download Descriptive Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia, lookup

Bootstrapping (statistics) wikipedia, lookup

Time series wikipedia, lookup

Categorical variable wikipedia, lookup

World Values Survey wikipedia, lookup

Transcript
Learning objectives
 To recognize different types of variables
Descriptive Statistics
 To learn how to appropriately explore your data
◙ How to display data using graphs
F. Farrokhyar, MPhil, PhD, PDoc
◙ How to display data with numbers and tables
 To learn about measures of central tendency
 To learn about the measures of variation
Descriptive and Inferential statistics?
 Descriptive statistics help us with the presentation,
organization, and summarization of data.
 Inferential statistics allow us to make inferences from a
sample of individuals to a larger population.
Type of variables
Qualitative or attribute variable
 Nonnumeric
gender (male, female), type of injury (blunt, fall, burn, etc)
Quantitative variable
 Numeric
Discrete variable can assume only whole numbers
What is data?
 Data is a set of information or observation about a group of
individuals or subjects.
 This information is organized in the form of variables.
 A variable is anyy characteristic of a pperson or a subject
j that
can be measured or categorized.
 Its’ value varies from individual to individual.
Level of measurement
 There are four levels of measurement:
◙ Nominal
◙ Ordinal
.
◙ Interval
.
◙ Ratio
no. of accidents, no. of injuries, no. of positive nodes
Continuous variable may take any value, within a defined range:
weight, age, blood pressure, level of cholesterol
1
Level of measurement … cont’d
Level of measurement … cont’d
 Nominal variable: consists of named categories with no order
among the categories.
- binomial ---- gender, mortality
 Interval variable: has equal distances between values with no
meaningful ‘zero’ value.
- IQ test
o
- Temperature (0 C does not represent absence of temperature
- multinomial ---- type of injury, blood type
 Ordinal variable: consists of ordered categories, where the
differences between categories cannot be considered to be
equal.
- Tumour stage – 1, 2, 3, 4
 Ratio variable: has equal intervals between values and a
meaningful zero point. The ratio between them makes sense.
- height, weight, laboratory test values
- Likert scale – excellent, very good, good, fair, poor
Type of variables
Level of measurement …
 Variable type:
 Assumptions:
◙ Nominal
◙ Named categories
◙ Ordinal
.
◙ Same as nominal plus
ordered categories
◙ Interval
.
◙ Same as ordinal plus equal
intervals
◙ ratio
◙ Same as interval plus
meaningful zero
Dependent variable
 Is the outcome of interest, which changes in response to some
intervention or exposure.
- mortality, survival, post-op pain, quality of life
Independent variable
 Is the explanatory variable that explains the changes in the
dependent variable
- demographics (age, gender, height), risk factors (diabetes, BP)
 Is the intervention or exposure variable that causes the changes
in the dependent variable.
- drug, surgery, radiation, smoking …
Describing Categorical data
Independent (Explanatory)
variables:
Age, Sex, Pre-op pain
Severity
Independent
(Comparison)
variable
Dependent/outcome
variables:
Changes in pain,
Complication
 Graphs
Bar charts
Pie charts
2
Bar charts
Bar Charts
 Used to display nominal or ordinal data.
 It is a series of separated bars.
 Bars represent frequency (counts) or relative frequency
(percent or proportion) of each category.
category
 Used to display data for more than one group.
Bar Charts
Pie charts
 Used for nominal and ordinal data.
 Used to display relative frequency distribution.
 The circle is divided proportionally using relative frequency
of each category.
category
 A pie chart is useful for showing data for one group but it is
useless for illustration of two or more groups.
Pie Charts
Describing Categorical data
 Numerically
Frequencies (counts)
Relative frequencies (%)
3
Describing quantitative data
Cross-tabulation of categorical data
 Graphs
Type of surgery
Severity
mild
moderate
severe
Histograms
Open
Laparoscopic
Total
4 (27%)
6 (40%)
5 (33%)
3 (20%)
7 (47%)
5 (33%)
7 (23%)
13 (43%)
10 (33%)
7 (47%)
8 (53%)
4 (27%)
11 (73%)
11 (37%)
19 (63%)
The five-number summary Æ Boxplot
Sex
male
female
Histogram
Histograms …
 Used for interval and ratio data.
 A histogram is a graph in which each bar (horizontal axis)
represent a range of numbers called interval width. The
vertical axis represents the frequency of each interval.
interval
 There are no spaces between bars.
 The frequencies are represented by the bar height and area
of each bar
 Histogram is useful for graphic illustration of one group.
Box plot: 5 – number summary
100th
 Used for interval and ratio data.
Maximum
Q3
Median (Q2)
Q1
1st
Box Plots …
 Uses the five-number summary measures
Median, Q1, Q3, minimum and maximum.
 It is useful in detecting outliers
 It is useful to illustrate the distribution of more than on
group.
Minimum
4
Box plot of change in pain score
Scatter plot
Â
Describing quantitative data
 Numbers
Measures of central tendency
– mode, median, mean
Measures of spread
p
– range, interquartile range, variance, standard
deviation
Used to display the relationship between two continuous
variables.
Mode – Measures of central tendency
 Mode is the most frequent value – the highest peak
 Used for nominal, ordinal, interval and ratio data.
 Could be more than one mode.
Example: pain score 1, 4, 6, 8, 5, 6, 3, 2, 15
1, 2, 3, 4, 5, 6, 6, 8, 15
Median – Measures of central tendency
 Median is the midpoint of the values after arranging the observations
in order of size, from smallest to largest.
 There is a unique median for each dataset
 Used for interval and ratio data.
Mean – Measures of central tendency
 Mean is the sum of sample values divided by the number of
sample values --- n.
 It is useful for interval and ratio data.
 It may nott bbe necessarily
il equall to
t one off th
the sample
l values.
l
n
∑xi
Properties:
 It is resistant (insensitive) toward extreme values.
X=
i =1
n
=
1+ 2 + 3 + 4 + 5 + 6 + 6 + 8 + 15
= 5.5
9
 It is useful for summarising skewed data.
Example -
1, 2, 3, 4, 5, 6, 6, 8, 15
5
Measures of central tendency …
Properties of mean …
 There is a unique mean for each dataset.
Normal curve
Skewed curve
 All values are included in the computation.
 It is the only measure of central tendency where the sum of
deviations of each value from the mean will always be zero.
n
_
∑( X i - X )
i= 1
 The mean is sensitive toward extreme values.
Measures of Spread
Â
Range
Â
Interquartile range
Â
Variance
Â
Standard deviation
X
Mean
Median
Mode
Mean
Median
Mode
Range
Â
Used mainly for interval or ratio data
Â
Range is the differences between the largest and
smallest values in a dataset.
Â
Properties
It uses only two values in its calculation.
It is effected by extreme values.
It is easy to understand.
1, 2, 3, 4, 5, 6, 6, 8, 15 ---- range = 14
Interquartile range
Â
Used mainly for interval and ratio data
Â
It is the distance between the third quartile (Q3) and
the first quartile (Q1).
Â
Interquartile range = Q3 – Q1
Interquartile range
 It is resistant (insensitive) to extreme values.
 It is useful for summarising skewed interval and ratio data.
 Arrange the observations from smallest to largest.
 Divide into 4 equal parts.
Example,
1, 2, 3, 4, 5, 6, 6, 8, 15
1st quartile (Q1) = (2+3)/ 2 = 2.5
Median (Q2) = 5
3rd quartile (Q3) = (6+8) / 2 = 7
Interquartile range = 7 – 2.5 = 4.5
6
1.5 × IQR Criterion for Outliers
Interquartile range
Â
 Interquartile range (IQR) is the distance between the
first and third quartiles. IQR = Q3 – Q1
Used to locate the outliers.
 From data
Q1 = 59 yyrs,, Q3 = 70 yyrs,,
What are outliers?
Â
IQR = 70 – 59 = 11
Outliers are extreme data values that fall outside of
distribution of the data set.
1.5 × IQR = 1.5 × 11 = 16.5
Q1 – IQR = 59 – 16.5 = 42.5
Q3 + IQR = 70 + 16.5 = 86.5
 From data: Min= 44 and Max = 82
Variance
Box plot: 5 – number summary
Â
100th
Outliers:
82
Â
< 42.5
population variance
> 86.5
Q3
Median (Q2)
Q1
44
1st
Variance
Â
n
(x - x )
σ2 = ∑ i
N
i =1
n ( x - x )2
s2 = ∑ i
n -1
i =1
Â
Here, the df is n-1 rather than n because we lose 1 df by
estimating the sample mean.
Standard deviation
Â
is square root of variance
n ( x - x )2
sd = ∑ i
= 4 .1
i=1 n - 1
of the original units
S=
sample variance
Degrees of freedom – measure the amount of information
available in the data that can be to estimate σ2.
The units are not the same as data, they are the square
Example: 1, 2, 3, 4, 5, 6, 6, 8, 15
2
Â
Properties
All values are used in the calculation
Used for interval or ratio data
Is the average of the squared deviations from the mean
Â
It is the average deviation from the mean in the same unit as
the data.
(1_ 5.5)2 + ( 2 _ 5.5)2 + (3 _ 5.5)2 + ... + (15 _ 5.5)2
= 17.2
9 _1
7
Uses of standard deviation …
Standard normal curve
 It is used for Empirical Rule.
 For any symmetrical distribution:
◘ About 68% of the observations will lie within 1 s.d. of the mean.
◘ About 95% of the observations will lie within 2 s.d. of the mean.
◘ About 99.8% of the observations will lie within 3 s.d. of the
mean.
Summary of what we have learned ….
 We report
Mean with standard deviation
Data type
Graph
Numerically
Ratio and interval
Histogram
Box plot
Scatter plot
Mean with standard deviation
Median with IQR, range
Mode
Ordinal data
Bar chart
Pie chart
Count and %
Median IQR, range
mode
Nominal
Bar chart
Pie chart
Count and %
mode
Median with first and third quartiles
Median with minimum and maximum
8