Download ISRU - Newcastle University

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Basic Statistics
Content

Data Types

Descriptive Statistics

Graphical Summaries

Distributions

Sampling and Estimation

Confidence Intervals

Hypothesis Testing (Statistical tests)

Errors in Hypothesis Testing

Sample Size
Data Types
Motivation
Defining your data type is always a
sensible first consideration
 You then know what you can ‘do’ with it

Variables

Quantitative Variable
 A variable that is counted or measured on a
numerical scale
 Can be continuous or discrete (always a whole
number).

Qualitative Variable
 A non-numerical variable that can be classified into
categories, but can’t be measured on a numerical
scale.
 Can be nominal or ordinal
Continuous Data

Continuous data is measured on a scale.

The data can have almost any numeric value
and can be recorded at many different points.

For example:
 Temperature (39.25oC)
 Time (2.468 seconds)
 Height (1.25m)
 Weight (66.34kg)
Discrete Data

Discrete data is based on counts, for example:
 The number of cars parked in a car park
 The number of patients seen by a dentist each day.

Only a finite number of values are possible e.g.
a dentist could see 10, 11, 12 people but not
12.3 people
Nominal Data

A Nominal scale is the most basic level of measurement.
The variable is divided into categories and objects are
‘measured’ by assigning them to a category.

For example,
 Colours of objects (red, yellow, blue, green)
 Types of transport (plane, car, boat)

There is no order of magnitude to the categories i.e.
blue is no more or less of a colour than red.
Ordinal Data

Ordinal data is categorical data, where the categories can be
placed in a logical order of ascendance e.g.;
 1 – 5 scoring scale, where 1 = poor and 5 = excellent
 Strength of a curry (mild, medium, hot)

There is some measure of magnitude, a score of ‘5 – excellent’ is
better than a score of ‘4 – good’.

But this says nothing about the degree of difference between the
categories i.e. we cannot assume a customer who thinks a
service is excellent is twice as happy as one who thinks the
same service is good.
Descriptive Statistics
Motivation

Why important?
–
extremely useful for summarising data in a meaningful
way
–
‘gain a feel’ for what constitutes a representative
value and how the observations are scattered around
that value
–
statistical measures such as the mean and standard
deviation are used in statistical hypothesis testing
Session Content

Measures of Location

Measures of Dispersion
Measures of Location

Measures of location
•
Mean
•
Median
•
Mode

The average is a general term for a measure
of location; it describes a typical measurement
Mean

The mean (arithmetic mean) is commonly called the
average

In formulas the mean is usually represented by
read as ‘x-bar’

The formula for calculating the mean from ‘n’
individual data-points is;
x

x
n
x-bar equals the sum of the data divided by the
number of data-points
Median

Median means middle

The median is the middle of a set of data that has been
put into rank order

Specifically, it is the value that divides a set of data into
two halves, with one half of the observations being
larger than the median value, and one half smaller
Half the data < 29
18
24
Half the data > 29
29
30
32
Mode

The mode represents the most commonly occurring
value within a dataset

Rarely used as a summary statistic

Find the mode by creating a frequency distribution
and tallying how often each value occurs
 If we find that every value occurs only once, the distribution
has no mode.
 If we find that two or more values are tied as the most
common, the distribution has more than one mode
Measures of Dispersion

Range

Interquartile range

Variance

Standard deviation
Measures of spread
2

The spread/dispersion in a set of data is the variation
among the set of data values

They measure whether values are close together, or
more scattered
4
6
8 10 12 14 16
Length of stay in hospital (days)
2
4
6
8 10 12
Length of stay in hospital (days)
Range

Difference between the largest and smallest value in a
data set

The actual max and min values may be stated rather
than the difference

The range of a list is 0 if all the data-points in the list
are equal
4
16
Range
Days
Interquartile range

Measures of spread not influenced by outliers can be
obtained by excluding the extreme values in the data
set and determining the range of the remaining values

Interquartile range = Upper quartile – Lower quartile
Interquartile Range
4
9
Q1
12
Q3
20 Days
Variance

Spread can be measured by determining the extent to
which each observation deviates from the arithmetic
mean

The larger the deviations, the larger the variability

Cannot use the mean of the deviations otherwise the
positive differences cancel out the negative differences

Overcome the problem by squaring each deviation and
finding the mean of the squared deviations = Variance

Units are the square of the units of the original
observations e.g. kg2
Standard Deviation

The square root of the variance

It can be regarded as a form of average of the
deviations of the observations from the mean

Stated in the same units as the raw data
Standard Deviation (SD)
Smaller SD = values clustered closer to the mean
Larger SD = values are more scattered
Mean
1 SD
4
6
8
10
1 SD
1 SD
12
14
16
8
Days
Mean
10
1 SD
12
Variance & Standard Deviation

The following formulae define these measures
Population
Variance   2 
Sample
2


x



N
Standard Deviation     2
Variance  s 2

x  x


2
n 1
Standard Deviation  s  s 2
Variation within-subjects

If repeated measures of a variable are taken on an individual
then some variation will be observed

Within-subject variation may occur because:
–
the individual does not always respond in the same way (e.g.
blood pressure)
–
of measurement error

E.g. readings of systolic blood pressure on a man may range
between 135-145 mm Hg when repeated 10 times

Usually less variation than between-subjects
Variation between-subjects

Variation obtained when a single measurement is
taken on every individual in a group

Between-subject variation

E.g. single measurements of systolic blood
pressure on 10 men may range between 125-175
mm Hg

Much greater variation than the 10 readings on
one man

Usually more variation than within-subject
variation
Session Summary

Measures of Location

Measures of Dispersion
Graphical Summaries
Motivation

Why important?
–
extremely useful for providing simple
summary pictures, ‘getting a feel’ for the data
and presenting results to others
–
used to identify outliers
Session Content

Bar Chart

Pie Chart

Box Plot

Histogram

Scatter Plot
Displaying frequency distributions

Qualitative or Discrete numerical data can be
displayed visually in a:
–
Bar Chart
–
Pie Chart

Continuous numerical data can be displayed visually
in a:
–
Box Plot
–
Histogram
Bar Chart

Horizontal or vertical bar drawn for each
category

Length proportional to frequency

Bars are separated by small gaps to indicate
that the data is qualitative or discrete
Example: Bar Chart
Pie Chart

A circular ‘pie’ that is split into sections

Each section represents a category

The area of each section is proportional to the
frequency in the category
Example: Pie Chart
What could improve
this chart?
Box Plot

Sometimes called a ‘Box and Whisker Plot’

A vertical or horizontal rectangle

Ends of the rectangle correspond to the upper and
lower quartiles of the data values

A line drawn in the rectangle corresponds to the
median value

Whiskers indicate minimum and maximum values but
sometimes relate to percentiles (e.g. the 5th and 95th
percentile)

Outliers are often marked with an asterix
Example: Box Plot
Histogram

Similar to a bar chart, but no gaps between
the bars (the data is continuous)

The width of each bar relates to a range of
values for the variable

Area of the bar proportional to the frequency
in that range

Usually between 5-20 groups are chosen
Example: Histogram
Displaying two variables

If one variable is categorical, separate
diagrams showing the distribution of the
second variable can be drawn for each of the
categories

Clustered or segmented bar charts are also
an option

If variables are numerical or ordinal then a
scatter plot can be used to display the
relationship between the two
Example: Scatter Plot
Scatterplot of Weight Loss vs Time on Diet
80
70
Weight Loss
60
50
40
30
20
10
0
0
5
10
15
Time on Diet
20
25
Fitting the Line

If the scatter plot of y versus x looks
approximately linear, how do we decide where
to put the line of best fit?

By eye?

A standard procedure for placing the line of
best fit is necessary, otherwise the line fitted
to the data would change depending on who
was examining the data
Regression

The least-squares regression method is used
to achieve this

This method minimises the sum of the
squared vertical differences between the
observed y values and the line i.e. the leastsquares regression line minimises the error
between the predicted values of y and the
actual y values

The total prediction error is less for the leastsquares regression line than for any other
possible prediction line
Example: Scatter Plot with
Regression Line
Scatterplot of Weight Loss vs Time on Diet
80
70
Weight Loss
60
50
40
30
20
10
0
0
5
10
15
20
Time on Diet
Weight Loss = 1.69 + 3.47 Time on Diet
25
Session Summary

Bar Chart

Pie Chart

Box Plot

Histogram

Scatter Plot
Distributions
Motivation

Why important?
–
if the empirical data approximates to a particular
probability distribution, theoretical knowledge can be
used to answer questions about the data
–
Note: Empirical distribution is the observed
distribution (observed data) of a variable
–
the properties of distributions provide the underlying
theory in some statistical tests (parametric tests)
–
the Normal Distribution is extremely important
Important point

It is not necessary to completely understand
the theory behind probability distributions!

It is important to know when and how to use
the distributions

Concentrate on familiarity with the basic
ideas, terminology and perhaps how to use
statistical tables (although statistical software
packages have made the latter point less
essential)
Normal Distribution

Used as the underlying assumption in many statistical
tests

Bell-shaped

Symmetrical about the mean

Flattened as the variance increases (fixed mean)

Peaked as the variance decreases (fixed mean)

Shifted to the right if mean increases

Shifted to the left if mean decreases

Mean and Median of a Normal Distribution are equal
Intervals of the Normal
Distribution
99.7%
95%
68%
≈ 3 standard deviations (3 )
Other distributions

t-distribution

2
χ distribution

F- distribution
Sampling and Estimation
Motivation

Why important?
–
studying the entire population in the majority
of cases is impractical, time consuming and/or
resource intensive
–
samples are used in studies to estimate
characteristics and draw conclusions about
the population
Populations and Samples

Population – the entire group of individuals in whom
we are interested

E.g.
–
All season ticket holders at Newcastle United
–
All students at the University of Newcastle upon Tyne
–
The entire population of the UK
–
All patients with a certain medical condition

Sample – any subset of a population
Sampling

Samples should be ‘representative’ of the population

Some degree of sampling error will exist when the
whole population is not used

Asking people to choose a ‘representative’ sample is
subjective as people will choose differently.

An objective method for selecting the samples is
desirable – a sampling strategy

The advantage of sampling strategies are that they
avoid subjectiveness and bias
Sampling Strategies

Include:
 Simple Random Sampling (SRS)
 Systematic Sampling
 Cluster Sampling
 Stratified Random Sampling
Simple Random Sampling

Sample chosen so that every member of a
population has the same chance (probability) of
being included in the sample

To carryout Simple Random Sampling a list of all
the sample units in the population is required (a
sampling frame)

Each unit is assigned a number and ‘n’ units are
selected from the population
Simple Random Sampling

Advantage
 SRS is a fairly simple and effective method of
obtaining a random sample from a population

Disadvantages
 It can theoretically result in an unbalanced sample
that does not truly represent some sector of the
population.
 It can be an expensive way to sample from a
population which is spread out over a large
geographic area
Point Estimates

It is often required to estimate the value of a
parameter of a population e.g. the mean

Can estimate the value of the population
parameter using the data collected in the
sample

The estimate is referred to as the point
estimate of the parameter as opposed to an
interval estimate which takes a range of
values
Sampling variation

If repeated samples were taken from a population it is unlikely
that the estimates of the population (e.g. estimates of the mean)
would be identical in each sample

However, the estimates should all be close to the true value of
the population and similar to one other

By quantifying the variability of these estimates, information can
be obtained on the precision of the estimate and sampling error
can be assessed

In medical studies, usually only one sample is taken from a
population, as opposed to many

Have to make use of the knowledge of the theoretical distribution
of sample estimates to draw inferences about the population
parameter
Sampling distribution of the
mean

Many repeated samples of size n from a population
can be drawn

If the mean of each sample was calculated a
histogram of the means could be drawn; this would
show the sampling distribution of the mean

It can be shown that:
–
the mean estimates follow a Normal Distribution whatever the
distribution of the original data (Central Limit Theorem)
–
if the sample size is small, the estimates of the mean follow a Normal
Distribution provided the data in the population follow a Normal
Distribution
–
the mean of the estimates equals the true population mean
Sampling distribution of the
mean
–
The variability of the distribution is measured by the
standard error of the mean (SEM)
–
The standard error of the mean is given by:
SEM 
–
σ
n
where σ is the population standard deviation and
n is the sample size
Best estimates in reality

When we have only one sample (as is the usual reality), the
best estimate of the population mean is the sample mean and
the standard error of the mean is given by:
SEM 

s
n
where s is the standard deviation of the observations in the
samples and n is the sample size
Interpreting standard errors

A large standard error means that the
estimate of the population mean is imprecise

A small standard error means that the
estimate of the population mean is precise

A more precise estimate of the population
mean can be obtained if:
–
the size of the sample is increased
–
the data is less variable
Using SD or SEM

SD, the standard deviation, is used to
describe the variation in the data values

SEM, the standard error of the mean, is used
to describe the precision of the sample mean
–
should be used if you are interested in the
mean of data values
Confidence Intervals
Motivation

Why important?
–
used to provide a measure of precision for a
population parameter such as the mean
–
can be used in statistical tests as a method of
testing whether the results are clinically
important
Confidence Intervals

The standard error is not by itself particularly
useful

It is more useful to incorporate the measure of
precision into an interval estimate for the
population parameter – this is known as a
confidence interval

The confidence interval extends either side of
the point estimate by some multiple of the
standard error
A 95% Confidence Interval

A 95% confidence interval for the population mean is
given by:
1.96s
1.96s
x
 μ x
n
n

If the study were to be repeated many times, this interval would
contain the true population mean on 95% of occasions

Usual interpretation: the range of values within which we are
95% confident that the true population lies – although not strictly
correct
Interpretation of CI intervals

A wide interval indicates that the estimate for the
population parameter is imprecise, a narrow one
indicates that the estimate is precise

The upper and lower limits provide a means of
assessing whether the results of a test are clinically
important

Can check whether a hypothesised value for the
population parameter falls within the confidence
interval
Hypothesis Testing
Motivation

Why important?
–
used to quantify a belief against a particular
hypothesis (a statistical test is performed)
e.g. the hypothesis is that the rates of cardiovascular
disease are the same in men and women in the
population
–
a statistical test could be conducted to determine the
likelihood that this is correct, making a decision based
on statistical evidence as to whether the hypothesis
should be rejected or not rejected
Hypothesis Testing

Once data is collected a process called
Hypothesis Testing is used to analyse it

There are specific types of hypothesis tests

Five general stages for hypothesis testing can
be defined:
Stages of Hypothesis Testing
1.
Define the Null & Alternative Hypotheses
under study
2.
Collect data
3.
Calculate the value of the test statistic
4.
Compare the value of the test statistic to
values from a known probability distribution
5.
Interpret the P-value and results
The Null Hypothesis

The Null Hypothesis is tested which assumes
no effect (e.g. the difference in means equals
zero) in the population

E.g. Comparing the rates of cardiovascular
disease in men and women in the population

Null Hypothesis H0: rates of cardiovascular
disease are the same in men and women in
the population
The Alternative Hypothesis

The Alternative Hypothesis is then defined,
this holds if the Null Hypothesis is not true

E.g. Alternative Hypothesis H1: rates of
cardiovascular disease are different in men
and women in the population
Two-tail testing

In the previous example no direction for the difference
in rates was specified

i.e. it was not stated whether men have higher or
lower rates than women

A two-tailed test is often recommended because the
direction is rarely certain in advance, if one does exist

There are circumstance in which a one-tailed test is
relevant
The test statistic

After data collection, the sample values are
substituted into a formula, specific to the type of
hypothesis test

A test statistic is calculated

The test statistic is effectively the amount of evidence
in the data against H0

The larger the value (irrelevant of sign), the greater
the evidence

Test statistics follow known theoretical probability
distributions
The P-value

The test statistic is compared to values from a known
probability distribution to obtain the P-value

The P-value is the area in both tails (occasionally
one) of the probability distribution

The P-value is the probability of obtaining our
results, or something more extreme, if the Null
Hypothesis is true

The Null Hypothesis relates to the population rather
than the sample
Use of the P-value

A decision must be made as to how much
evidence is required to reject H0 in favour of
H1

The smaller the P-value, the greater the
evidence against H0
Conventional use of the P-value
– rejecting H0

Conventionally, if the P-value < 0.05, there is
sufficient evidence to reject H0

There is only a small chance of the results
occurring if H0 is true
–
H0 is rejected, the results are significant at the
5% level
Conventional use of the P-value
– not rejecting H0

If the P-value > 0.05, there is insufficient
evidence to reject H0
–
H0 is not rejected, the results are not
significant at the 5% level

NB: This does not mean that the null
hypothesis is true, simply that we do not have
enough evidence to reject it!
Using 5%

The choice of 5% is arbitrary, on 5% of occasions H0
will be incorrectly rejected when it is true (Type I error)

In some clinical situations stronger evidence may be
required before rejecting H0
–
e.g. rejecting H0 if the P-value is less than 1% or 0.1%

The chosen cut-off for the P-value is called the
significance level of the test; it must be chosen before
the data is collected
Parametric vs. Non-Parametric
Tests

Hypothesis Tests which are based on knowledge of
the probability distribution that the data follow are
known as parametric tests

Often data does not conform to the assumptions that
underly these methods

In these cases non-parametric tests are used

Non-Parametric Tests make no assumption about the
probability distribution and generally replace the data
with their ranks
Non-parametric tests

Useful when:
•
sample size is small
•
data is measured on a categorical scale (though can
be used on numerical data as well)

However:
•
they have less power of detecting a real difference
than the equivalent parametric tests if all the
assumptions underlying the parametric test are true
•
they lead to decisions rather than generating a true
understanding of the data
Statistical tests

Quantitative data, Parametric tests
–
One-sample t-test
–
Two-sample t-test
–
Paired t-test
–
One-way ANOVA
Statistical tests

Quantitative data, Non-parametric tests
–
Sign test
–
Wilcoxon signed ranks test
–
Mann-Whitney U test
–
Kruskal-Wallis test
Statistical tests

Qualitative data, Non-parametric tests
–
z-test for a proportion
–
McNemar’s test
–
Chi-squared test
–
Fisher’s exact test
Choosing a statistical test

Useful medical statistical books will contain a
flowchart to help decide on the correct
statistical test

Considerations include:
–
Is the data quantitative or qualitative?
–
How many groups of data are there?
–
Can a probability distribution be assumed?
Examples
Paired t-test
Two sample t-test (paired)

Two samples related to each other and one
numerical or ordinal variable of interest

E.g. in a cross-over trial, each patient has two
measurements on the variable, one while
taking treatment, one while taking a placebo

E.g. the individuals in each sample may be
different but linked to each other in some way
Assumptions

The individual differences are Normally
distributed with a given variance

A reasonable sample size has been taken so
that the assumption of Normality can be
checked
Assumptions not satisfied

If the differences do not follow a Normal
distribution, the assumption underlying the ttest is not satisfied

Options:
–
Transform the data
–
Use a non-parametric test such as the Sign
Test or Wilcoxon signed ranks test
Example

A peak expiratory flow rate (PEFR) was taken from a
random sample of 9 asthmatics before and after a
walk on a cold day

The mean of the differences before and after the walk
= 56.11

The standard deviation of the differences = 34.17

Does the walk significantly influence the PEFR?
Example: Stages of a paired t-test
1)
Define the Null and Alternative hypotheses
under study:
Ho: the mean difference = 0
H1: the mean difference ≠ 0
Example: Stages of a paired t-test
2) Collect data before and after the walk
3) Calculate the value of the test statistic, t
t
56.11  0
 4.926
34.17
9
4) Compare the value of the t statistic to values from the known
probability distribution
5) The p-value = 0.001
A 95% confidence interval for the true difference is (29.8,82.4)
Paired t-test results
Paired Samples Statistics
Pair
1
Before Walk
After walk
Mean
323.8889
267.7778
N
9
9
Std. Deviation
59.82567
50.00694
Std. Error
Mean
19.94189
16.66898
Pa ired Sa mples Test
Paired Differences
Pair 1
Mean
Before Walk - After walk 56.11111
Std. Deviation
34.17398
Std. Error
Mean
11.39133
95% Confidenc e
Interval of the
Difference
Lower
Upper
29.84266 82.37956
t
4.926
df
– there is strong evidence to reject the Null Hypothesis in favour of the
Alternative Hypothesis
– there is strong evidence that the walk significantly effects PEFR, the
difference ≠ 0
8
Sig. (2-tailed)
.001
Mann-Whitney test
Mann-Whitney U test

The Mann-Whitney U test – two independent
samples test

It is equivalent to the Kruskal-Wallis test for
two groups

Mann-Whitney tests that two sampled
populations are equivalent in location
Methodology

The observations from both groups are combined and
ranked, with the average rank assigned in the case of
ties

If the populations are identical in location, the ranks
should be randomly mixed between the two samples

The test calculates the number of times that a score
from group 1 precedes a score from group 2 and the
number of times that a score from group 2 precedes a
score from group 1
Example

Two samples of diastolic blood pressure were
taken

Is there a difference in the population
locations without assuming a parametric
model for the distributions?

The equality of population means is tested
through the use of a Mann-Whitney test

Are the two populations significantly different?
Example - Mann-Whitney U test
Ranks
Diastolic Blood
Pressure 1
Group
1.00
2.00
Total
N
8
9
17
Mean Rank
7.50
10.33
Sum of Ranks
60.00
93.00
Test Statisticsb
Mann-Whitney U
Wilcoxon W
Z
Asymp. Sig. (2-tailed)
Exact Sig. [2*(1-tailed
Sig.)]
Diastolic
Blood
Pressure 1
24.000
60.000
-1.156
.248
a
.277
a. Not corrected for ties.
b. Grouping Variable: Group
- there is no evidence to reject the Null Hypothesis in favour of the Alternative
Hypothesis, p-value = 0.277 >0.05
- there is no evidence of a difference in blood pressure medians
Errors in Hypothesis Testing
Motivation

Why important?
–
when interpreting the results of a statistical
test, there is always a probability of making an
erroneous conclusion (however minimal)
–
it is important to ensure that these
probabilities are minimised
–
possible mistakes are called Type I and Type
II errors
Type I error

Rejecting the Null Hypothesis when it is true

Concluding that there is an effect when in
reality there is none

The maximum chance of making a Type I
error is denoted by alpha α

α is the significance level of the test, we reject
the null hypothesis if the p-value is less than
the significance level
Type II error

Not rejecting the Null Hypothesis when it is
false

Concluding that there is no effect when one
really exists

The chance of making a Type II error is
denoted by beta β

Its compliment 1- β, is the power of the test
Power of the test

The Power is the probability of rejecting the
Null Hypothesis when it is false

i.e. the probability of making a correct decision

The ideal power of the test is 100%

However there is always a possibility of
making a Type II error
Sample Size
Motivation

Why important?
–
if the sample size is too small, there may be
inadequate test power to detect an important existing
effect/difference and resources will be wasted
–
if the sample size is too large, the study may be
unnecessarily time consuming, expensive and
unethical
–
have to determine a sample size which strikes a
balance between making a Type I or Type II error
–
an optimal sample size can be difficult to establish as
an estimate of the results expected in the study is
required
Calculating an optimal sample
size for a test

The following quantities need to be specified
at the design stage of the investigation in
order to calculate an optimal sample size:
–
The Power
–
Significance level
–
Variability
–
Smallest effect of interest
Summary

Data Types

Descriptive Statistics

Graphical Summaries

Distributions

Sampling and Estimation

Confidence Intervals

Hypothesis Testing (Statistical tests)

Errors in Hypothesis Testing

Sample Size
Book Reference
Medical Statistics at a Glance, 3rd Edition
(Aviva Petrie & Caroline Sabin)
ISBN: 978-1-4051-8051-1