Download The 5 per cent trimmed mean - United Nations Office on Drugs and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Misuse of statistics wikipedia , lookup

Regression toward the mean wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
GAP Toolkit 5
Training in basic drug abuse data management
and analysis
Training session 9
Data analysis: Explore
Objectives
• To define a standard set of descriptive statistics
used to analyse continuous variables
• To examine the Explore facility in SPSS
• To introduce the analysis of a continuous variable
according to values of a categorical variable, an
example of bivariate analysis
• To introduce further SPSS Help options
• To reinforce the use of SPSS syntax
SPSS Descriptive Statistics
• Analyse/Descriptive Statistics/Frequencies
• Analyse/Descriptive Statistics/Explore
• Analyse/Descriptive Statistics/Descriptives
Exercise: continuous variable
• Generate a set of standard summary statistics for the
continuous variable Age
Explore: Age
Explore: Descriptive Statistics
Descriptives
Statistic
AGE
Mean
95% Confidence Interval for
Mean
Std. Error
31.78
Lower Bound
31.16
Upper Bound
32.40
5% Trimmed Mean
31.31
Median
31.00
Variance
Std. Deviation
154.614
12.434
Minimum
1
Maximum
77
Range
76
Interquartile Range
Skewness
Kurtosis
.315
20.00
.427
.062
-.503
.124
Exercise: Help
• What’s This?
• Results Coach
• Case Studies
Measures of central tendency
• Most commonly:
– Mode
– Median
– Mean
• 5 per cent trimmed mean
The mode
• The mode is the most frequently occurring value in a
dataset
• Suitable for nominal data and above
• Example:
– The mode of the first most frequently used drug is Alcohol,
with 717 cases, approximately 46 per cent of valid responses
Bimodal
• Describes a distribution
• Two categories have a large number of cases
• Example:
– The distribution of Employment is bimodal, employment and
unemployment having a similar number of cases and more
cases than the other categories
The median
• The middle value when the data are ordered from low to
high is the median
• Half the data values lie below the median and half
above
• The data have to be ordered so the median is not
suitable for nominal data, but is suitable for ordinal
levels of measurement and above
Example: median
• Seizures of opium in Germany, 1994-1998
(Kilograms)
•
Year
1994
1995
1996
1997
1998
Seizure
36
15
45
42
286
Source: United Nations (2000). World Drug Report 2000 (United Nations publication,
Sales No. GV.E.00.0.10).
• Sort the seizure data in ascending order
Year
1995
1994
1997
1996
1998
Seizure
15
36
42
45
286
Ranked:
1
2
3
4
5
• The middle value is the median; the median annual
seizures of opium for Germany between 1994 and 1998
was 42 kilograms
The mean
• Add the values in the data set and divide by the number
of values
• The mean is only truly applicable to interval and ratio
data, as it involves adding the variables
• It is sometimes applied to ordinal data or ordinal scales
constructed from a number of Likert scales, but this
requires the assumption that the difference between the
values in the scale is the same, e.g. between 1 and 2 is
the same as between 5 and 6
Example: mean
• Seizures of opium in Germany, 1994-1998
Year
1994
1995
1996
1997
1998
Seizure
36
15
45
42
286
• Sample size = 5
• 36 + 15 + 45 + 42 + 286 = 424
• 424/5 = 84.8
The 5 per cent trimmed mean
• The 5 per cent trimmed mean is the mean calculated on
the data set with the top 5 per cent and bottom 5 per
cent of values removed
• An estimator that is more resistant to outliers than the
mean
95 per cent confidence interval for the mean
• An indication of the expected error (precision) when
estimating the population mean with the sample mean
• In repeated sampling, the equation used to calculate the
confidence interval around the sample mean will contain
the population mean 95 times out of 100
Measures of dispersion
•
•
•
•
The range
The inter-quartile range
The variance
The standard deviation
The range
• A measure of the spread of the data
• Range = maximum – minimum
Quartiles
• 1st quartile: 25 per cent of the values lie below the value
of the 1st quartile and 75 per cent above
• 2nd quartile: the median: 50 per cent of values below
and 50 per cent of values above
• 3rd quartile: 75 per cent of values below and 25 per
cent of the values above
Inter-quartile range
• IQR = 3rd Quartile – 1st Quartile
• The inter-quartile range measures the spread or range
of the mid 50 per cent of the data
• Ordinal level of measurement or above
Variance
• The average squared difference from the mean
• Measured in units squared
• Requires interval or ratio levels of measurement
 X
X
2
i
n 1
Standard deviation
• The square root of the variance
• Returns the units to those of the original variable
 X
X
2
i
n 1
Example: standard deviation and variance
Seizures of opium in Germany, 1994-1998
Year
Seizure
Deviations
Squared
deviations
1994
36
-48.8
2381.44
1995
15
-69.8
4872.04
1996
45
-39.8
1584.04
1997
42
-42.8
1831.84
1998
286
201.2
40481.44
Total
424
0
51150.8
Count
5
Mean
84.8
5
Variance
10230
Standard
deviation
101
Distribution or shape of the data
• The normal distribution
• Skewness:
– Positive or right-hand skewed
– Negative or left-hand skewed
• Kurtosis:
– Platykurtic
– Mesokurtic
– Leptokurtic
The normal distribution
f(X)
Mean
Median
Mode
X
• Symmetrical data: the mean, the median and the mode
coincide
Right-hand skew (+)
f(X)
Mode
Median
Mean
X
• Right-hand skew: the extreme large values drag the
mean towards them
Left-hand skew (-)
f(X)
Mean
Median
Mode
X
• Left-hand skew: the extreme small values drag the
mean towards them
Bivariate analysis
• Continuous Dependent Variable
• Categorical Independent Variable
Explore
Explore: Options button
Explore: Plots button
Explore: Statistics button
Descriptives
Gender
AGE
Male
Statistic
Mean
95% Confidence Interval for
Mean
31.43
Lower Bound
30.76
Upper Bound
32.09
5% Trimmed Mean
31.03
Median
30.00
Variance
.340
144.286
Std. Deviation
12.012
Minimum
1
Maximum
70
Range
69
Interquartile Range
19.00
Skewness
Female
Std. Error
.370
.069
Kurtosis
-.573
.138
Mean
33.39
.789
95% Confidence Interval for
Mean
Lower Bound
31.84
Upper Bound
34.94
5% Trimmed Mean
32.77
Median
33.00
Variance
Std. Deviation
193.593
13.914
Minimum
14
Maximum
77
Range
63
Interquartile Range
Skewness
Kurtosis
23.00
.472
.138
-.602
.376
Male
Female
Histogram
Histogram
300
60
50
200
40
30
20
Std. Dev = 12.01
Mean = 31.4
N = 1247.00
0
0.0
10.0
5.0
Age
20.0
15.0
30.0
25.0
40.0
35.0
50.0
45.0
60.0
55.0
70.0
65.0
Frequency
Frequency
100
Std. Dev = 13.91
10
Mean = 33.4
N = 311.00
0
15.0
25.0
20.0
Age
35.0
30.0
45.0
40.0
55.0
50.0
65.0
60.0
75.0
70.0
Boxplot of Age vs Gender
100
Outlier
80
183
60
Median
40
20
Age
0
Inter-quartile range
-20
N=
Gender
1247
311
Male
Female
Syntax: Explore
EXAMINE
VARIABLES=age BY gender /ID=id
/PLOT BOXPLOT HISTOGRAM
/COMPARE GROUP
/STATISTICS DESCRIPTIVES
/CINTERVAL 95
/MISSING LISTWISE
/NOTOTAL.
Summary
• Measures of central
tendency
• Measures of variation
• Quantiles
• Measures of shape
• Bivariate analysis for a
categorical independent
variable and continuous
dependent variable
• Histograms
• Boxplots