Download Presentation 2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bias of an estimator wikipedia , lookup

Regression toward the mean wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Confidence interval wikipedia , lookup

Transcript
Statistical Analysis
SC504/HS927
Spring Term 2008
Week 17 (25th January 2008):
Analysing data
Recap
• Why
• How
• What
Questions to consider
• What is my question / interest?
• Can the sample tell me about it?
• What are the relevant variables, what are their
characteristics – are they in the form I want
them? (If not, change them into that form).
• What is the most appropriate method to use?
• If it works what will / might it show me? NB
negative results are not necessarily uniformative
• If it doesn’t work – why might that be?
Levels of measurement
• Nominal e.g., colours numbers are not meaningful
• Ordinal e.g., order in which you finished a race
numbers don’t indicate how far ahead the winner of the
race was
• Interval e.g., temperature equal intervals between
each number on the scale but no absolute zero
• Ratio e.g., time equal intervals between each number
with an absolute zero.
Univariate analysis
• Measures of central
tendency x x  x  x ...  x
n
– Mean= 
– Median= midpoint of
the distribution
– Mode= most common
value
1

2
3
4
n
Mode – value or category that has the
highest frequency (count)
age
frequency
(count)
sex
frequency
(count)
16-25
12
male
435
26-35
20
female
654
36-45
32
46-55
27
56+
18
Median – value that is halfway in the
distribution (50th percentile)
age 12
age 12
14
18 21
median
36
41
14 18 21 36 41
median= (18+21)/2 =19.5
42
Mean – the sum of all scores divided by the
number of scores
• What most people call the average
• Mean: ∑x / N
Which One To Use?
Mode
Median
Nominal

Ordinal


Interval


Mean

Measures of dispersion
– Range= highest value-lowest value
(Y  Y )  (Y  Y )  ...  (Y
– variance, s2=  (Yn 1Y ) =
n 1
_
i
_
2
1
_
2
2
2
n
_
 Y )2
– standard deviation, s (or SD)= s 2
The standard error of the mean and confidence
intervals
s
– SE 
n
Definitions: Measures of Dispersion
• Variance: indicates the distance of each
score from the mean but in order to
account for both + and – differences from
the mean (so they don’t just cancel each
other out) we square the difference and
add them together (Sum of Squares). This
indicates the total error within the sample
but the larger the sample the larger the
error so we need to divide by N-1 to get
the average error.
Definitions: Measures of Dispersion
• Standard deviation: due to the fact that we
squared the sums of the error of each
score the variance actually tells us the
average error². To get the SD we need to
take the square root of the variance. The
SD is a measure of how representative the
mean is. The smaller the SD the more
representative of your sample the mean is.
Definitions: Measures of Dispersion
• Standard error: the standard error is the
standard deviation of sample means. If
you take a lot of separate samples and
work out their means the standard
deviation of these means would indicate
the variability between the means of
different samples. The smaller the
standard error the more representative
your sample mean is of the population
mean.
Definitions: Measures of Dispersion
• Confidence Intervals: A 95% confidence
interval means that if we collected 100
samples and calculated the means and
confidence intervals 95 of those
confidence intervals would contain the
population mean.
Describing data
• Numbers / tables
– Analyze – Descriptive Statistics- Frequencies
/ Descriptives
• Charts / graphs
– Graphs – Pie / Histogram / Bar
– Using excel for charts
E.g.
Descriptive Statistics
@/what is your exact age?
Valid N (lis twis e)
N
3880
3880
Range
18.00
Minimum
12.00
Valid
NONE
VOC & O LEVEL
A LEVEL & ABOVE
Total
Percent
38.7
34.6
26.7
100.0
Mean
21.2317
Std. Deviation
6.20229
Statistics
Highest Qualification
Frequency
387
346
267
1000
Maximum
30.00
Valid Percent
38.7
34.6
26.7
100.0
Cumulative
Percent
38.7
73.3
100.0
Highest Qualification
N
Valid
Mis sing
Median
Mode
1000
0
1.0000
.00
Descriptive Statistics
@/what is your exact age?
Valid N (lis twis e)
N
Statis tic
3880
3880
Minimum
Statis tic
12.00
Maximum
Statis tic
30.00
Mean
Statis tic
Std. Error
21.2317
.09957
Std.
Deviation
Statis tic
6.20229
Variance
38.468
Bivariate relationships
• Asking research questions involving two
variables:
– Categorical and interval
– Interval and interval
– Categorical and Categorical
• Describing relationships
• Testing relationships
Interval and interval
• Correlation
– To be covered next week with OLS
Categorical (dichotomous) and
interval
• T-tests
– Analyze – compare means – independent
samples t-test – check for equality of
variances
– t value= observed difference between the
means for the two groups divided by the
standard error of the difference
– Significance of t statistic, upper and lower
confidence intervals based on standard error
E.g. (with stats sceli.sav)
•
•
•
•
•
•
Average age in sample=37.34
Average age of single=31.55
Average age of partnered=39.45
t=7.9/.74
Upper bound=-7.9+(1.96*.74)
Lower bound=-7.9-(1.96*.74)
Independent Samples Test
Levene's Tes t for
Equality of Variances
F
age
Equal variances
ass umed
Equal variances
not as sumed
.026
Sig.
.872
t-tes t for Equality of Means
t
df
Sig. (2-tailed)
Mean
Difference
Std. Error
Difference
95% Confidence
Interval of the
Difference
Lower
Upper
-10.721
998
.000
-7.905
.737
-9.352
-6.458
-10.661
470.095
.000
-7.905
.741
-9.362
-6.448
Categorical and Categorical
• Chi Square Test
– Tabulation of two variables
– What is the observed variation compared to what
would be expected if equal distributions?
– What is the size of that observed variation compared
to the number of cells across which variation could
occur? (the chi-square statistic)
2
( fo  fe)2
fe
– What is its significance? (the chi square distribution
and degrees of freedom)
E.g.
T
• Are
the
E
proportions
M
M
within
L
T
o
E
T
T
S
m
C
employment
3
2
3
8
%
%
%
%
%
status similar
f
C
0
8
1
9
across the
%
%
%
%
%
T
C
3
0
4
7
sexes?
%
%
%
%
%
• Could also
think about it
sex * EMPLOYMENT STATUS Crosstabulation
EMPLOYMENT STATUS
the other way
EMP
EMP
round
FULLTIME
PARTTIME
UNEMP
Total
s ex
male
female
Total
Count
% within EMPLOYMENT
STATUS
Count
% within EMPLOYMENT
STATUS
Count
% within EMPLOYMENT
STATUS
383
2
100
485
62.5%
1.1%
48.5%
48.5%
230
179
106
515
37.5%
98.9%
51.5%
51.5%
613
181
206
1000
100.0%
100.0%
100.0%
100.0%
sex * EMPLOYMENT STATUS Crosstabulation
sex
male
EMPLOYMENT STATUS
EMP
EMP
FULLTIME
PARTTIME
UNEMP
383
2
100
Count
Expected Count
% within sex
female
Count
Expected Count
% within sex
Total
Count
Expected Count
% within sex
Value
209.840a
265.507
48.211
485
297.3
87.8
99.9
485.0
79.0%
.4%
20.6%
100.0%
230
179
106
515
315.7
93.2
106.1
515.0
44.7%
34.8%
20.6%
100.0%
613
181
206
1000
613.0
181.0
206.0
1000.0
61.3%
18.1%
20.6%
100.0%
Chi-Square Tests
Pearson Chi-Square
Likelihood Ratio
Linear-by-Linear
Association
N of Valid Cases
Total
2
2
Asymp. Sig.
(2-sided)
.000
.000
1
.000
df
987
a. 0 cells (.0%) have expected count less than 5. The
minimum expected count is 87.17.
Now look through your
background notes before starting
the exercises…