Download Lecture #4

Document related concepts

History of statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Confidence interval wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Action Research
Data Manipulation
INFO 515
Glenn Booker
INFO 515
Lecture #4
1
Inferential Statistics Introduction
We often want to estimate some statistic
for a large population (e.g. average age,
percent customer satisfaction, average
weight, etc.)
 A common way to do so is by taking one
or more samples (such as surveys of
people, or measuring a handful of nails
from a production batch) and measuring
that statistic for each sample

INFO 515
Lecture #4
2
Inferential Statistics Introduction

We distinguish between the characteristics
of a sample versus the characteristics of
the entire population


INFO 515
If you survey 30 Drexel students and ask their
ages, your average (the sample statistic)
probably won’t equal the average of all Drexel
students (the population statistic)
If you repeat that survey many times, you will
get a range of sample statistics (the average
ages from each sample)
Lecture #4
3
Sampling Terms
“Statistic” means a value derived from
a sample
 “Parameter” is a value derived from
the entire population (often not known)
 The “values” may include: means,
standard deviations, percentages,
correlation coefficients, regression
coefficients, etc.

INFO 515
Lecture #4
4
Types of Distributions
Sample distribution: the distribution of
data in our sample for some variable
of interest
 Population distribution: the distribution
of the whole population for some variable
of interest
 Sampling distribution: theoretical
distribution of an infinite number of
samples; its mean is the population mean

INFO 515
Lecture #4
5
Sampling Distribution of a Statistic

It’s a theoretical distribution comprising
the outcomes of drawing infinitely
repeated samples


Is NOT the distribution of one sample for some
variable (e.g. one set of scores)
We don’t actually draw an infinite number
of samples!

INFO 515
We draw one sample and it gives us a point
estimate for the whole population
Lecture #4
6
Point Estimate vs Interval Estimate

From a sample statistic (such as a mean,
percentage, etc.), we try to estimate a
population parameter that we don’t
(maybe even can’t) know


This is called a point estimate of that statistic
But we know that samples can sometimes
be unrepresentative through random error

INFO 515
So we find a confidence interval, to make a
more realistic interval estimate
Lecture #4
7
Point Estimate vs Interval Estimate
A point estimate is more likely to be
reported to the public (e.g. average
income in a township) because people
tend to understand averages and
individual numbers
 An interval estimate (such as a confidence
interval) is useful in decision-making and
is often used with the test of hypothesis
(discussed shortly)

INFO 515
Lecture #4
8
Sampling Distribution
Could the point estimate ever be the true
value? Sure! Nature is not usually perverse,
but we seldom know the true value
 Do we expect to get the true value
always? No! Erroneous estimates would
theoretically occur as often as estimates
which are right on target


INFO 515
Example: average age of Philly residents.
Census would produce true mean age;
samples would produce varying estimates
of true mean age
Lecture #4
9
Central Limit Theorem

The Central Limit Theorem helps validate
sample statistics


Many (e.g. 30+) random samples from a given
population will have means which are close to
the true mean of the entire population
The means of those samples will be
normally distributed about the true
mean of the population
INFO 515
Lecture #4
10
Central Limit Theorem
When many samples of sufficient size are
randomly drawn from a population, the
values of the sample means will tend to
cluster about the true value of the
population mean, and this clustering will
be normally distributed
 The mean of a sampling distribution will
be a very good estimate of the true
population mean

INFO 515
Lecture #4
11
Hypothesis Testing
INFO 515
Lecture #4
12
Hypothesis Testing
There are Research and Null Hypotheses
 The Research Hypothesis, a.k.a just
Hypothesis, is the relationship you think
exists, or the statement you wish to prove



INFO 515
“Computer aided instruction will affect math
scores compared to standard instruction”
“Library patrons trust electronic sources more
or less than paper sources”
Lecture #4
13
Null Hypothesis

Null (statistical) Hypothesis: The Null
hypothesis is the opposite of the Research
Hypothesis – is generally the assumption
that no significant differences exist


INFO 515
“There is no difference between computer
aided instruction and traditional teaching in
their effect on math scores”
“Library patrons trust electronic and paper
sources equally”
Lecture #4
14
Hypothesis Testing is Weird

In statistics, you can never prove anything

You can only try to reject the null hypothesis
Because statistics are based on probability
- which implies uncertainty – we can
never absolutely disprove the null
hypothesis
 We can only provide evidence that
differences are not due to chance

INFO 515
Lecture #4
15
Risks

A risk in rejecting, or failing to reject,
the Null Hypothesis is that we could be
making one of two types of errors:

Type I Error


We reject the null when it is actually true, saying
something is significantly different when it is not
Type II Error

We accept the null when it is really false, saying
something is not significantly different when it is
Honest, this is what they’re called. I’m not making it up!
INFO 515
Lecture #4
16
Types of Error
Reality  Null hypothesis
really is True
Null hypothesis
really is False
Your result 
You reject the
null hypothesis
Type I error
You are correct
You do not
reject the null
hypothesis
You are correct
Type II error
From Norusis, Guide to Data Analysis, p. 256
INFO 515
Lecture #4
17
Choosing a Significance Level

We select a significance level (Sig.) as a
criterion for rejecting a null hypothesis


INFO 515
A significance level of 0.05 or less is generally
suitable for the social sciences and
corresponds to a z value of  1.96
A significance level of 0.01 or less is used
in clinical trials, drug studies, etc. and
corresponds to a z value of  2.57
Lecture #4
18
Significance versus Confidence

The significance level is the complement
of the confidence level (level of
confidence)

significance level + confidence level = 1
A significance level of 0.05 corresponds
to a confidence level of 95%
 A significance level of 0.01 corresponds
to a confidence level of 99%

INFO 515
Lecture #4
19
Sampling Error
Sampling errors are deviations of a
sample statistic from the true
population value, produced by chance
 If deviations aren’t caused by chance,
they are called systematic errors or bias
 Bias is introduced through failure to
randomize fully, prejudices of
interviewers, lies, evasions, memory
lapses, defective measuring
instruments, and many other causes

INFO 515
Lecture #4
20
Standard Errors

Standard Error is “A measure of how much
the value of a test statistic may vary from
sample to sample. It is the standard
deviation of the sampling distribution for a
statistic. For example, the standard error
of the mean is the standard deviation of
the sample means.” SPSS
INFO 515
Lecture #4
21
Standard Errors
How do you know if the sample you picked
is weirder than normal?
 Standard errors are estimates of
sampling error

INFO 515
Lecture #4
22
Standard Error of the Mean

If we are trying to estimate the population
mean from a sample, the standard error
of the mean is computed thus:
SEx = s/(n)



SEx (not a typo) means “Standard Error of x”
s is the standard deviation of your sample,
and n is your sample size
It can also be written with the variance,
rather than the standard deviation
SEx = (s2 /n) [s2 is the variance]
INFO 515
Lecture #4
23
Standard Error of the Mean

As n increases in the formula s/n, the
standard error gets smaller


The larger we make our sample, the
smaller is our margin of error in estimating
the true value
If a sample n grew to equal the whole
population N, we’d be gathering data
from the whole population, and so
make no error of estimate at all!
(SEx = 0 as n  ∞)
INFO 515
Lecture #4
24
Confidence Interval

CI is an estimated range of values with a given
probability of including the true population value



These are limits within which we would expect repeated
sample means to fall
The usual probability level is 95%, but we can use 99%
At 95% confidence, we are saying that:

INFO 515
If we took repeated samples at random, constructed
confidence intervals around the sample means, the true
population mean would be captured in 95 out of 100 of
those intervals
Lecture #4
25
Confidence Interval Example

The average blah for high school teachers
10 years ago was 30


I don’t believe this is true any more
Take a random sample to measure blah
and infer to the population
INFO 515
Lecture #4
26
CI Example
Given sample with n=25, mean=11, s=20
 Hypothesis: Is this sample different from a
population with a mean of 30?
 Null hypothesis: There is no difference
between the population mean and our
sample mean (i.e. the average blah could
still be 30)

INFO 515
Lecture #4
27
CI Example
Calculate standard error of the mean
SEx = s/n = 20/5 = 4
 Calculate the 95% confidence interval
CI = mean statistic  (critical
value)*(standard error of the mean
statistic)
CI 95 = 11  (1.96)*(4)
CI 95 = (18.84, 3.16)

INFO 515
Lecture #4
28
CI Example

We can find the exact location of our
sample mean with this formula for z:
z = (sample mean - population mean)/
(sample standard error)
z = (11-30)/4 = -4.75

This z is further from zero than -1.96
(the critical z value for .05 level of
significance), so reject the null hypothesis
INFO 515
Lecture #4
29
CI Example

Conclusion: So we can reject the null
hypothesis, and state that it is unlikely we
would get a sample mean as low as 11 if
the sample truly came from a population
with a mean of 30


INFO 515
There is a statistically significant difference
between our population mean of 30 and our
sample mean
At the 5% significance level, there is a five
percent chance that we wrong to reject the
null hypothesis
Lecture #4
30
Hypothesis Testing to Compare a
Sample to the Population Mean
1.
2.
3.
4.
5.
INFO 515
Action Research
State hypotheses:
p. 22/23
Research hypothesis
Null hypothesis
Choose confidence or significance level
Take sample from population
Find mean (x), standard deviation (s) of
that sample
Standard error of mean SEx = s/(n)
Lecture #4
31
Hypothesis Testing
6.
7.
8.
INFO 515
For n > 30, use ‘z’. For n <= 30, use ‘t’
where either z or t = (x-m)/SEx
[m is population mean]
Compare z or t to the critical value,
based on significance level (e.g. z = 1.96)
For t, find critical value, with df = n – 1
If actual z or t is further from zero
than the critical value, reject the
null hypothesis
Lecture #4
32
Evaluating Critical z or t Value
Applies to
2-tail testing
Reject Null
Hypothesis
Accept Null
Hypothesis
Reject Null
Hypothesis
X
- Crit. Value
of z or t
INFO 515
0
Lecture #4
+ Crit. Value
of z or t
Actual z
or t value
33
Student’s T

With fairly large samples (25 and up)
we can use z values and the two
critical values


When n is less than 25 (and some say
less than 30) we should use the t
distribution instead of z values

INFO 515
Critical z value only depends on
confidence level
Critical t value depends on confidence level
and df
Lecture #4
34
Student’s T
Good news: t is calculated the same as
z (same formula)
 BUT - the critical value of t varies as a
function of the degrees of freedom (df)




INFO 515
Degree of freedom df = n-1
Essentially, z distribution assumes df = ∞
(or at least df is very large)
So the Student’s t test corrects for small
sample sizes
Lecture #4
35
T Example
Set significance Level at .05
 Last year’s average DVD price: $20.00
 Random sample of n = 16 titles had a
mean of $24.00, standard deviation of
$3.00; is this different?



INFO 515
Calculate standard error =s/n = 3/4 = .75
Calculate t= (sample mean - population
mean)/standard error
t = (24-20)/.75 = 5.33
Lecture #4
36
T Example
Calculate df = 16-1= 15 degrees of
freedom
 Go to find critical t value (Action Research
handout p. 40/41)
 Critical value here is 2.131 (df=15, two
tail significance of .05)
 Our actual t value of 5.33 is greater than
our critical value – so reject the null
hypothesis (the sample cost is different
than last year’s average)

INFO 515
Lecture #4
37
1-tail versus 2-tail?

A one-tail test is used when a specific
direction of difference is to be tested


Sample DVD price is not greater than last
year’s
A two-tail test is used when no particular
direction of difference is to be tested


INFO 515
Sample DVD price is not different than last
year’s (could be greater or less than)
Use two tail test most of the time to find
critical t value
Lecture #4
38
Plotting Descriptive Statistics
INFO 515
Lecture #4
39
More Descriptive Statistics
Now we look at how one statistic may
vary in more detail
 Boxplots and stem-and-leaf diagrams help
tell more about the distribution of a
variable, especially for cases where the
distribution isn’t very normal

INFO 515
Lecture #4
40
Boxplots
Boxplots, also known as box-and-whisker
displays, help show odd, lopsided
distributions of data
 They show:



INFO 515
The median (not mean) value in a box, where
The box covers the 25th through 75th
percentiles, and whose top and bottom edges
are called “hinges”
Lecture #4
41
Boxplots

Also shows


INFO 515
The largest and smallest values which are not
outliers (these are whiskers on the box)
And they show individual values which are
more than 1.5 (“Outliers”) and more than 3.0
box-lengths (“Extremes”) from the 25th and
75th percentiles (these are separate symbols
“0” for outliers and “*” for extremes)
Lecture #4
42
Boxplot Example from MedLibs
5
Extreme (>3)
4
D
D
Outlier (>1.5)
3
Q
D
E
E
Z
2
E
Largest value
Z
1
75th percentile
0
Median
-1
25th percentile
-2
N=
26
26
26
26
26
Zscore: Annua l circ
Zscore: Annua l inho
Zscore: Stude nt pop
Zscore: Annua l expe
Zscore: Staff s ize
INFO 515
Lecture #4
Smallest value
43
Boxplot of Exam Scores
110
100
90
80
70
60
50
10
40
33
30
N=
40
EXAM
INFO 515
Lecture #4
44
Generating Boxplots
Use the menu Graphs / Boxplot…
 Select a Simple example for “Summaries
of separate variables”, then click Define
 Select the variables to be plotted (move
into Boxes Represent)
 Click “OK”

INFO 515
Lecture #4
45
Stem-and-Leaf Diagram
The stem-and-leaf diagram is a sideways
histogram which provides additional
information about specific data values,
and flags extreme values
 This only works easily for up to a few
dozen data points; maybe a couple
hundred
 Imagine taking your data and rounding
it off to two significant digits


INFO 515
3.567 becomes 3.6; 1.213 becomes 1.2; etc.
Lecture #4
46
Stem-and-Leaf Diagram
Then move the decimal point in between
the two significant digits, remembering
how far you had to move it
 The first significant digit is called the stem
 The second significant digit of each data
point is the leaf; and the leaves are
grouped together when they share a stem

INFO 515
Lecture #4
47
Stem-and-Leaf Diagram
For example, for ages ranging from 60’s to
80’s, make the first digit the stem, and
the second digit the leaf (86 becomes
stem of 8, and a leaf of 6)
 Then group the stems and provide a single
row of numbers to list every data value
with that stem – so data 82, 85, 89
become
8 . 259 (stem 8, & leaves 2, 5, and 9)

INFO 515
Lecture #4
48
Stem-and-Leaf Diagram

To show that a stem of ‘8’ means ’80’,
define the “stem width” as a multiplier for
the stem to get its true value


So if the stem and leaf diagram shows a value
of 8.2, with a stem width of 10, then the
actual data value was 8.2*10 = 82
Extreme values are reported separately as
“Extremes” with their approximate range
INFO 515
Lecture #4
49
EXAMPLE Stem-and-Leaf Plot
Frequency
This row
represents:
Eight data
points – four
are 81, two
are 82, and
one each 83
and 84
Stem &
2.00 Extremes
1.00
5 .
.00
5 .
3.00
6 .
3.00
6 .
4.00
7 .
11.00
7 .
8.00
8 .
4.00
8 .
3.00
9 .
1.00
9 .
Stem width:
Each leaf:
INFO 515
Lecture #4
Leaf
(=<46)
2
013
678
0004
56677778899
11112234
6778
224
8
10
1 case(s)
50
Stem-and-Leaf Diagram

If a very large number of data points are
close together, additional codes can be
used in place of the period to break up
the bars





INFO 515
“*” means leaves of 0 or 1
“t” means 2 or 3
“f” means 4 or 5
“s” means 6 or 7
“.” (period) means 8 or 9
Lecture #4
51
Stem-and-Leaf Diagram

So the stem and leaf of:
1 t 23
1 f 445
1 . 899

Would mean data values of (for width =
10) 12, 13, 14, 14, 15, 18, 19, and 19
instead of stuffing them all on one row
1 . 23445899
INFO 515
Lecture #4
52
Frequency Polygon


The polygon is constructed much the same as a
histogram, including the "¾ high rule," except a
dot instead of a bar is placed at the frequency
with which each value occurs and a line is drawn
between the points
You can see this polygon (next slide) shows the
actual value of the score, its frequency, and the
shape of the distribution, but the intervals are
unequal between data points

INFO 515
If you look at the horizontal (X) axis, you can see, for
example, the numbers go from 101 to 110, and then
from 120 to 123 (unequal intervals!)
Lecture #4
53
Frequency Polygon for Class IQ
Frequency Polygon of Class I.Q.
4.5
4
Frequency
3.5
3
2.5
2
1.5
1
0.5
13
2
13
5
14
0
12
6
12
8
13
0
12
0
12
3
12
5
10
1
11
0
0
Class I.Q.
INFO 515
Lecture #4
54
Histogram vs Frequency Polygon

How do you decide whether to use a
histogram or a frequency polygon?



A Histogram is used for discrete data
(counting units, like the bookmobile stops)
The Frequency Polygon is used for
continuous data (measurement, like the length
or I.Q. distribution)
This is a case where the type of tool used
depends on the type of data - discrete
(ordinal or nominal) or continuous (ratio
or interval)
INFO 515
Lecture #4
55