Download Mod7ComDatasets

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Confidence interval wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Student's t-test wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
Module 7: Comparing Datasets
and Comparing a Dataset
with a Standard
How different is enough?
Concepts
Independence of each data point
 Test statistics
 Central Limit Theorem
 Standard error of the mean
 Confidence interval for a mean
 Significance levels
 How to apply in Excel

module 7
2
Independent Measurements
Each measurement must be independent
(shake up basket of tickets)
 Example of non-independent measurements

– Public responses to questions (one result affects
next person’s answer)
– Samplers too close together, so air flows
affected
module 7
3
Test Statistics
Some number calculated based on data
 In student’s t test, for example, t
 If t is >= 1.96 and

– population normally distributed,
– you’re to right of curve,
– where 95% of data is in inner portion,
symmetrically between right and left (t=1.96
on right, -1.96 on left)
module 7
4
Test statistics correspond to
significance levels
“P” stands for percentile
 Pth percentile is where p of data falls below,
and 1-p fall above

module 7
5
Two Major Types of Questions

Comparing mean against a standard
– Does air quality here meet NAAQS?

Comparing two datasets
– Is air quality different in 2006 than 2005?
– Better?
– Worse?
module 7
6
Comparing Mean to a Standard

Did air quality meet CARB annual standard of
12 microg/m3?
Ft
Ft Smith
Ft Smith N_Fort
year
Smith
avg
Max
Smith
Min
‘05
14.78
0.1
37.9
77
module 7
7
Central Limit Theorem (magic!)
Even if underlying population is not
normally distributed
 If we repeatedly take datasets
 These different datasets have means that
cluster around true mean
 Distribution of these means is normally
distributed!

module 7
8
Magic Concept #2:
Standard Error of the Mean




Represents uncertainty around
mean
As sample size N gets bigger,
error gets smaller!
The bigger the N, the more
tightly you can estimate mean
LIKE standard deviation for
a population, but this is for
YOUR sample

module 7

N
9
For a “large” sample (N > 60), or when very
close to a normal distribution…
Confidence interval for population mean is:
 s 
x  Z

 n
Choice of z determines 90%, 95%, etc.
module 7
10
For a “Small” Sample
Replace Z value with a t value to get…
 s 
x  t 
 n 
…where “t” comes from Student’s t
distribution, and depends on sample size
module 7
11
Student’s t Distribution vs.
Normal Z Distribution
T-distribution and Standard Normal Z distribution
0.4
Z distribution
density
0.3
0.2
T with 5 d.f.
0.1
0.0
-5
0
Value
module 7
5
12
Compare t and Z Values
Confidence t value with Z value
level
5 d.f
2.015
1.65
90%
2.571
1.96
95%
4.032
2.58
99%
module 7
13
What happens as
sample gets larger?
T-distribution and Standard Normal Z distribution
0.4
Z distribution
density
0.3
T with 60 d.f.
0.2
0.1
0.0
-5
0
Value
module
7
5
14
What happens to CI as
sample gets larger?
For large samples
Z and t values
become almost
identical, so CIs are
almost identical

x  Z


x  t

module 7
s 

n
s 

n
15
First, graph and review data
 Use
box plot add-in
 Evaluate spread
 Evaluate how far apart mean
and median are
 (assume sampling design and
QC are good)
module 7
16
Excel Summary Stats
module 7
17
1. Use the
box-plot
add-in
40
35
2. Calculate
summary
stats
30
25
20
15
10
5
0
Ft Smith
module 7
N=77
Min
25th
Media
n
75th
Max
Mean
SD
0.1
7.5
13.7
18.1
37.9
14.8
8.718
Our Question
Can we be 95%, 90%, or how confident that
this mean of 14.78 is really greater than
standard of 12?
 We saw that N = 77, and mean and median
not too different
 Use z (normal) rather than t

module 7
19
The mean is 14.8 +- what?
 We
know equation for CI is
 s 
x  Z

 n
 Width
of confidence interval
represents how sure we want to be
that this CI includes true mean
 Now, decide how confident we want
to be
module 7
20
CI Calculation
For 95%, z = 1.96 (often rounded to 2)
 Stnd error (sigma/N) = (8.66/square root of
77) = 0.98
 CI around mean = 2 x 0.98
 We can be 95% sure that mean is included
in (mean +- 2), or 14.8-2 at low end, to 14.8
+ 2 at high end
 This does NOT include 12 !

module 7
21
Excel can also calculate a
confidence interval around the mean
Mean, plus and minus 1.93, is a 95%
confidence interval that does NOT
include 12!
module 7
22
We know we are more than 95%
confident, but how confident can we
be that Ft Smith mean > 12?
Calculate where on curve our mean of 14.8 is,
in terms of z (normal) score…
 …or if N small, use t score

module 7
23
To find where we are on the curve,
calc the test statistic…


Ft Smith mean = 14.8,
sigma =8.66, N =77
Calculate test statistic,
in this case the z factor
z
(we decided we can use the
z rather than the t
distribution)

(x  )

N
If N was < 60, test stat
is t, but calculated the
same way
Data’s
mean
module 7
Standard of 12
24
Calculate z Easily
Our mean 14.8 minus standard of 12 (treat real
mean  (mu) as standard) is numerator (= 2.8)
 Standard error is sigma/square root of N = 0.98

(same as for CI)
so z = (2.8)/0.98 = z = 2.84
 So where is this z on the curve?
 Remember, at z = 3 we are to the right
of ~ 99%

module 7
25
Where on the curve?
Z=2
Z=3
So between 95 and 99% probable that the true mean
will not include 12
module 7
26
You can calculate exactly where on
the curve, using Excel

Use Normsdist function, with z
If z (or t) =
2.84, in
Excel
Yields 99.8% probability that the
true mean does NOT include 12
module 7
27