Download Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Statistics:
Data Analysis and
Presentation
Fr Clinic II
Overview







Tables and Graphs
Populations and Samples
Mean, Median, and Standard Deviation
Standard Error & 95% Confidence Interval (CI)
Error Bars
Comparing Means of Two Data Sets
Linear Regression (LR)
Warning

Statistics is a huge field, I’ve simplified considerably
here. For example:
– Mean, Median, and Standard Deviation

There are alternative formulas
– Standard Error and the 95% Confidence Interval

There are other ways to calculate CIs (e.g., z statistic instead of
t; difference between two means, rather than single mean…)
– Error Bars

Don’t go beyond the interpretations I give here!
– Comparing Means of Two Data Sets

We just cover the t test for two means when the variances are
unknown but equal, there are other tests
– Linear Regression

We only look at simple LR and only calculate the intercept, slope
and R2. There is much more to LR!
Tables
Table 1: Average Turbidity and Color of Water Treated by Portable Water Filters
a
Water
Turbidity
(NTU)
True Color
(Pt-Co)
(1)
Pond Water
(2)
10
(3)
13
Apparent
Color
(Pt-Co)
(4)
30
Sweetwater
4
55
12
12
Hiker
3
8
11
MiniWorks
2
3
5
Standard
5a
15
15
Level at which humans can visually detect turbidity
Consistent Format, Title, Units, Big Fonts
Differentiate Headings, Number Columns
Figures
Consistent Format, Title, Units
Good Axis Titles, Big Fonts
25
Turbidity (NTU)
20
20
11
15
10
11
10
7
5
5
1
0
Pond Water Sweetwater
Miniworks
Hiker
Pioneer
Voyager
Filter
Figure 1: Turbidity of Pond Water, Treated and Untreated
Populations and Samples

Population
– All of the possible outcomes of experiment or observation



US population
Particular type of steel beam
Sample
– A finite number of outcomes measured or observations
made


1000 US citizens

5 beams
We use samples to estimate population properties
– Mean, Variability (e.g. standard deviation), Distribution

Height of 1000 US citizens used to estimate mean of US
population
Mean and Median

Turbidity of Treated Water (NTU)
1
3
3
6
8
10
Mean = Sum of values divided by number of samples
= (1+3+3+6+8+10)/6
= 5.2 NTU
Median = The middle number
Rank 1 2 3 4 5 6
Number 1 3 3 6 8 10
For even number of sample points, average middle two
= (3+6)/2 = 4.5
Excel: Mean – AVERAGE; Median - MEDIAN
Variance

Measure of variability
– sum of the square of the deviation about the
mean divided by degrees of freedom
s
2
x




x
i
n 1
n = number of data points
Excel: variance – VAR
2
Standard Deviation, s
Square-root of the variance
s s
 For phenomena following a Normal
Distribution (bell curve), 95% of population
values lie within 1.96 standard deviations of
Normal Distribution
the mean
2


Area under curve is
probability of getting
value within specified
range
Excel: standard deviation –
STDEV
95%
-4
-1.96
-2
0
1.96
2
Deviation
Standard Standard
Deviations
from Mean
4
Standard Error of Mean

Standard deviation of mean
– Of sample of size n
– taken from population with standard deviation s
s
sX 
n
– Estimate of mean depends on sample selected
– As n , variance of mean estimate goes down, i.e.,
estimate of population mean improves
– As n , mean estimate distribution approaches normal,
regardless of population distribution
95% Confidence Interval (CI) for Mean

Interval within which we are 95 % confident the
true mean lies
X  t 95%, n 1s X
 t95%,n-1 is
t-statistic for 95% CI if sample size = n
– If n  30, let t95%,n-1 = 1.96 (Normal Distribution)
– Otherwise, use Excel formula: TINV(0.05,n-1)

n = number of data points
Error Bars
Show data variability on plot of mean values
 Types of error bars include:


± Standard Deviation, ± Standard Error, ± 95% CI
Maximum and minimum value
10
Turbidity (NTU)

8
6
4
2
0
Filter 1
Filger 2
Filter Type
Filter 3
Using Error Bars to compare data

Standard Deviation
– Demonstrates data variability, but no comparison possible

Standard Error
– If bars overlap, any difference in means is not statistically
significant
– If bars do not overlap, indicates nothing!

95% Confidence Interval
– If bars overlap, indicates nothing!
– If bars do not overlap, difference is statistically significant

We’ll use 95 % CI
Example 1
Turbidity Data
1
2
3 mean St Dev
NTU NTU NTU NTU NTU
2.1
2.1 2.2 2.1
0.06
3.2
4.4
5
4.2
0.92
4.3
4.2 4.5 4.3
0.15
Filter 1
Filter 2
Filter 3
n
3
3
3
St Error
NTU
0.03
0.53
0.09
t95%,2 +/- 95% CI
4.30
4.30
4.30
0.14
2.28
0.38
Create Bar Chart of Name vs Mean. Right click on data. Select “Format Data Series”.
7.0
6.0
Turbidity (NTU)
5.0
4.2
4.3
Filter 2
Filter 3
4.0
3.0
2.1
2.0
1.0
0.0
Filter 1
Portable Water Filter
Example 2
Turbidity
Time
1
Min NTU
1
4.3
2
4.4
3
4.3
Measurements
2
3 mean St Dev
NTU NTU NTU NTU
4.5
4.6
4.5
0.15
4.4
4.5
4.4
0.06
4.2
4.2
4.2
0.06
6.0
5.0
Turbidity (NTU)
4.0
3.0
2.0
1.0
0.0
0
1
2
Tim e (m in)
3
4
n St Error
NTU
3 0.09
3 0.03
3 0.03
t95,2 +/- 95% CI
4.30
4.30
4.30
0.38
0.14
0.14
What can we do?
Plot mean water quality data for various
filters with error bars
 Plot mean water quality over time with error
bars

Comparing Filter Performance

Use t test to determine if the mean of two
populations are different.
– Based on two data sets

E.g., turbidity produced by two different filters
Comparing Two Data Sets using the t test

Example - You pump 20 gallons of water
through filter 1 and 2. After every gallon,
you measure the turbidity.
– Filter 1: Mean = 2 NTU, s = 0.5 NTU, n = 20
– Filter 2: Mean = 3 NTU, s = 0.6 NTU, n = 20

You ask the question - Do the Filters make
water with a different mean turbidity?
Do the Filters make different water?
Filter 1


Use TTEST (Excel)
Fractional probability of being wrong
if you answer yes
– We want probability to be small
 0.01 to 0.10 (1 to 10 %). Use 0.01
Filter 2
1.5
2
2.2
1.8
3
1.6
1.2
2.1
1.9
2.2
2.6
1.7
1.8
1.5
2.4
2.5
2.7
1.4
1.5
2.6
3
2.4
2.2
2.6
3.4
3.6
3.8
3.5
2.7
2.4
3.5
3.8
2.1
2.5
3.4
3.3
2.4
3.6
2.3
3.7
“t test” Questions

Do two filters make different water?
– Take multiple measurements of a particular water quality
parameter for 2 filters

Do two filters treat difference amounts of water
between cleanings?
– Measure amount of water filtered between cleanings for
two filters

Does the amount of water a filter treats between
cleaning differ after a certain amount of water is
treated?
– For a single filter, measure the amount of water treated
between cleanings before and after a certain total
amount of water is treated
Linear Regression

Fit the best straight line to a data set
Grade Point Average
25
20
y = 1.897x + 0.8667
R2 = 0.9762
15
10
5
0
0
2
4
6
8
10
12
Height (m)
Right-click on data point and use “trendline” option. Use “options”
tab to get equation and R2.
R2 - Coefficient of multiple Determination
ŷ  y 

 1
 y  y 
2
R
2
i
2
i
ŷi
yi
R2
i
ŷ


 y
 y
2
i
 y
2
i
= Predicted y values, from regression equation
= Observed y values
= fraction of variance explained by regression
(variance = standard deviation squared)
= 1 if data lies along a straight line