Download Data Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Taylor's law wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

German tank problem wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Data
Freshman Clinic II
Overview










Populations and Samples
Presentation
Tables and Figures
Central Tendency
Variability
Confidence Intervals
Error Bars
Student t test
Linear Regression
Applications
Populations and Samples

Population
– All possible data points



Entire US population
Every rainfall event in Glassboro (past, present, and
future)
Sample
– Subset of population

We use samples to estimate population
parameters
Presentation
Present clearly, objectively
 Properly communicate uncertainty
 Compare using valid statistics

Tables
Table 1: Water Quality (average of 3 to 5 values)
a
b
Water
Turbidity
(NTU)
True Color
(Pt-Co)
(1)
Pond Water
(2)
10
(3)
13
Apparent
Color
(Pt-Co)
(4)
30
Sweetwater
4
5
12
Hiker
3
8
11
MiniWorks
2
3
5
Comparison
5a
15b
15b
Visually detectable
Drinking Water Standard
Figures – Bar Chart
25
Turbidity (NTU)
20
20
11
15
10
11
10
7
5
5
1
0
Pond Water Sweetwater
Miniworks
Hiker
Pioneer
Voyager
Filter
Figure 1: Average Turbidity of Pond Water, Treated and Untreated
Apparent Color (Pt-Co)
Figures – XY Scatter
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
Water Treated (L)
Figure 2: Change in Water Quality
10
Central Tendency

Example: Turbidity of Treated Water (NTU)
– Sample is 1, 3, 3, 6, 8, 10
n=6
Mean = Sum of values divided by number of data points
e.g., (1+3+3+6+8+10)/6 = 5.17 NTU
Median = The middle number
Rank 1 2 3 4 5 6
Number 1 3 3 6 8 10 (ordered)
For even number of sample points, average middle two
e.g., (3+6)/2 = 4.5
For odd number of sample points, median = middle point
Variability

Standard deviation of a sample
 x  x 
2
s
i
n 1
xi = ith data point
x = mean of sample
n = number of data points
e.g.,
[{(1-5.2)2+(3-5.2)2 +(3-5.2)2 +(6-5.2)2 +(8-5.2)2 +(10-5.2) 2}/(6-1)]0.5
= 3.43
Confidence Interval of Mean

Estimated range within which population mean falls
– e.g., 95% confidence interval of mean, based on our
sample, is (1.57    8.77) where  = population mean
– We are 95% confident true mean of population (from
which our sample was drawn) lies within this range

Confidence interval (CI) calculated from sample:
ts
CI  x 
n
Where x = sample mean, t = statistical parameter related to
confidence, s = sample standard deviation, and n = sample size
Calculating “t”



In Excel, type “=TINV” into
a cell and select the “=“
symbol in the formula bar
The student’s t-distribution
inverse formula palette
pops up
“Probability” = 1 –
confidence level (as a
fraction)
– e.g., if confidence level is
95%, “probability” = 1 - 0.95
= 0.05

“Deg_freedom” = degrees
of freedom = n - 1

TINV returns “t”, the
statistical parameter
we need to estimate a
confidence interval
based on a sample
Calculating a Confidence Interval

For our example:
– “TINV” returned 2.57
– t x s / sqrt(n) = 2.57 x 3.43 / sqrt(6) = 3.60


5.17 – 3.60 = 1.57
5.17 + 3.60 = 8.77
– CI: (1.57    8.77) with 95% confidence


i.e., we are 95% confident the population mean lies
between 1.57 and 8.77
Quite Wide!
– Lower “s” or higher “n” will narrow range
Error Bars

Used to show data variability on a graph
30
Turbidity (NTU)
25
20
15
10
5
0
Pond Water
Sweetwater
Water (Untreated and Treated)

Bar chart, XY,…
Miniworks
Types of Error Bars
Standard Error of Mean
 Confidence Interval
 Standard Deviation
 Percentage

http://www.graphpad.com/articles/errorbars.htm
Standard Error
s
n
Adding Error Bars
1.
2.
3.
Create chart in Excel
Select a data series by
selecting a data point or bar
From “Format” menu, select
“Selected data series…”
5. Select + and – error bar
data. This could be standard
deviation, standard error, or
confidence limits.
4. Select “custom”
Confidence Interval
Average Lower
Upper
Turbidity Interval Interval
Pond Water
20
4
4
Sweetwater
10
2
2
Miniworks
7
3
3
Error Bars and our Example
Standard Error of Mean
 s / sqrt(n) = 3.43 / sqrt(6) = 1.40
 Put 1.40 in + and - cells
 Since the mean = 5.17, the error bars in a
bar chart would go from

– 5.17 – 1.40 = 3.77 to
– 5.17 + 1.40 = 6.57
Interpreting Error Bars


Error bars can be used to compare
two sample means
Standard Error (SE)
– SE bars do not overlap, no conclusions
can be drawn
– SE bars overlap, sample appear to be
not drawn from significantly different
populations

Confidence Interval (CI)
– CI bars do not overlap, samples appear
to be drawn from significantly different
populations, at confidence level of
confidence interval
– CI bars overlap, no conclusions can be
drawn
http://www.graphpad.com/articles/errorbars.htm
Comparing Samples with a t-test

Example - You measure untreated and
treated pond water
– Treated: mean = 2 NTU, s = 0.5 NTU, n = 20
– Untreated: mean = 3 NTU, s = 0.6 NTU, n = 20

You ask the question – Is the average
turbidity of treated water different from that
of untreated water?
– Use a t-test
Is the water different?

Use TTEST (Excel)
Probability (as fraction) of being wrong if you claim
statistically significant difference (type I error)

–Select significance level ahead of time, usually 0.01 - 0.1
–For our example, our #, 0.0000015, is very small
Treated
1.5
2
2.2
1.8
3
1.6
1.2
2.1
1.9
2.2
2.6
1.7
1.8
1.5
2.4
2.5
2.7
1.4
1.5
2.6
Untreated
3
2.4
2.2
2.6
3.4
3.6
3.8
3.5
2.7
2.4
3.5
3.8
2.1
2.5
3.4
3.3
2.4
3.6
2.3
3.7
T test steps
1. Identify two samples to compare
2. Select a , significance of statistical test
–
–
We’ll use 0.05 in this class
Confidence = 1 - a
3. Use Excel “TTEST” formula to estimate probability
of Type I Error
4. If probability returned by TTEST is less than or
equal to 0.05, assume the samples come from
two different populations
For our example, 0.0000015 < 0.05, assume the treated
water is different from the untreated water
Linear Regression

Fit the best straight line to a data set
Grade Point Average
25
20
y = 1.897x + 0.8667
R2 = 0.9762
15
10
5
0
0
2
4
6
8
10
12
Height (m)
Right-click on data point and use “trendline” option. Use “options”
tab to show equation and R2.
R2 - Coefficient of multiple Determination
R
ŷi
y
yi
R2
2
 yˆ


y
 y
2
i
 y
2
i
= Predicted y values, from regression equation
= Average of y
= Observed y values
= fraction of variance explained by regression
(variance = standard deviation squared)
= 1 if data lies along a straight line
What might you do in this class?

Flow rate versus stroke rate
– Figure with linear regression over linear range

Ability to improve water quality
– Table and t-test comparison with untreated water (for turbidity and
apparent color), or
– Bar chart (for turbidity and apparent color) with confidence interval
error bars

Pressure change versus flow rate, Power versus flowrate
– Figure (no statistics possible because we only took one reading of
pressure for each flow rate and relationship is non-linear)

Force versus stroke rate,
– Figure w/95% confidence interval error bars for each data point

Power versus Flowrate
– Figure
Example – Water Quality
Table 2: Improvement in Water Quality
Untreated Water
Treated Water
Statistically
Mean Standard Mean Standard Significant
Deviation
Deviation Difference?
Turbidity, NTU
8
1
3
0.5
Yes
Apparent Color, Pt-Co
100
5
7
0.6
Yes
Note: Statistical significance tested at level = 0.05 using t-test