Download Data Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Confidence interval wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
Statistics:
Data Presentation & Analysis
Fr Clinic I
Overview
•
•
•
•
Tables & Graphs
Populations & Samples
Mean, Median, & Variance
Error Bars
– Standard Deviation, Standard Error & 95% Confidence
Interval (CI)
• Comparing Means of Two Populations
• Linear Regression (LR)
Warning
• Statistics is a huge field, I’ve simplified considerably
here. For example:
– Mean, Median, and Standard Deviation
• There are alternative formulas
– 95% Confidence Interval
• There are other ways to calculate CIs (e.g., z statistic instead of t;
difference between two means, rather than single mean…)
– Error Bars
• Don’t go beyond the interpretations I give here!
– Comparing Means of Two Data Sets
• We just cover the t test for two means when the variances are
unknown but equal, there are other tests
– Linear Regression
• We only look at simple LR and only calculate the intercept, slope and
R2. There is much more to LR!
Tables
Table 1: Average Turbidity and Color of Water Treated by Portable Water Filters
Water
Pond Water
(2)
10
(3)
13
Apparent
Color
(Pt-Co)
(4)
30
Sweetwater
4
5
12
Hiker
3
8
11
(1)
Turbidity True Color
(NTU)
(Pt-Co)
Consistent Format, Title, Units, Big Fonts
Differentiate Headings, Number Columns
Consistent Format, Title, Units
Good Axis Titles, Big Fonts
Figures
25
Turbidity (NTU)
20
20
11
15
10
11
10
7
5
5
1
0
Pond Water Sweetwater
Miniworks
Hiker
Pioneer
Voyager
Filter
Figure 1: Turbidity of Pond Water, Treated and Untreated
Populations and Samples
• Population
– All possible outcomes of experiment or observation
• US population
• Particular type of steel beam
• Sample
– Finite number of outcomes measured or observations made
• 1000 US citizens
• 5 beams
• Use samples to estimate population properties
– Mean, Variance
• E.g., Height of 1000 US citizens used to estimate mean of US
population
Central Tendency
• Mean and Median
1
3
3
6
8
10
Mean = xbar = Sum of values divided by sample size
= (1+3+3+6+8+10)/6
= 5.2 NTU
Median = m = Middle number
Rank 1 2 3 4 5 6
Number 1 3 3 6 8 10
For even number of sample points, average middle two
= (3+6)/2 = 4.5
Excel: Mean – AVERAGE; Median - MEDIAN
Variability
• Variance, s2
– sum of the square of the deviation about the
mean divided by degrees of freedom
– s2 = n(xi – xbar)2/(n-1)
– Where xi = a data point and n = number of data
points
• Example (cont.)
– s2 = [(1-5.2)2 + (3-5.2)2 + (3-5.2)2 + 6-5.2)2 + (85.2)2 + (10-5.2)2] /(6-1) = 11.8 NTU2
Excel: Variance – VAR
Error Bars
• Show data variability on plot of mean values
• Types of error bars include:
• Max/min, ± Standard Deviation, ± Standard Error, ±
95% CI
Turbidity (NTU)
10
8
6
4
2
0
Filter 1
Filger 2
Filter Type
Filter 3
Standard Deviation, s
• Square-root of variance
s s
• If phenomena follows Normal Distribution
(bell curve), 95% of population lies within 1.96
standard deviations of the mean
2
• Error bar is s
above & below mean
Normal Distribution
95%
Excel: standard deviation –
STDEV
-4
-1.96
-2
0
1.96
2
Standard Deviations
from Mean
Standard Deviation
4
Standard Error of Mean
sX
• Also called St-Err or sxbar
• For sample of size n taken from population with
standard deviation estimated as s
s
sX 
n
• As n ↑, sxbar estimate↓, i.e., estimate of
population mean improves
• Error bar is St-Err above & below mean
95% Confidence Interval (CI) for Mean
• A 95% Confidence Interval is expected to contain the
population mean 95 % of the time (i.e., of 95%-CIs from
100 samples, 95 will contain pop mean)
X  t 95%, n 1s X
• t95%,n-1 is a statistic for 95% CI from sample of size n
– t95%,n-1 = TINV(0.05,n-1)
– If n  30, t95%,n-1 ≈ 1.96 (Normal Distribution)
• Error bar is t95%, n 1s X above & below mean
Using Error Bars to compare data
• Standard Deviation
– Demonstrates data variability, but no comparison possible
• Standard Error
– If bars overlap, any difference in means is not statistically significant
– If bars do not overlap, indicates nothing!
• 95% Confidence Interval
– If bars overlap, indicates nothing!
– If bars do not overlap, difference is statistically significant
• We’ll use 95 % CI in this class
– Any time you have 3 or more data points, determine mean, standard
deviation, standard error, and t95%,n-1, then plot mean with error bars
showing the 95% confidence interval
Adding Error Bars to an Excel Graph
• Create Graph
– Column, scatter,…
•
•
•
•
Select Data Series
In Layout Tab-Analysis Group, select Error Bars
Select More Error Bar Options
Select Custom and Specify Values and select cells
containing the t95%, n 1s X values
Example 1: 95% CI
Turbidity Data
1
2
3 mean St Dev
NTU NTU NTU NTU NTU
2.1
2.1 2.2 2.1
0.06
3.2
4.4
5
4.2
0.92
4.3
4.2 4.5 4.3
0.15
Filter 1
Filter 2
Filter 3
7.0
6.0
Turbidity (NTU)
5.0
4.2
4.3
Filter 2
Filter 3
4.0
3.0
2.1
2.0
1.0
0.0
Filter 1
Portable Water Filter
n
3
3
3
St-Err
NTU
0.03
0.53
0.09
t95%,2
+/- 95% CI
t95%,2St-Err
4.30
4.30
4.30
0.14
2.28
0.38
What can we do?
• Lift weight multiple times using different solar
panel combinations (or hyrdoturbines, or gear
boxes) and plot mean and 95 % Confidence
interval error bars.
– If error bars overlap between to different test conditions,
indicates nothing!
– If error bars do not overlap, difference is statistically
significant
T Test
• A more sophisticated way to compare means
• Use t test to determine if means of two
populations are different
• E.g., lift times with different solar panel combinations
or turbines or…
Comparing Two Data Sets using the t test
• Example - You lift weight with two panels in
series and two in parallel.
– Series: Mean = 2 min, s = 0.5 min, n = 20
– Parallel: Mean = 3 min, s = 0.6 min, n = 20
• You ask the question - Do the different panel
combinations result in different lift times?
– Different in a statistically significant way
Are the Lift Times Different?
Series
• Use TTEST (Excel)
• Fractional probability of being wrong
if you claim the two populations are different
– We’ll say they are significantly different if
probability is ≤ 0.05
Parallel
1.5
2
2.2
1.8
3
1.6
1.2
2.1
1.9
2.2
2.6
1.7
1.8
1.5
2.4
2.5
2.7
1.4
1.5
2.6
3
2.4
2.2
2.6
3.4
3.6
3.8
3.5
2.7
2.4
3.5
3.8
2.1
2.5
3.4
3.3
2.4
3.6
2.3
3.7
Marbles
Linear Regression
• Fit the best straight line to a data set
Grade Point Average
25
20
y = 1.897x + 0.8667
R2 = 0.9762
15
10
5
0
0
2
4
6
8
10
12
Height (m)
Right-click on data point and select “trendline”. Select options to
show equation and R2.
R2 - Coefficient of multiple Determination
• R2 = n(ŷi - ybar)2 / n(yi - ybar)2
– ŷi = Predicted y values, from regression equation
– yi = Observed y values
– Ybar = mean of y
• R2 = fraction of variance explained by
regression
– R2 = 1 if data lies along a straight line