Download Understanding Your Data Set

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Regression toward the mean wikipedia, lookup

Student's t-test wikipedia, lookup

Taylor's law wikipedia, lookup

Bootstrapping (statistics) wikipedia, lookup

Resampling (statistics) wikipedia, lookup

Degrees of freedom (statistics) wikipedia, lookup

Transcript
Understanding Your Data Set
• Statistics are used to describe data sets
• Gives us a metric in place of a graph
• What are some types of statistics used
to describe data sets?
– Average, range, variance, standard
deviation, coefficient of variation, standard
error
Table 1. Total length (cm) and average length of spotted gar collected
from a local farm pond and from a local lake.
Length
Number
Pond
Lake
1
34
38
2
78
82
3
48
58
4
24
76
5
64
60
6
58
70
7
34
99
8
66
40
9
22
68
10
44
91
47.2
68.2
Average=
Length
Number
Pond
Lake
1
34
38
2
78
82
3
48
58
4
24
76
5
64
60
6
58
70
7
34
99
8
66
40
9
22
68
10
44
91
47.2
68.2
Average=
• Are the two samples
equal?
– What about 47.2 and
47.3?
• If we sampled all of the
gar in each water body,
would the average be
different?
– How different?
• Would the lake fish
average still be larger?
Range
• Simply the distance between the smallest and
largest value
Lake
Overlap
Pond
0
20
40
60
80
100
Length (cm)
Figure 1. Range of spotted gar length collected from a pond and a lake.
The dashed line represents the overlap in range.
• Does the difference in average length
(47.2 vs. 68.2) seem to be much as large
as before?
Lake
Overlap
Pond
0
20
40
Length (cm)
60
80
100
Variance
• An index of variability used to describe
the dispersion among the measures of
a population sample.
• Need the distance between each
sample point and the sample mean.
100
Distance from point to the sample mean
Length
80
60
40
20
0
0
2
4
6
8
10
Number
Figure 2. Mean length (cm) of each spotted gar collected from the
pond. The horizontal solid line represents the sample mean length.
• We can easily put this new
data set into a spreadsheet
table.
• By adding up all of the
differences, we can get a
number that is a reflection of
how scattered the data
points are.
#
Length
Mean
Difference
1
34
47.2
-13.2
2
78
47.2
30.8
3
48
47.2
0.8
4
24
47.2
-23.2
5
64
47.2
16.8
6
58
47.2
10.8
• After adding up all of the
differences, we get zero.
7
34
47.2
-13.2
8
66
47.2
18.8
– This is true of all
calculations like this
9
22
47.2
-25.2
10
44
47.2
-3.2
Sum =
0
– Closer to the mean each
number is, the smaller the
total difference.
• What can we do to get rid of
the negative values?
Sum of Squares
#
Length
Mean
Difference
Difference2
1
34
47.2
-13.2
174.24
2
78
47.2
30.8
948.64
3
48
47.2
0.8
0.64
4
24
47.2
-23.2
538.24
5
64
47.2
16.8
282.24
6
58
47.2
10.8
116.64
7
34
47.2
-13.2
174.24
8
66
47.2
18.8
353.44
9
22
47.2
-25.2
635.04
10
44
47.2
-3.2
10.24
0
3233.6
Sum =
Now 3233.6 is a number we can use! This value is called the SUM
OF SQUARES.
Back to Variance
• Sum of Squares (SOS) will continue to
increase as we increase our sample size.
– A sample of 10 replicates that are highly variable
would have a higher SOS than a sample of 100
replicates that are not highly variable.
• To account for sample size, we need to
divide SOS by the number of samples minus
one (n-1).
– We’ll get to the reason (n-1) instead of n later
Calculate Variance (σ2)
σ2 = S2 = (Xi – Xm)2 / (n – 1)
Degrees of
Freedom
SOS
Variance for Pond = S2 = 3233.6 / 9 = 359.29
100
Distance from point to the sample mean
Length
80
60
40
20
0
0
2
4
6
Number
8
10
More on Variance
• Variance tends to increase as the
sample mean increases
– For our sample, the largest difference
between any point and the mean was 30.8
cm. Imagine measuring a plot of cypress
trees. How large of a difference would you
expect (if measured in cm)?
• The variance for the lake sample =
400.18.
Standard Deviation
• Calculated as the square root of the
variance.
– Variance is not a linear distance (we had to
square it). Think about the difference in
shape of a meter stick versus a square
meter.
• By taking the square root of the
variance, we return our index of
variability to something that can be
placed on a number line.
Calculate SD
• For our gar sample, the Variance was 359.29.
The square root of 359.29 = 18.95.
– Reported with the mean as: 47.2 ± 18.95 (mean ± SD).
• Standard Deviation is often abbreviated as σ
(sigma) or as SD.
• SD is a unit of measurement that describes the
scatter of our data set.
– Also increases with the mean
Standard Error
• Calculated as: SE = σ / √(n)
– Indicates how close we are to estimating the true
population mean
– For our pond ex: SE = 18.95 / √10 = 5.993
– Reported with the mean as 47.2 ± 5.993 (mean ± SE).
– Based on the formula, the SE decreases as sample
size increases.
• Why is this not a mathematical artifact, but a true
reflection of the population we are studying?
Sample Size
• The number of individuals within a
population you measure/observe.
– Usually impossible to measure the entire
population
• As sample size increases, we get closer
to the true population mean.
– Remember, when we take a sample we
assume it is representative of the
population.
Effect of Increasing Sample Size
• I measured the length of 100 gar
• Calculated SD and SE for the first 10,
then included the next additional 10,
and so on until all 100 individuals were
included.
Raw Data
120
100
80
60
40
20
0
0
20
40
60
Sample Size
80
100
120
SD = Square root of the variance
(Var = (Xi – Xm) / (n – 1))
SD
90
80
70
60
50
40
0
20
40
60
Sample Size
80
100
SE = SD / √(n)
SE
90
80
70
60
50
40
0
20
40
60
Sample Size
80
100
SD
24
22
20
18
16
14
12
0
20
40
60
80
100
60
80
100
SE
10
8
6
4
2
0
0
20
40
Population: a data set representing the
entire entity of interest
- What is a population?
Sample: a data set representing a portion
of a population
Population
Sample
Population mean – the true mean for that population
-a single number
Sample mean – the estimated population mean
-a range of values (estimate ± 95% confidence
interval)
Population
Sample
As our sample size increases, we sample
more and more of the population. Eventually,
we will have sampled the entire population
and our sample distribution will be the
population distribution
Increasing
sample
size
Ni=x
Mean = x = N
(x-x)2
Variance =
N-1
Individual
1
2
3
4
5
6
N=6
N-1=5
Weight
26
32
25
26
30
30
169
Standard Deviation
=

(x-x)2
N-1
SD
Standard Error =
2
Mean
(Weight - Mean)
28.17
4.7089
28.17
14.6689
28.17
10.0489
28.17
4.7089
28.17
3.3489
28.17
3.3489
SOS=
40.8334
√N
Mean = 169/6 = 28.17
Range = 25 – 32
SOS = 40.83
Variance = 40.83 / 5 = 8.16
Std. Dev. = 40.83/5 = 2.86
Std. Err. = 2.86 / √6 = 1.17
Go to Excel
MEAN ± CONFIDENCE INTERVAL
When a population is sampled, a mean value is
determined and serves as the point-estimate for that
population.
However, we cannot expect our estimate to be the
exact mean value for the population.
Instead of relying on a single point-estimate, we
estimate a range of values, centered around the
point-estimate, that probably includes the true
population mean.
That range of values is called the confidence interval.
Confidence Interval
Confidence Interval: consists of two numbers (high
and low) computed from a sample that identifies the
range for an interval estimate of a parameter.
There is a 5% chance (95% confidence interval) that
our interval does not include the true population
mean.
y ± (t/0.05)[() / (n)]
28.17 ± 2.29
25.88    30.45
•Hypothesis Testing
–Null versus Alternative Hypothesis
•Briefly:
–Null Hypothesis: Two means are not different
–Alternative Hypothesis: Two means are not similar
•A test statistic based on a predetermined
probability (usually 0.05) is used to reject or
accept the null hypothesis
 < 0.05 then there is a significant difference
 > 0.05 then there is NO significant difference
Are Two Populations The Same?
• Boudreaux: ‘My pond is better than
your lake, cher’!
• Alphonse: ‘Mais non! I’ve got much
bigger fish in my lake’!
• How can the truth be determined?
Two Sample t-test
• Simple comparison of a specific attribute
between two populations
• If the attributes between the two populations
are equal, then the difference between the
two should be zero
• This is the underlying principle of a t-test
• If P-value > 0.05 the means are not significantly
different; If P < 0.05 the means are significantly
different
Analysis of Variance
Can compare two or more means
• Compares means to determine if the population
distributions are not similar
• Uses means and confidence intervals much like a
t-test
• Test statistic used is called an F statistic (F-test),
which is used to get the P value
• If P-value > 0.05 the means are not significantly
different; If P< 0.05 the means are significantly
different
• Post-hoc test separates the non-similar ones
Analysis of Variance
• Compares means to determine if the
population distributions are not
similar
• Uses means and confidence intervals
much like a t-test
• Test statistic used is called an F
statistic (F-test)
Normal Distribution
• Most characteristics follow a normal
distribution
– For example: height, length, speed, etc.
• One of the assumptions of the ANOVA
test is that the sample data is ‘normally
distributed.’
Sample Distribution Approaches
Normal Distribution With Sample Size
10
Frequency
8
6
4
2
0
Population
Sample
Sample Distribution Approaches Normal
Distribution With Sample Size
10
Frequency
8
6
4
2
0
Population
Sample
Sample Distribution Approaches
Normal Distribution With Sample Size
10
Frequency
8
6
4
2
0
Population
Sample
Mean = x =
Variance =
Individual
1
2
3
4
5
6
N=6
N-1=5
Weight
26
32
25
26
30
30
169
Ni=x
N
(x-x)2
N-1
Standard Deviation = 
(x-x)2
N-1
SD
Standard Error =
2
Mean
(Weight - Mean)
28.17
4.7089
28.17
14.6689
28.17
10.0489
28.17
4.7089
28.17
3.3489
28.17
3.3489
SOS=
40.8334
√N
Mean = 169/6 = 28.17
Range = 25 – 32
SOS = 40.83
Variance = 40.83 / 5 = 8.16
Std. Dev. = 40.83/5 = 2.86
Std. Err. = 2.86 / √6 = 1.17
ANOVA – Analysis of Variance
Calculate a SOS based on an overall mean (total SOS)
Pond
Lake
120
100
80
60
40
20
0
0
1
2
3
Trtmnt
Replicate
Length
Overall Mean
SOSTotal
Pond
1
34
57.7
561.69
Pond
2
78
57.7
412.09
Pond
3
48
57.7
94.09
Pond
4
24
57.7
1135.69
Pond
5
64
57.7
39.69
Pond
6
58
57.7
0.09
Pond
7
34
57.7
561.69
Pond
8
66
57.7
68.89
Pond
9
22
57.7
1274.49
Pond
10
44
57.7
187.69
80
Lake
1
38
57.7
388.09
60
Lake
2
82
57.7
590.49
Lake
3
58
57.7
0.09
Lake
4
76
57.7
334.89
Lake
5
60
57.7
5.29
Lake
6
70
57.7
151.29
Lake
7
99
57.7
1705.69
Lake
8
40
57.7
313.29
Lake
9
68
57.7
106.09
Lake
10
91
57.7
1108.89
This provides a
measure of the
overall variance
(Total SOS).
Pond
Lake
120
9040.2
100
40
20
0
0
1
2
3
Calculate a SOS based for each treatment
(Treatment or Error SOS).
Pond
Lake
120
100
80
60
40
20
0
0
1
2
3
Trtmnt
Replicate
Length
Trtmnt Mean
SOSError
Pond
1
34
47.2
174.24
Pond
2
78
47.2
948.64
Pond
3
48
47.2
0.64
Pond
4
24
47.2
538.24
Pond
5
64
47.2
282.24
Pond
6
58
47.2
116.64
Pond
7
34
47.2
174.24
Pond
8
66
47.2
353.44
Pond
9
22
47.2
635.04
Pond
10
44
47.2
10.24
Lake
1
38
68.2
912.04
Lake
2
82
68.2
190.44
Lake
3
58
68.2
104.04
Lake
4
76
68.2
60.84
Lake
5
60
68.2
67.24
Lake
6
70
68.2
3.24
Lake
7
99
68.2
948.64
Lake
8
40
68.2
795.24
60
Lake
9
68
68.2
0.04
20
Lake
10
91
68.2
519.84
This provides a
measure of the
reduction of variance
by measuring each
treatment separately
(Treatment or Error
SOS).
What happens to Error
SOS when the
variability w/in each
treatment decreases?
Pond
Lake
120
100
80
40
0
6835.2
0
1
2
3
Calculate a SOS for each predicted value vs. the overall mean
(Model SOS)
Predicted_Pond
Predicted_Lake
Overall_Avg
120
100
80
60
40
20
0
0
1
2
3
Trtmnt
Replicate
Length
Trtmnt Mean
Overall Mean
SOSModel
Pond
1
34
47.2
57.7
110.25
Pond
2
78
47.2
57.7
110.25
Pond
3
48
47.2
57.7
110.25
Pond
4
24
47.2
57.7
110.25
Pond
5
64
47.2
57.7
110.25
Pond
6
58
47.2
57.7
110.25
Pond
7
34
47.2
57.7
110.25
Pond
8
66
47.2
57.7
110.25
Pond
9
22
47.2
57.7
110.25
Pond
10
44
47.2
57.7
110.25
Lake
1
38
68.2
57.7
110.25
Lake
2
82
68.2
57.7
110.25
Lake
3
58
68.2
57.7
110.25
Lake
4
76
68.2
57.7
110.25
Lake
5
60
68.2
57.7
110.25
Lake
6
70
68.2
57.7
110.25
Lake
7
99
68.2
57.7
110.25
Lake
8
40
68.2
57.7
110.25
Lake
9
68
68.2
57.7
110.25
Lake
10
91
68.2
57.7
110.25
2205
This provides a
measure of the
distance between
the mean values
(Model SOS).
What happens to
Model SOS when
the two means are
close together?
What if the means
are equal?
Detecting a Difference Between Treatments
• Model SOS gives us an index on how far
apart the two means are from each other.
– Bigger Model SOS = farther apart
• Error SOS gives us an index of how
scattered the data is for each treatment.
– More variability = larger Error SOS = more
possible overlap between treatments
Magic of the F-test
• The ratio of Model SOS to Error SOS (Model SOS divided
by Error SOS) gives us an overall index (the F statistic)
used to indicate the relative ‘distance’ and ‘overlap’
between two means.
– A large Model SOS and small Error SOS = a large F statistic. Why
does this indicate a significant difference?
– A small Model SOS and a large Error SOS = a small F statistic. Why
does this indicate no significant difference??
• Based on sample size and alpha level (P-value), each F
statistic has an associated P-value.
– P < 0.05 (Large F statistic) there is a significant difference between
the means
– P ≥ 0.05 (Small F statistic) there is NO significant difference
Showing Results
35
30
A
A
25
B
20
15
10
5
0
1
2
3
Regression
• For the purposes of this class:
– Does Y depend on X?
– Does a change in X cause a change in Y?
– Can Y be predicted from X?
• Y= mX + b
Predicted values
180
Dependent Value
Actual values
160
140
120
Overall Mean
100
30
40
50
60
Independent Value
70
80
When analyzing a regression-type data set, the first step
is to plot the data:
Y
114
120
150
140
166
138
180
Dependent Value (Y)
X
35
45
55
65
75
55
160
140
120
100
30
40
50
60
70
Independent Value (X)
The next step is to determine the line that ‘best fits’
these points. It appears this line would be sloped
upward and linear (straight).
80
The line of best fit is the sample regression of Y on X,
and its position is fixed by two results:
Y = 1.24(X) + 69.8
Dependent Value
180
slope
160
Y-intercept
140
(55, 138)
120
Rise/Run
100
30
40
50
60
70
80
Independent Value
1) The regression line passes through the point (Xavg, Yavg).
2) Its slope is at the rate of “m” units of Y per unit of X, where m
= regression coefficient (slope; y=mx+b)
Testing the Regression Line for
Significance
• An F-test is used based on Model, Error, and Total
SOS.
– Very similar to ANOVA
• Basically, we are testing if the regression line has a
significantly different slope than a line formed by
using just Y_avg.
– If there is no difference, then that means that Y
does not change as X changes (stays around the
average value)
• To begin, we must first find the regression line that
has the smallest Error SOS.
Error SOS
The regression line should pass through the overall average with
a slope that has the smallest Error SOS (Error SOS = the distance
between each point and predicted line: gives an index of the
variability of the data points around the predicted line).
Dependent Value
180
160
140
138
overall average is
the pivot point
120
100
30
40
50
55
60
Independent Value
70
80
For each X, we can predict Y: Y = 1.24(X) + 69.8
Error SOS is calculated as the sum of (YActual – YPredicted)2
This gives us an index of how scattered the actual observations
are around the predicted line. The more scattered the points,
the larger the Error SOS will be. This is like analysis of
variance, except we are using the predicted line instead of the
mean value.
SOSErro
X
Y_Actual
Y_Pred
r
35
114
113.2
0.64
45
120
125.6
31.36
55
150
138
144
65
140
150.4
108.16
75
166
162.8
10.24
294.4
Total SOS
• Calculated as the sum of (Y – Yavg)2
• Gives us an index of how scattered our data
set is around the overall Y average.
Dependent Value
180
Regression line
not shown
160
140
Overall Y average
120
100
30
40
50
60
Independent Value
70
80
Total SOS gives us an index of how scattered the data
points are around the overall average. This is calculated
the same way for a single treatment in ANOVA.
X
Y_Actual
Y
Average
35
114
138
576
45
120
138
324
55
150
138
144
65
140
138
4
75
166
138
784
SOSTotal
1832
What happens to Total SOS when all of the points are
close to the overall average? What happens when the
points form a non-horizontal linear trend?
Model SOS
• Calculated as the Sum of (YPredicted – Yavg)2
• Gives us an index of how far all of the
predicted values are from the overall
average.
Dependent Value
180
Distance between
predicted Y and
overall mean
160
140
120
100
30
40
50
60
Independent Value
70
80
Model SOS
• Gives us an index of how far away the predicted
values are from the overall average value
Y
Avera
ge
SOSMod
X
Y_Pred
35
113.2
138
615.04
45
125.6
138
153.76
55
138
138
0
65
150.4
138
153.76
75
162.8
138
615.04
el
• What happens to Model SOS when 1537.6
all of the
predicted values are close to the average value?
All Together Now!!
X
Y_Actual
Y_Pred
SOSError
Y_Avg
SOSTotal
SOSModel
35
114
113.2
0.64
138
576
615.04
45
120
125.6
31.36
138
324
153.76
55
150
138
144
138
144
0
65
140
150.4
108.16
138
4
153.76
75
166
162.8
10.24
138
784
615.04
1832
1537.6
294.4
SOSError =  (Y_Actual – Y_Pred)2
SOSTotal =  (Y_Actual –Y_ Avg) 2
SOSMode
l
=  (Y_Pred – Y_Avg) 2
Using SOS to Assess Regression Line
• Model SOS gives us an index on how ‘different’ the
predicted values are from the average values.
– Bigger Model SOS = more different
– Tells us how different a sloped line is from a line made
up only of Y_avg.
– Remember, the regression line will pass through the
overall average point.
• Error SOS gives us an index of how different the
predicted values are from the actual values
– More variability = larger Error SOS = large distance
between predicted and actual values
Magic of the F-test
• The ratio of Model SOS to Error SOS (Model SOS divided
by Error SOS) gives us an overall index (the F statistic)
used to indicate the relative ‘difference’ between the
regression line and a line with slope of zero (all values =
Y_avg.
– A large Model SOS and small Error SOS = a large F statistic. Why
does this indicate a significant difference?
– A small Model SOS and a large Error SOS = a small F statistic. Why
does this indicate no significant difference??
• Based on sample size and alpha level (P-value), each F
statistic has an associated P-value.
– P < 0.05 (Large F statistic) there is a significant difference between
the regression line a the Y_avg line.
– P ≥ 0.05 (Small F statistic) there is NO significant difference between
the regression line a the Y_avg line.
Mean Model SOS = F
Mean Error SOS
Dependent Value
180
160
140
120
100
40
50
60
70
80
Independent Value
180
Dependent Value
Basically, this is an
index that tells us how
different the regression
line is from Y_avg, and
the scatter of the data
around the predicted
values.
30
160
140
120
100
30
40
50
60
70
Independent Value
80
Correlation (r):
Another measure of the mutual linear
relationship between two variables.
• ‘r’ is a pure number without units or dimensions
• ‘r’ is always between –1 and 1
• Positive values indicate that y increases when x does and
negative values indicate that y decreases when x
increases.
– What does r = 0 mean?
• ‘r’ is a measure of intensity of association observed
between x and y.
– ‘r’ does not predict – only describes associations
between variables
Dependent Variable
180
Dependent Variable
180
r>0
160
140
160
140
r<0
120
100
120
30
40
50
60
70
Independent Variable
100
30
40
50
60
70
80
Inpendent Variable
Dependent Variable
180
r is also called Pearson’s
correlation coefficient.
r=0
160
140
120
100
30
40
50
60
Independent Variable
70
80
80
R-square
• If we square r, we get rid of the negative
value if it is negative) and we get an
index of how close the data points are
to the regression line.
• Allows us to decide how much
confidence we have in making a
prediction based on our model.
• Is calculated as Model SOS / Total SOS
r2 = Model SOS / Total SOS
= Model SOS
180
Dependent Value
= Total SOS
160
140
120
100
30
40
50
60
Independent Value
70
80
= Model SOS
180
Dependent Value
= Total SOS
R2 = 0.8393
160
r2 = Model SOS / Total SOS
140
 numerator/denominator
120
100
30
40
50
60
70
80
Independent Value
1.2
R2 = 0.0144
1
0.8
Small numerator
Big denominator
0.6
0.4
0.2
0
0
10
20
30
40
50
R-square and Prediction
Confidence
1.2
1.2
2
R = 0.0144
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
0
10
20
30
40
50
0
60
1.2
R2 = 0.5537
10
20
30
40
50
60
40
50
60
1.2
1
1
R2 = 0.7605
R2 = 0.9683
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
0
10
20
30
40
50
60
0
10
20
30
Finally……..
• If we have a significant relationship
(based on the p-value), we can use the
r-square value to judge how sure we are
in making a prediction.