Download Word - UC Davis Plant Sciences

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Regression toward the mean wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
1.1
HOMEWORK TOPIC 1&2.
Due Thursday January 15 at the beginning of the lab class. Indicate clearly the
procedures used in each exercise.
Question 1 ANSWER
[15 points]
1.1. The statistics for the data set are as follows
Sample
Mean
St Dev
CV
2
SE
1
18
29
10
14
16
12
19
21
19
15
17.3
5.33
30.8
1.687
28.46
2
17
12
27
22
20
11
20
24
16
17
18.6
5.04
27.1
1.593
25.38
3
18
23
19
17
34
14
23
20
21
14
20.3
5.77
28.4
1.826
33.34
4
16
23
14
20
14
11
23
15
17
17
17.0
3.94
23.2
1.247
15.56
18.3
5.02
27.4
1.588
25.683
Plus
20
38
49
30
34
36
32
39
41
39
35
37.3
5.33
14.3
1.687
37
32
47
42
40
31
40
44
36
37
38.6
5.04
13.1
1.593
38
43
39
37
54
34
43
40
41
34
40.3
5.77
14.3
1.826
36
43
34
40
34
31
43
35
37
37
37.0
3.94
10.7
1.247
38.3
5.02
13.1
1.588
Mean
S2-S1
S4-S3
-1
-2
-17
0
17
-5
8
3
4
-20
-1
-3
1
0
3
-5
-3
-4
St Dev
2
SE
2
1.3
8.60
2.720
74.01
3
-3.3
6.57
2.077
43.12
-1
7.58
2.39854
58.57
Theoretical
18
Mean
5


25

SE
1.581
1.1. Total average 40 numbers: 18.30
Average of Std. Dev: 5.02
Average Std. dev. of means: 1.588 similar to the theoretical SE of the means 1.581 (5/SQRT(10)).
1.2. The addition of 20 to each value results in an increase of 20 in the means but no change in
the standard deviations or standard error of the means. The CV's decrease due to the increase in
the means (recall CV = Standard Dev./Mean)
1.3. Average of averages: 18.30 -> Same as the average of 40 samples.
Average of the four Standard Errors: 1.588 (close to theoretical one).
1
1.2
and
Standard deviation of the 4 means = 1.503 (close to theoretical one).
Both are close to the theoretical SE = 5/SQRT(10)=1.581
1.4. By taking random samples of size 10 and finding the sample means, one offsets the
variability of the observations against one another. The effect of exceptionally large or small
values is “diluted”. Therefore, a set of sample means deviates less from μ (have less dispersion
about μ) than a set of individual variates. We see this here, where the standard deviation of the
four means (1.503) is much less than the average of the standard deviations of the four samples
(5.02). In fact, these two values are related to one another through the formula,
5.02/SQRT(10)=1.588
1.5. Using n=40 and all the samples combined
Distribution of sample means for sample size n =40 per mean  Y ( n  40) 
Zi 
Yi  
 Y ( n  40)


n

5
40
 0.79057 .
18.3  18.00
 0.379473
0.79057
P (Z0.38) = 0.3520 Conclusion: this is not an unusual sample for a population with a mean = 18
and a variance of 25, similar or larger values happen ~35% of the times
1.6. The average is -1, close to the expected value of 0
The average standard deviation is 7.58 (and the average 2= 58.57).
This is close to the expected values of 2= 50. (Remember that the variance of A – B = 2A+2B ;
even though you are subtracting the samples, the errors accumulate)
Mean
St
Dev
SE
Var
S2-S1
-1
-17
17
8
4
-1
1
3
-3
2
1.3
8.60
2.720
74.01
S4-S3
-2
0
-5
3
-20
-3
0
-5
-4
3
-3.3
6.57
2.077
43.12
-1.0
7.58
2.399
58.57
Question 2 ANSWER
[15 points]
2.1. Given a Normal distribution Y with Mean = 1.00 and 2 = 4.00. Find
2.1.1. P (Y≤ 3.44) = P (Z≤ (3.44-1.00)/2= P (Z≤ (1.22) = 1-0.1112= 0.8888
2.1.2. P(0.0≤ Y≤ 2.66)= P(-0.5≤ Z≤ 0.83)= P(Z≤ 0.83)- P(Z ≤ -0.5) –=
P(Z≤ 0.83) = 0.7967
P(Z ≤-0.5) = P(Z ≥0.5) =0.3085
= 0.7967-0.3085 = 0.4882
2.1.3. P(Y≤ Yo)= 0.6026  P(Y≥Yo)= 0.3974
In the Z Table= P(Z≥0.26) = 0.3974
If Zo = (Yo-)/   Yo= Zo*+
P(Y ≥(0.26*+))= P(Y ≥(0.26*2+1))= 0.3974
2
1.3
P(Y≥1.52)= 0.3974  Yo= 1.52
Checking
P(Y≤ 1.52)=1- P(Y≥1.52)=1- P(Z≥(1.52-1)/2)=1- P(Z≥0.26)=1-0.3974=0.6026
2.1.4. Remember that
P(|Z |≤ Zo): Pb that a random Z will be numerically less than Zo, that is, lie within the interval (–Z1, Z1)
P(|Z |≥ Zo) : Pb that a random Z will numerically exceed Zo, that is, lie outside the interval (–Z1, Z1)
P(|Y|≤ Yo)= 0.975  P(|Y|≥Yo)=1-0.975=0.025  2P(Y≥Yo)=0.025 
P(Y≥Yo)= 0.0125  P(Z≥2.2414)= 0.0125 
Right border= P(Y≥1+ 2.2414*2)= 0.0125 P(Y≥5.48)= 0.0125  Yo= 5.48
Left border = P(Y≤ 1-2.2414*2)= 0.0125 P(Y≥-3.48)= 0.0125  Yo= -3.48
The numbers are not identical because now the mean is1 instead of 0. The distance
between 5.48 and 1= -3.48 and 1= |4.48|
If you move all -1 to the left by subtracting 1, then
Left border: -3.48-1 = -4.48, mean =1-1= 0, right border: 5.48-1= 4.48
When mean = 0 the values of left and right borders are the same
Checking
P(|Y|≤ 5.48)= 1- P(|Y|≥ 5.48)= 1- 2P(Y≥ 5.48)= 1- 2P(Z≥ 2.24)=1-2*0.0125=0.975
2.2. Given that Y is normally distributed with mean =10 and variance 25 (= 5) and that
a sample of 25 observations is drawn then SE of the mean = 5/SQRT(25)= 1
2.2.1. P( Y ≥ 13) = P(Z≥ (13-10)/1)= P(Z≥ 3)=0.0013
2.2.2. P( 7≤ Y ≤ 13)= P(-3≤ Z≤ 3)= P(|Z|≤ 3)=1- 2P(Z≥ 3) =1-2*0.0013=0.9974
2.2.3. = 24 and 2=12 N=? if P( Y ≥ 26)=0.1587
P( Y ≥ 26)=0.1587 then P(Z ≥ (26-24)/SQRT(12/N)=0.1587 then
2/SQRT(12/N)= 1 then 2= SQRT(12/N) then 4= 12/N then N=12/4= 3.
Checking
Mean STD DEV=SQRT(12/3)= 2
P( Y ≥ 26)= P(Z≥ (26-24)/2)= P(Z≥ 1)= 0.1587
2.3. Given a χ2 distribution with 12 degrees of freedom. Find
2.3.1. P (χ2≤ 21.0)= 1- P (χ2≥ 21.0)=1-0.05= 0.95
2.3.2. Find χo2 such that P(χ2≥ χo2)=0.10 χo2=18.5
2.4. Given a t distribution. Find
2.4.1. Y = 10 g s2= 4 (population variance unknown). What is the approximate
probability that a bag of 16 oysters weights less than 8 g?
P( Y ≤ 8)= P(t≤ (8-10)/SQRT(4/16))= P(t≤ (-2/0.5)= P(t≤ (-4)= P(t≥ 4)=0.00058 or 
0.0006
3
1.4
2.4.2. Find to such that 80% of the values are within the -to to to interval for 22 d.f.
P(|t|≤ to )=0.8 then P(|t|≥ to )=0.20 then to=1.321
Question 3 ANSWER
[20 points]
data PROBLEM3;
input GPC1 $ PROTEIN;
cards;
No
11.6
No
9.3
No
11.7
No
12.7
No
13.4
No
9.2
No
7.9
No
14.1
No
12.0
No
10.6
No
12.0
No
10.1
No
12.1
Yes
18.1
Yes
15.2
Yes
17.2
Yes
11.9
Yes
14.7
Yes
12.0
Yes
12.9
Yes
12.7
Yes
10.9
Yes
14.5
Yes
12.8
Yes
12.9
Yes
14.9
;
PROC SORT;
By GPC1;
PROC UNIVARIATE normal plot;
by GPC1;
PROC TTEST;
Class GPC1;
Var PROTEIN;
proc power;
twosamplemeans
meandiff = 2.6154
stddev = 1.9521
npergroup = 13 14 15 16 17
power = .;
4
1.5
proc power;
twosamplemeans
meandiff = 2.6154
stddev = 1.9521
alpha= 0.01
npergroup = .
power = 0.85;
run; quit;
3.1. Both variables are normal
Tests for Normality No GPC
Tests for Normality
Test
Statistic
p Value
Shapiro-Wilk W 0.965582 Pr < W 0.8366
Tests for Normality Yes GPC
Tests for Normality
Test
Statistic
p Value
Shapiro-Wilk W 0.932847 Pr < W 0.3711
Q-Q plot
The good fit of the Q-Q plots to the expected ~N line correlates well with the high W values and the non-significant
differences in the Shaprio-Wilk test. We accept our assumption that the data are normally distributed.
3.2.
GPC1
N
Mean Std Dev Std Err Minimum Maximum
No
13 11.2846
1.7790
0.4934
7.9000
14.1000
5
1.6
GPC1
N
Yes
13 13.9000
2.1111
0.5855
-2.6154
1.9521
0.7657
Diff (1-2)
Mean Std Dev Std Err Minimum Maximum
Method
Variances
Pooled
Equal
Satterthwaite Unequal
10.9000
18.1000
DF t Value Pr > |t|
24
-3.42
0.0023
23.33
-3.42
0.0023
Equality of Variances
Method
Num DF Den DF F Value Pr > F
12
Folded F
12
1.41 0.5624
We reject the null hypothesis that the samples are the same. The samples are significantly
different (P= 0.0023). The GPC gene increases protein content of the grain.
3.3. Power Analysis
Using SAS
Computed Power
Index
N Per Group Power
1
13
0.906
2
14
0.927
3
15
0.943
4
16
0.956
5
17
0.966
6
1.7
Hand calculation of power
Using Section 2.3.2 n=13 2n-2=24
t /2 2*(n-1)= t /2 24 df = 2.064 /2=0.025
Mean
No Gpc
Gpc
11.2846
13.9000
S
1.78
2.11
Average s2
s2
3.165
4.457
3.811
|1 - 2 |= 2.6154
 ((2*s2)/n)=  ((2*3.811)/13) = 0.765707
P(t> 2.064-(2.6154/0.765707))= P(t > -1.3517=1-P(t >1.3517) 1-0.10= 0.90
Or using Section 2.4.4.
2
r = 2 (s(pooled) /(t/2, n1+n2-2 + t, n1+n2-2)
2
r = 2 (3.811/6.8402)(2.064+ t, n1+n2-2)
2
r= 1.1143(2.064+ t, n1+n2-2)
SQRT(13/1.1143) - 2.064= t, n1+n2-2
t, n1+n2-2= 1.3517
Then < 0.10 and the power= 1-0.10>0.90
3.4. Power of 85% and alpha=0.01 Average 2= 3.811
2
r = 2 (s(pooled) /  (t/2, n1+n2-2 + t, n1+n2-2)
Approximate with Normal
2
2
r = 2 (3.811/ (2.6154) ) (2.575 + 1.035) =14.5
2
2
Using T with n=16
r = 2 (3.811/ (2.6154) ) (2.75 + 1.055) =16.13
Then at least 17 to have at least 0.85 power. Actual power with 17 based on SAS= 0.87
guesstimate n
16
17
df = 2(n - 1)
30
32
t0.005
2.75
2.7385
t0.15
1.055
1.0535
estimated n
16.1
16.0
It is not possible to have 16.014 replications to achieve a power of at least 0.85 therefore
we must round up to 17 replications. The iterations suggest that 17 replications would
achieve a power of at least 0.85.
Using SAS
Computed N Per
Group
Question 4 ANSWER
Actual Power
N Per Group
0.870
17
[10 points]
7
1.8
4.1. Since the experiment was aimed to detect an increase in weight, the test is one-tailed.
=0.05 (Type I error).
Using the Power formula
|   2 |
|   2 |
70
Power  P( z  z  1
)  P ( z  z  1
)  P( z  1.645 
)
 Y 1Y 2
2 * 2500
2 2
4
n
P( z  0.3349)  1  0.36885  0.63
Also could be done using equation from section 2.4.4 in lecture notes
4.2. Table
Z/2= 0.005 =2.575 Z=0.2= 0.8416
2
(Z/2 + Z) = 11.673
2
2
r = 2 ( /) * (Z + Z)
Distance ¼
½
¾
N
374
94
42
1
24
1¼
15
1½
11
Question 5 ANSWER
Data Prob5;
Input
DIFF=
Cards;
15.675
17.160
18.480
19.745
19.470
19.855
18.590
18.150
18.975
15.620
;
Proc TTEST;
paired C*T;
1¾
8
2
6
[15 points]
C T @@; *paired samples;
C-T;
14.135
15.510
15.015
17.050
16.720
19.525
17.160
16.610
17.820
16.390
* assumes paired samples;
proc power;
onesamplemeans
mean
= 1.579
ntotal = 10
stddev = 1.223
nullmean= 0
alpha= 0.05
power = .;
run; quit;
Two sample paired test
Mean
Diff
95% CL Mean
Std Dev
95% CL Std Dev
8
1.9
1.5785
DF
t Value
9
4.08
0.7036
2.4534
1.223
0.8412
2.2327
Pr > |t|
0.0028
One sample DIFF
Fixed Scenario Elements
Normal
Distribution
Exact
Method
0
Null Mean
Alpha
0.05
Mean
1.579
Standard Deviation
1.223
Total Sample Size
10
Number of Sides
2
Computed Power
Power
0.951
5.1.: there are significant differences between treatment and control (P=0.028)
5.2.: the power of the test is 0.95
5.3. Graphical representation of DIFF
9
1.10
Question 6 ANSWER
[15 points]
Four locations. Average yield= 7,000 lb/ac, standard deviation= 450 lb/ac. How many
locations do we need to estimate the true mean yield with a 95% confidence interval of
less than 800 lb/acre? d= 400 lb/ac
First estimate using r = z/ 2  /d r= 1.962 * 202500/160000= 4.86
2
2
2
Using r = t2 /2, r-1 s2 / d2, the sample size, is estimated iteratively,
initial-n
5
10
7
8
t2.5 %, n-1
2.776
2.262
2.447
2.365
n
(2.776)2 (450)2 / 4002 = 9.75
(2.262)2 (450)2 / 4002 = 6.48
(2.447)2 (450)2 / 4002 = 7.57
(2.365)2 (450)2 / 4002 = 7.07
Answer: He should examine at least 8 replications.
Question 7 ANSWER
[10 points]
No standard deviation
CV not greater than 10%. Estimate the number of replications required in order
to have the total length of a 95% confidence interval about the true mean yield
be less than the standard deviation?
Using the equation:
r=
z2/ 2 CV 2
(d/)2
CV= (s/ Y ) < 0.1 and 2d<s (then s>2d)
Y >s/0.10
d/μ=(s/2)/(s/0.1))= 0.1/2=0.05
Normal approximation: r = 1.962 * 0.102 /0.052 = 15.4
Or
CV = s/ Y , so s = CV* Y = 0.10 * 7500= 750, and s < 750
2d <750, thus d < 375 and r = 1.962 * 7502 / 3752 = 15.4
Using r = t2 /2, r-1 s2 / d2, the sample size, is estimated iteratively,
initial-n
t 2.5%, df
n
2
16
2.131
(2.131) (750)2 /3752 = 18.2
19
2.101
(2.101)2 (750)2 /3752 = 17.66
18
2.110
(2.110)2 (750)2 /3752 = 17.8
10
1.11
18 replications are necessary to have a 95% confidence interval for the mean= 1s
11