Download Solutions - u.arizona.edu

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Time series wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
SOLUTIONS
MAT 167: Statistics
Test I: Chapters 1-4
Instructor: Anthony Tanbakuchi
Spring 2008
Name:
Computer / Seat Number:
No books, notes, or friends. Show your work. You may use the attached
equation sheet, R, and a calculator. No other materials. Write your work in the
provided space for each problem (including any R work if appropriate). You may
not use personal computers, only use the classroom computer at your desk. Using
any other program or having any other documents open on the computer will
constitute cheating.
You have until the end of class to finish the exam, manage your time wisely.
If something is unclear quietly come up and ask me.
If the question is legitimate I will inform the whole class.
Express all final answers to 3 significant digits. Probabilities should be given as a
decimal number unless a percent is requested. Circle final answers, ambiguous or
multiple answers will not be accepted. Show steps where appropriate.
The exam consists of 10 questions for a total of 46 points on 10 pages.
This Exam is being given under the guidelines of our institution’s
Code of Academic Ethics. You are expected to respect those guidelines.
Points Earned:
Exam Score:
out of 46 total points
MAT 167: Statistics, Test I: Chapters 1-4
SOLUTIONS p. 1 of 10
Solution:
Exam Results. Below are the summary statistics.
> summary ( s c o r e )
Min . 1 s t Qu . Median
Mean 3 rd Qu .
Max .
20.00
60.00
75.56
71.53
87.78 100.00
> par ( mfrow = c ( 1 , 2 ) )
> b o x p l o t ( s c o r e , main = ”Boxplot o f exam s c o r e s ”)
> h i s t ( s c o r e , main = ”Histogram o f exam s c o r e s ”)
Histogram of exam scores
3
1
0
2
40
20
60
Frequency
80
4
5
100
Boxplot of exam scores
20
40
60
80
100
score
1. Provide short succinct written answers to the following conceptual questions.
(a) (1 point) Would mass in grams be classified as a nominal, ordinal, interval, or ratio level
of measurement?
Solution: Mass is a ratio level of measurement since 0 mass means there is no mass
at all.
(b) (1 point) Which of the following measures of variation is most effected by outliers:
IQR, standard deviation, range
Solution: The range is most effected by outliers
(c) (1 point) What percent of data is greater than Q3 ?
Solution: 25%
(d) (1 point) If the mean, median, and mode for a data set are all the same, what can you
conclude about the data’s distribution?
Instructor: Anthony Tanbakuchi
Points earned:
/ 4 points
MAT 167: Statistics, Test I: Chapters 1-4
SOLUTIONS p. 2 of 10
Solution: If all three measures of center are the same, the distribution is symmetrical.
(Not necessarily a normal distribution, all we know is that it is symmetrical.)
(e) (1 point) If the median is greater than the mode for a data set, what can you conclude
about the data’s distribution?
Solution: The data set is positively skewed.
(f) (1 point) What does the standard deviation represent conceptually in words? (Be concise
but don’t simply state the equation in words verbatim.)
Solution: The standard deviation represents the average variation of the data from
the mean.
(g) (1 point) Give an example of sampling error?
Solution: Use a simple random sample of 5 students from a class to estimate the
mean weight of students in the class. Each time you repeat the process, you don’t
necessarily get the same sample mean because you don’t always have the same people
in the sample.
(h) (2 points) A box plot is a useful tool that can quickly communicate many traits about a
set of data. List 4 useful pieces of information that an observer can easily assess using a
box plot.
Solution: A box plot can be used to get an approximation of:
1. central tendency
2. variation in the data
3. IQR, Q1 , Q2 , median
4. shape of the data
5. assess if outliers exist
6. min
7. max
(i) (1 point) You scored in the 78th percentile on the GRE. If 8,000 people took the GRE,
how many people had a score at least as high as your score?
Solution: If you were in the 78th , then 78% of the people scored lower than you and
22% did at least as well as you. Therefore, 8000 · 0.22 = 1760 people who had a score
at least as high as yours.
2. A survey conducted in our class asked 18 students how many hours they work per week. Use
the R output below to answer the following questions.
There are 18 data points stored in the variable x, below is the sorted data:
Instructor: Anthony Tanbakuchi
Points earned:
/ 6 points
MAT 167: Statistics, Test I: Chapters 1-4
> sort (x)
[1] 0 0
0
0
8
SOLUTIONS p. 3 of 10
8 10 12 13 20 21 30 30 30 30 30 35 50
The basic descriptive statistical analysis is as follows:
> summary ( x )
Min . 1 s t Qu .
0.00
8.00
> var ( x )
[ 1 ] 215.6765
> sd ( x )
[ 1 ] 14.68593
Median
16.50
Mean 3 rd Qu .
18.17
30.00
Max .
50.00
> h i s t ( x , x l a b = ”h o u r s ” , main = ”Student Enr o llm en t Data ”)
5
4
3
2
0
1
Frequency
6
7
Student Enrollment Data
0
10
20
30
40
50
hours
(a) (1 point) Use the range rule of thumb to estimate the standard deviation. Is it close to
the actual standard deviation?
Solution:
> s . e s t = (max( x ) − min ( x ) ) / 4
> s i g n i f ( s . est , 3)
[ 1 ] 12.5
The range rule is a rough estimate of σ, it is relatively close to the actual standard
deviation shown in the output.
(b) (1 point) What is P50 equal to?
Solution: You can find the value in the summary output for the 1st quarter, it is
equal to the median 16.5.
(c) (1 point) What is the IQR (inter quartile range) equal to?
Instructor: Anthony Tanbakuchi
Points earned:
/ 3 points
MAT 167: Statistics, Test I: Chapters 1-4
SOLUTIONS p. 4 of 10
Solution: IQR = Q3 − Q1 , using the summary output, the IQR is 22.
(d) (1 point) What percent of the data falls within the IQR?
Solution: The IQR contains the data between Q1 and Q3 or equivalently P25 and
P75 , thus 50% of the data falls inside the IQR.
(e) (1 point) What is the mode for the data?
Solution: The mode is 30 hours per week (it occurs 5 times).
(f) (1 point) What is the percentile for a student who works 8 hours per week?
Solution: The percentile is the percent of data points less than the given data point.
A student working 8 hours per week could be position 5 or 6, use i = 5.5, thus:
Pk = xi , (sorted x)
i − 0.5
· 100%
k=
n
5.5 − 0.5
=
· 100%
18
= 27.8%
(g) (1 point) What is the z-score for the student who is working 50 hours per week?
Solution:
z=
x − x̄
s
> x . bar
[ 1 ] 18.16667
> s
[ 1 ] 14.68593
> z = ( 5 0 − x . bar ) / s
> s i g n i f ( z , 3)
[ 1 ] 2.17
(h) (1 point) Is 35 hours an unusual (outlier) value based on its z score? (Why)
Solution:
> x . bar
[ 1 ] 18.16667
> s
Instructor: Anthony Tanbakuchi
Points earned:
/ 5 points
MAT 167: Statistics, Test I: Chapters 1-4
SOLUTIONS p. 5 of 10
[ 1 ] 14.68593
> z = ( 3 5 − x . bar ) / s
> s i g n i f ( z , 3)
[ 1 ] 1.15
No, since |z| <
6 2.
(i) (2 points) If someone asked what the typical number of hours worked per week for this
class was, why would just reporting the mean be misleading? What would you report
in addition to the mean?
Solution: Since the mean can be biased by outliers, reporting the mean might mislead
someone to think that most students work quite a few hours per week. You also need
to give people an idea how much the number of hours per week varies. You should
also report the standard deviation or you could use the Empirical Rule to report an
interval that contains approximately 95% of the class work hours.
(j) (1 point) Is the data positively skewed, negatively skewed, or symmetrical?
Solution: Since the histogram has a longer right tail, it is positively skewed.
(k) (1 point) Construct an interval using the Empirical Rule which you would expect 95% of
the data to fall within.
Solution: 95% of the data falls within µ ± 2σ which for this data set is: 18.2 ± 29.4 =
(−11.2, 47.5).
(l) (1 point) Would the Empirical Rule be accurate when used for this data set? Why?
Solution: No. The Empirical Rule describes bell shaped symmetrical data (normally
distributed). Since this data is quite skewed, the Empirical Rule would not provide
an accurate estimate of an interval that contains 95% of the data.
3. (1 point) When plotting frequencies tabulated from categorical data, why would a bar graph
be preferable to a pie chart?
Solution: Bar graphs present the frequencies in terms of bar heights, pie charts present
the frequencies in wedges of area. Humans can easily compare lengths (the bar height),
but they cannot compare areas as easily. A bar graph makes interpreting the data straight
forward.
4. Use the below box plot to answer the following questions.
Instructor: Anthony Tanbakuchi
Points earned:
/ 6 points
MAT 167: Statistics, Test I: Chapters 1-4
SOLUTIONS p. 6 of 10
10
5
credits
15
Student credits verses gender
●
FEMALE
MALE
(a) (1 point) Which gender has a higher median number of credits?
Solution: Females
(b) (1 point) What is the approximate median number of credits for the females?
Solution: 14
(c) (1 point) What is the approximate IQR of credits for the males?
Solution: Q1 ≈ 10, Q3 ≈ 10, ∴ IQR = 3
(d) (1 point) Which gender has a larger range of credits?
Solution: Females (min≈ 4, max≈ 16, ∴ range≈ 16). Many students forgot to take
into account the outlier and claimed that the males had a larger range. Don’t ignore
the outliers, they are part of the range.
(e) (1 point) How many outliers are there in this data set as indicated by the box plots?
Solution: The circles indicate outliers, there is 1.
5. Using the below table for our class to answer the following questions.
CAR
PUBLIC
TRUCK
BLACK
0
2
1
BLOND
2
0
0
BROWN
11
0
2
(a) (1 point) Find the probability of selecting a person with brown hair.
Instructor: Anthony Tanbakuchi
Points earned:
/ 6 points
MAT 167: Statistics, Test I: Chapters 1-4
Solution:
P (brown) =
13
18
SOLUTIONS p. 7 of 10
= 0.722
(b) (1 point) Would it be unusual to randomly select a person with brown hair?
Solution: No since P 0.05.
(c) (1 point) Find the probability of randomly selecting three car drivers without replacement.
Solution:
P (car & car & car) =
13
18
·
12
17
·
11
16
= 0.35
(d) (1 point) If you randomly select 4 people with replacement, what is the probability that
at least one uses public transportation?
Solution:
P (at least one public) = 1 − P (none public) = 1 − P (not public)4 = 1 −
16 4
18
= 0.376
(e) (1 point) Find the probability of selecting a student who drives a truck or a student with
brown hair.
Solution:
P (brown or truck) =
14
18
= 0.778
(f) (1 point) Find the probability of selecting a person with blond hair given that they are a
truck driver.
Solution:
P (blond|truck) =
0
3
=0
Instructor: Anthony Tanbakuchi
Points earned:
/ 5 points
MAT 167: Statistics, Test I: Chapters 1-4
SOLUTIONS p. 8 of 10
6. A researcher needs to estimate the mean height of corn in a field containing 10,000 corn plants.
The corn plants are planted 100 to a row, and there are 100 rows. The researcher wants to take
a sample of 100 corn plant heights.
(a) (1 point) Describe a method for sampling this population that would be classified as a
convenience sample.
Solution: The researcher could sample 100 plants in the first row.
(b) (1 point) Describe a method for sampling this population that would be classified as a
simple random sample.
Solution: The researcher could assign each corn plant a different number and write
each number on a card. Then 100 cards could be picked after shuffling the cards
sufficiently to ensure they are in a random order. Preferably, a computer could be
used to select 100 of the numbers randomly using a sufficiently randomized number
generator.
7. The weather report for this work week (Monday through Friday) states that the probability of
rain is 5% for each day.
(a) (1 point) What is the probability that it will rain at least once this week?
Solution: The events are independent.
P (at least a day of rain) = 1−P (none) = 1−P (no rain on a day)5 = 1−0.955 = 0.226
(b) (1 point) What is the probability that it won’t rain this week?
Solution: The events are independent.
P (no rain) = P (no rain on 1 day)5 = 0.955 = 0.774
This is just the compliment of the previous question.
(c) (1 point) If you decided not to use an umbrella this week, would it be unusual for you to
get wet?
Solution: No, since P (at least 1 day of rain) 0.05
Instructor: Anthony Tanbakuchi
Points earned:
/ 5 points
MAT 167: Statistics, Test I: Chapters 1-4
SOLUTIONS p. 9 of 10
8. (2 points) Car tires must not deform or explode when inflated up to their maximum pressure
rating. Before distributing the tires, they must be tested. To test the safety of tires, an
inspector randomly samples 50 tires (without replacement) from a batch of 5,000 that have
been manufactured. The inspector inflates each of the fifty tires until they explode or deform
to make sure they meet the minimum safety requirements. If none of the sampled tires fails
the test, the tires will be distributed to dealers. If the batch contains 15 defective tires that
will explode if selected, what is the probability that the batch will be rejected?
Solution: We are randomly selecting n = 50 tires from n = 5000. Since we are sampling
without replacement these are dependent trials, but n/N ≤ 0.05 so we can simplify the
problem by approximating it as independent.
P (batch rejected) = P (at least one tire defective)
= 1 − P (None defective out of 50)
= 1 − P (not defective)50
= 1 − (1 − 15/5000)
approx. as indep
50
= 0.139
9. (2 points) The quadratic mean is defined as
rP
quadratic mean =
x2i
n
find the quadratic mean of x. If x = {2, 8, 7, 4, 9}
Solution:
> x = c (2 , 8 , 7 , 4 , 9)
> n = length (x)
> n
[1] 5
> quad . mean = s q r t ( sum ( x ˆ 2 ) / n )
> s i g n i f ( quad . mean , 3 )
[ 1 ] 6.54
10. (2 points) Given y = {a, −2a, 4a}, completely simplify the following expression. Assume a is
an unknown constant.
P
( (yi − 2a))2
9a
Instructor: Anthony Tanbakuchi
Points earned:
/ 6 points
MAT 167: Statistics, Test I: Chapters 1-4
SOLUTIONS p. 10 of 10
Solution:
P
( (yi − 2a))2
(a − 2a + −2a − 2a + 4a − 2a)2
=
9a
9a
(−3a)2
=
9a
=a
Instructor: Anthony Tanbakuchi
Points earned:
/ ?? points
Statistics Quick Reference
Card & R Commands
by Anthony Tanbakuchi. Version 1.8.2
http://www.tanbakuchi.com
ANTHONY @ TANBAKUCHI · COM
Get R at: http://www.r-project.org
R commands: bold typewriter text
1
Misc R
To make a vector / store data: x=c(x1, x2, ...)
Help: general RSiteSearch("Search Phrase")
Help: function ?functionName
Get column of data from table:
tableName$columnName
List all variables: ls()
Delete all variables: rm(list=ls())
√
x = sqrt(x)
n
∧
x =x n
(1)
(2)
n = length(x)
(3)
T = table(x)
(4)
2 Descriptive Statistics
2.1 N UMERICAL
2.3 V ISUAL
All plots have optional arguments:
• main="" sets title
• xlab="", ylab="" sets x/y-axis label
• type="p" for point plot
• type="l" for line plot
• type="b" for both points and lines
Ex: plot(x, y, type="b", main="My Plot")
Plot Types:
hist(x) histogram
stem(x) stem & leaf
boxplot(x) box plot
plot(T) bar plot, T=table(x)
plot(x,y) scatter plot, x, y are ordered vectors
plot(t,y) time series plot, t, y are ordered vectors
curve(expr, xmin,xmax) plot expr involving x
3
Probability
(5)
min = min(x)
(6)
max = max(x)
(7)
i=1
six number summary : summary(x)
∑ xi
= mean(x)
N
∑ xi
x̄ =
= mean(x)
n
x̃ = P50 = median(x)
s
2
∑ (xi − µ)
σ=
N
s
2
(x
−
x̄)
∑ i
s=
= sd(x)
n−1
σ
s
CV = =
µ
x̄
µ=
2.2
z=
Percentiles:
x−µ
x − x̄
=
σ
s
Pk = xi ,
k=
P(A or B) = P(A) + P(B) − P(A and B)
P(A or B) = P(A) + P(B) if A, B mut. excl.
(8)
n!
Perm. no elem. alike (24)
n Pk =
(n − k)!
n!
=
Perm. n1 alike, . . . (25)
n1 !n2 ! · · · nk !
n!
= choose(n,k)
(26)
nCk =
(n − k)!k!
4
Discrete Random Variables
(13)
P(xi ) : probability distribution
E = µ = ∑ xi · P(xi )
q
σ = ∑ (xi − µ)2 · P(xi )
(14)
4.1
µ = n· p
√
n· p·q
To find xi given Pk , i is:
1. L = (k/100%)n
2. if L is an integer: i = L + 0.5;
otherwise i=L and round up.
x (n−x)
P(x) = nCx p q
4.2
5.2
(29)
(39)
p = P(x < x0 ) = F(x0 )
(40)
x0 = F −1 (p)
(41)
p = P(x > a) = 1 − F(a)
(42)
p = P(a < x < b) = F(b) − F(a)
(43)
(44)
N ORMAL DISTRIBUTION
0
(47)
(48)
p = P(x < x ) = F(x )
= pnorm(x’, mean=µ, sd=σ)
(49)
x0 = F −1 (p)
(50)
p = P(t < t 0 ) = F(t 0 ) = pt(t’, df)
(51)
t 0 = F −1 (p) = qt(p, df)
(52)
χ2 - DISTRIBUTION
0
0
= pchisq(X 2 ’, df)
(32)
0
χ2 = F −1 (p) = qchisq(p, df)
5.5
(53)
(54)
F - DISTRIBUTION
p = P(F < F 0 ) = F(F 0 )
= pf(F’, df1, df2)
(33)
proportion: p̂ ± E,
F 0 = F −1 (p) = qf(p, df1, df2)
E = zα/2 · σ p̂
mean (σ known): x̄ ± E,
(59)
E = zα/2 · σx̄
mean (σ unknown, use s): x̄ ± E,
(60)
E = tα/2 · σx̄ , (61)
d f = n−1
variance:
(n − 1)s2
(n − 1)s2
< σ2 <
,
χ2R
χ2L
d f = n−1
(62)
r
pˆ1 qˆ1 pˆ2 qˆ2
+
n1
n2
s
s21
s2
2 means (indep): ∆x̄ ± tα/2 ·
+ 2,
n1 n2
(63)
(64)
d f ≈ min (n1 − 1, n2 − 1)
sd
matched pairs: d¯ ± tα/2 · √ ,
n
7.2
di = xi − yi ,
(65)
(55)
(56)
CI C RITICAL VALUES ( TWO SIDED )
zα/2 = Fz−1 (1 − α/2) = qnorm(1-alpha/2) (66)
tα/2 = Ft−1 (1 − α/2) = qt(1-alpha/2, df) (67)
χ2L = Fχ−1
2 (α/2) = qchisq(alpha/2, df) (68)
χ2R = Fχ−1
2 (1 − α/2) = qchisq(1-alpha/2, df)
(69)
7.3
5.3 t- DISTRIBUTION
5.4
(57)
(58)
d f = n−1
(46)
0
= qnorm(p, mean=µ, sd=σ)
σ p̂ =
7 Estimation
7.1 C ONFIDENCE I NTERVALS
2
z0 = F −1 (p) = qnorm(p)
σ
σx̄ = √
n
r
pq
n
µx̄ = µ
µ p̂ = p
2 proportions: ∆ p̂ ± zα/2 ·
U NIFORM DISTRIBUTION
p = P(χ2 < χ2 ) = F(χ2 )
(31)
µx · e−µ
= dpois(x, µ)
x!
(38)
−∞
(x−µ)
1
−1
f (x) = √
· e 2 σ2
2πσ2
p = P(z < z0 ) = F(z0 ) = pnorm(z’)
(27)
(28)
P OISSON DISTRIBUTION
P(x) =
(36)
(37)
u0 = F −1 (p) = qunif(p, min=0, max=1) (45)
(30)
= dbinom(x, n, p)
(x − µ)2 · f (x) dx
f (x0 ) dx0
Z x
Sampling distributions
(35)
= punif(u’, min=0, max=1)
B INOMIAL DISTRIBUTION
σ=
(16)
x · f (x) dx
F −1 (x) : inv. cumulative prob. density
F(x) =
6
(34)
F(x) : cumulative prob. density (CDF)
n! = n(n − 1) · · · 1 = factorial(n) (23)
(12)
−∞
∞
−∞
(20)
(22)
(sorted x)
i − 0.5
· 100%
n
(19)
P(A and B) = P(A) · P(B) if A, B independent
(10)
(15)
(17)
(18)
(21)
(11)
Z ∞
p = P(u < u0 ) = F(u0 )
P(A and B) = P(A) · P(B|A)
(9)
R ELATIVE STANDING
E =µ=
rZ
σ=
5.1
Number of successes x with n possible outcomes.
(Don’t double count!)
xA
P(A) =
n
P(Ā) = 1 − P(A)
total = ∑ xi = sum(x)
f (x) : probability density
2.4 A SSESSING N ORMALITY
Q-Q plot: qqnorm(x); qqline(x)
Let x=c(x1, x2, x3, ...)
n
5 Continuous random variables
CDF F(x) gives area to the left of x, F −1 (p) expects p
is area to the left.
R EQUIRED SAMPLE SIZE
z 2
α/2
,
E
( p̂ = q̂ = 0.5 if unknown)
zα/2 · σ̂ 2
mean: n =
E
proportion: n = p̂q̂
(70)
(71)
8
Hypothesis Tests
Test statistic and R function (when available) are listed for each.
Optional arguments for hypothesis tests:
alternative="two.sided" can be:
"two.sided", "less", "greater"
conf.level=0.95 constructs a 95% confidence interval. Standard CI
only when alternative="two.sided".
Optional arguments for power calculations & Type II error:
alternative="two.sided" can be:
"two.sided" or "one.sided"
sig.level=0.05 sets the significance level α.
8.1 1- SAMPLE PROPORTION
H0 : p = p0
prop.test(x, n, p=p0 , alternative="two.sided")
p̂ − p0
z= p
p0 q0 /n
8.2
(72)
1- SAMPLE MEAN (σ KNOWN )
8.6 2- SAMPLE MATCHED PAIRS TEST
H0 : µd = 0
t.test(x, y, paired=TRUE, alternative="two.sided")
where: x and y are ordered vectors of sample 1 and sample 2 data.
d¯ − µd0
√ , di = xi − yi , d f = n − 1
(78)
t=
sd / n
Required Sample size:
power.t.test(delta=h, sd =σ, sig.level=α, power=1 −
β, type ="paired", alternative="two.sided")
8.7
H0 : p1 = p2 = · · · = pn (homogeneity)
H0 : X and Y are independent (independence)
chisq.test(D)
Enter table: D=data.frame(c1, c2, ...), where c1, c2, ... are
column data vectors.
Or generate table: D=table(x1, x2), where x1, x2 are ordered vectors
of raw categorical data.
x̄ − µ0
√
σ/ n
8.3 1- SAMPLE MEAN (σ UNKNOWN )
H0 : µ = µ0
t.test(x, mu=µ0 , alternative="two.sided")
Where x is a vector of sample data.
x̄ − µ0
(74)
t = √ , d f = n−1
s/ n
Required Sample size:
power.t.test(delta=h, sd =σ, sig.level=α, power=1 −
β, type ="one.sample", alternative="two.sided")
8.4
Ei =
(73)
1
p̄ =
x1 + x2
,
n1 + n2
q̄ = 1 − p̄
Required Sample size:
power.prop.test(p1=p1 , p2=p2 , power=1 − β,
sig.level=α, alternative="two.sided")
8.5 2- SAMPLE MEAN TEST
H0 : µ1 = µ2 or equivalently H0 : ∆µ = 0
t.test(x1, x2, alternative="two.sided")
where: x1 and x2 are vectors of sample 1 and sample 2 data.
∆x̄ − ∆µ0
t=r
d f ≈ min (n1 − 1, n2 − 1), ∆x̄ = x̄1 − x̄2
s21
n1
s2
∑ (xi − x̄)(yi − ȳ)
,
(n − 1)sx sy
linear 1 indep var
. . . 0 intercept
linear 2 indep vars
. . . inteaction
polynomial
9.3
(77)
+ n2
2
r−0
t=q
,
1−r2
n−2
MS(treatment)
, d f1 = k − 1, d f2 = N − k
(85)
F=
MS(error)
To find required sample size and power see power.anova.test(...)
11 Loading and using external data and tables
11.1 L OADING EXCEL DATA
11.3
d f = n−2
EQUATION
y = b0 + b1 x1
y = 0 + b1 x1
y = b0 + b1 x1 + b2 x2
y = b0 + b1 x1 + b2 x2 + b12 x1 x2
y = b0 + b1 x1 + b2 x22
(81)
R MODEL
y∼x1
y∼0+x1
y∼x1+x2
y∼x1+x2+x1*x2
y∼x1+I(x2∧ 2)
R EGRESSION
y = b0 + b1 x1
b1 =
1. results=aov(depVarColName∼indepVarColName,
data=tableName) Run ANOVA with data in TableName,
factor data in indepVarColName column, and response data in
depVarColName column.
2. summary(results) Summarize results
3. boxplot(depVarColName∼indepVarColName,
data=tableName) Boxplot of levels for factor
1. Export your table as a CSV file (comma seperated file) from Excel.
2. Import your table into MyTable in R using:
MyTable=read.csv(file.choose())
Simple linear regression steps:
1. Make sure there is a significant linear correlation.
2. results=lm(y∼x) Linear regression of y on x vectors
3. results View the results
4. plot(x, y); abline(results) Plot regression line on data
5. plot(x, results$residuals) Plot residuals
Required Sample size:
power.t.test(delta=h, sd =σ, sig.level=α, power=1 −
β, type ="two.sample", alternative="two.sided")
10 ANOVA
10.1 O NE WAY ANOVA
11.2 L OADING AN .RDATA FILE
You can either double click on the .RData file or use the menu:
• Windows: File→Load Workspace. . .
• Mac: Workspace→Load Workspace File. . .
M ODELS IN R
MODEL TYPE
(76)
(79)
(80)
H0 : ρ = 0
cor.test(x, y)
where: x and y are ordered vectors.
9.2
2
i
9 Linear Regression
9.1 L INEAR CORRELATION
r=
(75)
i
For 2 × 2 contingency tables, you can use the Fisher Exact Test:
fisher.test(D, alternative="greater")
(must specify alternative as greater)
2- SAMPLE PROPORTION TEST
H0 : p1 = p2 or equivalently H0 : ∆p = 0
prop.test(x, n, alternative="two.sided")
where: x=c(x1 , x2 ) and n=c(n1 , n2 )
∆ p̂ − ∆p0
z= q
, ∆ p̂ = p̂1 − p̂2
p̄q̄
p̄q̄
n + n
)2
(O − E
, d f = (num rows - 1)(num cols - 1)
Ei
(row total)(column total)
= npi
(grand total)
χ2 = ∑
H0 : µ = µ0
z=
T EST OF HOMOGENEITY, TEST OF INDEPENDENCE
9.4 P REDICTION I NTERVALS
To predict y when x = 5 and show the 95% prediction interval with regression
model in results:
predict(results, newdata=data.frame(x=5),
int="pred")
∑ (xi − x̄)(yi − ȳ)
∑(xi − x̄)2
b0 = ȳ − b1 x̄
(82)
(83)
(84)
U SING TABLES OF DATA
1. To see all the available variables type: ls()
2. To see what’s inside a variable, type its name.
3. If the variable tableName is a table, you can also type
names(tableName) to see the column names or type
head(tableName) to see the first few rows of data.
4. To access a column of data type tableName$columnName
An example demonstrating how to get the women’s height data and find the
mean:
> ls() # See what variables are defined
[1] "women" "x"
> head(women) #Look at the first few entries
height weight
1
58
115
2
59
117
3
60
120
> names(women) # Just get the column names
[1] "height" "weight"
> women$height # Display the height data
[1] 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
> mean(women$height) # Find the mean of the heights
[1] 65