Download REVIEW

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Receiver operating characteristic wikipedia , lookup

Foundations of statistics wikipedia , lookup

Confidence interval wikipedia , lookup

History of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Law of large numbers wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
AMS 5
REVIEW
VARIABLES
QUANTITATIVE
(numerical scale)
Discrete
(e.g. size)
QUALITATIVE
(e.g. marital status)
Continuous
(e.g. age)
Average and Standard Deviation
The median of a histogram is the value with half the area to the
left and half to the right. In a symmetric histogram the median
and the average coincide.
Design of Experiments
To eliminate bias, subjects are assigned to each group at random
and the experiment is run double blind.
This is called a Controlled Experiment and allows to establish a
causal effect of the treatment on the response.
Sample Surveys
A population is a class of individuals that an investigator is
interested in. A full examination of a population requires a census.
If only one part of the population is examined, then we are
looking at a sample. There are usually some numerical
characteristics of the population that we are interested in. These
are called parameters. Parameters are unknown quantities which
are estimated using statistics, which are numbers that can be
computed from the sample.
When considering the quality of a survey keep in mind two
possible sources of bias: Selection bias and Non-response bias
Probability
How do we quantify chance?
Notation
Consider an event A, then the probability of A is denoted as
P(A)
Consider two events, A and B, then the conditional probability of
A given B is denoted as
P(A|B)
The multiplication rule can be written as
P(A and B) = P(A|B)P(B) = P(B|A)P(A)
A and B are independent if
P(A|B) = P(A) and P(B|A) = P(B)
When two events are independent the multiplication rule is
P(A and B) = P(A)P(B)
Addition Rule
The mathematical notation for this is
P(A or B) = P(A) + P(B) - P(A and B)
Expected Value and Standard Error
• 68% of the draws will be within one standard unit of the
expected value.
• 95% of the draws will be within two standard units of the
expected value.
• 99% of the draws will be within 2.5 standard units of the
expected values.
Problem 1: The figure below is a histogram for the scores on the
final of a certain class.
1. What percentage of the students scored between 20 and 40
points?
The boxes in the histogram correspond, from left to right, to
10%, 10%, 10%, 20%, 25%, 12.5% and 12.5% of the scores.
From 20 to 40 there are 30% of the scores.
2. What is the median score of the class?
The median is 40
Problem 2: Which of the following statements is true and which is
false?
1. If two events are independent then they are mutually exclusive.
F
2. If A and B are two events, then according to the multiplication
rule, the probability that both A and B happen equals the
probability of A times the probability of B.
F
3. The ages of 10 freshmen and two professors are recorded,
then the average age of the group is larger than the median
age.
T
4. One kilogram is approximately equal to 2.2 pounds, this implies
that the standard deviation of the amount of fish consumed in
a restaurant per day is larger if measured in kilos than if
measured in pounds.
Amount of fish in Pounds = 2.2 × (amount of fish in Kilos).
Then SD(Kilos) = 2.2 × SD(Pounds). So, the SD of the amount
in Kilos 1/2.2 times the SD of the amount consumed in
Pounds. Then answer is FALSE. The important part of this
question is that the SD changes when the units are changed.
5. 100 tickets are drawn at random with replacement from the
box containing
and you win the sum of the tickets. The same game is repeated
using the box
You expect to win the same amount in both cases.
T
6. A high non-response rate is a serious problem for survey
organizations because the investigators have to spend more
time and money getting additional people to bring the sample
back to its planned size.
F
Problem 3: A large number of people get together. Each person
rolls a die 180 times, and counts the number of 1 's. About what
percentage of these people should get counts in the range 20 to
40?
The expected number of 1 's is 30 and the standard error is 5, so
the interval 20 to 40 corresponds to two standard errors and thus
contains approximately 95% of the counts. So we expect that
approximately 95% of the people will get counts in that range.
Problem 4: Two students A and B are both registered for a certain
course. Assume that student A attends class 80% of the time and
student B 60% of the time, and the absences of the two students
are independent.
1. What is the probability that both students will be in class on a
given day?
Since the events are independent the multiplication rule can be
used with the unconditional probabilities. This yields 48%.
2. What is the probability that at least one of the two students
will be in class on a given day?
We can either use the addition rule subtracting the probability
calculated in the previous question or think of the opposite
event and use the multiplication rule. The answer is 92%.
Problem 5: Suppose the election of the president of the student
union is conducted using the following system: students vote in
the college they belong to and the candidate that wins the highest
number of colleges is elected. A candidate wins a college if he or
she obtains the majority of the votes in the college. Is it possible
that the winner will not have a majority of the total vote? If yes,
explain, if no, give an example.
This is related to Simpson's paradox. It is possible that the
winning candidate will not have the majority of the student vote.
One situation where this this may happen is the following: the
candidate that wins more than half of the colleges does so by
winning with a small margin in colleges where the number of
voting students is relatively small. At the same time he or she
looses by a large margin the colleges where the number of voting
students is relatively large.
Normal Density
The Gaussian or normal curve corresponds to the following
formula
2
1
y=
2π
e
−x / 2
, e=2.71828
The area below the curve is
equal to one. We observe that
the curve is symmetric around
zero and that most of the area
is concentrated between -4 and
4. The probability of an interval
is the corresponding area under
the curve.
Doing calculations with the normal curve requires the use of a
table. Tables are available for the standard normal curve and they
require that observations be transformed to standard units.
With the help of the tables of the book we can find P((-z, z))
•
•
•
•
•
•
P((0, z)) = P(-z, 0) = 1/2 × P((-z, z))
P((-z, x)) = P((-z, 0)) + P((0, x))
P(<-z) + P(> z) = 1 – P((-z, z))
P(> z) = 1/2 × [P(< -z) + P(> z)] = 1/2 × [1
[ – P((-z, z))]
P(< z) = P( < 0) + P(0, z) = 1/2 + P(0, z)
P ((-z, x)) = 1/2 × [P((-x, x)) – P((-z, z))]
Percentages
A box contains tickets with 0s and 1s.
The SD of the box is given by (fraction of 1s) × (fraction of 0s)
The SE for the sum of 1s is
number of draws × SD
The SE for the percentage of 1s is
SD
number of draws
• Sample percentage ± 1 SE is a 68% confidence interval of the
percentage.
• Sample percentage ± 2 SE is a 95% confidence interval of the
percentage.
• Sample percentage ± 3 SE is a 99.7% confidence interval of
the percentage.
Estimating Averages
The normal approximation can be used to create confidence
intervals for the average. Remember that a confidence interval of,
say, 95%, means that if the experiment is repeated 100 times,
about 95 of the resulting intervals will contain the true value of
the average.
Standard Errors
Test of Significance
• set up the null hypothesis
• pick a test statistics to measure the difference between the
data and what is expected under the null hypothesis
• compute the test statistics and the corresponding observed
significance level.
In general we are calculating a test statistics given by
observed - expected
z=
SE
which is referred to as the z-test.
The smaller the P-value, the stronger the evidence against
the null, but
The t-test
Step 1: Consider a different estimate of the SD
number of measurements
SD =
× SD
number of measurements-1
+
Notice that SD+ > SD.
observed - expected
Step 2: t =
where SE+ corresponds to SD+.
+
SE
Step 3: To find the observed significance level we can not use the
normal curve any more. We need to use a Student's t curve. This
curve depends on the degrees of freedom (DF). These are
calculated as
degrees of freedom = number of measurements - 1
Test of Difference
H0 : the difference is 0
H1 : the average of one group is bigger than that of the other
Two – Tailed Test
H0 : the difference is 0
H1 : the average of one group is bigger than that of the other
When a two tailed test is used the p-value is calculated adding the
area that corresponds to both tails of the normal curve.
Correlation
• The correlation is not affected when the two variables are
interchanged.
• The correlation is not changed if the same number is added to
all the values of one of the variables.
• The correlation is not changed if all the values of one of the
variables is multiplied by the same positive number. It will
change sign if the number is negative.
• The correlation coefficient is 1 if the variables have perfect
positive linear association and -1 is they have perfect negative
linear association.
Regression
Associated with an increase of one
SD in x there is an increase of r ×
SDs in y on average.
error = actual value of y - predicted
value of y
RMS error = 1 − r 2 × SD of y
Problem 6: The speed of light is measured 25 times by a new
procedure. The 25 measurements are recorded and show no
trend or pattern. The average of the measurements is 299,789.2
kilometers per second and the SD is 12 kilometers per second.
Find an approximate 95% confidence interval for the speed of
light.
1. Calculate the SE of the average.
The SE is given by 12 / 25 = 2.4.
2. Find an approximate 95% confidence interval for the speed of
light.
Two SEs correspond to 4.8 km per second. Thus a 95%
confidence interval is given by (299,784.4 , 299,794).
Problem 7: A simple random sample of size 400 was taken from
the population of all manufacturing establishments in a certain
state. The results are that 16 establishments had 250 employees
or more.
1. Estimate the percentage of manufacturing establishments with
250 employee or more.
4%
2. Attach a standard error to the estimate.
0.04 × 0.96
≈ 0.01.
400
Problem 8: Find the area under a Student's t curve with 3 degrees
of freedom in the following cases:
1. To the right of 2.35.
5%
2. To the left of -2.35.
5%
3. Between -2.35 and 2.35.
90%
4. Are these values higher or lower than the ones that correspond
to the standard normal curve?
a) and b) are smaller for the normal, as a consequence, c) is
larger.
Problem 9: Looking at data and making sense of them is the first
step of a statistical analysis.
The scatter diagram below shows the ages of 1,000 husbands and
wives in a town in California. Explore the plot. Is there anything
wrong with the data?
The range of x does not
correspond to the usual
range of married
men. In particular, there
is a 5 years old man
married to a 20 years
old woman.
Problem 10: True or false:
1. To make a t test with 4 measurements use a Student's t curve
with 4 degrees of freedom.
F
2. For a given experiment the null hypothesis is that the average
is equal to 231 units. The alternative hypothesis is that the
average is above 231 units. You compute a z-test and the
corresponding value P-value is 2.5%. The conclusion is that
the probability that the average is equal to 231 units is 2.5%.
F
3. The R.M.S. error for a regression line of y on x is less than or
equal to the SD of y.
T
4. The correlation between the daily minimum temperatures of
L.A. and San Francisco is higher when measured in Fahrenheit
than when it is measured in Celsius.
F
5. The correlation between two variables is -.92, this implies that
there is a strong negative linear association between the
variables.
T
Problem 11: Freshmen at public universities work 12.2 hours a
week for pay, on average, and the SD is 10.5; at private
universities the average is 9.2 hours and the SD is 9.9 hours.
Assume the data are based on two independent samples, each of
size 1,000. Is the difference due to chance?
1. Formulate the null and the alternative hypothesis.
H0 : There is no difference between public and private
universities.
H1 : Students at public universities work longer hours than
those at private universities.
2. Calculate the SE for the difference of the averages.
10.5
SE public =
≈ 0.33
1, 000
9.9
SE private =
≈ 0.31
1, 000
then the SE of the difference is
0.332 + 0.312 = 0.45
3. Calculate the appropriate test statistics.
12.2 − 9.2
= 6.7
0.45
4. What is your conclusion?
The null hypothesis is rejected since the P-value is VERY close
to 0.
Problem 12: A statistical analysis is made of the midterm and final
scores in a large class. The results are
average midterm score ≈ 60, SD ≈ 15
average final score ≈ 65, SD ≈ 20, r ≈ 0.50
1. Using the normal approximation, about what percentage of the
students scored over 80 on the midterm?
80 points on the final corresponds to
80 − 60
= 1.33
15
standard units. Using the normal we obtain that approximately
9% of the students scored over 80 on the midterm.
2. What is the R.M.S. error?
2
1 − 0.5 × 20 = 17.32
3. What is the slope of the regression line?
0.5 × 20
= 0.67
15
4. What is the predicted final score for a student who scored 80 in
the midterm?
80 points on the midterm is 1.33 SD units above average. This
corresponds to 1,33 × 0,5 = 0.67 SD above average on the final.
That corresponds to 13.4 points over average on the final, so
the students that scored 80 on the midterm, scored, on average,
65 + 13.4 = 78.4 on the final.
5. Of the students who scored 80 on the midterm, about what
percentage scored over 80 on the final?
In standard units we have 80 − 78.4
= 0.09
17.32
and there is an area of about 46% to the right of this value
under the normal curve.