Download Focus Questions Chapters 1-4

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

World Values Survey wikipedia , lookup

Time series wikipedia , lookup

Transcript
Focus Questions Chapters 1-5
Directions: Answer all questions asked and show work and formulas used.
1. Suppose you have the following sequence of numbers:
2, 7, 18, 11, 8, 9, 15, 21, 25, & 32
a. Calculate the mean of the following data set.
b. Calculate Q1 Q2 & Q3 and create a 5-number summary. Make sure to include IQR
and identify if there are any outliers.
c. Calculate the variance and standard deviation.
d. Create a stem and leaf display of the following data
(a) Recall that to calculate the mean we sum up the values and divide by N. In this case
we have 10 data points, so N=10.
μ = ∑xi / N = (2+7+18+11+8+9+15+21+25+32) / 10 = 14.8
(b) Now we want to calculate the three quartiles. To do this the first thing we need to do
is put the data in order from lowest to highest.
2, 7, 8, 9, 11, 15, 18, 21, 25, 32
To calculate the 25th (and all percentiles for that matter) percentile we use the following
formula Index – i = (p/100)n
Q1 = (25/100)10 = 2.5; since we have a decimal we round up to 3. So the 3rd data point
is the 25th percentile, or 8
Q2 = (50/100)10 = 5 since we have an integer value we take the ith value and i+1 value.
In this case it is the 5th and 6th values. So we add (9+11) / 2 = 10
Q3 = (75/100)10 = 7.5; since we have a decimal we round up to 8. So the 8th data point
is the 75th percentile, or 21
IQR = Q3 – Q1 = 21 – 8 = 13
Outliers would lie outside of Q1 and Q3 by a factor of 1.5*IQR
So the upper and lower bounds for outliers are as follows:
Upperbound = Q3 + 1.5* IQR = 21 + 1.5(13) = 40.5
Lowerbound = Q1 - 1.5* IQR = 8 - 1.5(13) = -11.5
So we check to see if any of the values are lower than -11.5 or higher than 40.5. Since
we don’t have any values outside of these values we conclude there are NO outliers.
Graph 5-Number Summary:
-note that we can see that Q1 (the lower portion of the box) is 8, Q2 (the line that goes
through the box) is 10, and Q3 is at 21 (top of the box). This
1
Boxplot of Data Set
35
Range of Values for Data
30
25
20
15
10
5
0
(c) Variance: s2= ∑ ( xi – μ)^2 / N = [ (2 – 14.8)2 + (7 – 14.8)2 + ….+ (25 – 14.8)2 + (32 –
14.8)2 ] / 9 = 85.29
Standard Deviation of GPA = s = 85.29 = 9.24
(d) To make a stem and leaf we use the ordered data and display as follows
0|
1|
2|
3|
2789
158
15
2
2
2. Suppose that we are looking at a data set that we identify as a normal distribution. If
we are given that the mean is 16 and the standard deviation is 3, use this information to
calculate the following information.
a. P ( x < 17)
b. P ( 13 < x < 21 )
c. What are the two points that will give you the middle 80% of the distribution?
In each of the above situations make sure to draw the appropriate normal distribution to
accompany each question.
(a) In order to find the probability that we are less than 17 in the distribution we need to
covert the normal distribution to a standard normal distribution so we can use our
standard normal table.
To do this we convert our value to a z-score.
Recall that z = (x - μ) / σ, so z = ( 17 – 16 ) / 3 = .33
So we want the P ( z < .33 ) = .6293
Graphically:
We include all the
area to the left of
0.33.
z
0
0.33
20
(b) We once again need to convert our data to z scores in order to use our standard
normal table. So z = (x - μ) / σ, so z = ( 21 – 16 ) / 3 = 1.67 & z = (x - μ) / σ, so z = ( 13 –
16 ) / 3 = -1
This corresponds to P ( -1 < z < 1.67 ) = P (z < 1.67 ) – P ( z < -1 ) = .9525 - .1587 =
0.7938.
Graphically:
We don’t want to
include the areas in
the tails.
z
-1.0
0
1.67
20
3
(c) Now we want to find that points that will give us the middle 80%. Graphically we
want:
So the area in
80 % fall between +/- some zA and B are
values which we can covert to
40%
A
B
x values
C
So first we find the z-value to the right that gives 90% to the left of that value. We do
this because in our table we are given values that give us all the area to the left of some zvalue. The tails are not excluded. So we Find the z-value that includes both A,B and C
(the tail of the graph). This value is z = 1.28 [ note: 1.28 corresponds to a value of .8997,
which is close enough to .90 for our purposes ]. Therefore if z = 1.28 gives us our
desired value and gives 40% in B we know by symmetry that our z-value to the left of the
mean should be z = -1.28
To find the x values that correspond to these z values we use the formula:
Upper x = μ + zσ = 16 +1.28 ( 3 ) = 19.84
Lower x = μ - zσ = 16 -1.28 ( 3 ) = 12.16
Graphically:
So we have found the values that
give us the middle 80% of the
distribution. We expect that 80%
of all the data point swill fall
between these values
x
12.18
16
19.84
4
3. Suppose that you are given the following data pairs between GPA and ACT.
**You may want to use excel to check your calculations
ACT
23
27
17
31
33
GPA
3.1
3.6
3.9
3.3
2.9
ACT
25
26
17
29
35
GPA
3.1
3.9
3.7
3.5
4.0
a. Calculate the mean, variance, and standard deviation for both GPA and ACT
b. Given the following data calculate the correlation coefficient for the data.
c. What does your answer in part (b) imply about the relationship between ACT score and
GPA? What does it tell you about simply looking at ACT score in order to predict GPA?
(a)
GPA:
Mean: μ = ∑xi / N = 3.1 + 3.6 + …+ 3.5 + 4.0 / 10 = 35 / 10 = 3.5
Variance: σ2= ∑ ( xi – μ)^2 / N = [ (3.1 – 3.5)2 + (3.6 – 3.5)2 + ….+ (3.5 – 3.5)2 + (4.0 –
3.5)2 ] / 10 = .134
Standard Deviation of GPA = σ GPA = .134 = .366
ACT:
Mean: μ = ∑xi / N = 23 + 27 + …+ 29 + 35 / 10 = 263 / 10 = 26.3
Variance: σ2= ∑ ( xi – μ)^2 / N = [ (23 – 26.3)2 + (27 – 26.3)2 + ….+ (29 – 26.3)2 + (35 –
26.3)2] / 10 = 33.61
Standard Deviation of ACT = σ ACT = 33.61 = 5.80
Note: when we use the … it simply means that we would carry out the same operation on
all the other data points in between or beyond. I am just putting this to save time.
(b) The general formula for correlation is as follows ρxy = σ xy / σ x σ y. When we apply
them to the specific study the formula becomes ρACT,GPA = σ ACT,GPA / σ GPA σ ACT We
have both the standard deviations, but we need the covariance. So below we will
calculate covariance as follows:
_
_
Population Covariance = σxy = (1/ N) ∑ ( xi – x ) ( yi – y ) = σ ACT,GPA = (1/ N) ∑ ( GPAi –
GPA ) ( ACTi – ACT ) = (1/10) * [(3.1 – 3.5) (23 – 26.3) + (3.6 – 3.5) (27 – 26.3) +
….+ (3.5 – 3.5) (29 – 26.3) + (4.0 – 3.5) (35 – 26.3)] = -0.44
Now plugging all numbers into the correlation coefficient formula we get:
ρACT,GPA = σ ACT,GPA / σ GPA σ ACT = -0.44 / (5.80) ( .366) = -0.20727
(c) This implies a weak negative relationship between GPA and ACT score. So we figure
based on these results, although ACT score helps to predict GPA it is not the only
component that determines how well a student performs inside the classroom. So, based
on this idea we could simply say that ACT score would validly tell us how well a person
would perform in school. On top of that, we get a relationship that is negative that
5
implies the higher the ACT scores the lower the GPA. This is the exact opposite result
someone would expect.
Alternate answer to question 3
***now we can assume that we are dealing with sample data and not population
statistics. So in this case we just have a portion of the overall data and want to calculate
some estimate of what we think is going on overall. So maybe we have an entire
graduating class with both GPA and ACT score, but we only look at 10 data points
instead of the entire classes.
GPA:
_
Mean: x = ∑xi / n = 3.1 + 3.6 + …+ 3.5 + 4.0 / 10 = 35 / 10 = 3.5
Variance: s2= ∑ ( xi – μ)^2 / n-1 = [ (3.1 – 3.5)2 + (3.6 – 3.5)2 + ….+ (3.5 – 3.5)2 + (4.0 –
3.5)2 ] / 9 = .149
Standard Deviation of GPA = s GPA = .149 = .386
ACT:
_
Mean: x = ∑xi / n = 23 + 27 + …+ 29 + 35 / 10 = 263 / 10 = 26.3
Variance: s2= ∑ ( xi – μ)^2 / N = [ (23 – 26.3)2 + (27 – 26.3)2 + ….+ (29 – 26.3)2 + (35 –
26.3)2] / 9 = 37.3
Standard Deviation of ACT = s ACT = 37.3 = 6.11
Note: when we use the … it simply means that we would carry out the same operation on
all the other data points in between or beyond. I am just putting this to save time.
(b) The general formula for correlation is as follows rxy = s xy / s x s y. When we apply
them to the specific study the formula becomes sACT,GPA = s ACT,GPA / s GPA s ACT We have
both the standard deviations, but we need the covariance. So below we will calculate
covariance as follows:
_
_
Population Covariance = sxy = (1/ n-1) ∑ ( xi – x ) ( yi – y ) = s ACT,GPA = (1/ n-1) ∑ ( GPAi
( ACTi – ACT ) = (1/9) * [(3.1 – 3.5) (23 – 26.3) + (3.6 – 3.5) (27 – 26.3) +
….+ (3.5 – 3.5) (29 – 26.3) + (4.0 – 3.5) (35 – 26.3)] = -0.489
– GPA )
Now plugging all numbers into the correlation coefficient formula we get:
rACT,GPA = s ACT,GPA / s GPA s ACT = -0.489 / (6.11) ( .386) = -0.20734
We can see that the difference in the answer is very small. We have to go out to the ten
thousandths place to find a difference. It completely depends on the data and its scale
though. Sometimes we may find a larger difference, but we would never expect the
difference to be very large.
6
5. Suppose you were told that there was a regression line that was fit to data between age
and salary at OCCC given all the people that worked in the Business Department at the
college and the results were:
Income = 20000 + 455 * Age
With an r2 = .45
a. Use this information to graph the regression line. Make sure to display the slope and
intercept in your graph.
b. According to the equation what is the predicted salary of someone with an age of 34?
c. What might be some reasons that this regression equation is not a good predictor of
salary? (i.e. can you think of any shortcomings of using solely age and income?)
d. Given the square of the correlation, what type of relationship does age have with
income and how well does age do at predicting income?
(a) Given the equation we can graph the equation just as if we were graphing a typical
mathematical expression. Keep in mind however this is not a typical mathematical
expression. This expression comes from data and it is an estimation of how both age and
income are related to each other in a simple linear regression setting.
Table 1
Income
20,000
24550
29100
33650
38200
42750
Age
0
10
20
30
40
50
Graph 1:
Income = 20000 + 455 * Age
33,650
10,000
10
20
30
40
50
(b) If we want to know the predicted salary for someone 34 we simply plug in the value
to our least squares regression line as follows:
Income = 20000 + 455 * Age  20000 + 455 * 34 = $35,470
7
(c) There are a couple of reasons the equation might not be such a good predictor of
salary.
i. Age might not be the best variable in which to predict salary. So finding the
relationship between age and income might not be appropriate. Maybe education should
be used.
ii. Also, it makes no sense for us to talk about values of ages between 0-18. So although
we find the least squares regression line it might not be appropriate to predict salary for
certain age values.
iii. Lastly, we have not assessed how closely the relationship is between the two
variables. If the relationship is not strong it would indicate that there is no real
relationship between the variables. See part d.
(d) Recall that we are given that r2 = .45, so we know that the correlation coefficient is
.45 = (+) 0.67. So we know that there is a positive correlation between the variables
and that it is fairly strong, so age can probably be used to reasonably predict income for
people, but we must note that the age range should be confined to ages of people that
actually work at OCCC, so we should not go below 18.
8