Survey

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
```Statistics 515
Statistical Methods I
February 21, 2002
E. A. Pena's Class
NAME _____________________________________
SCORE ______________
Part A [25 points] (Numerical Summary Measures): Iron status in athletes is important
because of the central role of this mineral in the synthesis of hemoglobin and enzymes
fundamental to energy production. The following seven observations are the hemoglobin level
(g/dl) for female alpine skiers.
Table 1: Unarranged data set for the hemoglobin level of seven female alpine skiers.
14.6
14.3
15.1
12.7
11.8
13.4
13.8
For this sample data set,
1. Compute the sample mean.
Answer: Sample Mean = 95.7/7 = 13.67
2. Compute the sample median.
Answer: First arrange the data: 11.8, 12.7, 13.4, 13.8, 14.3, 14.6, 15.1
Median= 13.8
3. Compute the first quartile.
First Quartile: Either the average of 12.7 and 13.4 which is 13.05, or you may just take 12.7
as the first quartile.
4. Compute the sample variance.
Sample variance = [ 1316.19 - (95.7)^2/7]/(7-1) = 1.306
5. Compute the sample standard deviation.
Sample Standard Deviation = Square Root of 1.306 = 1.14
1
Part B [30 points] (Data Organization and Interpretations): In the August 29, 2001 issue of
The State, Columbia's daily newspaper, SAT scores for South Carolina's 86 school districts for
the years 1998-2001 were reported. The variables in this data set are:
SAT98 = school district SAT score for 1998.
SAT99 = school district SAT score for 1999.
SAT00 = school district SAT score for 2000.
SAT01 = school district SAT score for 2001.
Using Minitab, the following numerical summary measures and graphical displays were obtained
for this data set.
Figure 1. The following frequency histogram is that associated with the SAT scores for Year
2001 for the 86 school districts.
Frequency Histogram of 2001 SAT Scores of
86 South Carolina School Districts
Frequency
20
10
0
720 760 800 840 880 920 960 1000 1040 1080 1120
SAT01
2
Figure 2: Comparative boxplots of the SAT scores for the 86 school districts for years 19982001.
Comparative BoxPlots of the SAT Scores for 86 of
South Carolina's School Districts for 1998-2001
1100
SAT Score
1000
900
800
700
SAT98
SAT99
SAT00
SAT01
Table 2: The following are numerical summary measures for the SAT Scores for the 86 districts
for each of the years 1998-2001.
Variable
SAT98
SAT99
SAT00
SAT01
N
86
86
86
86
Mean
918.43
911.38
922.91
934.49
Variable
SAT98
SAT99
SAT00
SAT01
Minimum
741.00
731.00
730.00
753.00
Median StDev
926.50 68.83
921.50 75.85
938.00 77.19
944.50 74.43
Maximum
1051.00
1049.00
1056.00
1063.00
Q1
879.25
857.25
882.50
903.00
Q3
969.50
969.00
979.25
988.25
On the basis of the information in Table 2, Figure 1, and Figure 2, answer the following questions
pertaining to the SAT scores of South Carolina's school districts.
1. By examining Figure 1, describe the shape of the distribution of the SAT scores for
the 86 school districts for Year 2001. (That is, would you describe the shape as
relationship between the mean and median [which you could obtain from Table 2] for
SAT01?
Answer: The distribution is left-skewed. This is consistent with the observation that the
sample mean is smaller than the median as a consequence of the effect of extreme values in
the left on the mean.
3
2. From Figure 1, how many out of the 86 school districts got SAT scores of at most
800 points?
Answer: From the histogram, the number is 1 + 5 = 6.
3. Using information in Table 2, what value will "balance" or serve as the "center of
gravity" of the distribution of the SAT01 scores?
Answer: The center of gravity coincides with the sample mean, so this is 934.49. The median
need not balance the distribution … it divides it into two equal parts.
4. Using Table 2, which value divides the SAT01 scores into a 25:75 split?
Answer: The quantity that splits the data set into a 25:75 split is Q1 = 903.
5. From Table 2, the mean and standard deviation for the SAT01 scores are 934.49 and
74.43, respectively. If you are to use the empirical rule, what percentage of the 86
school districts would you expect to have scores between 934.49 - 2(74.43) = 785.63
and 934.49 + 2(74.43) = 1083.35?
Answer: Since this is a 2 standard deviation from the mean interval, the empirical rule
dictates that there will be approx 95% of all observations in the interval. If one is to use the
Chebyshev's rule, then we could claim that there will be at least 75% of all observations in
this interval.
6. By referring to Figure 2 (Comparative Boxplots) and Table 2 (Numerical Summary
Measures), make a comparison of the SAT scores of South Carolina school districts
for the years 1998 to 2001. In particular, could you conclude that the SAT scores
have improved from 1998 to 2001 for the 86 South Carolina school districts? Provide
a brief discussion.
Answer: Looking at the box plots and the values of means and medians, there seems to be a
slight increase in the SAT scores over the 4-year period. On whether the increase is
significant remains to be seen.
4
Part C [30 points] (Basic Probability): Below is a two-way table of 31510 suicides committed
in 1993, categorized by the sex of the victim and the method used. ("Hanging" also includes
suffocation.)
Table 3: A two-way table of suicides classified according to sex of victim and the method used.
Method\Sex of
Victim
Firearms
Poison
Hanging
Other
TOTAL
Male
Female
TOTAL
16381
3569
3824
1641
25415
2559
2110
803
623
6095
18940
5679
4627
2264
31510
Consider the experiment of choosing one suicide victim among the 31510 suicides committed in
1993 as depicted in Table 3. For this experiment, the method of suicide used and the sex of the
victim will be observed.
Let A be the event that the victim used firearms to commit suicide, and B be the event that the
victim is female.
1. What is P(A)?
2. What is P(B)?
3. Find P(A or B).
4. Find P(A and B).
Answer: 2559/31510 = .0812. Note that you cannot multiply P(A) and P(B) since we do NOT
know that they are independent.
5. Find P(B|A).
6. Are events A and B independent events?
Answer: Since P(B|A) is not equal to P(B), A and B are dependent.
5
Part D [10 points] (Probability Updating): In a genetic setting, either a parent is a carrier or is
not a carrier of some trait (for example, the trait of "being smart"). If the parent is a carrier, then
the conditional probability that an offspring will have the trait is 0.75; while if the parent is not a
carrier, then the conditional probability that an offspring will have the trait is 0.25. Assume that
the prior probability that the parent is a carrier of the trait is 0.30. Suppose that this parent has
one offspring. [HINT: Would help to draw a tree diagram!]
1. What is the probability that the offspring will have the trait?
Answer: P(trait) = P(carrier and trait) + P(not a carrier and trait) = (.3)(.75) + (.7)(.25) =
.225 + .175 = .40
2. Given that the offspring possesses the trait, what is the conditional probability that
the parent is a carrier of the trait?
Answer: P(carrier|trait) = P(carrier and trait)/P(trait) = (.3)(.75)/(.4) = .5625
Part E [10 points]. A random variable X takes values 1, 4, 5 according to the following
probability function:
x
p(x) = P(X = x)
1
.5
4
.3
5
.2
1. Compute the (population) mean, , of X.
Answer: Mean = (1)(.5) + (4)(.3) + (5)(.2) = 2.7
2. Compute the (population) variance, 2, of X.
Answer: Variance = (1-2.7)2(.5) + (4-2.7)2(.3) + (5-2.7) 2(.2) = 3.01
6
Part F [15 points]. On the basis of past examinations, the probability that a student will pass the
First Examination in a Stat 515 is 0.90. Furthermore, the performance of each of the students in
the class can be considered to be independent of each other. Suppose that there are 20 students in
a Stat 515 class who will take the First Examination. Denote by X the number of students out of
these 20 students who will pass the First Examination.
1. Explain why it is reasonable to assume that the distribution of X is binomial with
parameters n = 20 and p = .90.
Answer: The binomial distribution is appropriate since there are 20 trials, each with two
possible outcomes, the trials are independent, the probability of "pass" per trial remains
the same at .90, and X denotes the number of "passes" in the 20 trials.
2. What are the mean and standard deviation of X?
Answer: Mean = np = (20)(.9) = 18
Variance = n(p)(1-p) = (20)(.9)(1-.9) = 1.8
Standard Deviation = Square Root of 1.8 = 1.34
3. Using the binomial table that is provided, determine P{15 < X < 18}.
Answer: P(15 < X < 18) = P(X < 18) - P(X < 14) = .608 - .011 = .597.
7
Some Formulas That May Be Useful
X
1 n
 Xi
n i 1
2

 n
 
 Xi  

1 n
1  n 2  i 1  
2
2
S 
Xi 
 ( X i  X )  n  1 

n  1 i 1
n
i 1




M = value that divides arranged data into two equal parts
Q1 = Divides arranged data into 25:75 split
Q3 = Divides arranged data into 75:25 split
P(A or B) = P(A) + P(B) - P(A and B)
P(B|A) = P(A and B)/P(A)
P(B) = P(A)P(B|A) + P(Ac)P(B|Ac)
P(A|B) = P(A)P(B|A)/P(B)
P(A and B) = P(A)P(B) if A and B are independent
n! = (n)(n-1)(n-2)...(2)(1) with 0! = 1
n
   xp(x) ;
n
n!
Cr    
 r  r! (n  r )!
 2   ( x   ) 2 p( x ) ;
  2
n
p( x )    p x (1  p ) n  x , x = 0, 1, 2, …, n
 x
 = np and 2 = np(1-p)
8
```
Related documents