Download Solutions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Data mining wikipedia , lookup

Time series wikipedia , lookup

Transcript
Exercises 1C Solutions
Question 1C.1
Consider the following data set:
78
41
100
47
71
51
22
60
1
41
24
45
50
76
42
23
21
10
46
(i) Is this data set left-skewed, right-skewed or symmetric?
To answer this, we sort the data into ascending order and informally cluster the
data. There appears to be a central piece 40-60, a set on either side of this 20s and
70s and some outliers. It appears that the data is reasonable symmetric
1
21
41
71
100
10
22
41
76
23
42
78
24
45
46
47
50
51
60
(ii) Construct a 4-bar histogram for the data set. Would you describe the data as leftskewed, right-skewed or symmetric?
The data values lie in the range 1-100 and so dividing this into four gives the bins
1-25, 26-50, 51-75 and 76-100. The corresponding frequencies are 6, 7, 3, 3. The
histogram is right-skewed.
8
7
6
5
4
3
2
1
0
1-25
26-50
51-75
76-100
(iii) Construct a 5-bar histogram for the data set. Would you describe the data as leftskewed, right-skewed or symmetric?
The data values lie in the range 1-100 and so dividing this into five gives the bins
1-20, 21-40, 41-60 and 61-8, 81-100. The corresponding frequencies are 2, 4, 9, 3,
1. The histogram is symmetric.
10
8
6
4
2
0
1-20
21-40
41-60
61-80
81-100
Question 1C.2
Which of the following situations give a distribution that is skewed-to-the-right,
skewed-to-the-left, unimodal symmetric, bimodal symmetric?
a) Distances completed by people competing in a marathon.
It is reasonable to assume that most people entering will finish the course, but
some will have overestimated their abilities or fallen ill on the day. It is also
reasonable to assume that more will get close to the end than will give up early.
This gives a distribution that is right-skewed.
b) Scores on an easy quiz given the day before spring break when only half the class
shows.
Half the students in the class are absent and get zeros. Because the quiz is easy
those that show are likely to do well, so we expect a bimodal distribution.
c) Money taken at the box office by movies released in a given year.
Most movies are flops; only a few are very successful. This gives a left-skewed
distribution.
d) Heights of 8 year old girls.
These are likely to be symmetric about the average height with just as many really
tall girls as really short girls.
Question 1C.3
Find the mean and median of the data presented in Question 1C.1. That is of the data
set:
78
41
100
47
71
51
22
60
1
41
24
45
50
76
42
23
21
10
46
(i) For the mean we sum these values to get 849. There are 19 data values and so the
mean is 849/19 = 44.68.
(ii) The data was ordered in the solution to Question 1B.1:
1
46
10
47
21
50
22
51
23
60
24
71
41
76
41
78
42
100
45
The middle number, 45, is the median
Question 1C.4
Of the two averages computed in Question 1C3, which average is better?
The mean of 44.68 and the median of 45 are very similar. Neither choice is better.
Question 1C.5
Carry out a 5-number summary for the data provided in Question 1C.1. Use the
information to construct a boxplot.
The data set was ordered in the solution to Question 1B.3
1
46
10
47
21
50
22
51
[23]
[60]
24
71
41
76
41
78
42
100
[45]
Since there are 19 data values, 9 lie below the mean and 9 above the mean. To find the
two quartiles we need the middle value of these two sets, namely 23 and 60. The 5number summary is therefore given by:
Low =1
LQ=23
Median=45
UQ=60
High=100
This information looks quite symmetric as a boxplot:
Question 1C.6
Find the mean and standard deviation for the following sets of data:
a)
0, 1, 3, 5, 7, 7, 9, 11, 13, 14
The mean is 70/10=7. For the standard deviation we set up the table
Value
Deviation from Mean
0
1
3
5
7
7
9
11
13
14
7
6
4
2
0
0
2
4
6
7
Squared Deviation
49
36
16
4
0
0
4
16
36
49
The sum of the squared deviations is 210. Divide this by 9 to get a variance of 210/9
and take a square root to get a standard deviation of 4.83.
b)
0, 0, 0, 0, 7, 7, 14, 14, 14, 14
The mean is 70/10=7. For the standard deviation we set up the table
Value
0
0
0
0
7
7
14
14
14
14
Deviation from Mean
7
7
7
7
0
0
7
7
7
7
Squared Deviation
49
49
49
49
0
0
49
49
49
49
The sum of the squared deviations is 392. Divide this by 9 to get a variance of 392/9
and take a square root to get a standard deviation of 6.60
Question 1C.7
A data set is such that its mean is 20, and all of its values lie between 0 and 10 or
between 30 and 40. Which is correct? The standard deviation is
a) less than 0
b) between 0 and 5
c) between 5 and 10
d) greater than 10
Since the mean is 20 and all data points lie between 10 and 20 units from this, the
standard deviation must also lie in this range. So d is correct.