Download File - Varsity Field

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
1
Chapter 3 - Descriptive stats: Numerical measures
3.1 Measures of Location
Mean


Perhaps the most important measure of location is the mean (average).
Sample mean:
∑

where n = sample size
Example:
The number of students per class is as follows:
46
54
42
46
32
The mean is:
∑
Median




The median is another measure of location for a variable.
The median is the value in the middle when the data are arranged in ascending order (smallest
to largest value).
Computation:
o Arrange the data in ascending order (smallest to largest value)
o For an odd number of observations, the median is the middle value
o For an even number of observations, the median is the average of the middle 2 values
Example:
The number of students per class is as follows:
46
54
42
46
32
The median is:
Arrange the values from smallest to largest:
32
42
46
46
54
Middle value = Median = 46
Copyright Reserved
1
2

Example
The yearly income (R1000’s) of 8 workers is as follows:
95
125
1.
102
150
105
220
120
450
Calculate the mean and the median.
Answers:
Mean/average:
∑
Median:
For the median, we arrange the values from smallest to largest:
95
102
105
120
125
150
220
450
Median =
 Although the mean is the more commonly used measure of central location, in some
situations the median is preferred.
 The mean is influenced by extremely small and large data values, while the median is not
influenced by extreme values.
Mode

Definition:
The mode is the value that occurs with greatest frequency.

Example:
The number of students per class is as follows:
46
54
42
46
32
The mode is: 46
 Note:

Bi-modal:
If the data have exactly 2 modes.
Example of a bi-modal data set:
46

54
42
46
32
54
Multimodal:
If data have more than 2 modes.
Copyright Reserved
2
3

Example:
Give the appropriate measure of location for the following data:
Soft drink
Coke Classic
Diet Coke
Dr. Pepper
Pepsi-Cola
Sprite
Frequency
19
8
5
13
5
The mode is: Coke Classic
For this type of data it obviously makes no sense to speak of the mean or median.
Using Microsoft Excel 2007 to compute the mean, median and mode
Formula worksheet
Value worksheet
Copyright Reserved
3
4
Percentiles
Definition: The pth percentile is a value such that at least p percent of the observations are less than
or equal to this value and at least (100 – p) percent of the observations are greater than or equal to
this value.
Calculating the pth percentile:
• Arrange the data in ascending order (smallest to largest value)
• Compute an index i
( )
where
p = percentile of interest
n = sample size
(a) If i is not an integer, round up
(b) If i is an integer, the pth percentile is the average of the values in positions i and (i +1)
Example:
Determine the 85th percentile (
) for the starting salary data:
Step 1: Arrange the data in ascending order
Step 2:
(
)
(
)(
)
Step 3: In the 11th position (after being arranged in ascending order):
.
Interpretation: 85% of the graduates have a starting salary of R3 730 or less.
Copyright Reserved
4
5
Determine the 33rd percentile (
starting salary:
) for the
Step 1: Arrange the data in ascending order
Step 2:
(
)
(
)(
)
) for the starting
Determine the median (
salary:
Step 1: Arrange the data in ascending order
Step 2:
(
)
(
)(
)
i+1=7
Step 3: In the 4th position (after being arranged
in ascending order):
.
Step 3: The median is the average of the values
in the 6th and 7th positions:
Interpretation: 33% of the graduates have a
starting salary of R3 480 or less.
Interpretation: 50% of the graduates have a
starting salary of R3 505 or less.
Copyright Reserved
5
6
Determine the 25th percentile (
starting salary:
) for the
Step 1: Arrange the data in ascending order
Step 2:
(
)
(
)(
)
i+1=4
Determine the 75th percentile (
starting salary:
) for the
Step 1: Arrange the data in ascending order
Step 2:
(
)
(
)(
)
i + 1 = 10
Step 3:
is the average of the values in the
rd
th
3 and 4 positions:
Step 3:
is the average of the values in the
th
th
9 and 10 positions:
Interpretation: 25% of the graduates have a
starting salary of R3 465 or less.
Interpretation: 75% of the graduates have a
starting salary of R3 600 or less.
Copyright Reserved
6
7
Quartiles
First quartile, 25th percentile
Second quartile, 50th percentile, median
Third quartile, 75th percentile
3.2 Measures of variability
Range
Range = Largest Value – Smallest Value
Range
Example of the salary data.
The range is:
= 3 925 – 3 310 = 615

Advantages:
o Easy to calculate

Disadvantages:
o It’s sensitive to just 2 data values: the Largest Value and the Smallest Value.
o Unstable, it is influenced by extreme values.
Suppose one of the graduates received a starting salary of 10 000 per month. Then the range is equal
to:
The range is:
= 10 000 – 3 310 = 6 690.
Copyright Reserved
7
8
Interquartile Range - IQR


It’s the range for the middle 50% of the data
Example of the salary data.
The interquartile range for the salary data is:

Advantages:
o Easy to interpret
o Is not influenced by extreme values

Disadvantages:
o It’s only based on the middle 50% of the data.
Variance

The variance is a measure of variability that utilizes all the data
Example
46
54
42
46
32
Given:

The Sample Variance
∑(
)
Standard Deviation

Sample Standard Deviation
√
and therefore
∑(
√
)
Copyright Reserved
8
9
Example
Calculate the standard deviation of the class sizes.
Number of
students in class
( )
46
54
42
46
32
Mean
class size
( )
44
44
44
44
44
Deviation about
the mean
(
)
2
10
-2
2
-12
Squared deviation
about the mean
(
)
4
100
4
4
144
)
∑(
∑(
)
∑(
and
)
√
OR
)
∑(
(
)
(
(
and
(
)
)
)
(
(
(
)
)
(
(
)
)
(
(
)
(
)
)
)
√
Interpretation:
The average deviation of the class sizes from the average class size (44) is 8 students.
Coefficient of Variation




It’s a relative measure of variability
It measures the standard deviation relative to the mean
Coefficient of Variation:
The coefficient of variation tells us that the sample standard deviation is a % of the value of
the sample mean.
Copyright Reserved
9
10
Example:
The class test mark (out of 10) and the semester test mark (out of 50) of 5 students are investigated.
Class test (out of 10)
4
5
7
6
8
Average of class test marks = 6
Variance of class test marks = 2.5
Semester test (out of 50)
13
20
25
32
40
Average of semester test marks = 26
Variance of semester test marks = 109.5
Which test has the biggest relative variation? Calculate the relevant numerical measures.
Coefficient of variation for the class test marks:
Coefficient of variation for the semester test marks:
√
√
Therefore, the semester test has the biggest relative variation.
Using Microsoft Excel’s 2007 Descriptive Statistics Tool
Self-study (see page 115)
3.3 Measures of Distribution Shape, Relative Location and Detecting Outliers
Distribution Shapes
Read through by yourself.
z- Scores

z - Scores:

The z -score is called the standardized value.

It can be interpreted as the number of standard deviations x is from the mean .
Copyright Reserved
10
11
Example:
z -scores of the class sizes dataset.
(We calculated the mean and standard deviation previously:
Number of students
in class
( )
Deviation about the
mean
(
)
and s = 8).
z-score
(
)
Interpretation:

54 is 1.25 standard deviations above the mean.

32 is 1.5 standard deviation below the mean.
Example:
The Mathematics marks of 2 students are compared.
Student 1
Student 2
75%
80%
(in School A)
(in School B)
Which one has done the best, relatively to his school?
School
A
B
55
80
64
144
s
8
12
Student 1:
Student 1’s mark is 2.5 standard deviations above the mean.
Student 2:
Student 2’s mark is exactly the same value as the mean.
Conclusion:
Student 1 has done relatively better in his school than Student 2.
Copyright Reserved
11
12
Chebyshev’s Theorem – Not for examination
Empirical Rule
Empirical Rule:

68% of the data values will be within 1 std dev of
.

95% of the data values will be within 2 std dev of
.

100% of the data values will be within 3 std dev of
.
Copyright Reserved
12
13
Example of the application of the empirical rule:
Suppose IQ scores have a bell-shaped distribution with a mean of 100 and a standard deviation of 15.
a) What percentage of people should have an IQ score between 85 and 115? Answer = 68%
b) What percentage of people should have an IQ score between 70 and 130? Answer = 95%
c) What percentage of people should have an IQ score of more than 130? Answer = 2.5%
100% - 95% = 5% and
= 2.5%
Copyright Reserved
13
14
d) The 16th percentile (
100% - 68% = 32% and
e) The 84th (
) is equal to:
= 16%. Therefore, P16 = 85.
) percentile is equal to:
16% + 68% = 84%. P84 = 115
f) Is a person with an IQ score of 160 seen as an outlier?
Yes, since approximately 100% of the values are between 55 and 145, an IQ score of 160 is seen as
an outlier.
OR
> 3 (see the next Section on outliers).
Copyright Reserved
14
15
Detecting Outliers

Sometimes a data set will have one or more observations with unusually large or unusually
small values.

Extreme values are called outliers.

Standardized values (z-scores) can be used to identify outliers.
In the case of a bell-shaped distribution, the following rule can be applied:
Since 100% of the data will be within 3 std dev of the mean, we recommend treating any data
value with a (z-score <-3) OR a (z –score >3) as an outlier.
3.4 Exploratory Data Analysis
Five-Number Summary
The following 5 numbers are used to summarize the data:
1.
2.
3.
4.
5.
Smallest Value
First Quartile ( )
Second Quartile ( )
Third Quartile ( )
Largest Value
The five-number summary of the salary data is:
Smallest value = 3310
(Median)
Largest value = 3925
(These values have been calculated previously).
Copyright Reserved
15
16
Box Plot


A box plot is a graphical summary of data that is based on a five-number summary.
A box plot provides another way to identify outliers.
Upper limit = Q3 + (1.5)(IQR) = 3600 + (1.5)(135) = 3802.5
Lower limit = Q1 - (1.5)(IQR) = 3465 - (1.5)(135) = 3262.5
If a point falls above the upper limit or below the lower limit, the point is seen as an outlier.
Copyright Reserved
16
17
Box-plots and skewness:
The median is in the middle of the box, indicating symmetry.
The median is not centered in the middle of the box. The median is closer
to , indicating that the shape of the distribution is skewed to the right.
The median is not centered in the middle of the box. The median is closer
to , indicating that the shape of the distribution is skewed to the left.
Skewness:
Skewed to the left (negative skew): The left tail is longer; the mass of the distribution is
concentrated on the right of the figure. It has relatively few low values.
Skewed to the right (positive skew): The right tail is longer; the mass of the distribution is
concentrated on the left of the figure. It has relatively few high values.
Symmetric
Note: A normal distributions is symmetric
Copyright Reserved
17
18
3.5 Measures of association between two variables
Covariance
Sample Covariance: Measure of the linear relationship between x and y.
∑(
)(
)
Note:
Positive linear relationship
Negative linear relationship
No linear relationship
Note: (Not in the textbook)
∑(
)(
)
∑(
)(
)
∑(
)
where
denotes the sample variance of the x observations.
where
denotes the sample variance of the y observations.
Similarly:
∑(
)
Calculations for the variance and standard deviation of x, the variance and standard deviation of y
and the covariance between x and y:
x
2
5
1
3
4
1
5
3
4
2
30
) (
y (
50
-1
57
2
41
-2
54
0
54
1
38
-2
63
2
48
0
59
1
46
-1
510
0
∑
and
) (
1
4
4
0
1
4
4
0
1
1
20
) (
-1
6
-10
3
3
-13
12
-3
8
-5
0
) (
1
36
100
9
9
169
144
9
64
25
566
)(
1
12
20
0
3
26
24
0
8
5
99
)
∑
Copyright Reserved
18
19
1.
Calculate the variance and the standard deviation of x:
)
∑(
2.
√
̇
Calculate the variance and the standard deviation of y:
)
∑(
3.
̇ and
̇ and
√
̇
Calculate and interpret the covariance between x and y:
∑(
)(
)
. There is a positive linear relationship between x and y.
Copyright Reserved
19
20
Interpretation of sample covariance
A positive linear relationship
25
20
y
15
10
5
0
0
2
4
6
8
x
A negative linear relationship
25
20
y
15
10
5
0
0
2
4
6
8
x
Correlation Coefficient
To measure the strength of the linear relationship between x and y.
(
)(
)
 Strong positive linear relationship between x and y.
where
Sample covariance between x and y.
Sample standard deviation of x.
Sample standard deviation of y.
Copyright Reserved
20
21
Interpretation of the Correlation Coefficient
Measures the linear relationship between x and y
i. Positive linear relationship
Perfect positive linear relationship
ii. Negative linear relationship
Perfect negative linear relationship
iii. Non-linear relationship
Strong negative linear relationship between x and y
Weak negative linear relationship between x and y
Weak positive linear relationship between x and y
Strong positive linear relationship between x and y
No linear relationship between x and y
Copyright Reserved
21
22
Using Microsoft Excel 2007 to compute the covariance and correlation coefficient
Formula worksheet:
Value worksheet:
Note: We have to adjust the Excel result of 9.9 for the covariance, since the COVAR function in
Excel calculates the population covariance.
= sample covariance
= population covariance
(
)
( )
Copyright Reserved
22
23
Homework (work through the following example on your own):
The class test mark (out of 10) (x) and the semester test mark (out of 50) (y) of 5 students are
investigated.
Class test (out of 10) (x)
Semester test (out of 50) (y)
4
13
5
20
7
25
6
32
8
40
(a) Calculate the mean mark and the variance for the class test:
∑
∑(
and
)
(
)
(
)
(
)
(
)
(
)
.
(b) Calculate the mean mark and the variance for the semester test:
∑
∑(
and
)
(
)
(
)
(
)
(
)
(
)
.
(c) Calculate and interpret the standard deviation for the semester test:
.
√
The average deviation of the semester test marks from the average (
) is 10.5.
(d) Calculate and interpret the covariance:
Answer:
x
4
5
7
6
8
∑(
)(
y
13
20
25
32
40
)
(
)
-2
-1
1
0
2
(
)
-13
-6
-1
6
14
(
)(
26
6
-1
0
28
)
. There is a positive linear relationship between x and y.
(e) Calculate and interpret the correlation coefficient:
√
. There is a strong positive linear relationship between x and y.
√
(f) Suppose a student obtained 6/10 for the class test and 30/50 for the semester test. In which test
did the student perform the best, relative to the other students?
√
and
√
. The student performed the best in the semester
test, relative to the other students.
Copyright Reserved
23
24
3.6 The weighted mean and working with grouped data
Weighted Mean
Example
Consider the following sample of 5 purchases of raw material
Purchase
Cost per pound ($)
Number of pounds
1
3.00
1200
2
3.40
500
3
2.80
2750
4
2.90
1000
5
3.25
800
Question: The mean cost per pound for the raw material?

The weighted mean:
∑
∑
(
)( ) (
)(
) (
)(
) (
)(
) (
)(
)
Example:
The net full supply capacity (FSC) (in millions of cubic metres) in the various regions and catchment
areas in South Africa, and also the percentage content as on 31 August 1992 are given in the table
below.
Region/catchment area
Vaaldam
Bloemhofdam
Sterkfonteindam
FSC
% content
2529
1269
2617
20
20
99
Question: Calculate the weighted mean for the % content in the catchment area:
∑
(
)(
) (
)(
) (
)(
)
∑
Copyright Reserved
24
25
Grouped data
The audit times for 20 clients were as follows:
Audit times
(in days)
Frequency
10-14
4
15-19
8
20-24
5
25-29
2
30-34
1
Class Midpoint
20
∑
Sample mean for grouped data:
The midpoint for class i
The frequency for class i
( )(
∑
) ( )(
) ( )(
) ( )(
) ( )(
)
Sample variance for grouped data:
∑
(
(
)
)
(
(
)
)
(
(
)
)
(
(
)
)
(
(
)
(
)
)
= 30
The standard deviation:
√
Copyright Reserved
25
26
Homework (go through this example on your own)
Automobiles traveling on a road that has a posted speed limit of 55 miles per hour are checked for
speed by a state police radar system. Following is a frequency distribution of speeds.
Speed (miles per hour)
45-49
50-54
55-59
60-64
65-69
70-74
75-79
(a)
10
40
150
175
75
15
10
475
47
52
57
62
67
72
77
Calculate the average speed of the automobiles.
∑
(b)
Calculate the variance and the standard deviation
∑ (
)
√
Copyright Reserved
26
27
Typical exam questions:
The annual amounts (in $ millions) spent on research and development for a random sample of
30 electronic component manufacturers are given in the following Excel spreadsheets. By
using the Sort-option in Excel the data set is sorted according to the amount spent.
Unsorted
Sorted
Annual amounts (in $ millions) for electronic component manufacturers has a bell-shaped
distribution with a mean of 20 and a standard deviation of 7.
Question 1
The range is:
Answer 1
Range = xmax – xmin = 38 – 6 = 32.
Question 2
The median is:
Answer 2
( )
(
)
. We need to take the average of the values in the 15th and 16th
positions. In position 15 we have 20 and in position 16 we have 20, therefore
Question 3: The data type of annual amounts is:
.
Answer 3: Continuous
Question 4
According to the coefficient of variation:
Answer 4
. The standard deviation is 35% of the average.
Copyright Reserved
27
28
Questions 5 to 8 are based on the following information:
The relationship between the age (in years) of a motorist and the speed (in km/h) of the car on the
highway is summarised in the following Excel spreadsheet:
Formula sheet:
Value sheet:
Question 5
The variance of the age of the motorists is:
Answer 5
(
∑(
)
)
( )
(
(
)
)
(
(
)
)
(
(
)
(
)
(
)
)
Question 6
The coefficient of variation of the age of the motorists is:
Answer 6
√
Copyright Reserved
28
29
Question 7
The sample covariance is:
Answer 7
Sample covariance = Population covariance
Question 8
The relationship between the age of a motorist and the speed of the car on the highway can be
described as:
(A)
(B)
(C)
(D)
(E)
no linear relationship
a strong negative linear relationship
a weak negative linear relationship
a strong positive linear relationship
a weak positive linear relationship
Answer 8
r = -0.78 which is close to -1. Consequently, we have a strong negative linear relationship.
Questions 9 to 11 are based on the following information:
Consider the following set of Descriptive Statistics on time per week (in hours) spent on
campaigning for the upcoming general election for a specific political party:
Descriptive statistic
̅
Smallest value
Largest value
Value
22
25
18
22
26
8
36
Question 9
The distribution of time per week (in hours) is:
(A) Bimodal
(C) Symmetrical
(E) Skewed to the left
(B) Multimodal
(D) Skewed to the right
Answer 9
Q1 and Q3 are equally far away from the median, therefore, the distribution is symmetrical. The boxplot, for example, will look something like this:
The median is in the middle of the box, indicating symmetry.
Copyright Reserved
29
30
Question 10
Using the box and whisker plot approach, an outlier is a value greater than:
Answer 10
(
)
(
)
.
Question 11
The z-score (standardised value) for the largest value in the data set is:
Answer 11
√
Copyright Reserved
30
Related documents