Download Introduction To Statistics - MATHCFS-STUDENTS-PAGE

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Introduction To Statistics
SEF1124
CHAPTER 3: INTRODUCTION TO STATISTICS
The statistical process:
POPULATION
Plan the
Investigation:
What? How? Who? Where?
Collect the Sample
MAKING INFERENCES
SAMPLE
ANALYZING



3.1
Organize
Present
Describe
The Nature of Probability and Statistics
STATISTICS
 the science of conducting studies to collect, organize, summarize, analyze
and draw conclusion from data.
 used to analyze the results of surveys and as a tool in scientific research to
make decisions based on controlled experiments.
 also useful for operations, research, quality control, estimation and
prediction.
POPULATION: consists of all subjects that are being studied.
SAMPLE: a group of subjects selected from a population.
VARIABLE: a characteristic of interest of each subject in the population or
sample.
DATA: values (measurement or observations) that the variables can assume.
Variables whose values are determined by chance are called random variables.
1
Introduction To Statistics
SEF1124
EXAMPLE 1
A polling organization wants to know whether Malaysians favour national cars over foreign
ones. What would be the population data set? What would be the sample data set?
Solution
The population data set would consist of the responses of every Malaysian. A common way
of choosing a sample data set would be to randomly call 1000 Malaysians and gather their
responses to the question of whether they favour national cars over foreign ones.
EXAMPLE 2
Suppose we are interested in measuring the mid-semester examination results of students
taking SHE1114 in CFS IIUM in the first semester of an academic year. What would be the
population data set? What would be the sample size?
Solution
The population data set would be the results of all students who sit for the mid-semester
examination for SHE1114 in the first semester. The sample data set would be a random
number of results of 100 students who take the examination in that semester.
2
Introduction To Statistics
SEF1124
A statistical exercise normally consists of 4 stages:
1.
2.
3.
4.
Collection of data by counting or measuring.
Presentation of the data in a convenient form.
Analysis of the collected data.
Interpretation of the analysis results and making conclusions.
Two types of statistics:
STATISTICS
DESCRIPTIVE STATISTICS
 Consists of the collection,
organization, summarization and
presentation of data.
 Describes a situation. Data
presented in the form of charts,
graphs or tables.
 Makes use of graphical techniques
and numerical descriptive
measures such as average to
summarize and present the data.
 Eg: The national census conducted
by Malaysian government every 5
years or 10 years. The results of
this census give some information
regarding average age, income
and other characteristics of the
Malaysian population
INFERENTIAL STATISTICS





Consists of generalizing from
samples to populations,
performing hypothesis tests,
determining relationships among
variables and making prediction.
Inferences are made from
samples to populations.
Uses probability, that is the
chance of an event occurring.
The area of inferential statistics
called hypothesis testing is a
decision-making process for
evaluating claims about a
population, based on information
obtined from samples.
Eg: A researcher may want to
know if a new product of skin
lotion containing aloe vera will
reduce the skin problem on
children. For this study, two
group of young children would be
selected. One group would be
given the lotion containing aloe
vera and the other would be given
a normal lotion without
containing aloe vera. As aresult is
observed by experts to see the
effectiveness of the new product.
3
Introduction To Statistics
SEF1124
EXAMPLE 3
A study conducted at Manatee Community College revealed that students who attended
class 95% to 100% of the time usually received an A in the class. Students who attended
class 80% to 90% of the time usually received a B or C in the class. Students who attended
class less than 80% of the time usually received a D or an F or eventually withdrew from
the class.
Based on this:
(a) What are the variables under study?
(b) What are the data in the study?
(c) Which type of statistics was used?
(d) What is the population under study?
(e) Was a sample collected?
(f) From the information given, comment on the relationship between the variables.
Solution
(a) Grades, attendance
(b) Specific grades, attendance records
(c) Descriptive
(d) Students at Manatee Community College
(e) Most probably
(f) The better the attendance, the higher the grade
4
Introduction To Statistics
SEF1124
Variables and Types of Data
LEVEL OF MEASUREMENT
NOMINAL
QUALITATIVE
TYPES OF
DATA
ORDINAL
DISCRETE
RATIO
QUANTITATIVE
INTERVAL
CONTINUOUS
QUALITATIVE VARIABLES:



variables that can be placed into distinct categories, according to some
characteristic or attribute.
non-numeric
eg: gender, colour, religion, workplace etc.
QUANTITATIVE VARIABLES:

variables that are numerical and can be ordered or ranked. Quantitative
variables can be further classified into two groups, namely discrete and
continuous.
DISCRETE VARIABLES:


assume values that can be counted, or for which there is a fixed set of values.
Eg: the number of children in a family, shoe size etc.
CONTINUOUS VARIABLES:


can assume an infinite number of values between any two specific values,
obtained by measuring.
Eg: height, weight, temperature etc.
5
Introduction To Statistics
SEF1124
EXAMPLE 4
Classify each variable as qualitative or quantitative. If the variable is quantitative, further
classify it as discrete or continuous.
(a) Number of times students in a hostel wash their clothes in a week
(b) State of origin of members in a club in CFS IIUM
(c) Weights of new born babies in a hospital
(d) Hijab colour of students in group 3 of SHE1114
Solution
(a) Quantitative, discrete
(b) Qualitative
(c) Quantitative, continuous
(d) Qualitative
Levels of measurement
NOMINAL LEVEL OF MEASUREMENT

is the lowest of the four ways to characterize data. Nominal means "in name only"
and that should help to remember what this level is all about.

deals with names, categories, or labels.

data are qualitative.

eg: eye colour, gender, yes or no responses to a survey, favorite breakfast cereal etc.

data can't be ordered in a meaningful way, and it makes no sense to calculate things
such as means and standard deviations.
ORDINAL LEVEL OF MEASUREMENT

the next level after nominal.

data at this level can be ordered, but there are no meaningful differences between
the data ranks.

eg: a list of the top ten cities to live (the cities are ranked from one to ten, but
differences between the cities don't make much sense), letter grades (A could be
6
Introduction To Statistics
SEF1124
higher than a B, but without any other information, there is no way of knowing how
much better an A is from a B), man’s build (small, medium, large) etc.

as with the nominal level, data at the ordinal level should not be used in calculations.
INTERVAL LEVEL OF MEASUREMENT

has all characteristics of a nominal and ordinal scale but in addition it is based upon
predetermined equal interval.

deals with data that can be ordered, and in which differences between the data does
make sense. Data at this level does not have a starting point.

The Fahrenheit and Celsius scales of temperatures are both examples of data at the
interval level of measurement. You can talk about 30 degrees being 60 degrees less
than 90 degrees, so differences do make sense. However 0 degrees (in both scales)
cold as it may be does not represent the total absence of temperature.

data at the interval level can be used in calculations.
RATIO LEVEL OF MEASUREMENT

the fourth and highest level of measurement is the ratio level.

data at the ratio level possess all of the features of the interval level, in addition to a
zero value. Due to the presence of a zero, it now makes sense to compare the ratios
of measurements. Phrases such as "four times" and "twice" are meaningful at the
ratio level.

eg: distances, in any system of measurement give us data at the ratio level. A
measurement such as 0 feet does make sense, as it represents no length.
Furthermore 2 feet is twice as long as 1 foot. So ratios can be formed between the
data.

sums and differences can be calculated, as well as ratios. One measurement can be
divided by any nonzero measurement, and a meaningful number will result.
7
Introduction To Statistics
SEF1124
EXAMPLE 5
Identify the following as nominal level, ordinal level, interval level, or ratio level data.
(a) Percentage scores on a Math exam.
(b) Letter grades on an English essay.
(c) Flavors of yogurt.
(d) Instructors classified as: Easy, Difficult or Impossible.
(e) Employee evaluations classified as : Excellent, Average, Poor.
(f) Religions.
(g) Political parties.
(h) Commuting times to school.
(i) Years (AD) of important historical events.
(j) Ages (in years) of statistics students.
8
Introduction To Statistics
SEF1124
Data collection and Sampling Techniques
Sampling:

the process of selecting a number of individuals for a study in such a way that the
individuals represent the larger group from which they were selected.

to use a sample to gather information about a population.

a random sample is a sample selected in such a way that every subject in the
population has a chance of being selected.
4 basic methods of random sampling:
o Random Sampling: subjects are selected by random numbers.
o Systematic Sampling: Subjects are selected by using every kth number
after the first subject is randomly from 1 through k.
o Stratified Sampling: Subjects are selected by dividing up the population
into groups (strata) and subjects within groups are randomly selected.
Eg.: We divide the population into 5 group then we take the subjects from
each group to become our sample.
o Cluster Sampling: Subjects are selected by using an intact group that is
representative of the population. Eg.: We divide the population into 5
group then we take 2 groups to become our sample. That means 2 groups
of subject represent 5 groups of subjects.
9
Introduction To Statistics
SEF1124
EXERCISE 3.1
1
A firm wanted to keep a database on the heights,gender, marital status, blood types
and highest qualifications of new employees. For that purpose, a study was
conducted.
(a)
What would be the population? What would be the sample?
(b)
Identify the variables.
(c)
Which variables are qualitative, which are quantitative? If the variable is
quantitative, is it discrete or continuous?
2
State whether each of the following variables is discrete or continuous.
(a) Number of calls received by the 999 operators every day for a month.
(b)
Life expectancy of 500 government pensioners chosen at random.
(c)
Cost of prepaid top-ups among students in a college for a month.
(d)
Temperature of coffee served in a bistro for breakfast over a week.
(e)
Amount of nasi lemak kampung sold in the month of May at a roadside stall
3
Classify each set of data as discrete or continuous.
(a)
(b)
(c)
(d)
(e)
(f)
4
The number of suitcases lost by an airline.
The height of corn plants.
The number of ears of corn produced.
The number of green M&M's in a bag.
The time it takes for a car battery to die.
The production of tomatoes by weight.
Identify the following as nominal level, ordinal level, interval level, or ratio level
data.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
Lecturers classified according to subjects taught
IQ scores of 100 children below the age of 15 diagnosed with Down’s
Syndrome
Number of phone calls received during a water disruption in Klang
Heights of saplings in a green house after three weeks
Order of F1 drivers completing a race
Marital status of applicants to a job vacancy
Salaries of fresh graduates in the country
10
Introduction To Statistics
3.2
SEF1124
Frequency Distributions and Graphs
A frequency distribution is the organization of raw data in table form, using classes and
frequencies.
There are three types of frequency distribution.
1. Categorical frequency distribution
Used when data can be placed in specific categories
EXAMPLE 6
The following data represent the colour of men’s shirts purchased in the men’s department
of a large department store. Construct a frequency distribution for the data. (W = White, BL
= Blue, BR = Brown, Y =Yellow, G = Grey)
W
W
BL
Y
W
W
W
G
BL
BL
BR
BL
W
G
Y
Y
BR
BL
BR
W
BL
BL
W
G
W
BL
BR
W
BR
BL
W
BL
BL
W
W
W
BL
W
W
BR
Y
BR
BL
BR
G
G
Y
BR
Y
G
(A complete categorical distribution must have class, frequency & percentage column in the
table)
Shirt Colour
White
Blue
Brown
Yellow
Grey
Frequency
Percentage
11
Introduction To Statistics
SEF1124
2. Grouped frequency distribution
When the range of the data is large, the data must be grouped into classes.
Grouped data are a collection of data in a more condensed form, where the data set is made
into groups of suitable size. These groups are known as data classes. The number of values
from the set in each class makes the frequency of that class.
To construct a frequency distribution for grouped data, we must first determine the classes.
We can use the following guidelines when forming the classes:
1. There should be between 5 and 20 classes.
2. The class width should be an odd number. This will guarantee that the class
midpoints are integers instead of decimals.
3. The classes must be mutually exclusive. This means that no data value can fall into
two different classes
4. The classes must be all inclusive or exhaustive. This means that all data values must
be included.
5. The classes must be continuous. There are no gaps in a frequency distribution.
Classes that have no values in them must be included (unless it's the first or last
class which are dropped).
6. The classes must be equal in width. The exception here is the first or last class.
Next, we use the following guidelines to create a grouped frequency distribution:
1. Find the largest and smallest values
2. Compute the range
Range = Maximum - Minimum
3. Select the number of classes desired. This is usually between 5 and 20.
12
Introduction To Statistics
SEF1124
4. Find the class width by dividing the range by the number of classes and rounding up.
𝑐𝑙𝑎𝑠𝑠 𝑤𝑖𝑑𝑡ℎ =
𝑟𝑎𝑛𝑔𝑒
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠
5. Pick a suitable starting point less than or equal to the minimum value.
6. To find the upper limit of the first class, subtract one from the lower limit of the
second class.
7. Find the boundaries by subtracting 0.5 units from the lower limits and adding 0.5
units from the upper limits. The boundaries are also half-way between the upper
limit of one class and the lower limit of the next class.
8. Tally the data.
9. Find the frequencies.
10. Find the cumulative frequencies.
11. If necessary, find the relative frequencies and/or relative cumulative frequencies.
13
Introduction To Statistics
SEF1124
The following table lists the important terminology we use when describing data in a
frequency distribution.
Terminology
Description
Class-Interval
Each class is bounded by two figures, which are
called class limits. The figure on the left side of a class is
called its lower limit and that on its right is called
its upper limit.
Lower class limit
The least value that can belong to a class
Upper class limit
The greatest value that can belong to a class
Class width
The difference between the upper (or lower) class
limits of consecutive classes. All classes should have the
same class width.
Upper Class Boundary - Lower Class Boundary
= Lower class limit of one class - Lower class limit of
next class
= Upper class limit of one class - Upper class limit of
next class
Class boundaries
Class Midpoint
The average of the upper limit of one class and the
lower limit of the next class.
The middle value of each data class. To find the class
midpoint, average the upper and lower class limits, or
the upper and lower class boundaries.
𝑐𝑙𝑎𝑠𝑠 𝑚𝑖𝑑𝑝𝑜𝑖𝑛𝑡 =
𝑙𝑜𝑤𝑒𝑟 + 𝑢𝑝𝑝𝑒𝑟
2
14
Introduction To Statistics
SEF1124
EXAMPLE 7
The following data show the ages of patients diagnosed with metastatic carcinoma of the
bone at an oncology ward over a period of two years.
45
31
46
25
57
39
42
55
20
37
40
59
11
38
34
22
62
33
48
43
57
37
43
51
29
41
35
66
45
32
44
47
42
46
54
65
17
35
53
27
38
22
33
39
45
32
43
41
57
45
Construct a frequency distribution for the data using 6 equal classes, showing the class
boundaries and midpoints.
Solution
Range = 66 – 11 = 55
Ages
10 - 19
20 - 29
30 - 39
40 - 49
50 - 59
60 - 69
Class width = Range/number of class
= 55/6 = 9.1 (round up = 10)
Boundaries
9.5 - 19.5
19.5 – 29.5
29.5 – 39.5
39.5 – 49.5
49.5 – 59.5
59.5 – 69.5
Midpoints
14.5
24.5
34.5
44.5
54.5
64.5
total
Frequency
2
6
14
17
8
3
50
Cumulative frequency: the number of data elements in any given class and all previous
classes.
Relative frequency: the ratio of the frequency of any given class to the sum of
frequencies.
15
Introduction To Statistics
SEF1124
EXAMPLE 8
The lengths in mm of a batch of 40 spindles manufactured on a day were measured with
the following results:
20.90
20.57
20.86
20.74
20.82
20.63
20.53
20.89
20.75
20.65
20.71
21.03
20.72
20.41
20.49
20.75
20.79
20.65
21.08
20.89
20.50
20.88
20.97
20.78
20.61
20.92
21.07
21.16
20.80
20.77
20.82
20.72
20.60
20.90
20.86
20.68
20.75
20.88
20.56
20.94
Construct a frequency distribution for the data using 8 equal classes, showing the class
boundaries, midpoints, cumulative frequencies and relative frequencies.
Solution
Range = 21.16 – 20.41
= 0.75
Lengths
20.40 – 20.49
20.50 – 20.59
20.60 – 20.69
20.70 – 20.79
20.80 – 20.89
20.90 – 20.99
21.00 – 21.09
21.10 - 21.19
Boundaries
20.395 -20.495
20.495 – 20.595
20.595 – 20.695
20.695 – 20.795
20.795 – 20.895
20.895 – 20.995
20.995 – 21.095
21.095 – 21.195
Class width = Range/number of class
= 0.75/8 = 0.09 = 0.1 (round up)
Midpoints
20.445
20.545
20.645
20.745
20.845
20.945
21.045
21.145
Frequency
1
4
6
10
9
6
3
1
Cum. fre
1
5
11
21
30
36
39
40
Rel. fre
.025
.10
.15
.25
.225
.15
.075
.025
16
Introduction To Statistics
SEF1124
3. Ungrouped frequency distribution
when the range of data is small
Ungrouped frequency distribution is used for data which have been obtained in their
original form, also called raw data. When a set of ungrouped data is arranged in ascending
or descending order, the set is called an array.
To construct the frequency distribution for ungrouped data, we take each observation from
the data, one at a time, and indicate the frequency (the number of times the observation
has occurred in the data) by small line, called tally marks. For convenience, we write tally
marks in bunches of five, the fifth one crossing the fourth diagonally. We may choose to
omit the tally marks from the frequency distribution.
In the table so formed, the sum of all the frequencies is equal to the total number of
observations in the given data.
EXAMPLE 9
The marks obtained obtained by 25 students in a class in a certain examination are given
below:
25, 8, 37, 16, 45, 40, 29, 12, 42, 25, 14, 16, 16, 20, 10, 36, 33, 24, 25, 35, 11, 30, 45, 48
If these marks are arranged in ascending order, we get the following array:
8, 10, 11, 12, 14, 16, 16, 16, 20, 24, 25, 25, 25, 29, 30, 33, 35, 36, 37, 40, 40, 42, 45, 45, 48
17
Introduction To Statistics
SEF1124
EXAMPLE 10
The number of years in service of faculty members at the Mathematics Department is given
below:
7 8 5 4 9 8 5 7 6 8 9 6 7 98 7 9 9 6 5 8 9 4 5 5 8 9 6
From this data, we may construct a frequency distribution table, as given below:
Years served
Frequency
4
2
5
5
6
4
7
4
8
6
9
7
Total
28
Graphical Representation of Data
Histogram
A histogram is a graphical representation of the information in a frequency table using a
bar graph.
The histogram should have the variable being measured in the data set as its horizontal
axis, and the class frequency as the vertical axis. Each data class will be represented by a
vertical bar whose height is the frequency of the class and whose width is the class width.
i)
x-axis: class boundary
y-axis: frequency
ii)
x-axis: class boundary
y-axis: relative frequency
18
Introduction To Statistics
SEF1124
EXAMPLE 11
Construct a histogram for the data in Example 7.
Solution
Ages
10 - 19
20 - 29
30 - 39
40 - 49
50 - 59
60 - 69
Boundaries
9.5 - 19.5
19.5 – 29.5
29.5 – 39.5
39.5 – 49.5
49.5 – 59.5
59.5 – 69.5
Midpoints
14.5
24.5
34.5
44.5
54.5
64.5
total
Frequency
2
6
14
17
8
3
50
17
18
16
14
14
12
10
8
8
6
6
4
3
2
2
0
1
14.5
24.5
34.5
44.5
54.5
64.5
19
Introduction To Statistics
SEF1124
Frequency Polygon
A frequency polygon is a line graph representation of the information in a frequency table.
Like a histogram, the vertical axis represents frequency and the horizontal axis represents
the variable being measured in the data set. To construct the graph, a point is plotted for
each class at its midpoint and with height given by the frequency of the class. The points
are then connected by straight lines.
i)
x-axis: class midpoint
y-axis: frequency
ii)
x-axis: class midpoint
y-axis: relative frequency
EXAMPLE 12
Construct a frequency polygon for the same data in Example 7.
Solution
Ages
10 - 19
20 - 29
30 - 39
40 - 49
50 - 59
60 - 69
Boundaries
9.5 - 19.5
19.5 – 29.5
29.5 – 39.5
39.5 – 49.5
49.5 – 59.5
59.5 – 69.5
Midpoints
14.5
24.5
34.5
44.5
54.5
64.5
total
Frequency
2
6
14
17
8
3
50
18
16
14
12
10
8
6
4
2
0
[]
[14.5]
[24.5]
[34.5]
[44.5]
[54.5]
[64.5]
[]
20
Introduction To Statistics
SEF1124
Ogive
An ogive is a line graph representing the cumulative frequencies for the classes.
The vertical axis represents cumulative frequency and the horizontal axis represents the
variable being measured in the data set. To construct the graph, a point is plotted for each
class at its midpoint and with height given by the frequency of the class. The points are
then connected by straight lines.
i)
x-axis: class boundary
y-axis: cumulative frequency
ii)
x-axis: class boundary
y-axis: cumulative relative
frequency
EXAMPLE 13
Construct an ogive for the same data in Example 7.
Solution
Ages
10 - 19
20 - 29
30 - 39
40 - 49
50 - 59
60 - 69
Boundaries
9.5 - 19.5
19.5 – 29.5
29.5 – 39.5
39.5 – 49.5
49.5 – 59.5
59.5 – 69.5
Midpoints
14.5
24.5
34.5
44.5
54.5
64.5
total
Frequency
2
6
14
17
8
3
50
Cumulative Frequency
2
8
22
39
47
50
60
50
40
30
20
10
0
14.5
24.5
34.5
44.5
54.5
64.5
21
Introduction To Statistics
SEF1124
EXERCISE 3.2
1. In a class of 35 students, the following grade distribution was found. Construct a
histogram, frequency polygon and ogive for the data. (A=4, B=3, C=2, D=1, F=0)
Grade
0
1
2
3
4
Frequency
3
6
9
12
5
2. Using the histogram shown below. Construct
i) A frequency distribution
ii) A frequency polygon
iii) An ogive
7
6
5
4
3
2
1
0
[21.5]
[24.5]
[27.5]
[30.5]
[33.5]
[36.5]
[39.5]
[42.5]
22
Introduction To Statistics
SEF1124
3. The following frequency distribution was obtained from the duration (in minutes)
taken by 70 candidates in a writing test.
Duration
21.2 –
21.4
3
Frequency
Find:
(a)
(b)
(c)
(d)
(e)
21.5 –
21.7
7
21.8 –
22.0
12
22.1 –
22.3
16
22.4 –
22.6
19
22.7 –
22.9
13
The lower and upper boundary of the third class.
The midpoint of the fifth class.
The cumulative frequency of the fourth class.
The relative frequency of the second class.
The class width
4.
x
15
1
Frequency
16
4
17
9
18
10
19
6
20
2
Based on the frequency distribution above, construct:
(a)
a relative frequency histogram.
(b)
an ogive.
5. The number of cars passing through a guard post was recorded on 40 occasions, as
shown below.
(a)
(b)
66
87
79
74
84
72
81
78
68
74
80
71
91
62
77
86
87
72
80
77
76
83
75
71
83
67
94
64
82
78
77
67
76
82
78
88
66
79
74
64
Construct a frequency distribution using 7 equal classes from 60 to 94,
showing the class boundaries and midpoints.
Draw a frequency polygon for the data.
23
Introduction To Statistics
SEF1124
6. The number of calories per serving for selected ready-to-eat cereals is listed here.
Construct a histogram, frequency polygon and ogive for the data using relative
frequency.
130
210
190
190
115
190
130
210
240
210
140
100
120
80
110
80
90
200
120
225
100
210
130
90
190
120
120
180
190
130
220
200
260
200
220
120
270
210
110
180
100
190
100
120
160
180
7. Below is a data set for the duration (in minutes) of a random sample of 24 longdistance phone calls:
1
20 10 20 12 23 3 7 18 12 4 5 15 7 29 10 18 10 10 23 4 12 8 6
(a) Construct a frequency distribution table for the data using the classes “1 to 5” “6
to 10” etc.
(b) Construct a cumulative frequency distribution table and use it to draw up an
ogive
2 The following table refers to the 2003 average income (in thousand Ringgit) per
year for 20 employees of company A.
Income (‘000
Ringgit)
5–9
10 – 14
15 – 19
20 – 24
25 – 29
30 – 34
Frequency
6
3
2
4
3
2
(a) Draw a histogram and a frequency polygon for the above data.
(b) Construct the cumulative frequency table. Hence, draw up an ogive for the above
data
24
Introduction To Statistics
3.3
SEF1124
Measures of Central Tendency
The following symbols and variables will have the meanings given below (unless otherwise
specified)
Variables
x
= data value
n
= number of values in a sample data set
N
= number of values in a population data set
f
= frequency of a data class
m
= midpoint of a data class
Symbol

indicates the sum of all values for the following variable or expression.
Example: Using our notation, we can write the statement that the sum of the frequencies in
a frequency table should equal the number of values in the data set as follows:
∑𝑓 = 𝑛
A measure of central tendency is a value used to represent the “average” value in a data set.
There are three most commonly used measures of central tendency.
Mean – the sum of all data values divided by the
number of values in the data set. The mean of a
Mean is the most commonly used
measure of central tendency.
sample data set is denoted by 𝑥̅ and the mean of
a population data set by the Greek letter 𝜇.
Median – the value which separates the largest 50% of data values from the lowest 50%.
Mode – the data value (or values) which appears the largest number of times in the set. If
no data value is repeated, we say that there is no mode.
25
Introduction To Statistics
SEF1124
Mean, median and mode for ungrouped data
Population mean
𝜇=
∑𝑥 ∑𝑥
=
∑𝑓
𝑁
Sample mean
𝑥̅ =
∑𝑥 ∑𝑥
=
∑𝑓
𝑛
Median

arrange the data in ascending order.

if n is odd, the middle value is the median.

if n is even, the mean of the two middle values is the median.
Mode

the value that occurs most frequently in a data set.
26
Introduction To Statistics
SEF1124
EXAMPLE 14
Suppose earnings from selling burgers by the roadside for the past week were as follows:
Day
Monday
Earnings
in RM
350
Thursday Wednesday
150
100
Thursday
Friday
350
50
Calculate the mean, median and mode earning of each day.
Solution
Mean:
𝑥̅ =
∑ 𝑥 350 + 150 + 100 + 350 + 50 1000
=
=
𝑛
5
5
= RM200
Median:
Mode:
EXAMPLE 15
Calculate the mean, median and mode of quiz score from the data below:
1, 5, 7, 7, 6, 8, 10, 9, 5, 10, 8
Solution:
Placing the data in ascending order,
1, 5, 5, 6, 7, 7, 8, 8, 9, 10, 10
Since the number of data values is odd, the median is the middle value, which is 7.
27
Introduction To Statistics
SEF1124
Mean, median and mode for ungrouped frequency distribution
Mean
𝑥̅ =
∑ 𝑓𝑥
∑𝑥
Median

find the cumulative frequency

location of the median:
∑𝑓
2
Mode

the value with the highest frequency
EXAMPLE 16
The masses in kg of 50 groupers ordered by a sushi restaurant have been measured as
follows.
Mass
f
4.2
1
4.3
3
4.4
7
4.5
10
4.6
12
4.7
10
4.8
5
4.9
2
4.7
10
43
4.8
5
48
4.9
2
50
Calculate the mean, median and mode mass of the fish.
Solution:
Mass
f
cum f
4.2
1
1
4.3
3
4
4.4
7
11
4.5
10
21
4.6
12
33
Since the number of data values is even, the median lies between the two middle values,
that is between the 25th and the 26th values. From the cumulative frequencies, we can see
that this value will be 4.6 kg.
28
Introduction To Statistics
SEF1124
EXAMPLE 17
This ungrouped frequency distribution of the number of cups of coffee consumed with each
meal was obtained from a survey conducted in a restaurant. Find the mean, median and
mode.
Number of cups
0
1
2
3
4
5
Frequency
5
8
10
2
3
2
29
Introduction To Statistics
SEF1124
Mean, mode and median for grouped data
Mean
Population mean
𝜇=
∑ 𝑓𝑚 ∑ 𝑓𝑚
=
∑𝑓
𝑁
𝑥̅ =
∑ 𝑓𝑚 ∑ 𝑓𝑚
=
∑𝑓
𝑛
Sample mean
Median



Find cumulative frequency
Find median class (location of median)
The median is:
𝑛
− ∑ 𝑓𝑚−1
𝐿𝑚 + [2
]𝑤
𝑓𝑚
Lm
∑ 𝑓𝑚−1
fm
n
w
: lower boundary of the median class
: cumulative frequency before the median class
: frequency of the median class
: number of data values
: class width
30
Introduction To Statistics
SEF1124
Mode
 find modal class (class with the highest frequency)
𝐿𝑚𝑜 + [
Lmo
∆1
∆2
w
∆1
]𝑤
∆1 + ∆2
: lower boundary of the modal class
: difference between frequency of modal class and frequency of class before
: difference between frequency of modal class and frequency of class after
the class width
EXAMPLE 18
The following distribution shows the prices of items sold at a car boot sale.
Prices in
RM
1–5
Frequency
Midpoints
fm
8
3
24
6 – 10
6
8
48
11 – 15
4
13
52
16 – 20
2
18
36
21 – 25
4
23
92
26 – 30
6
28
168
31 – 35
2
33
66
∑ 𝑓𝑚=486
n=32
Calculate the mean, median and mode price of the sold items.
Solution
𝑥̅ =
∑ 𝑓𝑚 486
=
𝑛
32
= RM 15.19
31
Introduction To Statistics
SEF1124
EXAMPLE 19
Calculate the mean, median and mode of the lengths of 40 spindles in Example 7.
Solution
Lengths
Boundaries
Frequency
20.40 – 20.49
20.50 – 20.59
20.60 – 20.69
20.70 – 20.79
20.80 – 20.89
20.90 – 20.99
21.00 – 21.09
21.10 - 21.19
20.395 -20.495
20.495 – 20.595
20.595 – 20.695
20.695 – 20.795
20.795 – 20.895
20.895 – 20.995
20.995 – 21.095
21.095 – 21.195
1
4
6
10
9
6
3
1
Cum.
fre
1
5
11
21
30
36
39
40
The median class is the class with the n/2= 20th data value, that is, the fourth class.
The class width is 0.1
Using the formula, the median is
𝑛
40
− ∑ 𝑓𝑚−1
− 11
2
𝐿𝑚 + [
] 𝑤 = 20.695 + [ 2
] . 0.1
𝑓𝑚
10
=
32
Introduction To Statistics
SEF1124
EXERCISE 3.3
1. The following frequency distribution shows the numbers of books read by each of
the 28 students in a literature class.
Number of
books
0–2
3–5
6–8
9 – 11
12 – 14
Frequency
2
6
12
5
3
(a) Find the mean, median and mode.
(b) Find the percentage of students who read:
(i)less than six books
(ii) more than nine books.
2. Eighty randomly selected light bulbs were tested to determine their lifetimes (in
hours). This frequency distribution was obtained. Find the mean, median and mode.
Class Boundaries
52.5 – 63.5
63.5 – 74.5
74.5 – 85.5
85.5 – 96.5
96.5 – 107.5
107.5 – 118.5
Frequency
6
12
25
18
14
5
3. After a month, the heights of 120 saplings were measured as follows:
Height (cm)
29.4
29.5
29.6
29.7
29.8
29.9
Frequency
6
25
34
32
18
5
(a) Find the mean, mode and median height.
(b) Draw a relative frequency histogram for the data.
33
Introduction To Statistics
SEF1124
4. The following data set represents the life expectancy of ten government pensioners.
96
68
78
82
74
𝑥
70
86
84
87
If the median is 81, find the value of 𝑥. Hence, find the mean.
5. The following scores were obtained by a batch of students on a Calculus test:
Scores
56-60
61-65
66-70
71-75
76-80
81-85
86-90
91-95
96-100
f
2
2
3
5
6
7
8
4
3
(a) Calculate the mean and the mode of the scores.
(b) Give a brief interpretation of the mean and the mode with reference to the
students’ performance in the test.
6. The distance in km traveled on a given day by 40 sales representatives of a direct
selling company were recorded as the following:
210
181
192
164
170
186
205
194
178
161
175
195
172
188
196
182
206
188
165
202
178
163
190
198
187
198
174
172
183
208
185
162
203
172
196
184
185
176
197
184
(a) Construct a frequency distribution using 5 equal classes
(b) Draw an ogive
(c) Calculate the mean, mode and median
(d) Give a brief interpretation of the measures in part (c)
34
Introduction To Statistics
3.4
SEF1124
Measures of Variation
Measure of variation is a measure that describes how a set of data is spread out or
scattered. It is also known as measures of dispersion or measures of spread. Variation in a
data set is the amount of difference between data values.
In a data set with little variation, almost all data values would be close to one another. The
histogram of such a data set would be narrow and tall. On the other hand, a data set with a
great deal of variation will have data values that are spread widely. The histogram of this
data set would be low and wide.
Compare the histograms for the two sets of quiz scores below.
Quiz Scores A: 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5
Quiz Scores B: 1, 3, 4, 5, 6, 6, 7, 8, 8, 9, 10
7
6
6
5
4
3
3
2
2
2
2
1
1
1
1
1
1
1
0
1
2
The narrow and tall histogram on the left shows that Quiz Scores A have little variation.
The wide and low histogram on the right shows that Quiz Scores B have greater variation.
35
Introduction To Statistics
SEF1124
There are three measures of variation, namely the range, the variance and the standard
deviation.
Range
The range is the difference between the highest and the lowest values.
Variance
Variance indicates a relationship between the mean of a distribution and the data points; it
is determined by averaging the sum of the squared deviations. Squaring the differences
instead of taking the absolute values allows for greater flexibility in calculating further
algebraic manipulations of the data.
Standard Deviation
Standard deviation is the square root of the variance. This calculation is useful because it
allows for the same flexibility as variance regarding further calculations and yet also
expresses variation in the same units as the original measurements.
Population variance,
𝜎2 =
∑(𝑋 − 𝜇)2
𝑁
Variance
Sample variance, 𝑠 2
Population standard deviation,
∑(𝑋 − 𝜇)2
𝜎=√
𝑁
Standard Deviation
Sample variance, 𝑠 2
36
Introduction To Statistics
SEF1124
Variance and standard deviation for ungrouped/raw data
Variance
𝑠2 =
∑(𝑋 − 𝑋̅)
𝑛−1
Standard deviation
∑(𝑋 − 𝑋̅)
𝑠 = √𝑠 2 = √
𝑛−1
Where,
𝑋: data value, 𝑋̅: sample mean, 𝑛: sample size
OR (short cut formula)
Variance
𝑠2 =
(∑ 𝑋)2
𝑛
𝑛−1
∑ 𝑋2 −
Standard deviation
𝑠 = √𝑠 2 =
(∑ 𝑋)2
𝑛
𝑛−1
2
√∑ 𝑋 −
EXAMPLE 20
The normal daily temperatures (in degrees Fahrenheit) in January for 10 selected cities are
as follows. Find the variance and standard deviation.
50 37 29 54 30 61 47 38 34 61
37
Introduction To Statistics
SEF1124
EXAMPLE 21
Twelve students were given an arithmetic test and the times (in minutes) to complete it
were as follows:
10
9 12 11 8 15 9 7 8 6 12 10
Find the variance and standard deviation.
Variance and standard deviation for ungrouped frequency distribution
Variance
(∑ 𝑓𝑥)2
∑𝑓
(∑ 𝑓) − 1
∑ 𝑓𝑥 2 −
𝑠2 =
Standard deviation
𝑠=
√
(∑ 𝑓𝑥)2
∑𝑓
(∑ 𝑓) − 1
∑ 𝑓𝑥 2 −
EXAMPLE 22
Calculate the variance and the standard deviation for the following data:
Years
served,
Frequency,
4
5
6
7
8
9
Total
2
5
4
4
6
7
28
38
Introduction To Statistics
SEF1124
Variance and standard deviation for grouped data
Variance
𝑠2 =
(∑ 𝑓𝑥𝑚 )2
∑𝑓
(∑ 𝑓) − 1
∑ 𝑓(𝑥𝑚 )2 −
Standard deviation
𝑠 = √𝑠 2 =
√
(∑ 𝑓𝑥𝑚 )2
∑𝑓
(∑ 𝑓) − 1
∑ 𝑓(𝑥𝑚 )2 −
39
Introduction To Statistics
SEF1124
EXAMPLE 23
Calculate the variance and the standard deviation for the following data:
Ages
Frequency,
f
10 - 19
20 - 29
30 - 39
40 - 49
50 - 59
60 - 69
2
6
14
17
8
3
40
Introduction To Statistics
SEF1124
EXERCISE 3.4
1. In a class of 29 students, this distribution of quiz scores was recorded. Find variance
and standard deviation.
Grade
0–2
3–5
6–8
9 – 11
12 – 14
Frequency
1
3
5
14
6
2. Eighty randomly selected light bulbs were tested to determine their lifetimes (in
hours). This frequency distribution was obtained. Find variance and standard
deviation.
Class Boundaries
52.5 – 63.5
63.5 – 74.5
74.5 – 85.5
85.5 – 96.5
96.5 – 107.5
107.5 – 118.5
Frequency
6
12
25
18
14
5
3. These data represent the scores (in words per minute) of 25 typists on a speed test.
Find variance and standard deviation.
Class limit
54 – 58
59 – 63
64 – 68
69 – 73
74 – 78
79 – 83
84 – 88
Frequency
2
5
8
0
4
5
1
41
Introduction To Statistics
SEF1124
4. Estimate the variance and the standard deviation for the data set whose frequency
distribution is given below:
Class
Frequency
3.45-3.47
2
3.48-3.50
6
3.51-3.53
12
3.54-3.56
14
3.57-3.59
10
3.60-3.62
5
3.63-3.65
1
5. The following sets of data were the scores of students in two groups taking SEF1134
in the second semester.
Group 1:
21
37
44
42
20
30
30
28
41
38
19
15
37
40
51
35
39
40
18
43
22
39
24
49
47
70
73
66
70
35
50
61
65
37
50
65
30
66
50
63
57
60
50
80
56
41
35
66
Group 2:
26
70
70
(a)
(b)
51
51
71
Construct a frequency distribution using 5 classes for each data set, then
compare the standard deviation of the scores of the two groups.
What can you conclude about the spread of the data in each group? Which
group has a bigger variation in the ability of the students?
42
Introduction To Statistics
SEF1124
REVIEW EXERCISE
1. Determine whether the following statement is a population or a sample:
(a)
The heights of 200 primary school students in Selangor.
(b)
The number of cars sold by Perodua in the first quarter of the year.
(c)
The time taken by students from groups 1 and 2 to complete an essay.
(d)
The lifespan of pensioners in the country for the past 50 years.
(e)
The household income of 100 residents in Putrajaya.
2. Discuss the difference between discrete variables and continuous variables.
3. Determine whether the given statement is a qualitative or quantitative variable. If it
is quantitative, identify whether it is discrete or continuous:
(a)
Noon temperature (in degrees Celsius) in Kuala Lumpur for the past two
months.
(b)
The responses to a survey that are either strongly agree, agree, disagree,
strongly disagree, no opinion.
(c)
The number of durian trees planted in three orchards in Pahang.
(d)
The weight of 100 cows measured in a feedlot farm in Gemas.
4. What type of sampling is being employed if a country is divided into economic
classes and a sample is chosen from each class to be surveyed?
5. The hours billed in a week by 30 lawyers in a prominent law firm were recorded as
follows:
52
68
75
72
32
56
38
42
34
43
62
34
35
65
54
50
60
40
58
63
44
41
47
70
75
49
55
39
78
38
(a)
Construct the frequency distribution with 30 – 39 as the first class.
43
Introduction To Statistics
SEF1124
(b)
Find the class boundaries.
(c)
Estimate the mean and variance of the billed hours.
6.
20
18
16
14
12
10
8
6
4
2
0
6 - 10
11 - 15
16 - 20
21 - 25
1
26 - 30
31 - 35
36 - 40
Based on the histogram above, calculate the
(a)
Mean
(b)
Median
(c)
Standard deviation
7. Given a set of data 5,2,8,14,10,5,7,10,m, n where X =7 and mode = 5. Find the
possible values of m and n. (ans: m=5, n=4 or m =4 , n =5)
8. Find the value that corresponds to the 30th percentile of the following data set:
78
82
86
88
92
97 (ans: P30 =82)
9. Given the variance of the set of 8 data 𝑥1 , 𝑥2, 𝑥3, … , 𝑥8 is 5.67. If ∑ 𝑋 2 = 944.96
find the mean of the data. (ans: 11.09)
10. Find Q3 for the given data set : 18,22,50,15,13,6,5,12 (ans: 20)
11. The number of credits in business courses that eight applicants took is 9, 12, 15, 27,
33, p, 63, 72. Given the value that corresponds to the 75th percentile is 54, find p.
(ans: 45)
44
Introduction To Statistics
SEF1124
12. The mean of 5, 10, 26, 30, 45, 32, x, y is 25 where x and y are constants. If x = 16,
find the median. (ans: 28)
13. A physician is interested in studying scheduling procedures. She questions 40
patients concerning the length of time in minutes that they waste past their
scheduled appointment time. The following data are obtained:
60
10
8
45
(a)
(b)
(c)
29
18
27
33
34
38
27
25
35
25
30
37
31
35
42
3
30
36
9
50
6
31
47
53
17
23
31
28
6
12
27
16
50
52
6
19
Construct a frequency distribution by using 7 classes (use 3 as lower limit of
the first class)
Find the mean, mode and standard deviation. (ans: 28.15 , 31.3 , 14.63)
Draw an ogive by using relative frequency and estimate the median from the
graph.
45