Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction To Statistics SEF1124 CHAPTER 3: INTRODUCTION TO STATISTICS The statistical process: POPULATION Plan the Investigation: What? How? Who? Where? Collect the Sample MAKING INFERENCES SAMPLE ANALYZING 3.1 Organize Present Describe The Nature of Probability and Statistics STATISTICS the science of conducting studies to collect, organize, summarize, analyze and draw conclusion from data. used to analyze the results of surveys and as a tool in scientific research to make decisions based on controlled experiments. also useful for operations, research, quality control, estimation and prediction. POPULATION: consists of all subjects that are being studied. SAMPLE: a group of subjects selected from a population. VARIABLE: a characteristic of interest of each subject in the population or sample. DATA: values (measurement or observations) that the variables can assume. Variables whose values are determined by chance are called random variables. 1 Introduction To Statistics SEF1124 EXAMPLE 1 A polling organization wants to know whether Malaysians favour national cars over foreign ones. What would be the population data set? What would be the sample data set? Solution The population data set would consist of the responses of every Malaysian. A common way of choosing a sample data set would be to randomly call 1000 Malaysians and gather their responses to the question of whether they favour national cars over foreign ones. EXAMPLE 2 Suppose we are interested in measuring the mid-semester examination results of students taking SHE1114 in CFS IIUM in the first semester of an academic year. What would be the population data set? What would be the sample size? Solution The population data set would be the results of all students who sit for the mid-semester examination for SHE1114 in the first semester. The sample data set would be a random number of results of 100 students who take the examination in that semester. 2 Introduction To Statistics SEF1124 A statistical exercise normally consists of 4 stages: 1. 2. 3. 4. Collection of data by counting or measuring. Presentation of the data in a convenient form. Analysis of the collected data. Interpretation of the analysis results and making conclusions. Two types of statistics: STATISTICS DESCRIPTIVE STATISTICS Consists of the collection, organization, summarization and presentation of data. Describes a situation. Data presented in the form of charts, graphs or tables. Makes use of graphical techniques and numerical descriptive measures such as average to summarize and present the data. Eg: The national census conducted by Malaysian government every 5 years or 10 years. The results of this census give some information regarding average age, income and other characteristics of the Malaysian population INFERENTIAL STATISTICS Consists of generalizing from samples to populations, performing hypothesis tests, determining relationships among variables and making prediction. Inferences are made from samples to populations. Uses probability, that is the chance of an event occurring. The area of inferential statistics called hypothesis testing is a decision-making process for evaluating claims about a population, based on information obtined from samples. Eg: A researcher may want to know if a new product of skin lotion containing aloe vera will reduce the skin problem on children. For this study, two group of young children would be selected. One group would be given the lotion containing aloe vera and the other would be given a normal lotion without containing aloe vera. As aresult is observed by experts to see the effectiveness of the new product. 3 Introduction To Statistics SEF1124 EXAMPLE 3 A study conducted at Manatee Community College revealed that students who attended class 95% to 100% of the time usually received an A in the class. Students who attended class 80% to 90% of the time usually received a B or C in the class. Students who attended class less than 80% of the time usually received a D or an F or eventually withdrew from the class. Based on this: (a) What are the variables under study? (b) What are the data in the study? (c) Which type of statistics was used? (d) What is the population under study? (e) Was a sample collected? (f) From the information given, comment on the relationship between the variables. Solution (a) Grades, attendance (b) Specific grades, attendance records (c) Descriptive (d) Students at Manatee Community College (e) Most probably (f) The better the attendance, the higher the grade 4 Introduction To Statistics SEF1124 Variables and Types of Data LEVEL OF MEASUREMENT NOMINAL QUALITATIVE TYPES OF DATA ORDINAL DISCRETE RATIO QUANTITATIVE INTERVAL CONTINUOUS QUALITATIVE VARIABLES: variables that can be placed into distinct categories, according to some characteristic or attribute. non-numeric eg: gender, colour, religion, workplace etc. QUANTITATIVE VARIABLES: variables that are numerical and can be ordered or ranked. Quantitative variables can be further classified into two groups, namely discrete and continuous. DISCRETE VARIABLES: assume values that can be counted, or for which there is a fixed set of values. Eg: the number of children in a family, shoe size etc. CONTINUOUS VARIABLES: can assume an infinite number of values between any two specific values, obtained by measuring. Eg: height, weight, temperature etc. 5 Introduction To Statistics SEF1124 EXAMPLE 4 Classify each variable as qualitative or quantitative. If the variable is quantitative, further classify it as discrete or continuous. (a) Number of times students in a hostel wash their clothes in a week (b) State of origin of members in a club in CFS IIUM (c) Weights of new born babies in a hospital (d) Hijab colour of students in group 3 of SHE1114 Solution (a) Quantitative, discrete (b) Qualitative (c) Quantitative, continuous (d) Qualitative Levels of measurement NOMINAL LEVEL OF MEASUREMENT is the lowest of the four ways to characterize data. Nominal means "in name only" and that should help to remember what this level is all about. deals with names, categories, or labels. data are qualitative. eg: eye colour, gender, yes or no responses to a survey, favorite breakfast cereal etc. data can't be ordered in a meaningful way, and it makes no sense to calculate things such as means and standard deviations. ORDINAL LEVEL OF MEASUREMENT the next level after nominal. data at this level can be ordered, but there are no meaningful differences between the data ranks. eg: a list of the top ten cities to live (the cities are ranked from one to ten, but differences between the cities don't make much sense), letter grades (A could be 6 Introduction To Statistics SEF1124 higher than a B, but without any other information, there is no way of knowing how much better an A is from a B), man’s build (small, medium, large) etc. as with the nominal level, data at the ordinal level should not be used in calculations. INTERVAL LEVEL OF MEASUREMENT has all characteristics of a nominal and ordinal scale but in addition it is based upon predetermined equal interval. deals with data that can be ordered, and in which differences between the data does make sense. Data at this level does not have a starting point. The Fahrenheit and Celsius scales of temperatures are both examples of data at the interval level of measurement. You can talk about 30 degrees being 60 degrees less than 90 degrees, so differences do make sense. However 0 degrees (in both scales) cold as it may be does not represent the total absence of temperature. data at the interval level can be used in calculations. RATIO LEVEL OF MEASUREMENT the fourth and highest level of measurement is the ratio level. data at the ratio level possess all of the features of the interval level, in addition to a zero value. Due to the presence of a zero, it now makes sense to compare the ratios of measurements. Phrases such as "four times" and "twice" are meaningful at the ratio level. eg: distances, in any system of measurement give us data at the ratio level. A measurement such as 0 feet does make sense, as it represents no length. Furthermore 2 feet is twice as long as 1 foot. So ratios can be formed between the data. sums and differences can be calculated, as well as ratios. One measurement can be divided by any nonzero measurement, and a meaningful number will result. 7 Introduction To Statistics SEF1124 EXAMPLE 5 Identify the following as nominal level, ordinal level, interval level, or ratio level data. (a) Percentage scores on a Math exam. (b) Letter grades on an English essay. (c) Flavors of yogurt. (d) Instructors classified as: Easy, Difficult or Impossible. (e) Employee evaluations classified as : Excellent, Average, Poor. (f) Religions. (g) Political parties. (h) Commuting times to school. (i) Years (AD) of important historical events. (j) Ages (in years) of statistics students. 8 Introduction To Statistics SEF1124 Data collection and Sampling Techniques Sampling: the process of selecting a number of individuals for a study in such a way that the individuals represent the larger group from which they were selected. to use a sample to gather information about a population. a random sample is a sample selected in such a way that every subject in the population has a chance of being selected. 4 basic methods of random sampling: o Random Sampling: subjects are selected by random numbers. o Systematic Sampling: Subjects are selected by using every kth number after the first subject is randomly from 1 through k. o Stratified Sampling: Subjects are selected by dividing up the population into groups (strata) and subjects within groups are randomly selected. Eg.: We divide the population into 5 group then we take the subjects from each group to become our sample. o Cluster Sampling: Subjects are selected by using an intact group that is representative of the population. Eg.: We divide the population into 5 group then we take 2 groups to become our sample. That means 2 groups of subject represent 5 groups of subjects. 9 Introduction To Statistics SEF1124 EXERCISE 3.1 1 A firm wanted to keep a database on the heights,gender, marital status, blood types and highest qualifications of new employees. For that purpose, a study was conducted. (a) What would be the population? What would be the sample? (b) Identify the variables. (c) Which variables are qualitative, which are quantitative? If the variable is quantitative, is it discrete or continuous? 2 State whether each of the following variables is discrete or continuous. (a) Number of calls received by the 999 operators every day for a month. (b) Life expectancy of 500 government pensioners chosen at random. (c) Cost of prepaid top-ups among students in a college for a month. (d) Temperature of coffee served in a bistro for breakfast over a week. (e) Amount of nasi lemak kampung sold in the month of May at a roadside stall 3 Classify each set of data as discrete or continuous. (a) (b) (c) (d) (e) (f) 4 The number of suitcases lost by an airline. The height of corn plants. The number of ears of corn produced. The number of green M&M's in a bag. The time it takes for a car battery to die. The production of tomatoes by weight. Identify the following as nominal level, ordinal level, interval level, or ratio level data. (a) (b) (c) (d) (e) (f) (g) Lecturers classified according to subjects taught IQ scores of 100 children below the age of 15 diagnosed with Down’s Syndrome Number of phone calls received during a water disruption in Klang Heights of saplings in a green house after three weeks Order of F1 drivers completing a race Marital status of applicants to a job vacancy Salaries of fresh graduates in the country 10 Introduction To Statistics 3.2 SEF1124 Frequency Distributions and Graphs A frequency distribution is the organization of raw data in table form, using classes and frequencies. There are three types of frequency distribution. 1. Categorical frequency distribution Used when data can be placed in specific categories EXAMPLE 6 The following data represent the colour of men’s shirts purchased in the men’s department of a large department store. Construct a frequency distribution for the data. (W = White, BL = Blue, BR = Brown, Y =Yellow, G = Grey) W W BL Y W W W G BL BL BR BL W G Y Y BR BL BR W BL BL W G W BL BR W BR BL W BL BL W W W BL W W BR Y BR BL BR G G Y BR Y G (A complete categorical distribution must have class, frequency & percentage column in the table) Shirt Colour White Blue Brown Yellow Grey Frequency Percentage 11 Introduction To Statistics SEF1124 2. Grouped frequency distribution When the range of the data is large, the data must be grouped into classes. Grouped data are a collection of data in a more condensed form, where the data set is made into groups of suitable size. These groups are known as data classes. The number of values from the set in each class makes the frequency of that class. To construct a frequency distribution for grouped data, we must first determine the classes. We can use the following guidelines when forming the classes: 1. There should be between 5 and 20 classes. 2. The class width should be an odd number. This will guarantee that the class midpoints are integers instead of decimals. 3. The classes must be mutually exclusive. This means that no data value can fall into two different classes 4. The classes must be all inclusive or exhaustive. This means that all data values must be included. 5. The classes must be continuous. There are no gaps in a frequency distribution. Classes that have no values in them must be included (unless it's the first or last class which are dropped). 6. The classes must be equal in width. The exception here is the first or last class. Next, we use the following guidelines to create a grouped frequency distribution: 1. Find the largest and smallest values 2. Compute the range Range = Maximum - Minimum 3. Select the number of classes desired. This is usually between 5 and 20. 12 Introduction To Statistics SEF1124 4. Find the class width by dividing the range by the number of classes and rounding up. 𝑐𝑙𝑎𝑠𝑠 𝑤𝑖𝑑𝑡ℎ = 𝑟𝑎𝑛𝑔𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠 5. Pick a suitable starting point less than or equal to the minimum value. 6. To find the upper limit of the first class, subtract one from the lower limit of the second class. 7. Find the boundaries by subtracting 0.5 units from the lower limits and adding 0.5 units from the upper limits. The boundaries are also half-way between the upper limit of one class and the lower limit of the next class. 8. Tally the data. 9. Find the frequencies. 10. Find the cumulative frequencies. 11. If necessary, find the relative frequencies and/or relative cumulative frequencies. 13 Introduction To Statistics SEF1124 The following table lists the important terminology we use when describing data in a frequency distribution. Terminology Description Class-Interval Each class is bounded by two figures, which are called class limits. The figure on the left side of a class is called its lower limit and that on its right is called its upper limit. Lower class limit The least value that can belong to a class Upper class limit The greatest value that can belong to a class Class width The difference between the upper (or lower) class limits of consecutive classes. All classes should have the same class width. Upper Class Boundary - Lower Class Boundary = Lower class limit of one class - Lower class limit of next class = Upper class limit of one class - Upper class limit of next class Class boundaries Class Midpoint The average of the upper limit of one class and the lower limit of the next class. The middle value of each data class. To find the class midpoint, average the upper and lower class limits, or the upper and lower class boundaries. 𝑐𝑙𝑎𝑠𝑠 𝑚𝑖𝑑𝑝𝑜𝑖𝑛𝑡 = 𝑙𝑜𝑤𝑒𝑟 + 𝑢𝑝𝑝𝑒𝑟 2 14 Introduction To Statistics SEF1124 EXAMPLE 7 The following data show the ages of patients diagnosed with metastatic carcinoma of the bone at an oncology ward over a period of two years. 45 31 46 25 57 39 42 55 20 37 40 59 11 38 34 22 62 33 48 43 57 37 43 51 29 41 35 66 45 32 44 47 42 46 54 65 17 35 53 27 38 22 33 39 45 32 43 41 57 45 Construct a frequency distribution for the data using 6 equal classes, showing the class boundaries and midpoints. Solution Range = 66 – 11 = 55 Ages 10 - 19 20 - 29 30 - 39 40 - 49 50 - 59 60 - 69 Class width = Range/number of class = 55/6 = 9.1 (round up = 10) Boundaries 9.5 - 19.5 19.5 – 29.5 29.5 – 39.5 39.5 – 49.5 49.5 – 59.5 59.5 – 69.5 Midpoints 14.5 24.5 34.5 44.5 54.5 64.5 total Frequency 2 6 14 17 8 3 50 Cumulative frequency: the number of data elements in any given class and all previous classes. Relative frequency: the ratio of the frequency of any given class to the sum of frequencies. 15 Introduction To Statistics SEF1124 EXAMPLE 8 The lengths in mm of a batch of 40 spindles manufactured on a day were measured with the following results: 20.90 20.57 20.86 20.74 20.82 20.63 20.53 20.89 20.75 20.65 20.71 21.03 20.72 20.41 20.49 20.75 20.79 20.65 21.08 20.89 20.50 20.88 20.97 20.78 20.61 20.92 21.07 21.16 20.80 20.77 20.82 20.72 20.60 20.90 20.86 20.68 20.75 20.88 20.56 20.94 Construct a frequency distribution for the data using 8 equal classes, showing the class boundaries, midpoints, cumulative frequencies and relative frequencies. Solution Range = 21.16 – 20.41 = 0.75 Lengths 20.40 – 20.49 20.50 – 20.59 20.60 – 20.69 20.70 – 20.79 20.80 – 20.89 20.90 – 20.99 21.00 – 21.09 21.10 - 21.19 Boundaries 20.395 -20.495 20.495 – 20.595 20.595 – 20.695 20.695 – 20.795 20.795 – 20.895 20.895 – 20.995 20.995 – 21.095 21.095 – 21.195 Class width = Range/number of class = 0.75/8 = 0.09 = 0.1 (round up) Midpoints 20.445 20.545 20.645 20.745 20.845 20.945 21.045 21.145 Frequency 1 4 6 10 9 6 3 1 Cum. fre 1 5 11 21 30 36 39 40 Rel. fre .025 .10 .15 .25 .225 .15 .075 .025 16 Introduction To Statistics SEF1124 3. Ungrouped frequency distribution when the range of data is small Ungrouped frequency distribution is used for data which have been obtained in their original form, also called raw data. When a set of ungrouped data is arranged in ascending or descending order, the set is called an array. To construct the frequency distribution for ungrouped data, we take each observation from the data, one at a time, and indicate the frequency (the number of times the observation has occurred in the data) by small line, called tally marks. For convenience, we write tally marks in bunches of five, the fifth one crossing the fourth diagonally. We may choose to omit the tally marks from the frequency distribution. In the table so formed, the sum of all the frequencies is equal to the total number of observations in the given data. EXAMPLE 9 The marks obtained obtained by 25 students in a class in a certain examination are given below: 25, 8, 37, 16, 45, 40, 29, 12, 42, 25, 14, 16, 16, 20, 10, 36, 33, 24, 25, 35, 11, 30, 45, 48 If these marks are arranged in ascending order, we get the following array: 8, 10, 11, 12, 14, 16, 16, 16, 20, 24, 25, 25, 25, 29, 30, 33, 35, 36, 37, 40, 40, 42, 45, 45, 48 17 Introduction To Statistics SEF1124 EXAMPLE 10 The number of years in service of faculty members at the Mathematics Department is given below: 7 8 5 4 9 8 5 7 6 8 9 6 7 98 7 9 9 6 5 8 9 4 5 5 8 9 6 From this data, we may construct a frequency distribution table, as given below: Years served Frequency 4 2 5 5 6 4 7 4 8 6 9 7 Total 28 Graphical Representation of Data Histogram A histogram is a graphical representation of the information in a frequency table using a bar graph. The histogram should have the variable being measured in the data set as its horizontal axis, and the class frequency as the vertical axis. Each data class will be represented by a vertical bar whose height is the frequency of the class and whose width is the class width. i) x-axis: class boundary y-axis: frequency ii) x-axis: class boundary y-axis: relative frequency 18 Introduction To Statistics SEF1124 EXAMPLE 11 Construct a histogram for the data in Example 7. Solution Ages 10 - 19 20 - 29 30 - 39 40 - 49 50 - 59 60 - 69 Boundaries 9.5 - 19.5 19.5 – 29.5 29.5 – 39.5 39.5 – 49.5 49.5 – 59.5 59.5 – 69.5 Midpoints 14.5 24.5 34.5 44.5 54.5 64.5 total Frequency 2 6 14 17 8 3 50 17 18 16 14 14 12 10 8 8 6 6 4 3 2 2 0 1 14.5 24.5 34.5 44.5 54.5 64.5 19 Introduction To Statistics SEF1124 Frequency Polygon A frequency polygon is a line graph representation of the information in a frequency table. Like a histogram, the vertical axis represents frequency and the horizontal axis represents the variable being measured in the data set. To construct the graph, a point is plotted for each class at its midpoint and with height given by the frequency of the class. The points are then connected by straight lines. i) x-axis: class midpoint y-axis: frequency ii) x-axis: class midpoint y-axis: relative frequency EXAMPLE 12 Construct a frequency polygon for the same data in Example 7. Solution Ages 10 - 19 20 - 29 30 - 39 40 - 49 50 - 59 60 - 69 Boundaries 9.5 - 19.5 19.5 – 29.5 29.5 – 39.5 39.5 – 49.5 49.5 – 59.5 59.5 – 69.5 Midpoints 14.5 24.5 34.5 44.5 54.5 64.5 total Frequency 2 6 14 17 8 3 50 18 16 14 12 10 8 6 4 2 0 [] [14.5] [24.5] [34.5] [44.5] [54.5] [64.5] [] 20 Introduction To Statistics SEF1124 Ogive An ogive is a line graph representing the cumulative frequencies for the classes. The vertical axis represents cumulative frequency and the horizontal axis represents the variable being measured in the data set. To construct the graph, a point is plotted for each class at its midpoint and with height given by the frequency of the class. The points are then connected by straight lines. i) x-axis: class boundary y-axis: cumulative frequency ii) x-axis: class boundary y-axis: cumulative relative frequency EXAMPLE 13 Construct an ogive for the same data in Example 7. Solution Ages 10 - 19 20 - 29 30 - 39 40 - 49 50 - 59 60 - 69 Boundaries 9.5 - 19.5 19.5 – 29.5 29.5 – 39.5 39.5 – 49.5 49.5 – 59.5 59.5 – 69.5 Midpoints 14.5 24.5 34.5 44.5 54.5 64.5 total Frequency 2 6 14 17 8 3 50 Cumulative Frequency 2 8 22 39 47 50 60 50 40 30 20 10 0 14.5 24.5 34.5 44.5 54.5 64.5 21 Introduction To Statistics SEF1124 EXERCISE 3.2 1. In a class of 35 students, the following grade distribution was found. Construct a histogram, frequency polygon and ogive for the data. (A=4, B=3, C=2, D=1, F=0) Grade 0 1 2 3 4 Frequency 3 6 9 12 5 2. Using the histogram shown below. Construct i) A frequency distribution ii) A frequency polygon iii) An ogive 7 6 5 4 3 2 1 0 [21.5] [24.5] [27.5] [30.5] [33.5] [36.5] [39.5] [42.5] 22 Introduction To Statistics SEF1124 3. The following frequency distribution was obtained from the duration (in minutes) taken by 70 candidates in a writing test. Duration 21.2 – 21.4 3 Frequency Find: (a) (b) (c) (d) (e) 21.5 – 21.7 7 21.8 – 22.0 12 22.1 – 22.3 16 22.4 – 22.6 19 22.7 – 22.9 13 The lower and upper boundary of the third class. The midpoint of the fifth class. The cumulative frequency of the fourth class. The relative frequency of the second class. The class width 4. x 15 1 Frequency 16 4 17 9 18 10 19 6 20 2 Based on the frequency distribution above, construct: (a) a relative frequency histogram. (b) an ogive. 5. The number of cars passing through a guard post was recorded on 40 occasions, as shown below. (a) (b) 66 87 79 74 84 72 81 78 68 74 80 71 91 62 77 86 87 72 80 77 76 83 75 71 83 67 94 64 82 78 77 67 76 82 78 88 66 79 74 64 Construct a frequency distribution using 7 equal classes from 60 to 94, showing the class boundaries and midpoints. Draw a frequency polygon for the data. 23 Introduction To Statistics SEF1124 6. The number of calories per serving for selected ready-to-eat cereals is listed here. Construct a histogram, frequency polygon and ogive for the data using relative frequency. 130 210 190 190 115 190 130 210 240 210 140 100 120 80 110 80 90 200 120 225 100 210 130 90 190 120 120 180 190 130 220 200 260 200 220 120 270 210 110 180 100 190 100 120 160 180 7. Below is a data set for the duration (in minutes) of a random sample of 24 longdistance phone calls: 1 20 10 20 12 23 3 7 18 12 4 5 15 7 29 10 18 10 10 23 4 12 8 6 (a) Construct a frequency distribution table for the data using the classes “1 to 5” “6 to 10” etc. (b) Construct a cumulative frequency distribution table and use it to draw up an ogive 2 The following table refers to the 2003 average income (in thousand Ringgit) per year for 20 employees of company A. Income (‘000 Ringgit) 5–9 10 – 14 15 – 19 20 – 24 25 – 29 30 – 34 Frequency 6 3 2 4 3 2 (a) Draw a histogram and a frequency polygon for the above data. (b) Construct the cumulative frequency table. Hence, draw up an ogive for the above data 24 Introduction To Statistics 3.3 SEF1124 Measures of Central Tendency The following symbols and variables will have the meanings given below (unless otherwise specified) Variables x = data value n = number of values in a sample data set N = number of values in a population data set f = frequency of a data class m = midpoint of a data class Symbol indicates the sum of all values for the following variable or expression. Example: Using our notation, we can write the statement that the sum of the frequencies in a frequency table should equal the number of values in the data set as follows: ∑𝑓 = 𝑛 A measure of central tendency is a value used to represent the “average” value in a data set. There are three most commonly used measures of central tendency. Mean – the sum of all data values divided by the number of values in the data set. The mean of a Mean is the most commonly used measure of central tendency. sample data set is denoted by 𝑥̅ and the mean of a population data set by the Greek letter 𝜇. Median – the value which separates the largest 50% of data values from the lowest 50%. Mode – the data value (or values) which appears the largest number of times in the set. If no data value is repeated, we say that there is no mode. 25 Introduction To Statistics SEF1124 Mean, median and mode for ungrouped data Population mean 𝜇= ∑𝑥 ∑𝑥 = ∑𝑓 𝑁 Sample mean 𝑥̅ = ∑𝑥 ∑𝑥 = ∑𝑓 𝑛 Median arrange the data in ascending order. if n is odd, the middle value is the median. if n is even, the mean of the two middle values is the median. Mode the value that occurs most frequently in a data set. 26 Introduction To Statistics SEF1124 EXAMPLE 14 Suppose earnings from selling burgers by the roadside for the past week were as follows: Day Monday Earnings in RM 350 Thursday Wednesday 150 100 Thursday Friday 350 50 Calculate the mean, median and mode earning of each day. Solution Mean: 𝑥̅ = ∑ 𝑥 350 + 150 + 100 + 350 + 50 1000 = = 𝑛 5 5 = RM200 Median: Mode: EXAMPLE 15 Calculate the mean, median and mode of quiz score from the data below: 1, 5, 7, 7, 6, 8, 10, 9, 5, 10, 8 Solution: Placing the data in ascending order, 1, 5, 5, 6, 7, 7, 8, 8, 9, 10, 10 Since the number of data values is odd, the median is the middle value, which is 7. 27 Introduction To Statistics SEF1124 Mean, median and mode for ungrouped frequency distribution Mean 𝑥̅ = ∑ 𝑓𝑥 ∑𝑥 Median find the cumulative frequency location of the median: ∑𝑓 2 Mode the value with the highest frequency EXAMPLE 16 The masses in kg of 50 groupers ordered by a sushi restaurant have been measured as follows. Mass f 4.2 1 4.3 3 4.4 7 4.5 10 4.6 12 4.7 10 4.8 5 4.9 2 4.7 10 43 4.8 5 48 4.9 2 50 Calculate the mean, median and mode mass of the fish. Solution: Mass f cum f 4.2 1 1 4.3 3 4 4.4 7 11 4.5 10 21 4.6 12 33 Since the number of data values is even, the median lies between the two middle values, that is between the 25th and the 26th values. From the cumulative frequencies, we can see that this value will be 4.6 kg. 28 Introduction To Statistics SEF1124 EXAMPLE 17 This ungrouped frequency distribution of the number of cups of coffee consumed with each meal was obtained from a survey conducted in a restaurant. Find the mean, median and mode. Number of cups 0 1 2 3 4 5 Frequency 5 8 10 2 3 2 29 Introduction To Statistics SEF1124 Mean, mode and median for grouped data Mean Population mean 𝜇= ∑ 𝑓𝑚 ∑ 𝑓𝑚 = ∑𝑓 𝑁 𝑥̅ = ∑ 𝑓𝑚 ∑ 𝑓𝑚 = ∑𝑓 𝑛 Sample mean Median Find cumulative frequency Find median class (location of median) The median is: 𝑛 − ∑ 𝑓𝑚−1 𝐿𝑚 + [2 ]𝑤 𝑓𝑚 Lm ∑ 𝑓𝑚−1 fm n w : lower boundary of the median class : cumulative frequency before the median class : frequency of the median class : number of data values : class width 30 Introduction To Statistics SEF1124 Mode find modal class (class with the highest frequency) 𝐿𝑚𝑜 + [ Lmo ∆1 ∆2 w ∆1 ]𝑤 ∆1 + ∆2 : lower boundary of the modal class : difference between frequency of modal class and frequency of class before : difference between frequency of modal class and frequency of class after the class width EXAMPLE 18 The following distribution shows the prices of items sold at a car boot sale. Prices in RM 1–5 Frequency Midpoints fm 8 3 24 6 – 10 6 8 48 11 – 15 4 13 52 16 – 20 2 18 36 21 – 25 4 23 92 26 – 30 6 28 168 31 – 35 2 33 66 ∑ 𝑓𝑚=486 n=32 Calculate the mean, median and mode price of the sold items. Solution 𝑥̅ = ∑ 𝑓𝑚 486 = 𝑛 32 = RM 15.19 31 Introduction To Statistics SEF1124 EXAMPLE 19 Calculate the mean, median and mode of the lengths of 40 spindles in Example 7. Solution Lengths Boundaries Frequency 20.40 – 20.49 20.50 – 20.59 20.60 – 20.69 20.70 – 20.79 20.80 – 20.89 20.90 – 20.99 21.00 – 21.09 21.10 - 21.19 20.395 -20.495 20.495 – 20.595 20.595 – 20.695 20.695 – 20.795 20.795 – 20.895 20.895 – 20.995 20.995 – 21.095 21.095 – 21.195 1 4 6 10 9 6 3 1 Cum. fre 1 5 11 21 30 36 39 40 The median class is the class with the n/2= 20th data value, that is, the fourth class. The class width is 0.1 Using the formula, the median is 𝑛 40 − ∑ 𝑓𝑚−1 − 11 2 𝐿𝑚 + [ ] 𝑤 = 20.695 + [ 2 ] . 0.1 𝑓𝑚 10 = 32 Introduction To Statistics SEF1124 EXERCISE 3.3 1. The following frequency distribution shows the numbers of books read by each of the 28 students in a literature class. Number of books 0–2 3–5 6–8 9 – 11 12 – 14 Frequency 2 6 12 5 3 (a) Find the mean, median and mode. (b) Find the percentage of students who read: (i)less than six books (ii) more than nine books. 2. Eighty randomly selected light bulbs were tested to determine their lifetimes (in hours). This frequency distribution was obtained. Find the mean, median and mode. Class Boundaries 52.5 – 63.5 63.5 – 74.5 74.5 – 85.5 85.5 – 96.5 96.5 – 107.5 107.5 – 118.5 Frequency 6 12 25 18 14 5 3. After a month, the heights of 120 saplings were measured as follows: Height (cm) 29.4 29.5 29.6 29.7 29.8 29.9 Frequency 6 25 34 32 18 5 (a) Find the mean, mode and median height. (b) Draw a relative frequency histogram for the data. 33 Introduction To Statistics SEF1124 4. The following data set represents the life expectancy of ten government pensioners. 96 68 78 82 74 𝑥 70 86 84 87 If the median is 81, find the value of 𝑥. Hence, find the mean. 5. The following scores were obtained by a batch of students on a Calculus test: Scores 56-60 61-65 66-70 71-75 76-80 81-85 86-90 91-95 96-100 f 2 2 3 5 6 7 8 4 3 (a) Calculate the mean and the mode of the scores. (b) Give a brief interpretation of the mean and the mode with reference to the students’ performance in the test. 6. The distance in km traveled on a given day by 40 sales representatives of a direct selling company were recorded as the following: 210 181 192 164 170 186 205 194 178 161 175 195 172 188 196 182 206 188 165 202 178 163 190 198 187 198 174 172 183 208 185 162 203 172 196 184 185 176 197 184 (a) Construct a frequency distribution using 5 equal classes (b) Draw an ogive (c) Calculate the mean, mode and median (d) Give a brief interpretation of the measures in part (c) 34 Introduction To Statistics 3.4 SEF1124 Measures of Variation Measure of variation is a measure that describes how a set of data is spread out or scattered. It is also known as measures of dispersion or measures of spread. Variation in a data set is the amount of difference between data values. In a data set with little variation, almost all data values would be close to one another. The histogram of such a data set would be narrow and tall. On the other hand, a data set with a great deal of variation will have data values that are spread widely. The histogram of this data set would be low and wide. Compare the histograms for the two sets of quiz scores below. Quiz Scores A: 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5 Quiz Scores B: 1, 3, 4, 5, 6, 6, 7, 8, 8, 9, 10 7 6 6 5 4 3 3 2 2 2 2 1 1 1 1 1 1 1 0 1 2 The narrow and tall histogram on the left shows that Quiz Scores A have little variation. The wide and low histogram on the right shows that Quiz Scores B have greater variation. 35 Introduction To Statistics SEF1124 There are three measures of variation, namely the range, the variance and the standard deviation. Range The range is the difference between the highest and the lowest values. Variance Variance indicates a relationship between the mean of a distribution and the data points; it is determined by averaging the sum of the squared deviations. Squaring the differences instead of taking the absolute values allows for greater flexibility in calculating further algebraic manipulations of the data. Standard Deviation Standard deviation is the square root of the variance. This calculation is useful because it allows for the same flexibility as variance regarding further calculations and yet also expresses variation in the same units as the original measurements. Population variance, 𝜎2 = ∑(𝑋 − 𝜇)2 𝑁 Variance Sample variance, 𝑠 2 Population standard deviation, ∑(𝑋 − 𝜇)2 𝜎=√ 𝑁 Standard Deviation Sample variance, 𝑠 2 36 Introduction To Statistics SEF1124 Variance and standard deviation for ungrouped/raw data Variance 𝑠2 = ∑(𝑋 − 𝑋̅) 𝑛−1 Standard deviation ∑(𝑋 − 𝑋̅) 𝑠 = √𝑠 2 = √ 𝑛−1 Where, 𝑋: data value, 𝑋̅: sample mean, 𝑛: sample size OR (short cut formula) Variance 𝑠2 = (∑ 𝑋)2 𝑛 𝑛−1 ∑ 𝑋2 − Standard deviation 𝑠 = √𝑠 2 = (∑ 𝑋)2 𝑛 𝑛−1 2 √∑ 𝑋 − EXAMPLE 20 The normal daily temperatures (in degrees Fahrenheit) in January for 10 selected cities are as follows. Find the variance and standard deviation. 50 37 29 54 30 61 47 38 34 61 37 Introduction To Statistics SEF1124 EXAMPLE 21 Twelve students were given an arithmetic test and the times (in minutes) to complete it were as follows: 10 9 12 11 8 15 9 7 8 6 12 10 Find the variance and standard deviation. Variance and standard deviation for ungrouped frequency distribution Variance (∑ 𝑓𝑥)2 ∑𝑓 (∑ 𝑓) − 1 ∑ 𝑓𝑥 2 − 𝑠2 = Standard deviation 𝑠= √ (∑ 𝑓𝑥)2 ∑𝑓 (∑ 𝑓) − 1 ∑ 𝑓𝑥 2 − EXAMPLE 22 Calculate the variance and the standard deviation for the following data: Years served, Frequency, 4 5 6 7 8 9 Total 2 5 4 4 6 7 28 38 Introduction To Statistics SEF1124 Variance and standard deviation for grouped data Variance 𝑠2 = (∑ 𝑓𝑥𝑚 )2 ∑𝑓 (∑ 𝑓) − 1 ∑ 𝑓(𝑥𝑚 )2 − Standard deviation 𝑠 = √𝑠 2 = √ (∑ 𝑓𝑥𝑚 )2 ∑𝑓 (∑ 𝑓) − 1 ∑ 𝑓(𝑥𝑚 )2 − 39 Introduction To Statistics SEF1124 EXAMPLE 23 Calculate the variance and the standard deviation for the following data: Ages Frequency, f 10 - 19 20 - 29 30 - 39 40 - 49 50 - 59 60 - 69 2 6 14 17 8 3 40 Introduction To Statistics SEF1124 EXERCISE 3.4 1. In a class of 29 students, this distribution of quiz scores was recorded. Find variance and standard deviation. Grade 0–2 3–5 6–8 9 – 11 12 – 14 Frequency 1 3 5 14 6 2. Eighty randomly selected light bulbs were tested to determine their lifetimes (in hours). This frequency distribution was obtained. Find variance and standard deviation. Class Boundaries 52.5 – 63.5 63.5 – 74.5 74.5 – 85.5 85.5 – 96.5 96.5 – 107.5 107.5 – 118.5 Frequency 6 12 25 18 14 5 3. These data represent the scores (in words per minute) of 25 typists on a speed test. Find variance and standard deviation. Class limit 54 – 58 59 – 63 64 – 68 69 – 73 74 – 78 79 – 83 84 – 88 Frequency 2 5 8 0 4 5 1 41 Introduction To Statistics SEF1124 4. Estimate the variance and the standard deviation for the data set whose frequency distribution is given below: Class Frequency 3.45-3.47 2 3.48-3.50 6 3.51-3.53 12 3.54-3.56 14 3.57-3.59 10 3.60-3.62 5 3.63-3.65 1 5. The following sets of data were the scores of students in two groups taking SEF1134 in the second semester. Group 1: 21 37 44 42 20 30 30 28 41 38 19 15 37 40 51 35 39 40 18 43 22 39 24 49 47 70 73 66 70 35 50 61 65 37 50 65 30 66 50 63 57 60 50 80 56 41 35 66 Group 2: 26 70 70 (a) (b) 51 51 71 Construct a frequency distribution using 5 classes for each data set, then compare the standard deviation of the scores of the two groups. What can you conclude about the spread of the data in each group? Which group has a bigger variation in the ability of the students? 42 Introduction To Statistics SEF1124 REVIEW EXERCISE 1. Determine whether the following statement is a population or a sample: (a) The heights of 200 primary school students in Selangor. (b) The number of cars sold by Perodua in the first quarter of the year. (c) The time taken by students from groups 1 and 2 to complete an essay. (d) The lifespan of pensioners in the country for the past 50 years. (e) The household income of 100 residents in Putrajaya. 2. Discuss the difference between discrete variables and continuous variables. 3. Determine whether the given statement is a qualitative or quantitative variable. If it is quantitative, identify whether it is discrete or continuous: (a) Noon temperature (in degrees Celsius) in Kuala Lumpur for the past two months. (b) The responses to a survey that are either strongly agree, agree, disagree, strongly disagree, no opinion. (c) The number of durian trees planted in three orchards in Pahang. (d) The weight of 100 cows measured in a feedlot farm in Gemas. 4. What type of sampling is being employed if a country is divided into economic classes and a sample is chosen from each class to be surveyed? 5. The hours billed in a week by 30 lawyers in a prominent law firm were recorded as follows: 52 68 75 72 32 56 38 42 34 43 62 34 35 65 54 50 60 40 58 63 44 41 47 70 75 49 55 39 78 38 (a) Construct the frequency distribution with 30 – 39 as the first class. 43 Introduction To Statistics SEF1124 (b) Find the class boundaries. (c) Estimate the mean and variance of the billed hours. 6. 20 18 16 14 12 10 8 6 4 2 0 6 - 10 11 - 15 16 - 20 21 - 25 1 26 - 30 31 - 35 36 - 40 Based on the histogram above, calculate the (a) Mean (b) Median (c) Standard deviation 7. Given a set of data 5,2,8,14,10,5,7,10,m, n where X =7 and mode = 5. Find the possible values of m and n. (ans: m=5, n=4 or m =4 , n =5) 8. Find the value that corresponds to the 30th percentile of the following data set: 78 82 86 88 92 97 (ans: P30 =82) 9. Given the variance of the set of 8 data 𝑥1 , 𝑥2, 𝑥3, … , 𝑥8 is 5.67. If ∑ 𝑋 2 = 944.96 find the mean of the data. (ans: 11.09) 10. Find Q3 for the given data set : 18,22,50,15,13,6,5,12 (ans: 20) 11. The number of credits in business courses that eight applicants took is 9, 12, 15, 27, 33, p, 63, 72. Given the value that corresponds to the 75th percentile is 54, find p. (ans: 45) 44 Introduction To Statistics SEF1124 12. The mean of 5, 10, 26, 30, 45, 32, x, y is 25 where x and y are constants. If x = 16, find the median. (ans: 28) 13. A physician is interested in studying scheduling procedures. She questions 40 patients concerning the length of time in minutes that they waste past their scheduled appointment time. The following data are obtained: 60 10 8 45 (a) (b) (c) 29 18 27 33 34 38 27 25 35 25 30 37 31 35 42 3 30 36 9 50 6 31 47 53 17 23 31 28 6 12 27 16 50 52 6 19 Construct a frequency distribution by using 7 classes (use 3 as lower limit of the first class) Find the mean, mode and standard deviation. (ans: 28.15 , 31.3 , 14.63) Draw an ogive by using relative frequency and estimate the median from the graph. 45