Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Preparatory Mathematics: Collecting and Presenting Data (2015) 1. Data Presentation and Summary Statistics Section A: Working with Data An important part of statistical work is presenting data in a clear way from which conclusions can be drawn. Example: Each of 40 people is asked how many episodes of “The Sopranos” they had seen. (40 is the sample size which we will discuss later.) Answers: 0, 2, 1, 7, 9, 3, 0, 1, 1, 1, 0, 4, 1, 5, 1, 9, 3, 2, 0, 3, 1, 6, 1, 0, 2, 5, 7, 1, 3, 4, 1, 0, 2, 7, 1, 3, 0, 1, 3, 1 The answers are given as a list, but this is not the most helpful way to present the information. The “No. of Sopranos Episodes” varies from person to person and is known as the variable. It is convenient to denote a variable by a letter such as X. Each different value of the variable occurs with a particular frequency, i.e. the number of people in each category. The frequency is always denoted by the letter f (small f). We could present the information in a Frequency Table. X 0 1 2 3 4 or more f 7 13 4 6 10 Page 1 of 25 Preparatory Mathematics: Collecting and Presenting Data (2015) Discrete Data Data is said to be discrete when it can only take a certain set of distinct values. Examples of Discrete Data: The number of children in a family: 0, 1, 2, etc. Hourly pay (in euro and cent) in a certain job where the minimum pay is €8.50 per hour: 8.50, 8.70, 8.95, etc. We will consider how to present discrete data clearly. Example: The number of defective items on a production line on 20 successive days is given below. 12 14 9 13 11 15 10 12 14 13 12 11 13 12 15 9 14 12 13 10 When the figures are presented in this way, it is difficult to make much sense of them. A frequency table makes clearer how the values in the set of data are spread out. Frequency Tables In the example above, the variable is the number of defective items during a day. Let X stand for the variable, i.e. X = the number of defective items on a day. The number of times each value of the variable occurs is the frequency of the value, denoted by f. A frequency table has two columns labelled X and f. For discrete data we proceed as follows: 1. Find the lowest and highest values in the set of data. In our example these are 9 and 15. 2. Draw the table with two columns, one each for the variable and the frequency. Label each column clearly. 3. Enter each value of the variable in the X column from the lowest to the highest. Enter the number of times each value occurs in the f column opposite the appropriate value. X (Defective Items) f (Number of Days) 9 10 11 12 13 14 15 The frequency table for our example is shown opposite. Page 2 of 25 Preparatory Mathematics: Collecting and Presenting Data (2015) Section B: Presenting Data Histograms A Histogram displays the data from a frequency table graphically. Some people prefer tables and others pictures! Put the values of the variable on the horizontal axis (or x-axis) and measure the frequency on the vertical axis. Draw a rectangle above each value with the height of the rectangle indicating how often the corresponding value occurs. Frequency (f) iin Days Defective items per Day 6 5 4 3 2 1 0 9 10 11 12 13 14 15 Defective Items in a Day(X) Some important points when drawing graphs: 1. Your graph must have a title. 2. You must label each axis. 3. You must put units on each axis. 4. If possible indicate where the data comes from in a footnote. From a Histogram, it is easy to read off which value of the variable occurs most often. It is simply the value corresponding to the tallest rectangle. This value is called the mode of the data set. In the above example the mode is 12 defective items – this occurred on more days than any other value. The mode of a data set is an example of a summary statistic. It is a single number extracted from the data which can tell us a lot about the data as a whole. We will see more summary statistics as we go. Page 3 of 25 Preparatory Mathematics: Collecting and Presenting Data (2015) Continuous Data and Classes Data is continuous if it can take any value in some range. If you have a lot of discrete data of values which are very close to each other then that is usually seen as continuous data too. Examples of Continuous Data: 1. The weights of components in a manufacturing process. 2. the length of time it takes a component to break down. 3. The percentage returns on different investments. For continuous data or discrete data with a large number of values, it is necessary to group the values in some way before drawing up a frequency table. This gives us a grouped frequency table. Note: We have already seen one grouped frequency table. In our very first example (on Sopranos Episodes) we grouped the values 4, 5, 6, and so on into a single entry in the frequency table: “4 or more” Example: The salaries (in €000’s) of 50 employees in a large corporation are given below. 10.2 21.2 23.49 43.78 42.23 35.7 10.2 16.61 25.55 62.21 12.1 33.76 17.12 37.76 68.37 19.2 34.12 49.16 18.81 11.2 17.5 41.65 55.12 49.12 27.89 44.21 29.43 39.17 38.71 39.87 15.5 21.0 35.63 33.34 52.3 22.3 34.47 41.25 56.72 41.23 25.6 38.72 37.73 30.0 19.54 28.6 32.12 29.15 47.76 28.87 Question: What type of data are these, continuous or discrete? A straightforward frequency table would be useless here as most of the values occur with a frequency of either 1 or 2. We organise the numbers in the list into groups or classes. 1. Find the lowest and highest values in the data set. 2. Choose a class size so that the number of classes is between 5 and 15. The size itself of each class should be convenient. 3. It is vital that classes do not overlap. Page 4 of 25 Preparatory Mathematics: Collecting and Presenting Data (2015) In our example the lowest salary is 10.2 and the highest is 68.37. Classes of width 10 or 5 would be appropriate. We will take classes of width 10. Our first class will be salaries from 10 to 20, the second from 20 to 30 and so on (in €000’s). The extreme values in a class (such as 10 and 20 in the first class) are known as class boundaries. To prevent the classes from overlapping we must decide between two alternatives: (a) Include the lower class boundary in each class and exclude the upper OR (b) Exclude the lower class boundary from each class and include the upper. If we do not make this choice, then we will not know whether the number 20 belongs to the first class or the second class. We will have similar problems for 30, 40, 50, and so on. We will take option (a). The first class is salaries “from 10 to under 20 (€000’s)”. 10 belongs to this class but 20 does not. The second class is salaries “from 20 to under 30 (€000’s)”. And so on. Grouped Frequency Table: X (Salaries in €000’s) From 10 to under 20 From 20 to under 30 From 30 to under 40 From 40 to under 50 From 50 to under 60 From 60 to under 70 Tally f (No. of Employees) The Tally column is there to speed up your counting. As you read through the data put a mark in the relevant Tally box. Put them in groups of 5 like this as you go 1111 so that each fifth mark is a horizontal line. Why Tally first? – it means you only read ONCE through the data and ensures you neither miss a value or count it twice. Now add up your Tally marks to get your frequency column. Note: A more mathematical notation for “From 10 to under 20” that you might see is “10 X <20”. We will stick to the english version! Page 5 of 25 Preparatory Mathematics: Collecting and Presenting Data (2015) Histograms for Grouped Data Drawing a Histogram for grouped data is much the same as it was for simple discrete data. Again the variable is plotted on the x-axis with each frequency represented by a rectangle of the correct height. However, this time we have a rectangle for each class in the frequency table as opposed to a rectangle for each individual value of the variable. The Histogram for the salary data is below. Frequency (f) Company Salaries 16 14 12 10 8 6 4 2 0 from 10 to under 20 from 20 to under 30 from 30 to under 40 from 40 to under 50 from 50 to under 60 from 60 to under 70 Salaries in €000's (X) Page 6 of 25 Preparatory Mathematics: Collecting and Presenting Data (2015) Practice Example: A group of 25 people are asked for their weight to the nearest lb. Here are the answers. 145, 143, 161, 156, 159, 159, 154, 153, 167, 155, 151, 146, 148, 160, 134, 143, 155, 157, 142, 171, 146, 163, 161, 153, 172 A Grouped Frequency Table for these data is given below. X (Weights in lbs) From 130 to under 140 From 140 to under 150 From 150 to under 160 From 160 to under 170 From 170 to under 180 Tally f (frequency) 0 0 0 0 0 Histogram: Frequency (f) Weights 12 10 8 6 4 2 0 From 130 to From 140 to From 150 to From 160 to From 170 to under 140 under 150 under 160 under 170 under 180 Weights in lbs (X) Page 7 of 25 Preparatory Mathematics: Collecting and Presenting Data (2015) Section C: Summary Statistics “Population” and “Sample” The word “population” refers to a complete data set. Examples: If we are dealing with rolls of a die, the population is the results of every roll of the die ever done. If we are dealing with the heights of men in Ireland, the population is the heights of every man in the country. If we are % carbon content in a piece of steel from a production run, the population might be all of the pieces produced on one run. It may also be all pieces which might be produced by that production process into the future. PROBLEM: It is often difficult or impossible to work with the population as a whole. In the last example, we will never know all of the % carbon content figures on all steel pieces. We select a sample from the population and work with that instead. In manufacturing, samples from a production run are usually tested for quality control purposes. This sampling and testing may be destructive and/or time consuming, so you obviously don’t want to test everything! Examples of Samples: Roll a die 250 times and record the results. Select 1000 men at random and measure their heights. Select 10 steel pieces at random from a day’s production run and test their % carbon content. The sample should represent the whole population. Examples of Bad Sampling: The population is the voting intentions of every voter in Ireland. The sample (opinion poll) is the voting intentions of 100 voters attending the Labour Party Conference. The population is the heights of all the men in Ireland. The sample is the heights of 1000 male Gardaí. The sample is the first 10 steel pieces in the day’s production run. Page 8 of 25 Preparatory Mathematics: Collecting and Presenting Data (2015) Statistics and Parameters A parameter is a number that you calculate using the entire population. Example, calculating the mode from the whole population would give us a parameter. A parameter is constant for the population. A statistic is a number that we calculate using only a sample. Parameters are difficult or more usually impossible to calculate! Statistics can always be calculated. A statistic can vary from one sample to another and is not constant for the population. A statistic is used to estimate a parameter. The obvious question is “when is a statistic a good approximation for the parameter i.e. how big does my sample need to be?” We will look at this problem next semester – for now, we will simply calculate some statistics. A summary statistic is a statistic that sums up the data in a sample – it tells you something about the data as a whole. There are two types of summary statistic: 1. Measures of Location (also called Central Tendency) 2. Measures of Dispersion Measures of Location An average is a point within a group of data which is central to the group, and around which the other values are distributed. It is therefore a measure of central tendency – a measure which starts to summarise the data by fixing one point as the centre. The position of the central item fixes the location of the distribution and averages are therefore sometimes called measures of location. There are three measures of location that we will discuss: the mode, the median, and the mean. Mode – most frequently occurring value Median – “middle” value when all the numbers are arranged in order. (It is greater than half of the values in the data set and less than half the values in the data set.) Mean – “average” value in the familiar sense. The mean is the arithmetic average: add up the values and divide by however many of them there are. Page 9 of 25 Preparatory Mathematics: Collecting and Presenting Data (2015) The mode, the median, and the mean all give an indication of where the data is situated. The idea in each case is to pick one number that is representative of the data set as a whole. Example: For the following set of numbers, calculate the mode, the median, and the mean. 12, 11, 14, 7, 0, 1, 11, 8, 11, 2, 3 Firstly, write the numbers in ascending order: 0, 1, 2, 3, 7, 8, 11, 11, 11, 12, 14 Mode = 11 (the number that occurs most often – writing the numbers in ascending order groups the 11’s together making it easier to identify the mode) Median = 8 (the middle number – only when the numbers are written in ascending order can we see which one is in the middle) To calculate the mean we add up the numbers and divide by how many numbers we have: Add up the numbers: 12+11+14+7+0+1+11+8+11+2+3=80 There are 11 numbers in the list. This is n, the sample size. Mean = 80 = 7.27 11 Formula for the mean: Note: The symbol (to two decimal places) x x n means to “add up”. Note: The symbol x is always used to denote the mean of a sample of numbers. Page 10 of 25 Preparatory Mathematics: Collecting and Presenting Data (2015) Example Find the mean for the salary data given before: The salaries (in €000’s) of 50 employees in a large corporation are given below. 10.2 21.2 23.49 43.78 42.23 35.7 10.2 16.61 25.55 62.21 12.1 33.76 17.12 37.76 68.37 19.2 34.12 49.16 18.81 11.2 17.5 41.65 55.12 49.12 27.89 44.21 29.43 39.17 38.71 39.87 15.5 21.0 35.63 33.34 52.3 22.3 34.47 41.25 56.72 41.23 25.6 38.72 37.73 30.0 19.54 28.6 32.12 29.15 47.76 28.87 Page 11 of 25 Preparatory Mathematics: Collecting and Presenting Data (2015) Calculating the Mode from a Histogram The mode of a data set can easily be found using a Histogram of the data. The mode is just the value (or class of values) with the tallest rectangle in the Histogram. When it is a class of values it is called the Modal Class. Example: Below is the histogram for the Defective Items data. It is easy to pick out the mode of the data set. Frequency (f) Defective Items Data 6 4 2 0 9 10 11 12 13 14 15 Num ber of Defective Item s (X) Mode = What is the modal Class for the Salaries data? What is the modal Class for the weights data? Page 12 of 25 Preparatory Mathematics: Collecting and Presenting Data (2015) Estimating the Median for larger data sets To estimate the median a set of numbers arrange all of the numbers in ascending order and pick out the number which is half way along the list. To estimate the median of a set of numbers where there are a large number of data it is easiest to arrange the data in ascending order by putting it into a spreadsheet and using the sort function ( see the lab notes for how to do this). Example: Recall that a group of 25 people are asked for their weight to the nearest lb. Here are the answers. 145, 143, 161, 156, 159, 159, 154, 153, 167, 155, 151, 146, 148, 160, 134, 143, 155, 157, 142, 171, 146, 163, 161, 153, 172 Estimate the median Solution: Median ≈ Example Find the median for the salary data given before: The salaries (in €000’s) of 50 employees in a large corporation are given below. 10.2 21.2 23.49 43.78 42.23 35.7 10.2 16.61 25.55 62.21 12.1 33.76 17.12 37.76 68.37 19.2 34.12 49.16 18.81 11.2 17.5 41.65 55.12 49.12 27.89 44.21 29.43 39.17 38.71 39.87 15.5 21.0 35.63 33.34 52.3 22.3 34.47 41.25 56.72 41.23 25.6 38.72 37.73 30.0 19.54 28.6 32.12 29.15 47.76 28.87 Page 13 of 25 Preparatory Mathematics: Collecting and Presenting Data (2015) Solution: Enter the data into a spreadsheet and using the sort function rearrange in ascending order as follows: Median = List Position on list Reverse Position on list 10.2 10.2 11.2 12.1 15.5 16.61 17.12 17.5 18.81 19.2 19.54 21 21.2 22.3 23.49 25.55 25.6 27.89 28.6 28.87 29.15 29.43 30 32.12 33.34 33.76 34.12 34.47 35.63 35.7 37.73 37.76 38.71 38.72 39.17 39.87 41.23 41.25 41.65 42.23 43.78 44.21 47.76 49.12 49.16 52.3 55.12 56.72 62.21 68.37 Page 14 of 25 Preparatory Mathematics: Collecting and Presenting Data (2015) Advantages/Disadvantages of the Three Measures of Location It should be noted that for a reasonably large set of real data, the three measures of location will be more or less the same. Mean Advantages Takes all the numbers into account Easy to calculate Disadvantages Affected by large values Can be a number not in the actual data set Familiar Useful Mathematical properties When to use the mean - if all of the values in a data set are roughly equal, the mean is the best number to use as a summary statistic. Mode Advantages Simple to calculate Unaffected by large values Disadvantages May be unrepresentative of the whole data set Might not be unique. For example, consider the list of numbers 1, 2, 2, 2, 4, 4, 7, 7, 7, 9. Both 2 and 7 are modes for that set! Useful for non-numerical data as well When to use the mode - when nearly all the values in the data set are the same or for a small data set. Median Advantages Unaffected by large values Disadvantages May not be representative of the whole data set Easy to calculate When to use the median - when the data set has a few very large or very small numbers. Page 15 of 25 Preparatory Mathematics: Collecting and Presenting Data (2015) Skewed Data( http://www.mathsisfun.com/data/skewness.html) Data can be "skewed", meaning it tends to have a long tail on one side or the other: Negative Skew No Skew Positive Skew Negative Skew? Why is it called negative skew? Because the long "tail" is on the negative side of the peak. People sometimes say it is "skewed to the left" (the long tail is on the left hand side). The mean is also on the left of the peak. Not Skewed A Normal Distribution is not skewed. It is perfectly symmetrical. And the Mean is exactly at the peak Positive Skew And positive skew is when the long tail is on the positive side of the peak, and some people say it is "skewed to the right". The mean is on the right of the peak value. Calculating Skewness "Skewness" (the amount of skew) can be calculated, for example you could use the SKEW() function in Excel. Some other resources: http://www.statisticshowto.com/skewed-distribution/ Positive Skew Page 16 of 25 And positive skew is when the long tail is on the positive side of the peak, and some people Preparatory Mathematics: Collecting and Presenting Data (2015) Measures of Dispersion While the mode, median, and mean help us to summarise a set of data, they tell us nothing about how spread out the values in the set are. Example: Look at the following two sets of numbers 22, 22, 23, 24, 25, 27, 27 18, 20, 21, 24, 27, 28, 30 For both, the mean = 24, but the second is obviously more spread out. We will look at two ways of describing the amount of spread in a set of data: 1. Range 2. Standard Deviation Range The range is just the difference between the lowest and highest numbers in the data set. Example: 22, 22, 23, 24, 25, 27, 27 Range = 27-22 = 5 Advantages 1. Easy to calculate Disadvantages 1. Only use two numbers from the set 2. One outlier can skew the range as a useful measure if all the other values are close together. The range, as a measure of dispersion, is not extensively used. It has some usefulness for certain “quality control” systems, where values (e.g. temperature) cannot be allowed to stray outside a certain range. Page 17 of 25 Preparatory Mathematics: Collecting and Presenting Data (2015) Standard Deviation The standard deviation is the most important and widely used measure of dispersion. It measures the average deviation of the numbers in a data set from the mean of the set. Example: Calculate the standard deviation of the sample data set 177, 180, 181, 185, 187 1. First we calculate the mean Mean = x = 2. Next subtract the mean away from each number in the original set. 177 − = 185 − = 180 − = 187 − = 181 − = 3. Now square each of these “deviations” from the mean. 4. Next calculate the average (mean) of these squared deviations. (This average is known as the sample variance and is denoted by s2.) s2 = 5. Finally, to compensate for having squared all of the original deviations, take the square root of the answer from part 4. This is the standard deviation of the set of numbers. Standard deviation s = = Note: For a small sample such as this we should really use the “sample standard deviation” formula, which in step 4 above would divide by n – 1 = 4 rather than n = 5. or larger samples, there is no great difference between using n and n – 1, so we will use n and not worry about it. Page 18 of 25 Preparatory Mathematics: Collecting and Presenting Data (2015) Example: Find the mean and the standard deviation of : 3 3.2 4.1 4.5 5.5 Solution: Steps: 1. First we calculate the mean 2. Next subtract the mean away from each number in the original set. 3. Now square each of these “deviations” from the mean. 4. Next calculate the average (mean) of these squared deviations s2 = 5. Finally to get the standard deviation of the set of numbers, take the square root of the answer from part 4. This is Formula for Standard Deviation s e x x j x 3 3.2 4.1 4.5 5.5 2 (again, use n – 1 if n is small) n mean x-mean d x x i x = x = (x-mean)^2 and s 2 e x x j n 2 = Page 19 of 25 Preparatory Mathematics: Collecting and Presenting Data (2015) Formula for Standard Deviation e x x j s 2 (again, use n – 1 if n is small) n Note: This is the standard deviation formula when frequencies are not known. Example: The number of days with an average temperature below 6oC is shown below. Calculate the mean and standard deviation for this data. Jan 18 Feb 12 Mar 11 Apr 6 May 9 Jun 4 Jul 2 Aug 0 Sep 2 Oct 6 Nov 12 Dec 17 Solution: We have 12 numbers here: 18,12,11,6 ,9, 4, 2, 0, 2, 6, 12, 17 so here n = 12 Standard Deviation s e x x j 2 n x x 18 12 11 6 9 4 2 0 2 6 12 17 2 d x x i x = x = dx xi xx and s e x x j n 2 2 = This is method is fine for small data sets and in the lab with the use of an excel spreadsheet we can use it for large data sets also. Page 20 of 25 Preparatory Mathematics: Collecting and Presenting Data (2015) Six sigma picture: http://www.google.co.uk/imgres?imgurl=http://nursingplanet.com/biostatistics/normal_c urve.jpg&imgrefurl=http://nursingplanet.com/biostatistics/normal_distribution_and_prob ability.html&h=331&w=784&sz=20&tbnid=c7hWm9OjFIPjAM:&tbnh=52&tbnw=124 &prev=/search%3Fq%3Dstandard%2BDeviation%2B6%2Bsigma%2Bpictures%26tbm%3Disch%26tbo%3Du&zoom=1&q=standard+Deviati on+-6+sigma+pictures&docid=0WVRBIa8D5MITM&hl=en&sa=X&ei=cSCYT4XyKIDhQehzdn1BQ&ved=0CDAQ9QEwAg&dur=8406 Page 21 of 25 Preparatory Mathematics: Collecting and Presenting Data (2015) Problem Sheet Data Presentation and Summary Statistics 1. Find the mean, median and mode of the following sets of data: (a) 5, 7, 10, 19, 7, 8, 16, 11, 9, 12, 14. (b) 127 121 122 138 123 129 124 128 132 124. 2. Find the range and standard deviation for each of the data sets in question 1. 3. A firm has a team of 5 maintenance technicians who service machines leased to customers. The following frequency distribution records the number of days when some or all of the technicians were out on service calls during a 40 day period: Number of Technicians Number of Days 0 3 1 10 2 8 3 7 4 6 5 6 Plot a histogram for this data. What is the mode for these data? Find the mean number of technicians out on call per day. By writing down a list of these numbers calculate what is the median for these data? Comment on the results for the mean and median. Page 22 of 25 Preparatory Mathematics: Collecting and Presenting Data (2015) Back count Couint 1 2 11 10 x 5 7 meanx (x-meanx) (x-meanx)^2 Ps1(a) 10.5455 -5.54545455 30.75206612 Mean = 10.5455 -3.54545455 12.57024793 10.5455 -3.54545455 10.54545455 Standard 12.57024793 Deviation = 10.5455 -2.54545455 6.479338843 10.5455 -1.54545455 2.388429752 10.5455 -0.54545455 0.297520661 n= 10.5455 0.454545455 0.20661157 10.5455 1.454545455 2.115702479 10.5455 1.454545455 2.115702479 Mode = 7 and 12 10.5455 5.454545455 29.75206612 Median =10 10.5455 8.454545455 71.47933884 Range =19-5 = 7 3 9 4 8 5 7 6 6 7 5 8 4 9 3 10 2 11 1 8 9 10 11 12 12 16 19 3.939627033 11 14 116 is the sum of the x 170.7272727 Is Sum of (x-meanx)^2 Page 23 of 25 Preparatory Mathematics: Collecting and Presenting Data (2015) Back Couint count meanx (x-meanx) (x-meanx)^2 PS 1(b) 126.8 -5.8 33.64 Mean = 126.8 -4.8 23.04 126.8 -3.8 14.44 1 10 x 121 2 9 122 3 8 4 7 124 126.8 -2.8 7.84 5 6 124 126.8 -2.8 7.84 6 5 127 126.8 0.2 0.04 7 4 128 126.8 1.2 1.44 8 3 129 126.8 2.2 4.84 9 2 132 126.8 5.2 27.04 126.8 11.2 125.44 123 138 10 1 126.8 Standard Deviation = 4.955804677 n= 10 Mode = 124 Median =(124+127)/2 = 125.5 Range =138-121 = 17 1268 is the sum of the x 245.6 Is Sum of (x-meanx)^2 Page 24 of 25 Preparatory Mathematics: Collecting and Presenting Data (2015) x 18 meanx (x-meanx) (x-meanx)^2 8.25 9.75 95.0625 12 8.25 3.75 14.0625 11 8.25 2.75 7.5625 6 8.25 -2.25 5.0625 9 8.25 0.75 0.5625 4 8.25 -4.25 18.0625 2 8.25 -6.25 39.0625 0 8.25 -8.25 68.0625 2 8.25 -6.25 39.0625 6 8.25 -2.25 5.0625 12 8.25 3.75 14.0625 17 8.25 8.75 76.5625 99 is the sum of the x Temp Notes Eg Mean = 8.25 Standard Deviation = 5.643949563 n= 12 382.25 Is Sum of (x-meanx)^2 Page 25 of 25