Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistics Introduction: In many real-life situations, it is helpful to describe data by a single number that is most representative of the entire collection of numbers. Three ways of characterizing any data distribution are: Measures of Central Tendency. Describe the center point of a data set with a single value. Measures of Dispersion. Describe how far individual data values have strayed from the mean. You need to know that some measures of central tendency and variability are inappropriate for qualitative variables. Mean: The mean (or average) of a set of data values is the sum of all of the data values divided by the number of data values. Mean = Mean: x x .f i Sum of all data Number of data values value i N Where: 'x bar' is the mean of the set of x values. is the sum of all the xi fi values, and N is the number of data values in the population A fruit-seller has the following daily sales (in $) for five consequtive days: 100 - 120 - 125 - 100 - 130 Determine his average daily sales. Thus, the average daily sale of the fruit-seller is $115. We calculate the statistical mean of a list of numbers in order to find the general tendency of the numbers in the list. Find the mean number of minutes per day spent in Facebook: 75, 36, 0, 94, 56 Solution: Mean=52.2 minutes Characteristics of the mean: Every value in the distribution contributes to the value of the mean. The mean is very sensitive to extreme scores. An extreme score can pull the mean in one or the other direction and make it less representative of the set of scores and less useful as a measure of central tendency. Arithmetic mean is affected by change of both origin and scale. (Proof) Its value may not actually exist in the data (e.g., for the data set 2,3,4 and 5; the mean is 3.5). Remember that the word average means only the one measure that best represents a set of scores, and that there are many different types of averages. Which type of average you use depends on the question that you are asking and the type of data you are trying to summarize. Exercises The heights (cm) of the students in a class are: 195, 192, 192, 150, 174, 186, 159, 156, 189, 168, 156, 168, 150, 186, 192, 162, 183, 174, 189, 159 Their mean height is: cm Median: The median is the middle value when the data is arranged in order of size. In other words, the median divides the whole set of values in two parts such that half of the observations are less than or equal to it and half are more than or equal to it. Find the median of the followin set of data: 2, 3, 5, 3, 4, 3, 6 Step 1. Rewrite the numbers in ascending order: 2, 3, 3, 3, 4, 5, 6 Step 2. There are 7 values in the data set. The median is the fourth value. The median is 3. If the total number of given values n, is an odd number, then there exists only one middlemost value, namely the median of the values. th value in the arrangement and it represents the Find the median for the following set of values: 243514533 Step 1. Rank the data in ascending order as follows: 123334455 Step 2. Because the number of values in this set is odd (nine), there are four values less than and four values greater than the median. Therefore, the median is teh fifth value, 3. If the total number of given values n, is an even number, median may not be ubiquely determined. In fact, any possible value between the two middle values, namely, the the th and th values in the ordered arrangement, may be takes as median. But in order to obtain a definite value, the arithmetic mean of the th and the regarded as the median of te set of values, by convention. Find the median for the following set of values: 02351453 Step 1. Rank the data in ascending order as follows: 02334455 th values is Step 2. Because the number of values in this set is even (eight), the median is the midpoint between the fourth and the fifth values, 3 and 4. The median for grouped data is slightly more difficult to compute. We know that the median occurs in the particular class interval for which the cumulative frequency is . On observing the less-than type, say, cumulative frequencies, we can obtain the class interval that contains the median. In fact, the cumulative frequency for this interval is just more than or equal to . Marks Number of students 0-10 2 10-20 12 20-30 22 30-40 8 40-50 6 Advatages: It is very easy to calculate. The median is unaffected by extreme scores. Disadvantages: It may not correspond to any observed value (e.g., for the data set 2,3,4 and 5; the median is 3.5) Cannot be manipulated algebraically. Exercise: Find the median for the following set of scores: 1, 8, 10, 8, 4, 10, 6, 3, 7, 3, 5, 5, 6, 1, 3, 10, 0, 7, 9 Median: Mode: The mode of a set of data is the value or values which occur most often. Steps to determine the mode: Step 1. Count the number of times each value in a set occurs. If one value occurs more time than any other, it is the mode. If two or more values occur more time than any other, they are all modes of the set. If all values occur the same number of time, there is no mode. Step 2. Find the mode of: 2, 3, 4, 4, 2, 3, 4 Number 2 occurs 2 times, Number 3 occurs 2 times, Number 4 occurs 3 times, So the number with most occurrences is 4 and is the Mode of this distribution. Another method for determining mode is to use the empirical relation between mean, median and mode which is found to hold for unimodal distributions that do not deviate much from symmetry. The relation is: Mode for grouped data. In the computation of the value of the mode for grouped data, it is necessary to identify the class interval that contains the mode. This interval, called the modal class, contains the hightest frequency in the distribution. This table shows the monthly income of different families in a special locality. Find the income earned by the most number of families. Income Families 1000-2000 10 2000-3000 14 3000-4000 10 4000-5000 12 Advantages: It is applicable to nominal data. It is unaffected by extreme values. Disadvantages: It may not be unique in a set of data. It can not be manipulated using the rules of algebra. Exercises The modal class is 20003000. 1.-Find the mode of the following scores: 0, 10, 1, 0, 4, 0, 0, 4, 0, 9, 0, 6, 8, 4, 1, 2, 7, 6, 9, 8, 10, 0, 4, 5, 3 Mode: 2.-A farmer has 50 chickens. After weighing them all he got the following amounts (in grams): 1800 2700 3000 2500 2900 1900 3000 3400 2900 3400 2300 1500 3100 1500 1800 1900 2700 2600 2400 2200 3500 2100 1700 2000 2500 2900 2700 1700 2700 3100 1600 3100 2000 3200 1800 1800 3200 2000 3000 1900 2500 2400 3500 3200 1500 2100 1900 2000 1800 1600 Find Median: Range is the difference between the highest value and the lowest value of the given set of observations: Range = maximum value - minimum value. The heights of a sample of five people are 180, 183, 190, 179 and 180 cm. Find the range. Maximum value = 190 Minimum value = 179 Range = 190 - 179 = 11 Properties. It is easy to understand. It is simple to calculate. It does not depend on all observations, and is based on only the largest and the smaller among them. It is highly affected by extreme values. It does not take into account the form of the distribution. Mean Deviation and its Coefficient: The mean deviation (also called average deviation), of a set of N numbers X1,X2,...,XN is abreviated by MD and is defined by. Where is the arithmetic mean of the numbers and deviation of Xj from Find the mean deviation of the set 2, 3, 4, 5, 6. Properties: is the absolute value of the The mean deviation is based on all the observations. Shows the dispersion of values around the measure of central tendency. It is easy to compute. Average deviation from mean is always zero in any data set. The MD avoids this problem by using absolute values to elimitate negative signs. The mean deviation is a better measure of absolute dispersion than the range and the quartile deviation. Variance: The variance is a numerical index describing the dispersion of a set of scores around the mean of the distribution. The variance is calculated as the average of the squared deviations from the mean. Formula for variance: s 2 x 2 i . fi N x 2 A couple has six children whose ages are 6, 8, 10, 12, 14 and 16. Find the variance in ages. Solution: The following table gives the frequency distribution of the number of computers sold during the past 30 weeks at a computer store. Computers sold Frequency (f) [0-4) 2 [4-8) 3 [8-12) 4 [12-16) 2 [16-20) 1 Calculate the variance. Solution: S2=21.6 The Standard deviation is simply the square root of the variance and gives the spread of the sample or population about the mean. That is, The standard deviation plays a dominating role for the study of variation in the data. It is a very widely used measure of dispersion. As far as the important statistical tools are concerned, the first important tool is the mean and the second important tool is the standard deviation. A couple has six children whose ages are 6, 8, 10, 12, 14 and 16. Find the standard deviation in ages. 1. The population mean is: 2. 3. Find the positive square root of the variance: Properties: The standard deviation is in the same units as the units of the original observations. Standard deviation is independent of change of origin but not of scale (Proof) Coefficient of variation: The coefficient of variation (symbol CV), also referred to as the coefficient of mean deviation, is defined as the ratio of the standard deviation to the mean of the data set. It is used to express the standard deviation as a percentage of the mean. Mathematically, the coefficient of variation is calculated using the following equation: Sample: The coefficient of variation is especially useful when comparing data set, which have different units because the coefficient of variation is a dimensionless number. So when comparing between data sets with different units or widely different means, one should use the coefficient of variation for comparison instead of the standard deviation. A national sampling of prices for new and used houses found that the mean price for a new house is $120,000 and the standard deviation is $6100 and that the mean price for a used house is $50,000 with a standard deviation equatl to $3150. In terms of absolute deviation, the standard deviation of price for new houses is more than twice that of used houses. However, in terms of relative variation, there is more relative variation in the price of used houses that in new houses. The CV for used houses is The CV for new houses is Properties: When the mean value is near zero, the coefficient of variation is sensitive to small changes in the mean, limiting its usefulness. The coefficient of variation is independent of change of scale but not of origin Exercises The mean and standard deviation of height of a group of teenages are found to be 138 and 1.25 cm, while the same measures for their parents are 189 and 7.9 cm. The Coefficient of variation of the teenages is: The Coefficient of Variation of their parents is: There is more variability on teenages parents