Download INTRODUCTION - Department of Mathematics and Statistics

INTRODUCTION STATISTICS deals with the - collection - description - interpretation of DATA. Data may arise naturally as observations of “things as they are”, or as a result of a controlled designed experiment. The collection of data must be done carefully and according to certain rules. We will explore one of these methods, called SIMPLE RANDOM SAMPLING in the next few pages. Beyond the technical details of collecting data in a specific discipline, the methods used will determine the possible uses to which the data can be meaningfully applied. The important thing to keep in mind is that data collected incorrectly may invalidate any conclusion we may draw from it. Once collected, it is often useful to describe the data through displays or summaries that bring attention to certain characteristics which may be of interest to us. The techniques used for such descriptions will involve us in the area of DESCRIPTIVE STATISTICS. The description of data is not usually an end in itself. We may be interested in what the data can tell us about the group from which it was drawn, but which remains mostly unsampled. The methods employed in dealing with such generalizations are known under the label of INFERENTIAL STATISTICS. In dealing with the questions raised by Inferential Statistics we are led to define two fundamental concepts, SAMPLE and POPULATION. (1) POPULATIONS AND SAMPLES POPULATION: the set of all measurements or objects of interest in a particular study. QUESTION: If the population is the thing that interests an investigator, why doesn’t she just simply examine each member of the population. ANSWER: Thus to obtain information about the population the investigator must use a __________. SAMPLE: a subset of the population. In a study, an investigator should use a SIMPLE RANDOM SAMPLE (srs) SIMPLE RANDOM SAMPLE (srs): a sample chosen in such a way that each member of the population has an equal chance of being chosen. REASON for using a srs: We can now refine our definition of “inferential statistics” and say that INFERENTIAL STATISTICS deals with ways of using the sample to draw conclusions about the population (from which it was drawn). (2) EXAMPLE (POPULATION AND SAMPLE) : A manufacturer of light bulbs wishes to estimate the mean (average) lifetime of 40 watt bulbs his company produces. He obtains a srs of the lifetime of four such bulbs. 300(hrs), 450, 500, 400. What is the population? What is the sample? Estimate the mean lifetime of the 40 watt bulbs produced by this company. Would you expect your estimate to be accurate? Explain. EXAMPLE (POPULATION AND SAMPLE): A candidate for political office wishes to assess his chances in the next election by estimating the proportion of voters in his district who will vote for him. He obtained a srs of 1000 voters and found that 600 voters in the sample intended to vote for him and 400 against him? What is the population? What is the sample? Estimate the proportion of voters in the population that intend to vote for this candidate. Would you expect your estimate to be accurate? Explain. (3) TYPES OF DATA In statistics one works with several kinds of data. NUMERICAL or QUANTITATIVE DATA is data measured on a numerical scale. It is used to tell us how much or how many. There are two types of numerical data. CONTINUOUS DATA is data such as length or weight, which in theory, can be measured to any degree of accuracy. DISCRETE DATA is data whose possible values are isolated points on the number line. For example, data such as the number of children in families or number of fish caught by fisherman is discrete data. CATEGORICAL or QUALITATIVE DATA is data that records qualities of persons or objects, such as names, gender, or student numbers. Categorical data, which has a natural ordering is called ORDINAL DATA. For example if each student in a statistics class is assigned one of the grades A, B, C, D or F, the resulting data is ordinal data. EXAMPLE: State what type of data would be recorded in each of the following cases. (a) In a sample of 1000 Canadian voters each is classified as being Liberal, Reform, NDP, or other. _______________________________and________________________ (b) A biologist measures the height of tomato plants. __________________________and______________________________ (c) In the automobile industry, quality control classifies each paint job as excellent, good, fair or poor. _______________________________and____________________________ (4) EXPLORATORY DATA ANALYSIS Frequency and Relative Frequency Distributions A set of raw data gives us very little information as to how the observations are distributed. For example, consider the following sample of grades from a large mathematics class (the grades have been ordered from lowest to highest for convenience). 55 65 69 73 77 83 55 66 69 73 77 83 55 66 69 73 78 83 56 56 57 57 57 66 67 67 67 67 70 70 70 71 71 74 74 74 74 76 78 79 79 79 80 84 84 85 85 87 58 67 72 76 80 87 59 67 72 76 80 88 59 68 72 76 80 88 60 68 72 76 81 89 60 68 73 76 82 92 60 69 73 76 82 92 61 69 73 77 82 94 65 69 73 77 83 What is the distribution of these grades? Are there more B’s than C’s? Do the grades clump around a certain letter grade? One method of describing a sample is to divide it into classes or categories and to find the number of sample values in each class (the CLASS FREQUENCY) and/or the proportion of sample values in each class (the CLASS RELATIVE FREQUENCY). Divide the sample of grades into five classes, say , [50, 60), [60, 70), [70, 80), [80, 90), and [90, 100), and find the frequency and relative frequency for each class. Frequency and Relative Frequency Distribution Class Interval [50, 60) Class Midpoint (mi) 55 Class Frequency (fi) 11 Class Relative Frequency (fi/n) 11/95 = .1158 [60, 70) [70, 80) [80, 90) [90, 100) n= 95/95 = 1.000 Notes: 1. For a class [a, b). ‘a’ is called the LOWER CLASS BOUNDARY (LCB) ,‘b’ is called the UPPER CLASS BOUNDARY (UCB). For example, for the second class [60, 70): LCB =________, UCB=____________ (5) 2. The CLASS MID-POINT is the average of the lower and upper class boundaries. For example, for the second class [60, 70), m2 = (LCB + UCB)/2 = The mid – point represents a typical value in the class. 3. The CLASS Width is CW = UCB – LCB. So for the second class CW = UCB – LCB = Frequency and Relative Frequency Histograms A FREQUENCY (RELATIVE FREQUENCY) HISTOGRAM of a sample is simply a picture of the frequency ( relative frequency) distribution of that sample. Example (FREQUENCY HISTOGRAM): To draw a frequency histogram of our sample of 95 grades from a large mathematics class we proceed as follows: (a) Form an xy coordinate system with the class boundaries on the horizontal axis and the frequencies marked on the vertical axis. (b) Above each class draw a rectangle, whose height is equal to the class frequency. (6) To draw a relative frequency histogram, mark the relative frequencies on the vertical axis and above each class draw a rectangle whose height is equal to the class relative frequency. Notes: 1. The class (relative) frequencies are proportional to the AREAS of the rectangles above the class. For example the area of the rectangle above class 2 is roughly twice the area of the rectangle above class 1. Thus the class (relative) frequency of class 2 is roughly twice that of class 1. 2. The class width determines the number of classes. This choice will affect the “look” of the histogram. For example, what would the frequency histogram of the class grades look like if we used one class [50, 100). (7) The Histogram on Minitab (a) MTB > hist c1 [Draws a histogram of the sample in c1; Minitab chooses the classes] (b) MTB > hist c1; MTB > start m; MTB > incr w. [ Draws a histogram of the sample in c1 with starting mid-point ‘m’ and class width (CW) ‘w’] Note: The Minitab Output gives the class mid-points mi and the class frequencies (counts) fi . By hand we obtain the class width. CW= the distance between successive mid-points. The lower and upper boundaries of the ith class are given by [LCB, UCB) = [mi - .5CW, mi + .5 CW) Example ( rem.mtw): A part of sleep called REM sleep is associated with dreaming. The data below gives the fraction of time spent in REM sleep for a random sample of 40 adults. 0.23 0.18 0.24 0.24 0.27 0.13 0.15 0.35 0.18 0.18 0.17 0.14 0.19 0.21 0.20 0.14 0.21 0.18 0.18 0.28 0.23 0.24 0.17 0.24 0.17 0.17 0.12 0.23 0.18 0.28 0.10 0.20 0.19 0.20 0.31 0.26 0.23 0.17 0.18 0.17 (8) A. MTB > hist c1 Histogram of %REM N = 40 Midpoint Count 0.10 1 * 0.12 1 * 0.14 3 *** 0.16 1 * 0.18 13 ************* 0.20 5 ***** 0.22 2 ** 0.24 8 ******** 0.26 1 * 0.28 3 *** 0.30 0 0.32 1 * 0.34 0 0.36 1 * (9) B. MTB > hist c1; SUBC> start .08; SUBC> incr .04. Histogram Histogram of %REM N = 40 Midpoint 0.0800 0.1200 0.1600 0.2000 0.2400 0.2800 0.3200 0.3600 Count 0 3 9 14 8 4 1 1 *** ********* ************** ******** **** * * (10) QUESTIONS: 1. With reference to histogram A answer the following: (a) What is the class width? (b) What are the boundaries of the class width? (c) What percentage of the subjects in the sample spent 23% or more of their sleep in REM sleep? 2. With reference to histogram B answer the following: (a) Find the boundaries of the class containing the greatest number of sample values? (b) How many sample values fall in the last four classes? (c) What is the shape of this histogram? 3. Suppose you wanted to use Minitab to form a histogram with the lower boundary of the first class equal to 1 and a class width equal to 6. What Minitab Commands would you use? (11) Ideally we want a histogram that gives us as much information as possible. If we have too few classes we lose too much individual information; if we have too many we lose overall information. MTB > hist c1; SUBC> start .1; SUBC> incr .18. Histogram of %REM Midpoint 0.100 0.280 N = 40 Count 19 21 ******************* ********************* (12) MTB > hist c1; SUBC> start .1; SUBC> incr .005. Histogram of %REM N = 40 Midpoint Count 0.10000 1 * 0.10500 0 0.11000 0 0.11500 0 0.12000 1 * 0.12500 0 0.13000 1 * 0.13500 0 0.14000 2 ** 0.14500 0 0.15000 1 * 0.15500 0 0.16000 0 0.16500 0 0.17000 6 ****** 0.17500 0 0.18000 7 ******* 0.18500 0 0.19000 2 ** 0.19500 0 0.20000 3 *** 0.20500 0 0.21000 2 ** 0.21500 0 0.22000 0 0.22500 0 0.23000 4 **** 0.23500 0 0.24000 4 **** 0.24500 0 0.25000 0 0.25500 0 0.26000 1 * 0.26500 0 0.27000 1 * 0.27500 0 0.28000 2 ** 0.28500 0 0.29000 0 0.29500 0 0.30000 0 0.30500 0 0.31000 1 * 0.31500 0 0.32000 0 0.32500 0 0.33000 0 0.33500 0 0.34000 0 0.34500 0 0.35000 1 * (13) Histograms for Discrete and Categorical Data For data which is discrete, individual data values may, in some instances be used as classes. Example: The table below gives the number of children in a random sample of 40 Canadian families. Number of Children Number of Families 0 1 2 3 4 8 12 15 4 1 (14) For a categorical data, the categories themselves may be used as classes. Exercise: The table below gives the final grades to a class of 100 students in an elementary statistics course: Grades Number of students A 15 B 25 C 30 (a) What type of data is presented here? (b) Draw a histogram of this data? (15) D 20 F 10 Some Typical Sample Shapes (1) A LEFT SKEWED sample is one whose histogram ( or stem and leaf plot) has a long left tail; the sample values tend to cluster at the right end of the scale and taper off at the lower end. (2) A RIGHT SKEWED sample is one whose histogram (or stem and leaf plot) has a long right tail; the sample values tend to cluster at the left end of the scale and taper off at the higher end. (3) A SYMMETRIC sample is one whose histogram ( or stem and leaf plot) is distributed approximately the same on each side of some central value. (4) A particular type of symmetric sample is a BELL- SHAPED; all bell shaped samples are symmetric, but not all symmetric samples are bell shaped. (16) The Stem and Leaf Plot Although a histogram tells us how many observations fall in a particular class, we lose information about the different values within a class. For example, in the histogram B of the REM sleep data the class [.26, .30) includes exactly 4 sample values; but unless we go back to the original data, we would not know that these values are .26, .27, .28 and .28. Note that the mid-point for this class is .28 and yet all four sample values in the class are at or below this mid-point. A different method for displaying data which overcomes some of these difficulties, and which is easy to do, even by hand, is the Stem and Leaf Plot. Recall that the position of an individual digit in a number tells us the value the digit represents. For example, consider the number 962.78. For this number 9 is 100’s digit, 6 is 10’s digit, 2 is 1’s digit, 7 is .1’s digit and 8 is .01’s digit. The stem and leaf plot uses selected digits ( called stems) to group the sample into classes. The individual observations are then represented by the stem and the next significant digit. This is called the “truncation” method. EXAMPLE: Consider the following sample data of English Scholastic Aptitude Test (SAT) scores. 638 574 627 621 705 690 522 612 594 581 640 653 638 760 491 The data here ranges from 491 to 760. Thus it is reasonable to group the sample using the 100’s digit. The number “638” is plotted as 63, (8 is truncated). “6” is called a stem. The stem unit is SU=100 since the stem digit is the 100’s digit. “3” is called a Leaf. The leaf unit is LU=10, since the leaf digit is 10’s digit. “63” represents 63 leaf units; thus 63=63(LU)=63(10)=630. Thus although a value read from the plot contains more information about the observation than is available from a histogram, it may only be an approximation to the actual sample value because of truncation. [ 63 = 630 approximates the actual value of 638]. (17) Notes: 1. Stem unit = 10 (Leaf unit) i.e. SU=10(LU). 2. The leaf always consists of a single digit [anything more is truncated]. To illustrate these ideas, we plot the sample as follows: 638 574 627 621 705 690 522 612 594 581 640 653 638 760 491 Initial Plot: Stems Leaves Final Plot: Stems Leaves LU= 10, SU = 100. This plot is called one leaf category per stem plot (LCPS). The number of LCPS merely gives the number of lines for which the stem value is the same. Two leaf category per stem plot: Stems Leaves LU= 10, SU=100. (18) Five Leaf Category per stem plot: STEMS LEAVES LU=10, SU=100. Note: The increment of a stem and leaf plot is the distance from one line to the next. INCR= SU/#LCPS. Notice that the stem together with the increment represent classes. In a one leaf category per stem plot the increment is 100. Therefore, this plot represent the following 4 classes. [400,500), [500,600), [600,700), [700, 800) While a histogram gives the number of sample values in each class, the stem and leaf display plots the (truncated) sample values belonging to each class. (19) The Stem and Leaf Plot on Minitab MTB> stem c1 [Draws a stem and leaf plot of the sample in c1] Note: The Minitab output gives only the leaf unit(LU). The stem unit (SU) can be obtained by the rule: SU=10*LU Example: Consider the sample data of English Scholastic Aptitude Test (SAT) scores 638 574 627 621 705 690 522 612 594 581 640 653 638 760 491. MTB> stem c1 Stem-and-leaf of C1 Leaf Unit = 10 1 2 5 (6) 4 2 1 4 5 5 6 6 7 7 N=15 9 2 789 122334 59 0 6 Leaf Unit: LU=10 Stem Unit: SU=10*LU=100 #LCPS=2 INCREMENT=SU/#LCPS = 100/2 = 50. Values are easily read from the plot. For example the smallest sample value is 49 = 49(LU) = 49*10 = 490 Note: 1. Minitab also displays a depth column. The numbers in this column count cumulatively the number of leaves from the bottom and top lines as long as the count is  n/2. If there is a row remaining ( the middle row) the number of leaves in this row is given in parentheses [ In the example above (6)]. 2. The scale ( i.e. the stem unit and number of leaf categories per stem ) can be controlled using the subcommand INCRement. This is illustrated below. (20) To obtain a stem and leaf plot with stem unit SU and 1,2 or 5 leaf categories per stem use the STEM command with the subcommand INCRement where INCR= SU/1 for a plot with stem unit SU and 1 LCPS = SU/2 for a plot with stem unit SU and 2 LCPS =SU/5 for a plot with stem unit SU and 5 LCPS Example: Consider the English SAT scores of the previous example. Suppose we want a stem and leaf plot with SU=100. (a) MTB> stem c1; SUBC>incr 100. Stem-and-leaf of C1 N=15 Leaf Unit = 10 1 5 (8) 2 4 9 5 2789 6 12233459 7 06 (b) MTB> stem c1; SUBC> incr 50. Stem-and-leaf of C1 N=15 Leaf Unit =10 1 2 5 (6) 4 2 1 4 9 5 2 5 789 6 122334 6 59 7 0 7 6 (c) MTB> stem c1; SUBC> incr 20. Stem-and-leaf of C1 N=15 Leaf Unit= 10 1 4 9 1 5 2 5 2 2 5 3 5 7 5 89 5 6 1 (4) 6 2233 5 6 45 3 6 3 6 9 2 7 0 1 7 1 7 1 7 6 (21) Example: weights ( in ounces) of 30 major league baseballs are 5.08 5.21 5.26 5.26 5.04 5.21 5.28 5.04 5.17 5.12 5.30 5.07 5.06 5.17 5.24 5.13 5.26 5.16 5.09 5.22 5.06 5.09 5.24 5.22 5.22 5.11 5.23 5.06 5.27 5.13 Construct a stem and leaf plot with stem unit .10 and with two leaf categories per stem. MTB> stem c1; SUBC> incr .05. Stem-and-leaf of C1 N=30 Leaf Unit =0.010 1 9 13 (3) 14 6 1 50 50 51 51 52 52 53 44 6667899 1233 677 11222344 66678 0 QUESTIONS: 1(a) What value does “52  3” on the plot above represent? (b) Find the sample range R. . 2. Suppose you wish to use Minitab to draw a stem and leaf plot. What increment would you use to obtain a plot with the following specifications? (a) Leaf Unit = 1 and one leaf category per stem. SU= INCR= (b) Stem Unit = 10 and 5 leaf categories per stem. INCR= (22) MEASURES OF THE LOCATION or CENTRE OF A SAMPLE Sample Mean: Let a sample has n observations X1 , X2 , . . . , Xn . The sample mean is denoted by the symbolX and is defined as follows:  is the upper case Greek letter “sigma”. In statistics it means to sum. Thus Xi means sum of the sample values Xi . (23) Example: Stop watches are tested for reliability by counting the number of cycles (on-off-restart) till some part of the mechanism fails. The data below gives the failure time ( in thousands of cycles) for a random sample of a certain type. 22, 2, 12, 18, 16 X = Xi / n = 70/5 = 14 __________________________________________________________ 0 5 10 15 20 25 Sample Median : The value above and below which approximately 50% of the sample falls is called the median. It is denoted by the symbol m. Steps in the calculation of the sample median (1) Order the sample values from smallest to largest. (2) Calculate the approximate position of m: AP(m)=(50/100)n, where n is the sample size. (3) From (2) calculate the exact position of m as follows: (i) If AP(m) is not a whole number: Pos(m) = AP(m) rounded up to the next whole number. (ii) If AP(m) is a whole number: Pos(m)= AP(m) + .5 (4) RESULT: m is the value at Pos(m). If Pos(m) is not a whole number, there is no observation in that position. In that case m is the average of the values in the two positions on either side of Pos(m). (24) Example: Consider the stop watch data of the previous example. For this sample, (1) Ordered Sample: 2 12 16 18 22 (2) AP(m)= (.50)n = (.50)5 = 2.5 (3) Pos(m) = 3 (4) m=16 Interpretation: Approximately 50% of the failure times in the sample fall below m=16. Example: Incomes ( in $1000) of a sample of 8 residents of a hypothetical town are: 29 16 17 19 964 26 10 29 = Xi / n = 1110/8 = 138.75 Now we calculate sample median for this sample: (1) Ordered Sample: 10 16 17 19 26 29 29 964 (2) AP(m) = (.50)n = (3) Pos(m) = (4) m = Interpretation: Approximately 50% incomes are below 22.5 thousands in the sample. Note: Data values like “964” in the above example are called OUTLIERS. These are extreme data values which are either (a) infrequently occurring members of the population, or (b) sample values that do not belong to the population such as measurement errors, recording errors, or measurements from the wrong population. In case (a) outliers should not be removed from the sample. In case (b) they should be removed. (25) Questions: 1. How sensitive to outliers is (a) The sample meanX ? (c) The sample median m? 2. In general which is a better measure of the centre ( or location) of a sample when the sample is (a) Skewed or has outliers ______________________ (b) symmetric __________________________ 3. What does each of (a)-(c) below indicate about the shape of the sample? (left skewed, right skewed or symmetric): (a) X is well above m? ________________ (b) X is well below m?___________________ (c) X is close to m ?___________________ (26) The Sample Mean and the Sample Median on Minitab Consider the data from our previous example: Incomes ( in $1000) of a sample of 8 residents of a hypothetical town. C1: 10 16 17 19 26 29 29 964 MTB> mean C1 Mean = 138.75 MTB> medi C1 Median = 22.5 MTB> desc C1 C1 N MEAN MEDIAN 8 139 22 MIN MAX Q1 10 964 16 TRMEAN 139 Q3 29 STDEV 334 Note: To see how the outlier “964” affects the sample mean sample without the outlier. C2: 10 16 17 19 26 29 SEMEAN 118 , consider the 29 MTB> mean c2 Mean = 20.857 MTB>medi c2 Median = 19.00 Notice that the removal of the outlier has changed the sample mean from 138.75 to 20.857 while the sample median has changed little [from 22.5 to 19] (27) MEASURES OF VARIATION (SPREAD) OF A SAMPLE VARIATION: a measure of how far the sample values are from their central value. There are many such measures. One convenient way is as follows: Total Variation: SSTO=  ( )2. The average value of the total variation for a sample is known as the Sample Variance: s2 = SSTO n 1 Note that in calculating the average here we divide by n-1 rather than n. It turns out that a sample variance defined this way has better statistical properties. Since the total variation and the sample variance are calculated as squares of the observations, their unit of measurement is the square of the unit of measurement of the data ( for example if the data is measured in centimeters (cm) then SSTO and s2 are measured in centimeters2 (cm2)). To obtain a measure of variation which has the same units of measurement as the data we take the square root of s2 to get the Sample Standard Deviation: s =  s2 (28) Example: The stopwatch data from a previous example is given below: 22 We have calculated _______ SSTO = sample mean. 2 12 18 16 X =14. We now calculate SSTO, s2 and s. ______________ _____________ , Total squared deviations of the sample values from the s2 = SSTO/n-1 = = , roughly the average distance2 of the sample values from the sample mean. s= = Some sort of “average distance” of the sample values from the sample mean. (29) Computational Formula for Total variation The computation of total variation can actually be done without calculating An alternative formula is SSTO =  xi2 - (xi)2/n For the stopwatch data __________ _______________ SSTO = s2= SSTO/n-1 = s=  Example: Below is a srs of 36 grades of students in a mathematics course 57 83 75 79 60 75 60 51 57 47 89 50 62 54 96 66 60 55 75 78 59 62 57 77 For this sample n=36, x=2352, 56 65 52 78 73 62 68 64 64 68 57 61  x2 = 158,210 Sample mean = = SSTO = = = s2= SSTO/n-1 = ;s= (30) SAMPLE PERCENTILES The median splits the sample evenly into two halves. We can define measures that divide the sample into parts of different size. The rth Percentile of a sample is a value Pr such that (approximately) r% of the sample falls below Pr. Example: If a students’s score on a university entrance exam is the 84th percentile, this means that approximately 84% of all the scores in the exam are below this student’s score. STEPS IN THE CALCULATION OF Pr FOR A SAMPLE OF SIZE n (1) Order the sample values from smallest to largest. (2) Calculate the approximate position of Pr: AP(Pr)=(r/100)n (3) From (2) calculate the exact position of Pr as follows: If AP(r) is not a whole number: Pos(Pr) = AP(Pr) rounded up to the next whole number. If AP(Pr) is a whole number: Pos(Pr) = AP(Pr)+.5 (4) RESULT: ‘Pr’ is the value at Pos(Pr). If Pos(Pr) is not a whole number, there is no observation in that position. In that case, Pr is the average of the values in the two positions on either side of Pos(Pr). Example: Below are SAT (scholastic aptitude test) mathematics scores for a srs of 32 university applicants. [the sample has been ordered for convenience] 484 490 506 509 523 532 539 539 544 545 550 558 578 580 591 593 610 610 630 634 641 647 648 655 662 673 682 688 693 726 745 780 (i) 60th Percentile: AP(P60) = ; Pos(P60 ) = P60 = Interpretation: (ii) 25th Percentile: AP(P25)= ; Pos(P25) = P25 = Interpretation: (31) SPECIAL PERCENTILES First Quartile: Q1 = P25 Second Quartile: Q2 = P50 = m Third Quartile: Q3 = P75 Example: Below are the scores of a profile test given by a psychologist to random sample of 30 convicted felons. The scores have been ordered from smallest to largest. 146 165 171 179 181 184 190 191 192 192 192 193 195 196 196 197 198 199 200 200 201 203 204 205 206 213 215 221 232 247 First Quartile: Second Quartile: Third Quartile: (32) The Boxplot (33) Example: Consider Psychological profile scores for 30 convicted felons. For this sample we know that m = 196.5, Q1 = 191 and Q3 = 204 Inner Fences: LIF = HIF = Outer Fences: LOF = HOF = Sketch : (34) The Boxplot on Minitab MTB>gstd MTB> boxp C1 [draws a boxplot for the sample in C1] Note: 1. On the boxplot scale one space (sp) “-“ is given by sp = ( the distance between successive tick marks [+] )/10 2. Values read on the boxplot scale are approximate. Example: Consider the psychological profile score data MTB > boxp C1 ------------I + I--------* O ------+---------+---------+---------+---------+---------+------C1 140 160 180 200 220 240 O * * Question: From the boxplot scale determine the ( approximate) values of the sample median, first and third quartiles and possible and probable outliers. sp= Q1 = ; m= ; Q3 = Possible Outliers: Probable Outliers: (35) Identifying Outliers in a Stem-and–leaf Plot Minitab gives us an option when we use the stem command to identify outliers as calculated in the boxplot. Using the STEM command with the subcommand TRIM one can form a stem and leaf plot with the outliers trimmed away and shown on special lines labeled LO and HI. In this way one can examine the shape of the sample with the outliers removed. Example: Scientists use ph (power of Hydrogen) as a numerical measure of acidity with a ph=7.00 being neutral. A scientist suspects a ph meter is defective. He takes a random sample of 45 measurements with this meter on a neutral substance. The data have been entered into C1. 8.32 6.19 4.61 5.49 6.76 6.13 10.28 6.86 6.09 8.35 5.89 7.05 9.05 5.82 7.22 6.45 6.56 5.88 5.38 8.85 7.14 5.90 5.79 6.60 5.95 9.00 5.99 6.58 6.49 7.24 7.21 6.99 3.46 6.62 3.26 6.02 6.98 6.21 7.22 6.72 5.96 8.26 6.07 3.76 11.17 MTB > stem C1; SUBC> trim. Stem-and-leaf of PHlevel Leaf Unit = 0.10 LO 4 6 14 22 (9) 14 8 8 5 4 4 5 5 6 6 7 7 8 8 9 HI N = 45 32, 34, 37, Low Outliers(values below the LIF) 6 34 78889999 00011244 555677899 012222 233 8 00 102, 111, High Outliers(values above the HIF) (36) The outliers are the values beyond the inner fences. We can show this using the approximate sample values read from the plot. See the calculations below: To show the “HI” values are outliers, we need calculate the high inner fence. AP(Q1) =(25/100)45 = 11.25, Pos(Q1) = 12 Q1 = AP(Q3) = (75/100)45 = 33.75; Pos(Q3) = 34 Q3 = IQR = HIF= The values 10.2 and 11.1 are both bigger than 9.15 and are plotted as “HI” outliers in the stem and leaf plot. Similarly, LIF = The values 3.2, 3.4 and 3.7 are all smaller than LIF and are plotted as “LO” outliers in the previous stem and leaf plot. (37) Bivariate Data All the data sets we have encountered up to now deal with measurements of one variable on each of several “individuals” (i.e. incomes of individuals, test scores etc.). Such type of data is called univariate data. In many situations it may be of interest to take measurements of two variables on each of several individuals. For example we may be interested in both the income of an individual and the number of hours of television watched per week. In such a case each observation will consist of two values ( a pair of measurements) (income, number of hours watched). Such type of data is called bivariate data and is often denoted in the form; (x,y), the “x-observation” and “y-observation”. A bivariate sample of ‘n’ observations would then be written as (x1,y2), (x2,y2),. . ., (xn,yn) The reason we take observations on two variables is that we are interested in exploring the relationship between the variables. For example “ Do poorer people tend to watch more TV”. The simplest possible relationship between two variables x and y is that of a straight line. (38) Measuring Association Between Variables Here are some typical examples of association: Positive linear relationship between x and y: Curvilinear relationship between x and y: Negative linear relationship between x and y: There are many measures of association in statistics and the term CORRELATION may be applied to such measures. We will study a specific type of correlation called PEARSON’S SAMPLE CORRELATION COEFFICIENT which measures the strength of the linear relationship between the two numerical variables. (39) Pearson’s Sample Correlation Coefficient r Consider the following bivariate data on two class quizzes: x y 1 5 1 4 2 4 4 2 5 1 5 2 2 2 4 4 sxy = = = sxy is called the sample covariance and is a measure of the linear relationship between x and y. To make this value independent of the units of measurement we divide by sx = 1.690 and sy =1.414. The scatter plot of this sample data is as follows: (40) Pearson’s sample correlation coefficient r is r = _______________________ =_______________________ = Note: r is always between –1 and +1. The nearer r is to –1, the closer the plotted points are to a straight line with negative slope. The nearer r is to +1, the closer the plotted points are to a straight line with positive slope. ( If r is equal to –1 or +1 then all the points are on a straight line). Note: Pearson’s sample correlation coefficient r is a measure of the linear relationship between two variables X and Y. Other relationship may exist which are not detected by r. Example: Consider the sample of pairs (-1,1), (0,0), (1,1). We use Minitab to calculate Pearson’s sample correlation coefficient r. C1: -1 0 1 C2: 1 0 1 MTB> corr c1 c2 Correlations (Pearson) Correlation of X and Y =0.000 The correlation r is 0. Thus there is no linear component to the relationship between X and Y; note however that for each pair (x,y), y=x2, so there is another type of relationship between X and Y ( a quadratic relationship). (41) Correlation and Causality Just because the value of r is close to 1, does not of itself mean that x “causes” y. A strong ( negative or positive) correlation does not necessarily imply a cause and effect relationship between X and Y. Often there is a third hidden variable which creates an apparent relationship between X and Y. Consider the example below. Example: A sample of students from a grade school were given a vocabulary test. A high positive correlation was found between X= “student’s height” and Y= “student’s score on the test” Should one infer that growing taller will increase one’s vocabulary? Explain. Also indicate a third variable which could offer a plausible explanation for this apparent relationship. Note: A cause and effect relationship is best established by an experiment in which other variables which influence X and Y are controlled. Question: In the example above, how might a more controlled experiment be performed? (42) Fitting a Straight Line: The Least Squares (Regression) Line A set of points (bivariate observations) may or may not have a linear relationship. If we want to draw a straight line which “fits” these point, which line fits best? Our choice depends on what we use to define a “good” fit. Given a sample of bivariate data (x1, y1), (x2,y2), . . .(xn,yn), the least squares line L is the line fitted to the points in such a way that d12+d22+…+dn2 is as small as possible. The equation of the least squares line L is given by y =b0 + b1x, where b1 = (syr/sx) and b0 = Note: 1. The sign of the slope (b1) is the same as the sign of the sample correlation coefficient. 2. The least squares line always passes through the point ( (43) ). Example: Consider the previous example of scores on two class quizzes. Sample data is (1,5), (1,4), (2,4), (4,2), (5,1), (5,2), (2,2), (4,4). In our calculation of r=-.717, we also found that x =3, y =3, sx =1.690 and sy = 1.414 Thus, b1 = (syr/sx) = b0 = = = ŷ = = b0 + b1x SKETCH: Note: 1.The least squares line only fits the points used in its calculation. Do not be tempted to extend it below the smallest x-value or above the largest x-value. New observations with x-values outside this range may not fall anywhere close to the fitted line. So if a situation requires you to extend the line, do so with caution. 2. In the language of regression r2 is known as the COEFFICIENT OF DETERMINATION. It is used in regression to measure how well the least squares line fits the points. r2 satisfies the following inequality 0 r2 1 (44) Pearson’s Sample Correlation Coefficient and the Least Squares Line on Minitab Example: The data below gives the cost estimates and actual costs (in millions) of a random sample of 10 construction projects at a large industrial facility. Estimate(X) 44.277 2.737 7.004 22.444 18.843 46.514 3.165 21.327 42.337 7.737 Actual(Y) 51.174 9.683 14.827 22.159 26.537 50.281 15.550 23.896 50.144 13.567 To find Pearson’s sample correlation coefficient ‘r’ for this data and the equation of the least squares line using Minitab, proceed as follows. Name C1 as EST(X) and C2 as ACT(Y), then enter the x-data into C1 and the corresponding y-data into C2. Then Click: Stat Regression  Fitted Line Plot Type in C2 in the Response (Y) box and type in C1 in the Predictor(X) box. Now click OK. Regression Plot ACT(Y) = 7.49228 + 0.937658 EST(X) S = 3.49312 R-Sq = 96.0 % R-Sq(adj) = 95.5 % 50 ACT(Y) 40 30 20 10 0 10 20 30 EST(X) (45) 40 50 From the output we find that the coefficient of determination r2 = .960 and the equation of the least squares line is ŷ =7.49228+.937658x Questions: (1) What is the value of the sample correlation coefficient? Interpret it. (2) Predict the actual cost of a project whose estimated cost is 15 million dollars. (3) Is it safe to use this data to predict the actual cost of a project estimated to cost 80 million dollars? Explain. (4) Estimate the average change in the actual cost of a project when the estimated cost increases by 10 million. (46) PROBABILTY In life an action (or an “experiment”) will produce one of a number of possible outcomes, but the exact nature of outcome may not be precisely predicted. Thus when we toss a coin we know the outcome will be “heads” or “tails” but we will not be able to predict on any given toss which of the two possibilities will actually occur. In real life few things can be predicted exactly and there are situations in which some outcomes may be more likely than others. For example you know that you are more likely to do well in this course than to purchase a winning Loto 6-49 ticket! We can formalize this feeling about the likelihood of an event by giving it a value which expresses the level of uncertainty associated with the event. A PROBABILITY is a number between 0 and 1, which measures how likely an event is to occur. The SAMPLE SPACE is the collection of all possible outcomes that can occur in an experiment. The identification of all the possible outcomes in S is often the first step in determining probabilities. An EVENT is a subset of the sample space S. Events are usually denoted by upper case letters (A,B,C,…) and the probability of an event E is denoted by P(E). If all the outcomes in the experiment are “equally likely”, then the probability of an event is given by P(Event) = Number of outcomes in the event/Number of outcomes in S = N(Event)/N(S) In many experiments we can use a “tree diagram” to list the possible outcomes of the experiment (i.e. the Sample Space S) and the probabilities of each outcome. Example : A “ fair” coin is tossed. List the sample space S and the probabilities of each outcome. (47) Example: A fair four sided die is rolled twice. (a) Use a tree diagram to list the sample space S. (b) List and find the probability of each of the following events. A = “ a sum of 4 or less” B = “ the second roll is a 2” C= “ a sum of 7” D = “ a sum greater than 8” (48) Operations on Events and Associated Rules of Probability 1. P(S) =1, P() =0 [Since, P(S) = N(S)/N(S) =1 and P() = 0/N(S) =0] Here  is the symbol for the so-called “empty set”, the event that contains no outcomes; so N() =0. 2. Complement: The complement of A is the event that A does not occur. It is made up of all outcomes which are not in A. Complement Rule: (49) 3. Intersection: A and B [AB] The intersection of two events A and B is the event that both A and B occur. It is made up of all the outcomes which are common to A and B. (50) 4. Union: A or B [AUB] The union of two events A and B is the event that A or B or both occur. It is made up of all the outcomes which are in A together with all the outcomes which are in B. Addition Rule : P(A or B) = P(A) + P(B) – P(A and B) (51) 5. Mutually Exclusive Events Two events A and B are mutually exclusive (or disjoint) if their intersection is empty, that is they have no outcomes in common. Practically speaking this means that A and B cannot both occur in a given experiment. This is written as A and B =. Addition Rule for Mutually Exclusive Events: P(A or B) = P(A) + P(B) (52) Although some what artificial, examples with balls of different colors and numbers being drawn from boxes are useful in highlighting some of the concepts and rules we have introduced. Here is a typical example. Example: Box I contains two balls one red and one black, box II contains three balls numbered 1,2 and 3. A ball is randomly selected from each box. (a) List the sample space. (b) List and find the probability of the following events: (i) A = “ an odd number is drawn”, (ii) B = “ a black ball is drawn”, (iii) A and B, (iv) A or B, (v) Complement of A. (c) Are the events A and B mutually exclusive? (53) Previous Example continued: (54) Condition Probability: The conditional probability of an event A given that event B has occurred is defined by P(AB) = P( A and B)/P( B) Note: This definition implies that for any events A and B P( A and B) = P(AB)P(B) = P(BA)P(A) This statement is called the Multiplication Rule. Example: Suppose two cards are selected at random from a deck one at a time and without replacement. Find the probability that (a) an ace is drawn first and a king is drawn second. (b) two queens are drawn. (55) Example: In a large insurance agency, 60% of the customers have automobile insurance 40% of the customers have homeowners insurance 75% of the customers have one type or the other. (a) Find the proportion of customers with both types of insurance. (b) Find the probability that a customer has homeowner insurance given that he has automobile insurance. (56) Independence Sometimes additional information is not relevant and does not lead to an update of the probability. Example: A group of 200 voters in a certain electoral poll are classified according to gender on one hand and candidate preference on the other. Male Female Total A 20 40 60 B 36 44 80 C 24 36 60 Total 80 120 200 Probability a voter favours candidate A given that the voter is a male is P(A/M) = 20/80 =.25 Notice that the probability a randomly chosen voter favours A is P(A)= 60/200 = .30 While 30% of this group of voters favour candidate A, a different percentage of the male voters (namely 25%) favour candidate A. Thus knowledge of the gender of the voter is useful and leads us to update our probability of candidate A preference. The events A = “a voter favours candidate A” and M=”a voter is a male” are called DEPENDENT EVENTS because P(AM)P(A). Now look at the probability a voter favours candidate C given that the voter is a male. P(CM) = 24/80 = .30 Notice that the probability a voter favours candidate C is P(C) = 60/200 = .30 Now in this case knowledge of the gender of the voter gives us no additional information regarding preference for candidate C, and does not lead to an updating of this probability ( while 30% of this group of voters favour candidate C, the same percentage of the male voters also favour candidate C). The events C = “ a voter favours candidate C” and M=”a voter is a male” are called INDEPENDENT EVENTS because P(CM) = P(C). (57) There are a number of equivalent ways of defining independence. Two events A and B are said to be INDEPENDENT if any of the following statements hold: (a) P(AB) = P(A) (b) P(BA) = P(B) (c) P(A and B) = P(A)P(B). Statement (c) is the MULTIPLICATION RULE FOR INDEPENDENT EVENTS. Example: Suppose A is the event an adult is a male and B is the event an adult favours capital punishment. Further suppose that in a certain country, 50% of the adults are male, 80% of the adults favour capital punishment, and 40% of the adults are male and favour capital punishment. For this country, answer the following: (a) Are the events A and B independent? Explain. (b) Are the events A and B mutually exclusive? Explain. (58) Example: For a certain population of individuals 52% are overweight, 30% have high blood pressure, 20% are both overweight and have high blood pressure. Is the fact that a person is overweight independent of his/her blood pressure? Example: An unfair coin has a probability of 0.7 of coming up “heads” when tossed. Suppose it is tossed in such a way that the outcome of each toss is independent of the outcome of any other toss. (a) If the coin is tossed n=2 times, find the probability of obtaining two heads. (b) If the coin is tossed n = 3 times, find the probability of obtaining two heads followed by a tail. (59) POPULATIONS AND RANDOM VARIABLES Often the things we sample from a population are actually measurements of some characteristic. These measurements will naturally vary in some random way from observation to observation. This type of measurement, taken at random from a population, is called a RANDOM VARIABLE, and we will use the symbol X to refer to it. Example: Hospital records show that for patients undergoing a certain surgical procedure, the length of stay in hospital was 2 days for 10% of the patients, 4 days 20% of the patients, 5 days 40% of the patients, and 6 days 30% of the patients. Let X= the length of stay for a randomly selected patient. We can “model” the population of these measurements by using the “PROBABILITY DISTRIBUTION” of the random variable X. This is simply a table which gives the probability that the value of X that we pick will be a certain number. Probability Distribution of X x P(X=x) or p(x) 2 4 5 6 We obtain the following probabilities from this probability distribution. 1. P(X=4) = 2. P(X=2 or X=5) 3. P(X 4) = 4. P(X>3) = 5. P( 2 < X  6) = 6. P( X < 4 or X  5) = (60) Total The Mean, Variance and Standard Deviation of a Random Variable X MEAN: The population mean (or the mean of the distribution of X) is denoted by  or x and is given by the formula  =  xp(x) Variance: The population variance ( or the variance of the distribution of X) is denoted by 2 [ or x2 ] and is given by the formula: 2 =  (x - )2p(x) = x2p(x) - 2 The population standard deviation ( or the standard deviation of the distribution of X) is denoted by  or x and is defined as  = 2 Example: Consider the probability distribution of the random variable X in the previous example. x p(x) 2 .1 4 .2 5 .4 6 .3  = xp(x) = = [the mean (average) length of stay for patients undergoing this surgical procedure] 2 =  (x -  )2p(x) = = = (61) A Counting Rule Definition: n! ( read as “n factorial”) means the following: n! = n(n-1)(n-2)...(3)(2)(1) Note: By definition 0! = 1. Example: (a) 3! = (b) 5! = An Important Formula: Let Cnx ( read “ n choose x”) denote the number of ways of obtaining “ x heads in n tosses” of a coin. Then Cnx = n! x!(n  x)! The number Cnx is called a binomial coefficient. Example: How many ways are there to obtain (a) 2 heads in 3 tosses of a coin? (b) 0 heads in 3 tosses of a coin? (c) 12 heads in 15 tosses of a coin? Note: The formula used above is merely a short hand way of calculating something that would otherwise require us to draw a tree diagram. For example, the answer to (a) could have been obtained as follows: (62) THE BINOMIAL EXPERIMENT Conditions for a Binomial Experiment 1. n identical trials are performed, where n is determined prior to the experiment A Coin Tossing Experiment 1. A fair coin is tossed 10 times exactly the same way each time. So 10 identical trials are performed. 2. The trials are independent; so the outcome of any particular toss does not influence the outcome of any other toss. 3. Each trial has two possible outcomes, heads (success) and tails (failure) 4. At each trial: P(success) = p =.5 P(failure) = q = .5, since the coin is fair. The Binomial Random Variable is X = the total number of heads in the 10 tosses. 2. The trials are independent; so the outcome of any particular trial does not influence the outcome of any other trial. 3. Each trial has two possible outcomes which we call “success” and “failure”. 4. The probability of “success” is same for each trial and is denoted by “p”. Thus at each trial: P(success) = p P(failure) = 1 –p =q. The Binomial Random Variable is X= the total number of successes in the n trials. Example of Binomial Experiment: A manufacturer supplies a certain component for automobile companies. 2% of the components produced are defective. A srs of 30 components is examined. Of interest is the number of defective components in the sample. 1. n = , identical trials are performed. A trial consists of ____________________________________________________ 2. The trails are independent. 3. Each trial has two possible outcomes, Success: _______________________________________________ Failure:________________________________________________ 4. At each trial : P(success) =p = P(failure) = q = The Binomial Random Variable is : X= (63) Note: In general If X is a Binomial(n,p) ;then, P(X=x) = p(x) = Cnxpxqn-x; x = 0,1,2,…,n; q = 1-p. n! where, Cnx = x!(n  x)! Example: A biased coin which has P(heads) = p =.7 and p(tails) = q = .3 is tossed 3 times. The coin is tossed in such a way that the outcomes on each toss are independent. We obtain the probability of 0,1,2 and 3 heads using the above formula: (i) P(X=0) = (ii) P(X=1) = (iii) P( X=2) = (iv) P( X=3) = (64) Probability Histogram for a Binomial Distribution: Consider our binomial n=3, p=.7 random variable X above. x p(x) 0 .027 1 .189 2 .441 3 .343 Questions: 1. What is the shape of this histogram?______________________ 2. What do you think the shape of this histogram would be if (a) p < .50?__________________________________ (b) p =.50?____________________________________ (65) Mean and Variance of a Binomial Random Variable Again consider our binomial n=3, p=.7 random variable X above. Using the usual formula for  and 2 , we obtain  =  xp(x) = 2 =  (x -  )2 p(x) = = Fact: It can be shown that for a Binomial random variable X, the mean , the variance 2 and the standard deviation  can be calculated using the following formulas:  = np, 2 = npq, and  =  npq Thus in the example above:  = np = 3(.7) = 2.1 2 = npq = = (66) Example: In a multiple choice test, there are 5 questions each with three possible answers. For each question a student chooses an answer at random. Find the probability that she gets (a) exactly 3 correct answers (b) three or fewer correct. Let X = the number of correct guesses out of 5. X is binomial , n = _____________, p = _______________ P(X=x) = p(x) = (a) P(X=3) = (b) P(X3) = (67) Example: A variety of seed has a 40% chance of germinating. Ten seeds are planted. You are interested in X = the number of seeds out of ten that germinates. Then X is _______________, n = ____________, p = ______________ (a) find the mean (expected) number of seeds that will germinate. (b) the variance and standard deviation. (c) The probability that exactly 6 seeds germinate. (d) The probability that fewer than 6 seeds germinate. (68) (e) The probability at most six seeds germinate. (f) The probability at least 6 seeds germinate. (g) The probability that more than 6 seeds germinate. (69) CONTINUOUS RANDOM VARIABLES The Binomial random variable is called a discrete random variable because it takes on discrete values 0, 1, 2, 3, ..,n. Consider the probability histogram of X, Binomial n =3, p =.7. The areas under the histogram are related to the probability distribution. P(X=2) = P(X 2) = The total area under this probability histogram is p(0) + p(1) + p(2) + p(3) = 1 A continuous random variable X is one which represents measurements that (theoretically) can be made to any degree of accuracy. For example suppose X = the weight (in kg) of a randomly chosen newborn baby. Depending on the accuracy of our scale the weight X of a randomly selected baby could be recorded either as 3 or 3.3 or 3.26 or 3.258 etc.. The probability histogram for such a variable has to be formed in a very different way than for a discrete random variable. For example, in the case of “X=the birth weight of a newborn”, we could take a very large sample from the population of newborns, measure the sample birth weights very accurately (with many decimal of accuracy) and form a histogram of the birth weights using classes of small width. If the vertical scale is adjusted so the total area under the histogram is one then area under the histogram can be used to calculate (approximately) the probabilities. The larger the sample and the more accurate our measurements, the more accurate will be these probabilities. In the illustration below a srs of 812 babies was used to form a probability histogram. (70) Birth Weight (kg) of 812 Newborns Frequency 30 20 10 0 1 2 3 4 5 WT(kg) Let X be the birth weight of a randomly chosen newborn. P(X3) = Shaded Area. The larger the sample, the smaller we can make these rectangles, and the “smoother” will be the resulting histogram. Thus a “model” of the distribution could be obtained by fitting a curve to such a histogram and using the area under the curve to calculate the probabilities ( this curve is called a Probability Density Function). Thus P(X3) is the area under the curve to the left of 3. Thus total area under the curve must be 1. Note: As we know there are many different shapes among various populations (e.g. left skewed, right skewed, symmetric etc.). In the class of bell-shaped curves there is a specific one which is called normal curve ( there is a mathematical formula which defined it exactly). If a population can be modeled by this certain bell –shaped curve the population is said to have a Normal ( or Gaussian) Distribution. (71) The Normal Distribution A Normal ( or Gaussian) Population is one that can be modeled by a certain bellshaped curve called the normal curve. The population of weights of newborn babies described above is an example of such a population. In describing such a population, we need to know two quantities,  and  .  stands for the population mean and  stands for the population standard deviation. In our example  is the mean (average) weight of all newborn babies in the population and  measures the spread of the population values about the mean  . A Normal (or Gaussian) random variable X represents a randomly chosen measurement sampled from this population. Probabilities about X are found by finding the appropriate areas under the curve i.e. P(Xx) = the area under the normal curve to the left of x. Note: (a) The total area under the curve is 1. (b) For a normal random variable P(X=x) is always 0. Practically speaking this means that if we can measure observations very accurately, the chances of finding a newborn weighing exactly 3.0000000000kg, say, is very small. Thus for all practical purposes (X=3) =0. One consequence of this is that P(Xx) = P(X<x). The population of Z-scores of a normal population is called the Standard Normal Population. A randomly chosen measurement from the standard normal population is denoted by Z. Z is simply the Z-score of a normal random variable X. X  Z=  For the standard normal random variable Z: Z = 0 and Z = 1. We will show later that most of the time an observed value of Z will fall between –3 and +3. (72) Probabilities for the Standard Normal Probabilities for the standard normal distribution can be obtained from the Table A on pages T-2 and T-3. Examples: (a) (i) P(Z1.5) = (ii) P (Z>1.5) = (b) (i) P(Z  -1.5) = (ii) P(Z-1.5) = (c) P(-1.5  Z  1.5) = (d) P(-1.5 < Z < 2.21) = (73) Probabilities for Normal Random Variables in General Example: The heart rate of patients suffering from heart disease is normally distributed with a mean of 97 beats per minute and a standard deviation of 18 beats per minute. For a randomly chosen patient, find the probability the heart rate is (a) below 80 (b) more than 140, (c) between 55 and 90. Let X = the heart rate of a randomly selected patient;  =97,  = 18. (a) P(X< 80) = (b) P(X>140) = (c) P(55<X<90) = (74) Example: Let X be any normal random variable with mean  and standard deviation . What is the probability that X is within two standard deviations of the mean? First note that the statement “ X is within two standard deviations of the mean” means that X lies between  - 2 and  + 2. Thus P( X is within two standard deviations of the mean) = P ( - 2 < X <  + 2) = (75) Note: Similarly, (i) P(X is within one standard deviations of the mean) = (ii) P(X is within three standard deviations of the mean) = (76) Percentiles of the Standard Normal Distribution (Using the Normal Tables backwards) (a) Find the 95th percentile of the standard normal distribution i.e. find the value of z0 such that P( Z  z0 ) = .95 ( b) Find z0 such that P ( Z  z0 ) = .41 (c)Find z0 such that P( z0  Z  0 ) = .1 (d) Find z0 such that P( -z0  Z  z0 ) = .95 (77) Example: Let X be a normal random variable with mean  = 100 and standard deviation =10. Find x0 such that (a) P(X  x0 ) =.80 [i.e. x0 is the 80th percentile of X]. (b) P ( X  x0 ) = .025 . (78) Example : The scores on the Scholastic Aptitude Test ( SAT) for verbal ability of high school seniors is normally distributed with mean 430 and standard deviation 100. What score must a student attain in order to be in the top 5% of all the students who took the test? Example: The time to first failure for a certain model of television set is normally distributed with mean 5 years and standard deviation 1.56 years. If the manufacturer wishes to repair only 10% of the sets sold, for how long should he guarantee his product? (79) Normal Approximation to Binomial Probabilities Let X be a binomial random variable with parameters n and p. Mean and standard deviation of this distribution are given by   = np and  =  npq , q = 1-p. Then if both np  10 and nq  10 the binomial probabilities may be closely approximated by Normal probabilities in the following way. For a = 0,1,2,3…n P(X  a)  P(Z  a  .5  np npq ) Example: An automotive plant employs workers and suffers a daily absentee rate of 1%. (a) What is the mean ( or expected value) of absentees on a given day? (b) What is the variance and standard deviation of the absentees on a given day? (c) Find the probability that on a given day (i) at most 30 of the workers are absent, (ii) at least 60 of the workers are absent, (iii) between 30 and 60 (inclusive) of the workers are absent. Solution: Let, X = the number of workers absent on a given day. Therefore, X is binomial with n =5000 and p =.01. a.  = np = b. 2 = npq = = c. Since the values for n and p are not in Table C (Binomial Probabilities) and since using the binomial formula would take a very long time ( not to mention the difficulty of the calculation when n = 5000), we use the normal approximation to the binomial. Check: np = ; nq = (80) (i) P(X 30)  (ii) P( X  60 ) = (iii) P ( 30  X  60 ) = (81) Example : A travel agency promotes vacation packages by telephoning households at random in the evening hours. Historically, only 65% of heads of households are at home when the agency calls. Suppose that 30 households are phoned in a given evening. Find the probability that the agency will find (a) between 15 and 25 households, inclusively, with the head of the household at home. (b) fewer than 23 households with the head of the household at home. (c) P( 17 < X < 28) (82) The Central Limit Theorem Consider the following probability distribution . The random variable X is the length of stay in the hospital for a randomly selected patient undergoing a certain medical procedure. X: 2 4 5 6 p(x): .1 .2 .4 .3 The mean and standard deviation of this distribution are  = 4.8 and  = 1.166. Knowing the appropriate model for a population is like knowing the entire population itself. In a real life situation we would not know which model would be the appropriate one for the population of interest, and we certainly would not know the actual value of . The problem then is to ESTIMATE  in some way. One way to do this is to take a simple random sample from the population and then use it to calculate a value of the sample meanX . If we proceeded in this way, we would become aware of the fact that each time we took a new sample from the population and calculated X we would get a different value for our estimate of . Thus it is important to try to understand a bit more about the behaviour ofX. Mean, Variance and Standard Deviation of the Sample MeanX Given: 1. A population with mean  and standard deviation  . 2. A random sample of size n from the population: X1 , X2 , … Xn with sample mean X = (X1 +X2 +… Xn)/n Then:  =  [ the mean of the distribution ofX is equal to the population mean] X  =  / n [ the sd of the distribution of X is equal to ( the population sd)/n] X 2 =  2 /n [the variance of the distribution ofX is equal to X (the population variance)/n ] (83) The Central Limit Theorem (CLT) is stated as follows: Given a large random sample of size n from a population with mean  and standard deviation , then The sample mean X is approximately normally distributed with mean and standard deviation given by  =  X  =  / n X Note: 1. n >30 is usually large enough for the CLT to apply. 2. If the population from which we sample is normal thenX is exactly normally distributed with mean and standard deviation as above for any sample size. Example: The population of white males aged 35-44 in Canada has a mean systolic blood pressure of 130 with a standard deviation of 8. Find the probability that a random sample of 50 white males has a mean systolic blood pressure (a) below 133, (b) above 128, (c) within 2 of the population mean. Solution: Here we are asked to calculate probabilities about the sample meanX. This suggests we use the CLT. Population: Systolic blood pressures of white Canadian males aged 35-44. Population Mean  = 130; Population Stdev  = 8. Since n =50 > 30, the CLT tells us thatX is approximately normally distributed with  =  = 130 X (a) P ( X < 133 ) = (84) and  =  / n X (b) P( X > 128) = (c) Notice that “ X within 2 of the population mean “ means that X lies between 128 and 132. Thus, P ( 128 < X < 132 ) = (85) Example: The scores on a Standardized Reading Test are known to be normally distributed with a mean of  = 10 and standard deviation of  = 1.5. (a) Find the probability that a single individual selected at random scores between 9.4 and 10.6 on this test. (b) Find the probability that a random sample of 25 students has a mean score between 9.4 and 10.6 on this test. (c) Find the probability that a random sample of 64 students has a mean score between 9.4 and 10.6 on this test. Solution: Let X be the score of the randomly selected individual. Then X is normal  = 10,  = 1.5. (a) P( 9.4 < X < 10.6 ) = (86) (b) Since the population the sample came from is normal, then for any n ( hence for n =25), X is exactly normal with  = 10 and  = 1 .5/5 =.3 X X P( 9.4 < X < 10.6 ) = (c) For the same reason as in (b) above, for n =64,X is exactly normal ,  =  = 10 X and  =  / n = 1.5/8 = 0.1875. X Therefore, P( 9.4 <X < 10.6 ) = (87)

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download INTRODUCTION - Department of Mathematics and Statistics